Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Autonomous mobile robot navigation in urban environment
(USC Thesis Other)
Autonomous mobile robot navigation in urban environment
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
AUTONOMOUS MOBILE ROBOT NAVIGATION IN URBAN ENVIRONMENT Chin-Kai Chang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 13, 2016 Copyright 2016 Chin-Kai Chang Dedication To all iLab members, friends and families. ii Acknowledgements To iLab. This has been the greatest ten years in my life to fully dedicate to robotics and science. Thanks my advisor Dr. Laurent Itti allows me to explore all the cutting edge technology. I really appreciate Christian’s relentless mentoring. Finally, thank my parents for all the support and love. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures ix Abstract xxi 1 Introduction 1 1.1 Mobile Robot Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Visual Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Visual Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Hierarchical Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Beobot 2.0 Autonomous Navigation Experimental Platform 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Current Mobile Robot Platforms . . . . . . . . . . . . . . . . . . . 13 2.2 Electrical System Design and Implementation . . . . . . . . . . . . . . . . 17 2.2.1 Computing Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Mechanical System Design and Implementation . . . . . . . . . . . . . . . 22 2.3.1 Locomotion System . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 Computing Cluster Case . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2.1 Vibration Attenuation and Shock Absorption System . . 26 2.3.2.2 Cooling System . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Testing and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.1 Navigation Algorithm Implementation . . . . . . . . . . . . . . . . 34 2.5.2 Computational Capabilities . . . . . . . . . . . . . . . . . . . . . . 37 2.5.2.1 SIFT Object Recognition System Test . . . . . . . . . . . 38 2.5.2.2 Distributed Visual Saliency Algorithm Test . . . . . . . . 40 2.5.2.3 Biologically-InspiredRobotVisionLocalizationAlgorithm Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.6 Discussions And Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 49 iv 3 Gist and Saliency Based Navigation 62 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2 Design and Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2.1 Navigation System . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2.1.1 Salient Region Tracker . . . . . . . . . . . . . . . . . . . 70 3.2.1.2 Lateral Difference Estimation . . . . . . . . . . . . . . . . 72 3.2.1.3 Path Planning Biasing. . . . . . . . . . . . . . . . . . . . 75 3.3 Testing and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.4 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4 Road Region and Boundary Estimation based Navigation 83 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2 Design and Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.1 Road Region Estimation . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.2 Road Boundary Estimation . . . . . . . . . . . . . . . . . . . . . . 88 4.2.3 Combining Road Region and Boundary Estimation . . . . . . . . . 94 4.3 Testing and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3.1 Visual Road Recognition . . . . . . . . . . . . . . . . . . . . . . . 95 4.3.2 Autonomous Navigation . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5 Hierarchical Map Representation for Heterogeneous Sensor Fusion 103 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2.1 Localization System . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.2 Navigation System . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2.2.1 Visual Road Recognition . . . . . . . . . . . . . . . . . . 116 5.2.2.2 Constructing the Grid Occupancy Map . . . . . . . . . . 123 5.2.2.3 Path Generation . . . . . . . . . . . . . . . . . . . . . . . 126 5.3 Testing and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.3.1 Localization Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.3.2 Navigation Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.4.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 152 5.4.2 System Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.4.3 Failure Cases and Future Improvements . . . . . . . . . . . . . . . 159 6 DeepVP: Deep Learning for Vanishing Point Detection on 1 Million Street View Images 164 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.2 Vanishing Point Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.2.1 New dataset for deep learning training . . . . . . . . . . . . . . . . 168 6.2.2 Data collection method . . . . . . . . . . . . . . . . . . . . . . . 169 6.2.3 Image Collection from a single GPS location . . . . . . . . . . . . 170 6.2.4 Align road heading with image heading . . . . . . . . . . . . . . . 170 6.2.5 Mapping camera angle to pixel coordinate . . . . . . . . . . . . . . 172 v 6.3 Network architecture and its adaptation to vanishing point detection . . . 173 6.3.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.3.2 vanishing point detection . . . . . . . . . . . . . . . . . . . . . . . 174 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.4.1 Dataset setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.4.2 CNN setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.4.3 Quantitative Comparison . . . . . . . . . . . . . . . . . . . . . . . 176 6.4.4 Independent Routes Comparison . . . . . . . . . . . . . . . . . . . 176 6.4.5 VP Coordinate Comparison . . . . . . . . . . . . . . . . . . . . . 177 6.4.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7 Conclusion and Future Works 182 References 183 vi List of Tables 2.1 Sensors Provided in Beobot 2.0 . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Beobot 2.0 Electrical System Features . . . . . . . . . . . . . . . . . . . . 23 2.3 The Vision Toolkit Features . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4 Beobot2.0 Sub-System Testings . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 Nearness Diagram Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6 Beobot 2.0 Nearness-Diagram (ND) Navigation Testing . . . . . . . . . . 37 2.7 Beobot 2.0 SIFT Object Recognition Time Breakdown . . . . . . . . . . . 39 2.8 Beobot 2.0 SIFT Database Matching Algorithm Testing Results . . . . . . 40 2.9 Beobot 2.0 Distributed Saliency Algorithm Time Breakdown . . . . . . . 42 2.10 Beobot 2.0 Visual Saliency Algorithm Testing Results . . . . . . . . . . . 42 2.11 Beobot 2.0 Localization System Time Breakdown . . . . . . . . . . . . . . 46 2.12 Beobot 2.0 Vision Localization System Testing . . . . . . . . . . . . . . . 46 2.13 Beobot 2.0 Vision Localization System Testing Results . . . . . . . . . . . 46 3.1 Navigation and Localization System Performance Results . . . . . . . . . 77 4.1 Experiment Sites Information . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.2 Autonomous Navigation Performance Results . . . . . . . . . . . . . . . . 99 5.1 Experiment Sites Information . . . . . . . . . . . . . . . . . . . . . . . . . 135 vii 6.1 Overview of most available vanishing point dataset. . . . . . . . . . . . . . 181 6.2 Comparison of results with hand-crafted feature approach . . . . . . . . . 181 viii List of Figures 1.1 Examples of tasks for service robots illustrated from left to right: pa- trolling alongside law enforcement personnel, searching for target objects, providing information to visitors, and assessing the situations and condi- tions of the robot’s area of responsibility. . . . . . . . . . . . . . . . . . . 1 1.2 Various features of Beobot2.0. Beobot 2.0 utilizes an electric wheelchair platform to carry a high performance computing cluster of 16 processor cores, 2.2GHz each. The robot is equipped with various sensors such as Firewire camera, Laser Range Finder, sonar array, IMU, compass, and GPS. In addition, panel mount water-proof USB connectors are available for new sensors, along with RJ45 Ethernet for wired internet connection and panel-mount KVM inputs for regular-sized monitor, keyboard, and mouse. There is also a touchscreen LCD for a convenient user interface. Furthermore, the kill-switches at each corner of the robot are available as a last resort to stop it in emergency situations. . . . . . . . . . . . . . . . 6 2.1 Beobot 2.0 electrical system. Ontheright sideof thediagram, there are2 baseboards,each houses4COMExpressmodulesandeachmodulehasits own SATA hard-drive. The backbone inter-computer communication is Gigabit Ethernet that is connected through a switch. For visual interface to individual computers, a KVM (Keyboard, Video, and Mouse) is used to connect to either a 8-inch LCD Touchscreen or an external monitor, mouse, and keyboard. In addition, a PCI Express - Firewire interface card is used to connect to a low latency camera. The other sensors are connected via the many USB connectors that are panel mounted on top of the robot as well as on the baseboard. The whole system is powered by a 24V battery circuit supply (with kill-switches for safety purposes) and is regulated through a set of dedicated Pico-ATX power modules. The same battery circuit also powers the motors as well as the liquid-cooling system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 ix 2.2 Baseboard. The image on the left is a fully assembled baseboard with 4 COM Express modules. The black plates are the heat-spreaders attached to the processors. There is also an Ethernet and two USB jacks placed on the right side of the board. The layout on the right is the circuit done in Altium [5] Printed Circuit Board (PCB) design software. . . . . . . . . . . 54 2.3 Optional caption for list of figures . . . . . . . . . . . . . . . . . . . . . . 54 2.4 ASolidWorks[160]modeloftherobotshownfromitssideandcutthrough its center. The bottom of the image displays the Liberty312 wheelchair base, and above it is the robot body in cardinal color. The robot body is divided into two chambers by the dust-proof firewall. The back-chamber completely sealsthecomputerswithinfromtheelements. Thefrontofthe robot, which houses part of the cooling system, is open to allow for heat dissipation. The heat from the computers is transferred by the liquid in thecoolingblock, whichisattachedtotheheat-spreadersoneach module. The liquid then moves through the radiator, which is cooled by the fans, beforegoingbacktothecoolingblock. Inaddition,thecomputingblockis shock-mounted on four cylindrical mounts that are used to absorb shocks and vibration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.5 Graphical User Interface Display of the ND Navigation system. The sys- tem identifies 9 (indexed from 0 to 8, note that 4 and 7 are cut off as they aredrawnoutsidetheframeoftheGUI) different regions. Therobotnext direction of motion is denoted by the blue line. The yellow lines delin- eate the boundaries of the navigable region. The red line indicates the directions 60 degrees to the left and right of the robot’s next direction. We also display the robot’s translational and rotational motor command. Both of these numbers range from -1.0 to 1.0 (negative values indicate moving backwards and counter-clockwise rotation, respectively). . . . . . 56 2.6 Snapshot of the constructed environment for ND navigation testing. . . . 56 2.7 Optional caption for list of figures . . . . . . . . . . . . . . . . . . . . . . 57 2.8 Flow of the distributed SIFT database matching algorithm is denoted in increased alphabetical order in the figure and referred below in parenthe- sis. First the camera passes in the High Definition (1920 by 1080 pixel) frame to the SIFT Master module (A). This module takes about 18 sec- onds to extract the SIFT keypoints from the input image before sending them to the SIFT Worker processes utilized (denoted as B i , i being the total numberof workers). Dependingonthenumberofworkers, each take between 16.25 and 353.22 seconds to return a match to the SIFT Master (C i ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 x 2.9 Allocation of the different programs of the distributed SIFT database matching algorithm in Beobot 2.0. The SIFT Master module is run on one of the cores in computer COM E1, while the various SIFT Worker modules are allocated throughout the cluster. . . . . . . . . . . . . . . . . 57 2.10 Results for SIFT database matching testing on Beobot 2.0. . . . . . . . . 58 2.11 Flow of the distributed saliency algorithm is denoted in increased alpha- beticalorderinthefigureandreferredbelowinparenthesis. Firstthecam- erapassesinthehighresolution4000by4000pixelimagetotheSalMaster module (A). SalMaster pre-process the image which takes 100ms before sending out the image to various sub-channel SaliencyWorker processes (denoted as B i , i being the total number of workers). The color, in- tensity, and flicker sub-channels take up to 100ms, while the orientation sub-channels take up to 300ms. These results are then re-combined by SalMaster (C i ) and takes less than 10ms. . . . . . . . . . . . . . . . . . . 58 2.12 Allocation of the different programs of the distributed saliency algorithm in Beobot 2.0. The Saliency Master module is run on one of the cores in computer COM E1, while the various Saliency Worker modules are allocated throughout the cluster. . . . . . . . . . . . . . . . . . . . . . . . 59 2.13 Results for Saliency Algorithm Testing on Beobot 2.0. . . . . . . . . . . . 59 2.14 Flow of the localization system is denoted in increased alphabetical order in the figure and referred below in parenthesis. First the camera passes in a 160 by 120 pixel image to the Gist and Saliency Extraction module (A) which takes 30 - 40ms before sending out the Localization Master module (B). This module then allocates search commands/jobs in a form of priority queue to be sent to a number of Localization Worker (C i , i being the total number of workers) to perform the landmark database matching. A search command job specifies which input salient region to be compared to which database entry. This takes 50 -150ms, depending on the size of the database. The results are the sent back to Localization Master (D i ) to make the determination of the robot location given the visual matches. The last steps takes less than 10ms. . . . . . . . . . . . . 60 2.15 Allocation of the different programs of the localization system in Beobot 2.0. The Gist and Saliency Extraction (GistSal) and Localization Master modules are allocated computer COM E1, while the various Localization Worker modules are assigned to cores throughout the cluster. Note that there are also 2 worker modules in COM E1. This is because they only run when Gist Sal and Localization Master modules do not, and vice versa. 60 xi 2.16 Examples of images in the ACB (first row), AnFpark (second row), and FDFpark (third row) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.1 Beobot2.0 navigating in a cluttered environment. . . . . . . . . . . . . . . 63 3.2 Diagram of the overall vision navigation and localization system. From an inputimage thesystem extracts low-level features consisting of center- surround color, intensity, and orientation that are computed in separate channels. They are then further processed to produce gist features and salient regions. We then compare them with previously obtained envi- ronment visual information to estimate the robots location. Furthermore the identified salient regions are used to direct the robot’s heading, to navigate itself to its destination. . . . . . . . . . . . . . . . . . . . . . . . 66 3.3 Diagram for salient region tracking. The system uses conspicuity maps from7sub-channelstoperformtemplatematchingtracking. Beforeselect- ing the predicted new location, the system weighs the result with respect to the proximity of the point in the previous time step. In addition, we also adapt the templates overtime for tracking robustness. . . . . . . . . . 70 3.4 Forward projection procedure. A detected region R i from time T = t i (top left of the figure) is successfully matched to salient region L j,k , p frames later. This is the region that belongs to Landmark L j with index k (denotedinsolidredattherightsideofthefigure), wherek isoneofthe multiple views of the landmark recorded during training. At this current time T = t i+p , the region R i has changed its appearance as the robot moves (now denoted as R i+p on the bottom left side of the figure). We project forward the matched region R i by finding a corresponding match for R i+p in landmarkL j , in this case with region L j,k+l (denoted in solid green at the right side of the green) , which was found further in the path during training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.5 Diagram for the voting scheme of navigation algorithm. There are three scenarios where the robot either moves left, straight, or right. For each scenario, there aretwo cases dependingon thesign ofX d , thecorrespond- ingpointofthelandmarkdatabaseregion. IfX d istotheleftofthecenter of the image (the dashed line that goes through the origin of the image coordinate system), thenX d <0. The same applies forX t , thehorizontal coordinate of the keypoint in a matched input region. If X t < X d and X t < 0, then turn left, else if X t > X d and X t > 0, then turn right. If neither, then go straight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 xii 3.6 A snapshot of the localization and navigation system. The input (top left) image contains the salient region windows. Green windows mean a database match, while reds are not found. Below the input image, to the left, is the segment estimation computed from gist features. There are four segments in the environment. Here the actual segment 4 is classified as second most likely (to segment 1) by the system. To the right of the segment estimation result are the matched salient regions drawn at the original detected image coordinate. Next to the input image is the robot state projected onto the map: cyan points are the location particles (hypotheses) generated by the Monte Carlo Localization algorithm, the yellow pointsarethelocations of thematched database salient region, the green circle (the center of the blue circle) is the most likely location. The radius of the blue circle is equivalent to 3m. The robot believes that it is 4 meters from the intersection, which is correct within 20cm. On the navigation side (underneath the map), the robot is turning to the left as decided by the keypoint votes. . . . . . . . . . . . . . . . . . . . . . . . . 81 3.7 Scene examples of the HNB and Equad environment. . . . . . . . . . . . . 81 3.8 Result for the indoor HNB testing. Graph labels are described in the text. 82 3.9 Result for the outdoor Equad environment. Graph labels are describedin the text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.1 Beobot 2.0 performing autonomous navigation in an unknown environ- ment. There are two characteristics of the road that are exploited by our system. One is the similarity appearance within the road, color of the concrete, which can be compromised by shadows. The other cue is the vanishing point at the end of the road, noted by the yellow dot. The difficulty of vanishing point-based systems are their instability, as a small location shift can result in large shapechanges on the road area closest to the robot, making it less suitable for navigation tasks. . . . . . . . . . . . 84 4.2 Overall visual roadrecognition system. Thealgorithm starts byinputting the camera image to both the road region and road boundary estimation modules. The road region estimator utilizes a graph-based segmenter, while the road boundary estimator relies on a real time vanishing point detection. Both modules output road color average and the road center location, which are fused together by a Kalman Filter. . . . . . . . . . . 85 xiii 4.3 Road region selection process. The estimator searches for the road region in thesegmented input image usingthe centerW fixed windowif therobot is moving steadily in the middle of the road. However, if the robot starts to swerve from the center of the road significantly, it then switches to W float . The system switches back to W fixed , when the robot successfully re-centers itself. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4 Vanishingpointdetectionmodule. ThemodulefirstcomputeGaborpyra- mids in 4 directions and 3 scales. It then computes the edge direction confidence scores to select only the most robust edges. The module then votes for vanishing point location using robust edges within .35*image diagonal below each respective vertical coordinate [75]. To speed up the process, the module performs a non-uniform sampling of the candidate locations. That is, it densely samples the locations around the vanishing points from the last 10 frames, as well as sparsely samples the rest of the locations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5 Finding Road Boundaries. From the vanishing point, we extend ray with 5 degrees spacing and evaluate how likely each ray is a road boundary. Refer to text for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.6 Gabor Response Variance Score (GRVS). The Gabor responses of the points along a candidate road boundary line l are indicated by with cyan bars. The table to the right tabulates the total count of dominant angles for each orientation. Here, the 45 ◦ angle wins out, and the result is used to calculate GRVS(l). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7 Exampleimagesfromeachtestingenvironment. Wedisplay4examplefig- ures for sites HNB, AnF, Equad, and SSL. Below each figure, we display the system’s road segmentation estimation, denoted by the red shading. In addition, we also addedthe vanishingpoint estimation, denoted by the blue dot, with the green line extended being the road boundary. Also, we added a yellow line on the bottom of each image that points to the di- rection of the robot heading. As we can see, the system works on various road appearance, width, and under many lighting conditions. Further- more, it is also able to overcome various challenges, such as pedestrians and sporadic shadows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 xiv 4.8 Road Recognition Results. The bar graphs illustrate the performance of the different algorithms in recognizing road in different environments. Performance is measured by the difference between the ground truth and estimated road center in cm. We report the results by the individual algorithms in our approach, road region and road boundary estimation, andtheoverall system. Inaddition, wealsoincludeimplementations[135] and[75]. Note: shorterbarsarebetter,theshortestineachgroupdenoted by a star. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.9 System Comparison Examples. The different systems outputs are orga- nized to different rows with the first three rows occupied, in order, by the individual algorithms in our approach, road color region detection and road boundary estimation, and then the overall system. In addition, we also include an implementation of systems by [135] (forth row) as well as [75] (fifth row). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.10 Example trajectory of autonomous navigation using the presented vi- sual road recognition system. Here Beobot 2.0 has to traverse through a 60.54m route. The autonomous navigation is indicated by the green trail, while the remote control (to rescue the robot from hitting a wall) is colored in red. The reason why the robot moved to the right is because it decided to take the right turn on a road that suddenly decreases its width by a half. In addition, we also the added a few example images to show the different parts of the route, where we see that the robot has to manage its path around people and other vehicles. . . . . . . . . . . . . . 100 5.1 Diagram for the overall localization and navigation system. We run our system on a wheelchair-based robot called the Beobot 2.0. As shown on the sensor column, the system utilizes an IMU, camera, Laser Range Finder (LRF), and wheel encoders. We use the camera to perform vision localization by integrating results from salient landmark recognition and place recognition. For navigation we use visual road recognition for esti- matingtheroadheading, andLRF-basedgrid-occupancy mappingforob- stacleavoidance. Thesystemutilizesatopological mapforglobal location representation and a grid-occupancy map for local surrounding mapping. The grid map, which is represented by the blue rectangle on the topolog- ical map, is connected with the topological map by the intermediate goal (green dot). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Diagram of the vision localization system. The system takes an input im- age and extracts low-level features in the color, intensity, and orientation domains. They are further processed to produce gist features and salient regions. Thelocalization systemthenusesthesefeaturesforsegment clas- sification and salient landmark matching (respectively), along with fused odometry, to estimate the robot’s location using a Monte-Carlo framework.107 xv 5.3 Matching process of two salient regions using SIFT keypoints (drawn as red disks) and salient feature vector, which is a set of feature map values taken at the salient point location (yellow disk). The lines indicate the keypoint correspondences. The fused image is added to show alignment. . 111 5.4 Diagram of the localization sub-system timeline. The figure lists all the localization sub-systemcomponents: twosensors,fourperceptualprocess- ing steps, and a back-end localization estimation. Please see the text for explanation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.5 Overall visual road recognition system. The algorithm starts by comput- ing a Canny edgemap from the input image. The system has two ways to estimate the road. One is using the slower full recognition step (the top pipeline), where it performs road segment detection. Here, the seg- ments are used to vote for the vanishing point (VP) as well as to extract road lines. The second (bottom) is by tracking the previously discovered road lines to update the VP location as well as the robot lateral position. The tracker is also used to project forward output road lines from the top pipeline, which may fall behind the camera frame rate. The road recog- nition system then outputs the VP-based road heading direction as well as lateral position to the navigation system to generate motor commands. 117 5.6 Vanishing point (VP) voting. The VP candidates are illustrated by disks on the calibrated horizon line, with radii proportional to their respective accumulated votesfromthedetectedsegments. Forclarity, thefigureonly displays segments that support the winning VP. A segment s contributes to a vanishing point p by the product of its length and distance between p and the intersection point of the horizon line and a line extended from s, labeled as hintercept(s). . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.7 Linetrackingprocess. Thesystemtracksalineequationfromtheprevious frame(denotedinyellow) inthecurrentedgemapbyfindinganoptimally fit line (orange) among a set of lines obtained by shifting the horizontal coordinate of the previous line’s bottom point (bottom of the image) and horizon support point (intersection point of line and the horizon support line) by +/- 10 pixels with 2 pixel spacing. The set of candidate shifted points is shown in red on the bottom and on the horizon support line. . . 119 xvi 5.8 The conversion of the vanishing point (VP) shift from the center of the image and of the lateral deviation from the original location to robot lateral position. In the VP shift, each pixel (in this case 120 pixels) away from the center is multiplied by.16875 ◦ , while on lateral deviation, every pixel (in this case 22.8 pixels) away from the original position equals to 9.525mm. Since the lateral deviation δ is of low priority, it only needs to be corrected by the time the robot arrives at the top of the local map or in the goal longitudinal distance long(g), which is 7.2m. The resulting angle computes to atan(δ/long(g)) =atan(22.8∗.009525/7.2) =1.73 ◦ . . . 122 5.9 Diagram of road recognition timeline. The figure lists all the components in the system: camera and IMU, edgemap computation, the two visual recognition pipelines, and the final road heading estimation. . . . . . . . . 123 5.10 Three stages of the grid occupancy map. The first, leftmost, image is the discretemapDofallthediscoveredoccupancyvalues,withthewhitecolor being unoccupied, black fully occupied, and cyan unknown. The second map is the blurred map B, while the last occupancy map F is biased by most recently produced path to encourage temporal path consistency. Also notice that theunknowngrid cells, such as the ones below therobot, do not extend their influence out in both the blurred and biased map. . . 125 5.11 Diagram of grid map construction timeline. The figure lists all the com- ponents: the three utilized sensors, the road heading estimation, and the grid map construction process. . . . . . . . . . . . . . . . . . . . . . . . . 126 5.12 The three stages of the path trajectory and motion command generation. The figure displays both the full local map on the left and a zoomed part within the red rectangle on the right for a clearer picture. The system first runsA* to generate a rigid path shown in green in both images. The system then deforms the rigid path (in blue) to optimally and smoothly avoid obstacles. Finally, the system uses Dynamic Window Approach (DWA) togeneratetheactualcommand,onethataccountsforthecurrent robotdynamics. Thatresultingcommandismodeledaccuratelyasacurve in orange. In addition, the figure also displays the predicted location of the robot, noted by the red disk. . . . . . . . . . . . . . . . . . . . . . . . 129 5.13 Diagramoffullsystemtimeline. Thefinalmotorcommandtobeexecuted by the robot (bottom row) is generated by the path planning sub-system. It utilizes all the previously described sub-system, with grid map con- struction being the trigger. . . . . . . . . . . . . . . . . . . . . . . . . . . 131 xvii 5.14 Map of the experiment routes at the University of Southern California. They are the libraries (LIBR) route, indicated in red, the professional schools (PROF) route in blue, and the athletic complex (ATHL) route in green. The remaining routes in black are not used in the documented experiment but have been traversed during system development. . . . . . 133 5.15 A snapshot of the system test-run. Top-left (main) image displays the robot moving towards the goal. Below is the road recognition system which identifies the road heading (indicated by the green disk) and the lateral position of the robot (blue stub on the bottom of the image). To the right are three salient landmark matches as well as the segment es- timation histogram below. The top-right of the figure shows the global topological map of the campus, with the current robot location indicated by aredsquare, andthegoal byabluestar. Italso drawsthecurrentseg- ment in pink, matching the segment estimation output underneath land- mark matching. Finally, the right bottom of the figure displays the local grid occupancy map, which shows a robot path to avoid an approaching pedestrian. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.16 Examplesofscenesfromthelibraries(LIBR),professionalschools(PROF), and athletic complex (ATHL) route, each displayed in its respective row. Thefigurealsodepictsvariouschallengesthattherobotencountered,such as pedestrians, service vehicles, and shadows. These images are adjusted to remove the bluish tint in the original images for better display. . . . . . 135 5.17 Boxplotofmetriclocalization errorfromdifferentroutes,forbothmanual and autonomous navigation, and in both low and high crowd conditions. In addition, the figure also reports the total performance accuracies for each condition. The figure shows that the overall autonomous mode error median is 0.97m, while medians of the various conditions remain low, from .75 to 1.5. Each box plot notes the median (the thicker middle line), the first and third quartile, and the end whiskers, which are either the minima/maxima or 1.5 * IQR (inter-quartile range). The stars above the lines joining various box plots note the statistical significance with one star indicates p<0.05, two stars p<0.01, and three stars p<0.005. . . . 138 5.18 Detailed diagram of localization error while the robot autonomously nav- igates the professional schools (PROF) route duringlow crowd condition. The error is indicated by a gradation of red color, with the darker hue being higher error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.19 Landmark recognition in a building corridor at the last segment of PROF route. The opening at the end is the only salient landmark that is consis- tently recognized at different distances, with minimally observed changes throughout the traversal. . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 xviii 5.20 The localization error box plot for lateral deviation of 0m, 1.33m, 2.33m, and 3.33m from the center of the road. All the error differences compared to that of driving in the middle of the road are statistically significant based on Wilcoxon rank sum test (p = 1.093∗10 −7 vs. 1.33m, p = 0.015 vs. 2.33m, and p=2.476∗10 −5 vs. 3.33m). . . . . . . . . . . . . . . . . 142 5.21 Box plot of localization errors during obstacle avoidance on the different routes, and in both low and high crowd conditions. Note that the result for the PROF route during low crowd is missing because the robot did not execute a single obstacle avoidance maneuver on that run.. . . . . . . 143 5.22 Plot of turninglocations throughout theexperiment. The black partially- blockedarrowcomingfromthebottomofthefigureindicatesthedirection ofwheretherobotiscomingfrom. Itispointingtowardthegrayandblack circle representing the center of the intersection, which is the ideal turn location. The color of the dots, which corresponds to the color of the arrows, indicates where the robot is trying to go: red is for turning left, green for right, and yellow for straight. Each red, yellow, and green dot represents where the robot was when turning (the robot turns in place). . 145 5.23 Results of the navigation system in each site and under low and high crowding conditions. We measure the error as lateral deviation of the robot from the road center. The far right bars display the distributions of low vs. high crowd performance, which is statistically significant (Fried- man test, p<0.005). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.24 Trajectory of the robot on the PROF route trial run under low crowd condition. Here the deviation from the ground truth center of road is indicated by the gradation of the red color, with the darker hue being higher error. In addition, the green bar behind the trajectory indicates that the error is within 1.5m of the center of the road, while the wider black bar is for 5m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.25 Successive avoidance maneuver by the robot. It has to avoid two people (first image), street signs (second), and long bike rack (third). . . . . . . . 148 5.26 Navigation errors during obstacle avoidance from different routes, and in both low and high crowd conditions. Note that the result for the PROF route during low crowd is missing because the robot did not execute a single obstacle avoidance maneuver during the time. . . . . . . . . . . . . 149 5.27 Views oftheenvironment astherobotismovingthroughawater fountain area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 xix 6.1 Autonomous car navigation in a highway environment. Thesystem needs to estimates the road heading as well as road boundary . . . . . . . . . . 165 6.2 DeepVP Dataset collected from worldwide routes . . . . . . . . . . . . . . 168 6.3 Discretized VP labels in a total of 15x15=225 labels . . . . . . . . . . . . 169 6.4 Compute angle difference between camera heading and road heading . . . 171 6.5 The CNN architecture used for vanishing point detection. We closely follow the architecture of AlexNet [80], only changing the filters of the last two fully connected layers to 1024. The architecture in our notation form is: C96-P-C256-P-C384-C384-C256-P-F1024-F1024-O. Refer to text (section 6.3 from description. . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.6 Route-wise vanishingpoint detection accuracy of two algorithms. As seen the DeepVP algorithm consistently wins the Gabor VP method over all routes,withsignificantlyaccuracyimprovements. Thisisthecleardemon- stration of the superiority of DeepVP over the traditional arithmetic al- gorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.7 The location-dependent accuracy of vanishing point estimation by two algorithms. As seen, the Gabor VP algorithm is biased, since it achieves better predictionaccuracy whentheground-truthVPsareneartheimage center, while the prediction accuracy of DeepVP is relatively location- insensitive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.8 EffectofnetworkcapacityonVPdetectionaccuracy. Weshowempirically that when the number of neurons in the last two fully connected layers is as large as the number of the output layer labels, the CNN achieves as good a performance as a larger capacity network. . . . . . . . . . . . . . . 177 6.9 Visualization of receptive fields of filters at different layers. Top: top 25 images with the strongest activations for filters at a few layer (pool1, pool2, conv3, and pool5); Bottom: 25 image patches from the corre- sponding images on the 1st row, that excite the filter the most. . . . . . 178 6.10 example images from different sites. . . . . . . . . . . . . . . . . . . . . . 179 xx Abstract Unmanned ground vehicles (UGV) is one of the highly versatile carriers for transporta- tion, surveillance, search, and rescue task. For the service type mobile robot that ability to travel through indoor and outdoor environment may encounter complex challenges different than that of street vehicles. The urban pedestrian environment is typically GPS-denied which demands a further integrated approach of perception, estimation, planning and motion control to surmount. In this thesis, we present the design and implementation of Beobot 2.0 - an autonomous mobile robot that operates in uncon- strained urban environments. We developed a distributed architecture to satisfy the requirement for computationally intensive algorithms. Furthermore, we propose several biological-inspired visual recognition methodologies for indoor and outdoor navigation. We describe novel vision algorithms base on saliency, gist, image contour, and region segment to construct several perception modules such as place recognition, landmark recognition, and road lane detection. To conquer the latencies and updatefrequencies of each perception module while performing real-time navigation task. We further investi- gatehierarchicalmaprepresentationtofusethequick,yetlimitedstateinformationwhile time-consuming but higher discriminating data remains in processing. We validated our system using a ground vehicle that autonomously traversed several times in multiple xxi outdoor routes, each 400m or longer, in a university campus. The routes featured dif- ferent road types, environmental hazards, moving pedestrians, and service vehicles. In total, the robot logged over 10km of successfully recorded experiments, driving within a medianof1.37m laterally ofthecenter oftheroad, andlocalizing within0.97m (median) longitudinally of its true location along the route. xxii Chapter 1 Introduction The ability to localize or recognize one’s location, and to navigate or move about one’s environment, are critical building blocks in creating truly autonomous robots. In the past decade, there has been significant progress in this aspect of autonomy, particu- larly indoors [169, 43, 95] and on streets and highways for autonomous cars [110, 168]. However, despite these successes, ample opportunity remains for substantial progress in less constrained pedestrian environments, such as a university campus or an outdoor shoppingcenter. This type of settings encompasses a major portion of the working envi- ronments for human-sized service robots whose tasks are exemplified in figure 1.1. One major hurdle is that sensors and techniques which are reliable in other areas are not as effective here. Figure 1.1: Examples of tasks for service robots illustrated from left to right: patrolling alongside law enforcement personnel, searching for target objects, providing information tovisitors,andassessingthesituationsandconditionsoftherobot’sareaofresponsibility. 1 For one, current indoor localization [171, 43] and navigation [104, 95] systems rely primarily on planar proximity sensors such as the Laser Range Finder (LRF). These proximity sensors exploit the confined nature of indoor space, e.g., narrow hallways, to create a unique location signature for localization, as well as to direct the robot heading. In the open-space outdoors, however, these sensors are less effective because, first, surfaces detected by laser may not be as structurally simple as indoors (e.g., trees, shrubs), and, second, often the nearest solid surfaces may be out of the sensor’s range. Thus, it is desirable to find other ways to robustly sense the environment. Two primary sub-problems in navigation are to estimate the road heading, so that the robot stays on its route, and to avoid static and dynamic obstacles. Because obstacle avoidance is readily and identically solved by proximity sensors both indoors and outdoors, road recognition becomes the more pressing navigational concern. Another commonly-used group of devices are depth sensors, which produce dense three dimensional proximity information by exploiting a number of physical properties. However, many ofthem, e.g., theSwiss-rangerorKinect, donotworkrobustlyoutdoors, especially in the presence of sunlight. Even for the ones that do, such as stereo cameras, the system still has to perform the difficult task of extracting the shape and appearance of the road, possibly without the help of surrounding walls. The problem of road recognition can now be robustly solved on urban streets and highways because of distinguishing cues such as clear lane markings and bankings. One key breakthrough of the DARPA Grand Challenge [110] is the introduction of a sensor called the Velodyne [47]. It provides combined proximity and appearance information, measured densely in all 360 ◦ , with 120 meter range. The sensor has played a major role 2 in enabling many teams to finish the race. Furthermore, it has also allowed the Google Car [168] to travel more than 300,000 miles for many months in the San Francisco area. However, because of its cost, the Velodyne has not been widely utilized. In addition, pedestrian roads, which are less regulated and whose delineations are more subtle and varied, pose a different challenge, one that still has not been generally solved even with a Velodyne. Another aspect of autonomous cars that does not transfer to human-sized service robots is how localization is performed. Unlike faster moving cars driving on a highway, which can afford to make Global Positioning System (GPS) location queries that are far apart, service robots operate on a smaller environment scale and move at the speed of humans. A GPS reading, with its relatively coarse resolution, would not be able to localizearobotaccuratelyenoughtoturncorrectlyatatypicalcampusroadintersection. Furthermore, many pedestrian roads have tall trees or buildings alongside them, which denies satellite visibility that the GPS relies upon. Given these factors, vision, the primary sense of humans, is an ideal perceptual modality to localize and navigate in most outdoor places that humans can travel to. In addition, it is also an attractive alternative because of the affordability and availability of cameras. However, one drawback of these versatile devices is the associated process- ing complexity, further exacerbated by the real-time requirements that robotic systems demand. 3 1.1 Mobile Robot Platform Inthepastdecade, researchers inthefieldofmobile robotics have increasingly embraced probabilistic approaches to solving hard problems such as localization [43, 170, 171], vision [52, 192], and multi-robot cooperation [172, 41]. These algorithms are far more sophisticated and robust than the previous generation of techniques [17, 93, 124]. This is because these contemporary techniques can simultaneously consider many hypotheses in forms of multi-modal distributions. Because of that, however, they are also far more computationally demanding. For example, a visual recognition task, wherein we need to compare the current input image captured by a camera against a large database of sample images [90, 100, 10], requires not only that robust visual features be extracted from the input image — which already is a computationally demanding task — but also that these features be matched against those stored in the database — an even more demanding task when the database is large. As a point of reference, comparing two 320x240 images using SIFT features [90] can take 1 to 2 seconds on a typical 3GHz single-core machine. To be able to run such algorithms in near real-time, we need a mobile robot equipped with a computing platform significantly more powerful than a standard laptop or desktop computer. However, existing indoor and/or outdoor mobile robot platforms commercially avail- able to the general research community still appear to put little emphasis on computa- tional power. In fact, many robots, such as the Segway RMP series [145], have to be separately furnished with a computer. On the other hand, robots that come equipped 4 with multiple on-board computers either do not use the most powerful computers avail- able today (e.g., the Seekur [105], which relies on the less powerful PC/104 standard) or have fewer computers (e.g., [19, 189]) than our proposed solution Beobot 2.0. Beobot 2.0 is the next iteration of the Beobot system developed in our lab [26]. The original Beobot integrated two full-sized dual-CPU motherboards for a total of four 1GHz processors. For Beobot 2.0, we use eight dual-core Computer-on-Module (COM) systems. Each COM measures just 125mm×9.5mm×18mm and nominally consumes only 24 Watts of power. Nonetheless, with a 2.2GHz dual-core processor, a COM has the computing power equivalent to current dual-core laptop systems. Despite this state- of-the-art computingplatform, wehavemanagedtokeep theoverall cost ofourresearch- level cluster-based mobile robot to under $25,000 (detailed in [149]). OneaspectofaCOMsystemtounderscorehereistheeasewithwhichitscomponents can be upgraded. Since the input and output signals are routed through just two high- density connectors, one need only remove the current module and replace it with an upgraded one. Thus, as more and more powerful processors become available, Beobot 2.0’s computer systems can keep pace, making it somewhat more resistant to the rapid obsolescence that is characteristic of computer systems. The ability to keep pace with processortechnology isimportantbecauseroboticalgorithmsareexpectedtocontinueto evolveandbecomeevermorecomplex, thus,requiringcommensuratelevelsofcomputing power. Beobot 2.0’s computer system is mounted on an electric wheelchair base (figure 1.2), with an overall size that is close to that of a human. This allows the robot to navigate throughcorridorsandsidewalksandcreatesanembodimentwhichisidealforinteracting 5 Figure 1.2: Various features of Beobot2.0. Beobot 2.0 utilizes an electric wheelchair platform to carry a high performance computing cluster of 16 processor cores, 2.2GHz each. The robot is equipped with various sensors such as Firewire camera, Laser Range Finder, sonar array, IMU, compass, and GPS. In addition, panel mount water-proof USB connectors are available for new sensors, along with RJ45 Ethernet for wired inter- net connection and panel-mount KVM inputs for regular-sized monitor, keyboard, and mouse. There is also a touchscreen LCD for a convenient user interface. Furthermore, the kill-switches at each corner of the robot are available as a last resort to stop it in emergency situations. with people. We assume that the majority of these pertinent locations are wheelchair accessible, as required by law. We believe that even with this locomotion limitation, therearestill enoughphysicallyreachable locations toperformcomprehensivereal-world experiments. Figure 1.2 shows the finished robot. 1.2 Visual Localization The key to a successful vision localization system is to be able to match observed visual landmarksdespiteoutsideinterferenceorvisualvariability. Sincetherecentintroduction of visual keypoints, such as SIFT [90] and SURF [10], as well as subsequent techniques 6 such as the visual bag-of-words [84], systems can now robustly perform this critical matching step in many more environments, including outdoors. Therearenowmanysystemsthatarecapableofaccuratecoordinate-levellocalization [144, 143, 114, 136], as well as Simultaneous Localization And Mapping (SLAM) [34, 7, 164]. In fact, many systems are capable of localizing in large scale areas the size of a few city blocks [29, 141], as well as across many seasons [182]. However, many of these systems have thus far only been tested off-line as the computational complexity can be prohibitive. Aninterestingrecenteffortfocusesonalower-resolution localization taskcalledplace recognition [179, 175, 126, 37]. Here, the problem is framed as a classification task, to distinguish places in the environment, not the exact coordinate. Because of the coarser resolution, these systems easily run in real time. A practical way to exploit the speed of place recognition is by combining it with the accuracyofmetriclocalization systems. Manysystems[201,186,188,155]dosothrough a coarse-to-fine hierarchical approach, where place recognition primes possible areas in the map before refining the location hypothesis to a coordinate level. For completeness, it should be noted that many systems combine vision-based local- ization with other modalities such as proximity [128, 118, 121], GPS [4, 183], or visual odometry [120, 166]. Each modality provides something that complements vision local- ization. For example, even though GPS is coarser, has a lower update frequency, and is less reliable in some places, it provides clean global absolute location information that can be used to limit the matching process to a smaller geographic area. Note that we make a distinction between vision localization and visual odometery. The former is a 7 technique that tries to visually recognize one’s global location, while the latter only vi- sually estimates ones motion. However, if an initial global location is available, visual odometry can be critical in maintaining localization. 1.3 Visual Navigation A critical task in navigating in pedestrian environments is to recognize and stay on the road. It is important to distinguish road lane navigation from systems that simply drive through any flat terrain. The former specifically enforces proper and socially acceptable ways to travel in pedestrian environments, e.g., avoiding excessive zigzags. In addition, often it is important fortherobot totravel inthemiddleof theroad, because it provides an optimal vantage point to observe signs or landmarks. Also, by trying to recognize roads in general, instead of performing teach-and-replay, the system is not limited to a set path that has previously been manually driven. Instead, the system has the freedom to move about its surroundings and explore unmapped areas. Aside from using vision, systems have been proposed that use proximity cues from LRF [58, 193], stereo cameras [57], and Kinect [30]. They have shown a measure of success in restricted conditions, without the presence of sunlight or where the surfaces characteristics of the road and its flanking areas are sufficiently dissimilar. Monocular visual cues, on the other hand, readily encode road information in most environments. One visual approach relies on modeling the road appearance using color histograms [135, 82], which assumes that the road is contiguous, uniform, and visually different than the flanking areas [138]. In addition, to ease the recognition process, the technique 8 usually simplifies the road shape, as viewed in the image, as a triangle. These assump- tions are oftentimes violated in cases where there are complex markings, shadows, or pedestrians on the road. Furthermore, they also do not hold when the road appearance is similar to the flanking areas. Another way to visually recognize the road direction is by detecting the vanishing point (VP). Most systems [75, 108, 101] use theconsensusdirection of local textures and image edgels (edge pixels) to vote for the most likely VP. However, edgels, because of theirlimitedscopeofsupport,canleadtoanunstableresult. Furthermore,manysystems also attach atriangle to theVP toidealize theroad shape. Ourroad recognition module [151] improves upon these limitations by detecting the VP using more robust contour segments and not assuming the triangular shape. 1.4 Hierarchical Map Generally, a path planning system needs a mapping of the surroundings to generate motioncommands. However, afewsystemsdonotexplicitly modelthenearbyobstacles, but simply detect and avoid potentially hazardous conditions, for example, usingoptical flow [28, 139] or other visual features [98, 59, 115]. A standard representation allows a system to utilize multiple perceptual modalities and naturally fuse them together. Themappingisusuallyinformofagridoccupancymap,whetheritbethetraditional ground-view two dimensional [113] map, or more recent three dimensional [95] version. These maps are primarily filled by planar proximity sensors such as the LRF, which can also be swept to create a three dimensional perception. Another way to create a three 9 dimensional map is by using the aforementioned Velodyne [47]. From our experience, because of the openness of the outdoors, a well placed planar LRF is sufficient to avoid most obstacles. For more aggressive maneuvers through narrow openings such as under a table, a three dimensional representation would be needed. Given an occupancy map, there are two approaches to compute a motor command: directional space control and velocity space control. In the former, the algorithm selects a command that moves the robot to the best neighbor location in terms of advancing to the goal, while avoiding obstacles. A class of this is called the Vector Field Histogram [15, 178], which combines an attractive force from the goal with repelling forces from nearby obstacles. However, this technique, as well as related approaches such as the Nearness-Diagram algorithm [104], suffers from local minima problems because they do not explicitly search for complete paths that eventually arrive at the goal. On the other hand, shortest path algorithms such as A* [190] and its derivatives, e.g., D*-lite [73], ensure path completion to the goal. One difficulty of usinggrid map is determiningthe optimal grid size. An overly large size creates a coarse map that excludes some needed details, for example, in cluttered settings. A grid size that is too small is also not ideal as it renders the system inefficient because the exponential increase of the search space. One way to combat such problem is through the use of non-uniform grid size, such as an octree [194]. Evenwithanidealgridsize,becauseofthemapdiscretization, theresultingpathmay be too rigid and does not maximally avoid obstacles. This is because the path consists of straight segments that connect the centers of the visited grid cells. Techniques such as the elastic band approach [132] modifies and smooths out the resulting rigid path. 10 Another approach by [197] achieves the same objective by directly sampling the free space in the map without gridding it altogether. In order for the robot to behave decisively, it is important to create a path that is consistentovertime. AnextensionoftheshortestpathalgorithmscalledtheIncremental Heuristic Search [74] encourages such requirement as well as speeds up the process by incorporating paths from previous time steps as a prior to guide the subsequent search. A fundamental problem with directional space control is that it does not deal with kinematic anddynamicconstraints, for example, whether arobot is holonomic. Velocity space control techniques [42, 156, 40] address them directly by modeling factors such as robotgeometry, actuatorconfiguration,andtherobot’scurrentandachievablevelocities. The technique requires a good motion model to compute precise trajectory estima- tion. This is in contrast with directional space control which does not take into account how fast the robot will get to a certain spot, as long as it ends up there. In addition, constructing a full path to the goal is time consuming consideringthe number of options in each step. Onewaytoalleviatethisisusingarapidlyexploringrandomtree(RRT)[81],whichis designedtoefficientlysearchthroughhigh-dimensionalspace. Anotherwayistocompute a full path in the directional space, then refine an immediate portion in the velocity space [95]. The combination allows a system to gain the benefit of both techniques. Our presented system is an example of such an algorithm. 11 Chapter 2 Beobot 2.0 Autonomous Navigation Experimental Platform 2.1 Introduction We survey the current trendsin themobile robot market andidentify themost desirable features in an ideal research robot (aside from our central requirement of powerful com- putational facilities). Note that some of the robots discussed below may no longer be available (or may never have been) to the general public. We include them nonetheless for completeness of our analysis. The design and implementation of our proposed platform, Beobot 2.0, a powerful mobile robot platform equipped with a cluster of sixteen 2.2GHz processing cores. Our robotusescompact Computeron Module(COM) processorswithmodest powerrequire- ments, thus accommodating various robot design constraints while still satisfying the requirement for computationally intensive algorithms. Our complete design specifications, including supplier and cost information for al- most all the materials, are freely available on the Internet [149]. As the manufacturing, 12 assembly and machining details are available on-line, in this paper, we focus on (1) the design decisions we made, the implementational issues we faced and how we resolved them, and (2) experimental testing of the robot in diverse tasks. 2.1.1 Current Mobile Robot Platforms In the current state of robotics, researchers utilize a variety of mobile robots, from the commercially available [106, 61, 189] to thecustom-made[19]. These robotsarebuilt for many different environments such as underwater [180, 62], aerial [49, 39], as well as land [131, 137]. Here we focus on land robots of a size close to that of an adult human that can traverse most urban environments, both indoors and outdoors, and for considerable distances. In addition, it is versatile enough for research in many sub-fields such as localization/navigation, to human-robot interaction, and multi-robot cooperation. Furthermore,weprimarilyfocusonsitesthataresimilartoacollegecampussettings, which is mostly paved with some rough/uneven roads, not terrains that one would see in combat zones. Nowadays, because there is a concerted effort by most governments to make pertinent locations accessible to the disabled (using wheelchairs), legged robots (from the small QRIO and AIBO by Sony [162] to the human-sized Honda Asimo [6]) are no longer a must. A wheeled platform would suffice for the target environments. However, ability to traverse reasonably sloped terrain (about 10 degrees) should also be expected. Also, some form of weather protection in the outdoors is essential. While the robotisnotexpectedtooperateinallkindsofharshweather(pouringrain,forexample), like Seekur and Seekur jr. by MobileRobots [105] and IRobot’s PackBot [60], it should nevertheless be able to handle most reasonable conditions. 13 An overall size that is close to that of an adult human is ideal because the robot would be small enough to go through narrow building corridors, and yet large enough to travel between buildings in a timely manner. And thus we exclude small robots such as the Khepara [1] or large robotized cars such as the entries to the DARPA Grand Challenge. Smaller robots such as the Roomba [61] and Rovio [36] and slightly larger ones such as the Pioneer [106] are also excluded because of their lower payload capacity, which limits the amount of computing that can be carried to a single laptop. Aside from mobility, there are a few other important features that contribute to the usability of the robot. They are: battery life, sensors, interfaces, and available software. An ideal battery system would be one that enables the user to run for a whole day without having to recharge. And the two factors that matter here are the total charge carried and the amount of charge required to operate the robot. The latter is dictated by the total weight of the robot and power consumption of the computers and devices. These requirements should be decided first. Based on that, the former can then be adjusted by selecting the proper battery system (type and quantity). There are different types of available batteries: NiCd, NiMH, Sealed Lead Acid (SLA), and Lithium-based batteries. The trade-off is that of cost, dimensions, and durability. For one, SLA batteries are the most economical (in terms of cost to charge ratio), widely available, when it comes time to replace them, and robustas they areeasy to maintain and long lasting. However, SLA batteries have low charge-to-weight as well as charge-to-volume ratios compared to, particularly, the Lithium-based technologies. Lithium batteries are lighter and more compact for a comparable amount of charge [117]. However, these types of batteries are much more expensive and fragile than the 14 very rugged SLA’s. If Lithium batteries are not handled carefully, for example by not using protective circuits, they can explode. Note that, although for the size of robot we are considering battery weight is not as much an issue, volume would still matter in terms of placement. Itisimportanttohaveeasyaccesstothebatterycompartmentsothatwedonothave to unscrew or disassemble components in order to charge the batteries. If the batteries can becharged rapidly, within anhouror two, an even better option would betobeable to do so without having to remove them from the robot, instead using a docking station or a wall plug-in outlet. If rapid re-charge is not available a feature to hot-swap the batteries like in [189] to avoid shutting down computers in the switching process would be convenient. For aplatform tobeapplicable for awiderange of robotics andvision research, most commercial robots are furnished with a variety of sensors, manipulation tools such as a robot arm, and also provide avenues for future expansion. When selecting a sensor, we look for compact, light, low-power devices that exhibit high accuracy and high up- date rates. Popular sensors such as Laser Range Finders, sonar rings, cameras, Inertial Measurement Units (IMU), Global Positioning Systems (GPS), and compasses should be considered as potential accessories. For a camera in particular, negligible latency is a must. After a number of experiments, we found that Firewire (IEEE-1394) cameras werethebestinminimizingdelays, moresothanInternetProtocol(IP)orUSBcameras. In addition, related features such as pan-tilt-zoom, auto-focus, image stabilization, low- lightcapability, andwideangleoromni-directional viewingsetuparealsomadeavailable by various companies. 15 As for sensor expansion, aside from anticipating the future extra payload, it is also important to have many accessible USB ports placed throughout the body of the robot, USB-to-serial converters for serial devices, as well as several micro-controllers that can preprocess slower input signals. Another important feature is having multiple types of user interface. For example, USB inputs are useful to connect a keyboard and monitor to the computers in the robot to allow for hardware and operating system reconfiguration. In addition, many robots have reasonably sized (7 to 10 inches) full RGB LCD monitor for visualization of the robot’s state during test runs. Furthermore, wireless network connections for remote SSH logins give us additional flexibility to allow for safe and faster algorithm on-site debugging. Atthesametime, therobotcanusetheexternalconnectiontoaccessoutside network or Internet resources, which can be useful in some scenarios. A related feature in this category is a standard radio frequency (RF) remote controller for stopping the robot whenever autonomous driving starts to fail. Furthermore, most robots [105, 19] are equipped with large kill-switches to stop the flow of power to the motors. In addition to the hardware related aspects, robotic companies also provide software libraries to conveniently access all the included devices and monitor low-level state such battery charge, temperature, and so on. Some companies [105] provide further value additions such as mapping and navigation tools and even a full-blown simulation envi- ronment. We list the common software offerings in section 2.4, where we describe our freely available toolkit [64]. The rest of the paper is organized as follows: first, we describe the electrical system in section 2.2 and then the mechanical system in section 2.3. Section 2.4 goes into the 16 details of our software library, highlighting the advantage of implementing a computing cluster in robotics research. In section 2.5 we examine the robot on various important operational aspects, most important of which is computational speed/throughput to demonstrate how one could benefit from such a complex computing cluster architecture. We test Beobot 2.0 using three benchmark algorithms. One is the popular SIFT [90] object recognition. The second one is a distributed saliency algorithm [67], which models the visual attention system of primates. The algorithm operates on avery large image of 4000 by 4000 pixels and returns the most salient parts of the image. The last one is a vision localization system by [155] which requires the system to compare a detected salient landmark input with a large landmark database obtained from previous visits. All of these algorithms arepart of theVision Toolkit, available freely online [64], whichalso houses Beobot 2.0’s software control architecture, including obstacle avoidance [104] and lane following [3]. We then summarize our findings (in section 2.6) and what we have learned through the process of building this robot. 2.2 Electrical System Design and Implementation Figure 2.1 presents an overview of the electrical system. On the right hand side of the figure, there are two (green) baseboards, each housing four COM Express modules (ex- plained in depth in section 2.2.1), and implementing signals such as Gigabit Ethernet for the backplane inter-module communication, as well as others such as SATA (two per module), PCI Express, USB, and VGA. Beobot 2.0 uses a PCI Express 1394-Firewire 17 Table 2.1: Sensors Provided in Beobot 2.0 Item Company name Reference Laser range finder Hokuyo [56] Inertial measurement unit (IMU) MicroStrain [99] Compass PNI [122] Sonars (7 units) SensComp [146] Global Positioning System (GPS) US Global Sat [181] card for a low latency camera connection. One of the SATA ports was used for the pri- mary harddrive and the other for external drives such as CD-ROM (useful for installing operating systems, for example). Giving each module its own hard drive obviates the need to pass around copies of stored data, such as large knowledge databases obtained during training. There are six USB signal implementations per computer for a total of 48. Some of these are being used for sensors listed in table 2.1 below. Several of the USB connectors are panel-mounted outside the robot for ease of connecting external devices using dust- and water-proof connectors (figure 1.2). In addition, there are also USB connectors inside, on the baseboards (see figure 2.2). Furthermore, there is an on-board Keyboard-Video-Mouse (KVM) switch to toggle between each of the eight computers. The KVM is an 8-to-2 switch, 8 computers to 2 displayoutputs. Thedisplaysignaloutputscaneitherbearegular-sizedexternalmonitor or to an on-board 8-inch touchscreen LCD with a full-color VGA interface (figure 1.2). Note that in practice we operate all computers from a single node using an SSH login session to conveniently run and monitor multiple programs simultaneously. The use of wired interface to the individual computers is usually limited to hardware, BIOS, and boot troubleshooting. 18 2.2.1 Computing Modules The objectives for selecting a computing platform appropriate for the robot are high computing power, compactness, and low energy consumption. To have close to max- imum achievable speed, we concentrate on the X86 architectures rather than far less powerful CPU types such as ARM or Xscale. Within the X86 family, we select the mobile processor version rather than its desktop counterpart for energy efficiency, while still being competitive in computing power. By the same token, in using the mobile CPUversion, thecorrespondingembeddedsystemsoption can beselected forthemobile platform (regular-sized motherboards do not usually use mobile CPUs), which resolves the size issue. Inthefamilyofembeddedsystems, therearetwotypesofimplementations. Thefirst family of systems have the interfaces already implemented, ready to use. An example is the ITX form-factor family (pico-ITX, nano-ITX, mini-ITX) [185]. The drawback is that theprovidedinterfaces arefixed. Theymay notbethespecificones that areneeded and unusedconnections can bea waste of size as we cannot customize their location and orientation. In addition, by using off-the-shelf motherboards, their dimensions have to be accommodated in the design specifications, which may also limit the options for the locomotion platform. Incontrast, thesecond typeof embeddedsystems, theComputerOnModule (COM) concept, only provides specifications for all the interfaces through a set of high density connectors. These specifications are usually defined by an industry consortium such as the XTX-standard [196]. The actual breakout of the individual signals (such as Gigabit 19 Ethernet,USB,PCIExpress)fromthemoduleconnectorstotheoutsidedevices(ahard- drive, for example) has to be done on a custom-made carrier board. By building custom baseboards, the overall size of the electronics can be controlled by implementing only those signals that we actually need. In addition, connector placement (as well as type) can be specified so as to minimize the amount of cabling in the system. In the end, we found that a COM module solution best met our requirements, which we stated at the start of this section. Within this class, there are three options: ETX [35], XTX [196], or COM Express[27]. Thesemodulesuse themost powerfulprocessors, as opposed to the smaller but less powerful systems such as PC104-based Qseven [130]. We chose COM Express because it has an onboard Gigabit Ethernet interface on the module, and it is only slightly larger (12.5cm length by 9.5 cm width) than the XTX and ETX module (11.5cm length by 9.5cm width). Gigabit Ethernet is critical because in a cluster architecture, inter-computer communication can be just as important as the computing power of individual nodes. If the communication procedure cannot provide data fast enough, the individual processors will simply idle most of the time, waiting for data to arrive. This is especially true in our case because Beobot 2.0 is designed to performheavy-duty, real-time vision computation wherein thereal-time videostreaming is much more demanding than sending intermediate results. We implemented two carrier/base-boards (refer to figure 2.2), each accommodating four COM Express modules. A total of eight modules is chosen because the computing system fits within the mobile platform and because this number is expected to suffice for our research needs based on the findings presented in section 2.5. 20 WeusedtheKontronCOMExpressdesignguide[77](fortheKontronETX-Express- MC 2.2GHz (T7500) COM Express module [78]) to help properly design the electronic circuits as well as lay out the components in the board. We used the Electronics Com- puter Aided Design (ECAD) layout software Altium [5] to plan the physical placement of all the desired devices and connectors with as little cabling as possible for a system of eight computers. Altium’s 3D visualization proved to be an invaluable feature as it allowed us to verify that boards and devices packed close together within the robot’s insides would not collide or otherwise interfere with any other components. Themostcritical partinsuccessfullyimplementingthebaseboardswasbeingableto take care of the high speeddifferential-pair signal requirements such as matching length, spacing, as well as minimizing the number of vias in the baseboard. Altium allows its users to specify rules for each trace on the board, which tremendously eases the process of identifying unsatisfied constraints. We found that the signals are quite robust as long as the stated requirements are adhered to. In addition, most of these signals need very fewsupportingcircuits. Thehighest numberof componentsrequiredby asignal iseight, for the USB current limiter (500mA) circuit. The VGA signal actually specifies that it needsafilteringcircuit withmanycomponents, buttheKVMchipfurnishesthisfeature. To provide a clean and fail-safe power given a supply from the available batteries, a Pico-ATX [68] module(seefigure2.3(a)) isusedtoregulate power toeach COMExpress for a total of eight. There is also one extra Pico-ATX powering all the peripheral boards and sensors. There are three peripheral boards: one to control the cooling system, one to control access to the motors, and a sensor board that houses all the various built-in sensors. The power to the drive-train motors does not need to be filtered and, thus, is 21 directly connectedfromthebatteries. Becausepowersupplieshaveahighrateoffailure, goingwithmultiplepowermodulesprovidesbettergranularityinthatifonemodulefails, the remaining seven computers can still run. Additionally, the lower individual supply requirement allows for a wider range of products to choose from than would have been available in a single module solution. Table 2.2 summarizes the important electrical features: inter-processor communica- tions, input/output interfaces, KVM interface, sensors, and power management. These features are shown to be critical while compiling the list of available commercial robots as well as from our experience conducting robotics research. 2.3 Mechanical System Design and Implementation The mechanical design of the robot is divided into two parts: the locomotion platform, which is the black colored robot base in figure 2.4 and described below in section 2.3.1, and the computing cluster housing, which is the cardinal colored structure described in section 2.3.2. Again, note that we created a wiki page [149] to detail the execution matters such as actual part drawings [160], part manufacturing through a machine shop, or finding suppliers for the needed devices listed in the bill of material. 2.3.1 Locomotion System For the locomotion platform, we selected a Liberty 312 electrical wheelchair [94] instead of building one from scratch. Often priced at thousands of dollars, these types of units are easily acquired second-hand through channels such as Craigslist or eBay (ours cost 22 Table 2.2: Beobot 2.0 Electrical System Features Features Our Requirements Solution Chosen Alternative Considered Positives of chosen solution Negatives of chosen solution Remarks Inter-module communication Large enough bandwidth to stream real-time video. Gigabit Ethernet Symmetric Multiprocessing (SMP) architecture, same memory module, shared throughout the system bus Simpler to build larger latency Gigabit Ethernet network also connected to wireless internet connection for remote logins. Input/output interfaces Connection to Gigabit Ethernet, PCI Express, SATA, USB, VGA signals from the COM Express modules. Custom baseboards and peripheral boards connected to easily accessible panel-mount dust and water-proof connectors. Cables for access from COM Express modules to the connectors. Minimal cabling allow for easier de- bugging/repair. Complex design, especially in implementing high speed signals with differential pair requirements. Found that the signals are quite robust as long as the stated requirements (in the design guide [77]) are adhered to. KVM computer interfaces Provide easy visualization and access to each computer. Integrated KVM system Individual VGA, Mouse, Keyboard cables Do not need to cram cables to eight computers into a small area. Complex design, integrated in the baseboard and peripheral boards. 8 computers to 2 display KVM switches. One for regular-sized external monitor (if desired), another on-board 7in full color VGA touchscreen LCD. Note that software remote access solution (using ssh, for example) is also available. Sensors Need sensors for a wide range of robotics and vision research. Integrate sensors to the design. Add sensors later Less likely to accommodate space for sensors. Upfront added cost weight and design complexity. We include versatile sensors that are widely used. Also has an abundance of accessible USB ports throughout robot body and micro-controllers for new sensors. Power Management Clean and fail-safe power (12V, 5V, and 3.3V) as per COM Express Design Guide [77]. These modules primarily need large supply of 12V, up to 5A maximum per module, making for a total of 40A (480W) for all eight modules. Off-the-shelf system: Pico-ATX [68] modules to power each COM Express boards (plus 1 for all the peripherals and sensors) Built custom power system and profile supplying high current on the voltages needed the most. Fully engineered and tested, with protection from over- discharging and over-peaking. Shorter prototyping cycles. End up with about the same cost. Need to accommodate shape and sizes, waste of energy for unneeded supplies(-5 and -12VCC), and lack of availability of a 12V supply that goes over 12A. Modified the compact Pico-ATX and routed the power traces appropriately in the baseboard to minimize cable length. They fit in the available space without requiring any alterations to the mechanical design. 23 U.S. $200). The wheelchair is a robustly engineered, stable, safe, and low maintenance platform. Most importantly, adhering to the wheelchair form-factor allows the robot to traverse most terrain types encountered in modern urban environments, both indoors and out. This platform can also carry heavy payloads (250–300 pounds), which means the ability to add many more devices to the robot’s computing cluster. An important factor to consider is the ability to control the motors over a wide range of speeds (0–10 mph) with good resolution in between. The wheelchair platform has this characteristic as it is designed for fine-grained control, as opposed to a remote control (RC) car used by the original Beobot [26], which were able to be driven only at either top speed or a complete stop. Another benefit of the wheelchair is that it places the computing cluster ontop,relatively highabovetheground(about50cm)andaway fromthethickdustand mudthatcan accumulateonthestreets. Note thattherobot’sdrivingdynamicsistaken care of because the wheelchair is designed to have a person on top, where the computing system now is placed. This is accomplished by the wide spacing configuration of the wheels, enveloping the payload, to allow for the overall balance of the system while it is moving reasonably fast. In addition, the heavy SLA batteries are placed on the bottom to lower the center of mass. To control the wheelchair, we designed a motor board to connect the battery and motors to inputs from the computer for autonomous control as well as to a 2.4GHz Remote Controller (RC) for manual driving or overriding. A dual-output motor driver name Sabertooth [33] is used to provide up to 25A to each motor. In addition, because the motor driver has a built-in electrical brake system, the mechanical brakes that stop the motors by pinching the back-shafts is taken off. This then allows the back-shafts 24 to be coupled to a pair of encoders to provide odometry data. As a safety precaution, Beobot 2.0 is furnished with four kill-switches (figure 1.2), one on each corner for the user to immediately stop the robot in the event of an emergency. The wheelchair comes with a pair of 12V 35Ah Sealed Lead Acid (SLA) batteries, connected in series to provide 24V supply. They have a form-factor space of 19.5cm length by 13.2 cm width by 15.5 cm height for each battery. An attractive feature of the wheelchair is the built-in wall-outlet easy-plug-in battery recharging system, shown in figure 2.3(c). With this, the batteries can be conveniently recharged without having to take them inandout of therobot, although, therecharging processdoestake anaverage of 10 hours. 2.3.2 Computing Cluster Case The structure surrounding the computing clusters, as shown in figure 2.4, shields the computing cluster from unwanted environmental interference such as dust and mud. The structure is divided to two isolated chambers as illustrated in the figure. The back chamber is the water-tight area where the cluster is placed. The front chamber is an open area, reserved for a liquid-cooling system (further elaborated in section 2.3.2.2) which includes a radiator to allow for maximum air flow. These two cooling sub-systems are connected through Tygon tubing for liquid flow and are physically held together by apair of aluminumholders. Thecomputing cluster, along with thecooling system, itself is mounted on shock-absorbing standoffs (section 2.3.2.1) to withstand violent collision in the rare event the robot hits an obstacle. 25 2.3.2.1 Vibration Attenuation and Shock Absorption System As illustrated in figure 2.4, the only connections between the computing system and the robot base are the shock-and-vibration damping standoffs. This makes it easier to properly evaluate the necessary damping requirements. When considering a damping solution, one needs to take into account the basic relationship between shock and vi- bration. That is, the solution has to be rigid enough to not cause too much vibration on the load, but flexible enough to absorb shocks. Here, the focus is more on shock because, likeregular laptops, thecomputersshouldbeableto workdespite thevibration that comes from reasonably rough terrains. In addition, the system uses a solid-state hard-drives (SSD), which have no moving parts and can withstand far more shock than their mechanical counterparts. Thenaturalrubbercylindricalmountsisselectedoverotheroptionssuchaswire-rope isolators, rubber or silicone pads, and suspension springs because of their compactness. In addition, the height of the standoffs is easily adjustable by screwing together addi- tionalabsorbersaccordingtoneeds. Furthermore, onecanchange theirshockabsorption property by adding washers in between two mounts if need be. 2.3.2.2 Cooling System SinceBeobot2.0ismeanttobeusedbothindoorsandoutdoors,wedecidedagainstanair coolingsystemduetothepossibilityofthefanspushingdustintotheexposedelectronics inside, although air filters could have kept the dust out. However, the electronics would have to be placed in an area where air flow is well controlled, i.e., air must be drawn in 26 and exhausted out only through the fans. This would have entailed a push-and-pull fan system and significant prototyping and rework of the mechanical system. Therefore, we settled on a liquid cooling solution. Moreover, as water has 30 times the amount of thermal conductivity and 4 times the amount of heat capacity as air [18], a liquid cooling system is more effective in addition to being cleaner. The liquid-cooling system, as shown in figure 2.4, consists of the following compo- nents: cooling block, tubes, nozzles, radiator, two fans, liquid pump, reservoir, cooling liquid, a flow-meter, and a temperature sensor to monitor the system. Note that the system uses a cooling control board to provide power for the fans and the pump, as well as to take data from the flow-meter and temperature sensor. The heat dissipated by the COM Express modules is first transferred to the liquid coolant through the processors’ heat-spreaders that are firmly pressed up against the top and bottom of the cooling block, which contains the coolant. We recommend using a high performing, low-conducting, non-corrosive coolant for a maintenance-free system. The heat-carrying coolant first goes through the radiator, which has two fans pulling air through the radiator surface. These fans are the devices that actively take the heat out of the system. Note that the radiator (and the fans) can be placed as far away from the processors as necessary. The liquid pump is connected to the system to ensure the flow of the liquid. Finally, a reservoir is included to add the coolant into the system and to take the air (bubbles) out of it. 27 2.4 Software Design Our ultimate goal is to implement a fully autonomous embodied system with complete visual scene understanding. To do so, we lay the groundwork for a Robot Development Environment [79] that especially maximizes the multiple-processor hardware architec- ture. In addition it also fulfills the primary objective in designing the software, viz., to be able to integrate and run computationally heavy algorithms as efficiently as possible. The advantage of using COM Express modules as a platform is that they can be treated as regular desktops. This allows the use of Linux operating system in conjunction with C++ rather than some special-purpose environment. Note that, this way, the user can install any kindof Linux-compatible software tools that he/she prefers, not just theones that we suggest below. Also, although this is not a true real-time system, it is quite adequate for our needs with the control programs runningreasonably fast and the robot responding in real-time. In the case a user would like to go with real-time operating system, several Linux-based options and extensions are available [195, 123, 129, 191]. In order to speed up the development of the complex algorithms mentioned above, we use thefreely available iLab NeuromorphicVision C++Toolkit [64]. The motivation for the toolkit is to facilitate the recent emergence of a new discipline, neuromorphic engineering, which challenges classical approaches to engineering and computer vision research. These new research efforts are based on algorithms and techniques inspired from and closely replicating the principles of information processing in biological ner- vous systems. Their applicability to engineering challenges is widespread, and includes 28 smart sensors, implanted electronic devices, autonomous visually-guided robotics sys- tems, prosthesis systems, and robust human-computer interfaces. Thus, the develop- ment of a Neuromorphic Vision Toolkit helps provide a set of basic tools which can assist newcomers in the field with the development of new models and systems. Because of its truly interdisciplinary nature, the toolkit is developed by researchers in psychology, experimental and computational neuroscience, artificial intelligence, elec- trical engineering, control theory, and signal and image processing. In addition, it also aids in integration with other powerful, freely available software libraries such as Boost and OpenCV. The project aims to develop next generation general vision algorithms rather than being tied to specific environmental conditions or tasks. To this end, it provides a software foundation that can beusedfor thedevelopment of many neuromorphicmodels and systems in the form of a C++ library that includes classes for image acquisition, pre-processing, visual scene understanding, and embodied system control. These systems can be deployed in a single machine or a distributed computing plat- form. We use the lightweight middleware ICE (Internet Communication Engine) via its C++librarybindingstofacilitate inter-computercommunication withahigh-level inter- face that abstracts out low-level matters such as marshaling data, opening sockets, and so on. Sensors/devices, which are connected to a computer in the distributed system, are encapsulated as independent services which publish their data. Different systems can grab just the sensor outputs that they need by subscribing to that particular ser- vice. In addition, such a distributed system is fault-tolerant as non-functional services do not bring down the whole system. We are also working on adding functionality to 29 quickly detect non-respondinghardwareand recover from failures by performingan ICE reconnection protocol, for example. Another aspect that to pay close attention to is the need for robust debugging tools for distributed systems that are provided by the toolkit as well as future applications. Thatis, wewouldliketoknowwhichmodulesinthesystemthat takes thelongest times, which ones send the most amount of data, and how do all these factors affect the overall system efficiency. Currently, the system has logging facilities for analysis after a testing run has taken place. What would be ideal is an on-line monitoring system. In terms of hardware support, the Toolkit has extensive source code available for interfacing sensors through different avenues. For example, Beobot 2.0 currently can connect to different types of cameras: USB, Firewire or Internet Protocol (IP). Other devices that use a serial protocol should also be easily accommodated. In addition, it is important to note that the separation of hardware related and algorithmic related code comes naturally. This allows the user to test most of the software in both the robot and our custom simulator (provided in the toolkit) without too many changes. Furthermore, the same robot cluster computing design is used for our robot underwater and aerial vehicle. We find that porting the algorithms to the other robots is done quite easily. Table 2.3 below lists all the vision and robotic-related software capabilities provided by the toolkit. 30 Table 2.3: The Vision Toolkit Features Features Description Available Options Devices Interface code for various devices. Embedded systems/microcontrollers, joystick, keyboard, Gyroscope, wii-mote, GPS, IMU (HMR3300, MicroStrain 3DM GX2), LRF (Hokuyo). Robots Control code for various robots. Scorbot Robot Arm, Evolution Robotics Rovio, Irobot Roomba, Beobot, Beobot 2.0, BeoSub Submarine, BeoHawk Quadrotor Aerial Robots, Gumbot for Undergraduate Introduction to Robotics. Robotics Algorithm Modular Mobile Robotics Algorithm. Localization, Laser and Vision Navigation (lane following, obstacle avoidance), Simultaneous Localization and Mapping (SLAM). Distributed Programming Tools Allows programs to communicate between computers. CORBA, Beowulf, ICE. Neuromorphic Vision Algorithms Biologically plausible vision algorithms. Center surround feature maps, attention/saliency (multi-threaded, fixed point/integer), gist, perceptual grouping, contour integration, border ownership, focus of expansion, motion. Media Access to various input media. mpeg, jpeg, cameras (USB, IP, IEEE1394 Firewire), audio-visual. Image Processing Various tools to manipulate images. Drawing, cut/paste, color operations(HSV,RGB,etc), statistical operations, shape transformation, convolutions, Fourier Transform, pyramid builder, linear algebra/ matrix operation. Machine Learning. Tools for pattern recognition training. K-Nearest-Neighbor, Back-propagation Neural Networks, Support Vector Machine, Genetic Algorithm Object Recognition Visual Object Recognition modules SIFT, HMAX 31 2.5 Testing and Results We examine a few aspects of Beobot 2.0. The first is basic functionality such as power consumption, cooling system, and mobility as it pertains to shock absorption. The power consumption testing shows the typical length of operation given the amount of capacity of the batteries, and the weight which the motors have to move and the eight computers that the batteries have to power. Beobot 2.0 has a power supply of 35Ah x 24V capacity from two 12V SLA batteries in series. The robot is run with full-load computingbyrunningavisionlocalization system[155],explainedinsection2.5.2below, while running the robot around. In the testing, the cooling system is shown to drain about1.8A ofthe24Vsupply,whiletheGigabit switchandothersensorsconsumeabout .5A. Each of the 8 computers pulls up to .7A during the heavy use, while the motors pull 2A when the robot is moving at about 1 mile/hour. The total comes up to 9.9A in regular use and which corresponds to about 3.5 hours of expected peak computation running time. The good news is that Beobot 2.0 has two accessible power jacks located on its back, in the KVM Board, as shown in figure 2.3(b). By plugging in an auxilliary power source that stops its current flow when it detects another supply in the system, we can perform hot-swapping to temporarily replace the SLA batteries. This prolongs the running time considerably, given that on-site system debugging occurs quite often. Consequently, the running time becomes actual testing time, without debugging time. And this, for the most part, allows users to do research on site for the whole day and charge all night. Table 2.4 below summarizes the results. 32 Table 2.4: Beobot2.0 Sub-System Testings Sub-System Tests Results Remarks Liquid Cooling CPU’s at full load in room temperature of 22 degrees Celsius. Average CPU temperature of 41 degrees Celsius. System is virtually maintenance free, although it consumes 1.8A for the liquid pump. CPUs reach critical over-temperature within 15 minutes if liquid cooling system is turned off. Mobility and Shock Absorption Running the robot throughout the campus using RF controller at 2 meters/second. Computers run smoothly without disconnection through several bumps and abrupt stops whenever it is too close to nearby pedestrians. We modify the motor controller code to properly ramp down when going to a complete stop. Battery Consumption Running the robot using the remote controller with all computers running full computations Robot runs for 2.25 hours before one CPU shuts down. Can prolong the testing time considerably by hot-swapping batteries (there is a jack at the back of robot) during on-site debugging. Consequently, the running time becomes actual testing time, without debugging time and allows us to do research on site for the whole day. Wethengointotheusabilityofthesystembyreportingourexperienceimplementing the Nearness Diagram (ND) navigation system [104] in section 2.5.1. Note that this sec- tion is included to show that the robot can move about an environment and is ready for use. We do not try to optimize the implementation to improve the performance. On the other hand, in section 2.5.2, we describe our experiment performing three computation- allyintensivealgorithms: SIFT[90]objectrecognitionsystem,distributedvisualsaliency [67] and robot vision localization system [155]. These computational speed/throughput experiments test the most critical aspect of the project’s objectives. Given the complex- ity of having to implement a cluster of processors, we would like to see a good payoff for all our hard work. 33 2.5.1 Navigation Algorithm Implementation In this section, we test the first algorithm to successfully run on Beobot 2.0, viz., the Nearness Diagram Navigation algorithm [104], which uses aLaser Range Finderto build a proximity map around the robot and then searches this map for the navigable region closest to a goal location. A navigable region is a space or area that is at least as wide as the robot, thus, enabling it to pass through. For example, the system’s GUI display in figure 2.5 shows the robot’s surroundings, divided into nine distinct regions. The robot follows a series of binary decision rules that classify all situations into five mutually exclusive cases, which are summarized in table 2.5. Each case is associated with a corresponding movement action. First, we define a security zone around the robot that is an area twice the robot’s radius. In the GUI display (figure 2.5), this zone is denoted as the yellow circle. If there are obstacles within the security zone (red dots within the circle in the figure), there are two cases to consider: whether there are obstacles on both sides of the robot or only on one side. In the former case, the robot tries to bisect this opening; in the latter case, it can move more freely to the open side. Note that the system only considers obstacles that are within 60 degrees (between the 2 red lines in figure 2.5) of the robot’s direction of motion (blue line in the figure). Whentherearenoobstaclesinthesecurityzone,itconsidersthreepossiblesituations. If the goal location is in the navigable region, just go to it. If the goal location is not in the navigable region but the region is wide (only 1 obstacle on one of the sides), maneuver through the wide region, in hope there is a way to go to the goal region in 34 Table 2.5: Nearness Diagram Rules Number Situation Description Action 1 Low Safety 1 Only one side of obstacles in the security zone Turn to the other side while maintaining the angle to the goal location. 2 Low Safety 2 Both side of obstacles in the security zone Try to centering in between both side of obstacles and maintaining the angle to the goal location 3 High Safety Goal in Region All obstacles are far from the security zone and goal Directly drive towards the goal 4 High Safety Wide in Region All obstacles are far from the security zone but goal is not in this region Turn half max angle away from closest obstacles. 5 High Safety Narrow in Region All obstacles are far from the security zone and narrower region in the goal location. Centering both side of closest obstacles. the following time step. If goal location is not in the navigable region and the region is narrow (in between 2 obstacles on both sides), carefully move forward in the middle of the region. The overall resulting behavior is that the robot should continuously center itself between obstacles, while going to the goal. To test this algorithm, only two of the available eight computers are needed. The laserrangefinderispluggedintoonecomputerandthemotorcontrol boardtotheother. Additionally, a remote control (RC) setup allows the user to change from autonomous to manual mode at the flick of a switch in case the robot is about to hit something or has been stuck in a corner for some time. During implementation and debugging, there are a few notable features that speed up the process. Firstly, the 8" LCD screen allowed users to observe the system states and action decisions as the robot is moving. Secondly, the use of wireless USB keyboard and touch pad made it fairly easy to issue new commands while the robot was working. Last, but not least, taking the time to setup an intuitive GUI paid back dividends very quickly as it made it much easier to understand what was going on and how to fix the problems encountered. 35 The system is tested indoors, on a 20x24ft empty area. We then occupy some of the regions with obstacles and test Beobot 2.0 to see if it can navigate from one side of the environment and back. Figure 2.6 shows a snapshot of the environment setup for the experiment. In addition, some of the obstacle configurations are shown below in figure 2.5.1 with an example odometry trace overlaid on top. There are nine different obstacle configurations and robot starting positions in the testing protocol. Each test was performed ten times with the robot’s speed being the only variable parameter. We vary the speed between approximately .3m/s to 2.5m/s. Table 2.6 below summarizes the results of each trial. For the most part, the navigation system performs very well with a 72% success rate. Here success is defined as the robot moving from its starting side of theenvironment to the other and back without touching any of the entities surrounding it. Out of the total of 90 trials, 25 resulted in failures of some sort. While this might seem excessive, it should be pointed out that the majority of the these collisions were of thetypewhereBeobot 2.0only scrapedanobstacle. Thisoccurswhenever therobothas to turnsharply to avoid an obstacle, which causes its rear to scrapetheobstacle. This is a minor problem that can be easily rectified by some simple control fix; e.g., when turn is judged to be sharp, first backup a little. There were two occasions when Beobot actually hit an obstacle head-on. This hap- pened when the robot is running at its maximum speed. Under this circumstance, the latency of the system (the Laser Range Finder throughput is 25ms or 40Hz and pro- cessing is approximately the same duration as well) is simply too large to allow a timely reaction. 36 Table 2.6: Beobot 2.0 Nearness-Diagram (ND) Navigation Testing End Result Occurrence Percentage Success 65 72.22 Scraping the obstacles 16 17.78 Stuck in corner or circles 7 7.78 Squarely hit an obstacle 2 2.22 Finally, there were seven occasions when the robot became stuck in a corner or kept spinning in place because it kept alternating between left and right. The solution to this problem requires going beyond the simple reactive nature of the navigation system and figuring out what is globally optimal by integrating knowledge from localization or SLAM algorithms. 2.5.2 Computational Capabilities In this section, we characterize the computing platform by running three computation- ally intensive vision algorithms: SIFT [90] object recognition system, the distributed vi- sual saliency [67], and the biologically-inspired robot vision localization algorithm [155]. These algorithms have a common characteristic in that their most time-consuming por- tionscanbeparallelized,whetheritbedistributingthefeatureextractionprocess(section 2.5.2.2) or comparing those features to a large database (sections 2.5.2.1 and 2.5.2.3). These parallel computations are then assigned to worker processes allocated at differ- ent computers in Beobot 2.0’s cluster. Thus, we can fully test the computation and communication capabilities of the system. 37 2.5.2.1 SIFT Object Recognition System Test As a first step in demonstrating the utility of our system in performing computationally intensive vision tasks, we implemented a simple keypoint matching system which is a very common component and performance bottleneck in many vision-based robotic systems [143, 182]. This task consists of detecting various interest points in an input image, computingafeaturedescriptortorepresenteach suchpoint, andthensearchinga database of previously computed descriptors to determine if any of the interest points in thecurrentimagehavebeenpreviouslyobserved. Oncematchesbetweennewlyobserved and stored keypoints are found, the robot has a rough estimate of its surroundings and further processing such as recognizing its location or manipulating an object in front of it can commence. Generally, the main speed bottleneck in this type of system is the matching between newlyobservedkeypointswiththepotentially verylargedatabaseofpreviouslyobserved keypoints. In the naive case this operation is O(MN) where M is the number of newly observed keypoints, and N is the number of keypoints in the database. However, this time can be cut to O(MlogN) if the database of keypoints is stored as a KD-Tree. To test the efficacy of our cluster in speeding up such a task, we built a matching system in which a master node (called SIFT Master) computes SIFT keypoints [90] on aninputimage, andthendistributesthosekeypoints toanumberofworker nodes(SIFT Worker) for matching. Each of these workers is a separate process on the cluster located on either the same or a different machine as the master. Each worker contains a full copy of the database, stored as a KD-Tree. Upon receiving a set of keypoints (different 38 Table 2.7: Beobot 2.0 SIFT Object Recognition Time Breakdown Operation Description Computation Time Input SIFT keypoint Extraction Extract SIFT keypoints from the input image. Done by the master process. about 18 seconds SIFT Database Matching Match the input keypoints with the SIFT keypoints and return the results. Done by the worker processes. 16.25 - 353.22 seconds depends on the number of workers utilized. sets for each worker) from the master, a worker node compares each of them against its database and returns a set of uniqueIDs to the master representing the closest database matchforeach keypoint. Table2.7describesthedifferenttypesofmodulesinthesystem and figure 2.8 illustrates the flow of operation. Thedatabaseusedfortheexperimentiscomprisedof905968SIFTkeypointsobtained from HD footage (1920 by 1080 pixels) taken from an outdoor environment traversed by our robot. Each keypoint has 128 dimensions and 8 bits per dimension. We vary the number of workers used to perform the keypoint matching from one to a maximum of fifteen (because there are a total of sixteen cores available). Figure 2.9 illustrates the allocations of the modules. Table 2.8 and figure 2.10 record the time required to process each frame, plotted against the number of workers. It shows a total decrease of 21.73 times (from 353.22 to 16.25 seconds) in per frame processing time between 1 and 15 workers. Here, the improvement goes beyond fifteen- folds is, we believe, because of memory paging issues that arise when dealing with the large messages necessary when using a small number of nodes. This figure also shows that while diminishing returns are achieved after 11 nodes, there is still a significant performance improvement by the utilization of parallel work- ers in the cluster. While not all algorithms are as easily parallelized as this example, 39 Table 2.8: Beobot 2.0 SIFT Database Matching Algorithm Testing Results Num. of Workers Processing Time Standard Deviation (sec./frame) (ms./frame) 1 353.2194 57.8369 2 193.6876 58.1703 3 130.8932 39.6815 4 95.5375 27.8618 5 95.3156 34.4357 6 57.5809 12.2352 7 47.1917 10.6182 8 40.2749 9.9586 9 37.2474 8.3234 10 29.8436 6.6259 11 37.7632 15.9283 12 18.9525 6.2032 13 20.9672 6.1088 14 25.3063 6.8077 15 16.2541 5.7455 this experiment shows that a very common visual localization front-end can indeed be parallelized, and the benefits for doing so are significant. 2.5.2.2 Distributed Visual Saliency Algorithm Test One of the capabilities that is important in a robot is object collection. Here, a key task toperformisobjectrecognition, usuallyfromanimage. Therearetimeswheretheobject may be small, or placed in a cluttered environment. This is where an algorithm like the saliency model [67] can be quite useful. The term saliency is defined as a measure of conspicuityinanimage,andbyestimatingthischaracteristicforeverypixelintheimage, parts of it that readily attract the viewer’s attention can be detected. Thus, instead of blindly performingan exhaustive search throughout the input image, thesaliency model can direct the robot to the most promising regions first. We can then equip the robot withahighresolutioncameratocaptureallthedetailsofitssurroundings. Furthermore, 40 because ofBeobot 2.0’s powerfulcomputingplatform, saliency processinginsuch alarge image in a timely manner becomes feasible. To compute the salience of an image, the algorithm [67] first computes various raw visual cortex features that depict visual cues such as color, intensity, orientation (edges/corners), and flicker (temporal change in intensity). Here we have multiple sub- channels for each domain: two color opponencies (red-green and blue-yellow center- surroundcomputation), oneintensityopponency(dark-bright), twelveorientation angles (increments of 15 degrees), and one for flicker. That is a total of sixteen sub-channels, each producing a conspicuity map which are then combined to create a single saliency map. Because the computations in each sub-channel are independent, they can be easily distributed. And so, we use the algorithm to show how having many cores in a robot can alleviate such a large computational demand. For the experiment, we setup one master process and 1 to 15 worker processes to calculate the saliency of images of 4000 by 4000 pixel in size, for 100 frames. The master process takes approximately 100ms to pre-process the input image before sendingthe jobs to the workers. The jobs themselves takes up to 100ms to finish for the color, intensity, and flicker sub-channels, and up to 300ms for the twelve orientation sub-channels. And finally, the conspicuity map recombination takes less than 10ms. The table 2.9 summarizes the running time of individual parts of the system and figure 2.11 illustrates the flow of the algorithm. In addition, figure 2.12 shows the actual allocations of all the processes, at which computer the modules are run. 41 Table 2.9: Beobot 2.0 Distributed Saliency Algorithm Time Breakdown Module Description Computation Time Input Image Pre-processing Computes luminance, Red-Green and Blue-Yellow color opponency maps to be sent to the worker processes. Done by the master process. 100ms. Conspicuity Map Generation Perform center surround operations in multiple scales to produce a conspicuity map for each sub-channel. 300 - 3900ms. Depends on the number of workers utilized. Saliency Map Generation Combines all the conspicuity maps returned by the workers to a single saliency map. Done by the master process. 10ms. Table 2.10: Beobot 2.0 Visual Saliency Algorithm Testing Results Num. of Workers Processing Time Standard Deviation (ms/frame) (ms./frame) 1 3456.50 108.904 2 1787.69 491.056 3 1249.68 528.787 4 979.14 374.978 5 733.85 359.345 6 629.24 387.554 7 571.79 429.028 8 526.77 288.188 9 498.49 441.441 10 478.33 482.481 11 452.13 290.235 12 436.75 253.642 13 375.59 299.921 14 362.72 195.126 15 352.26 332.128 42 The results that we obtained from this experiment can be viewed in table 2.10 and graphed in figure 2.13. As we can see, the processing time drops as we continue to add the number of workers to the system. Quantitatively, the processing time reduction comes reasonablyclosetotheexpected value, at least earlyon. For example, ifusingone worker, the processing time is 3456.50ms, then using two workers should take half the time, 1728.25ms, which is comparable to the actual time of 1787.69ms. This is usually the case for a straight-forward distributed processing where there are no dependencies between the processes. Another point of comparison is that we would like to gauge the improvement of using the full cluster with what would be equivalent to a standard quad core system. Thus, we compare the usage of 3 worker nodes against all 15 nodes. We see a slight drop in improvement to 3.55 (from 1249.68 to 352.26ms). The reason for this slowdown is primarily attributed to network congestion, as we are shuffling large images around. Furthermore, if we compare the running time of one worker (3456.5ms) with ten times the running time of ten workers (478.33ms ∗ 10 = 4783.3ms), there seems to be a lot of added time. And so, as we add more and more workers, we expect to eventually hit a point of diminishing returns. A lesson to be taken here is that we should consider not only how to divide the task and properly balance job allocation, but also how large the data (or the total communication cost) is that needs to be distributed for each assigned job. 43 2.5.2.3 Biologically-Inspired Robot Vision Localization Algorithm Test For the third computational test, we utilized the vision localization algorithm by [155]. It relies onmatching localization cues fromaninputimage withalargesalient landmark database obtained from previous training runs to capture the scenes from the target environment under different lighting conditions. Thealgorithm firstcomputes thesame raw visual cortex features that areutilized by the saliency algorithm [67]. It then uses these raw features to extract gist information [152],whichapproximatesholisticaspectsandthegenerallayoutofanimage, tocoarsely locate the robot in the general vicinity. In the next step, the system then uses the same raw features to isolate the most salient regions in the image and compare them with the salient regions stored in the landmark database to refine its whereabouts to a metric accuracy. The actual matching between two regions itself is done using SIFT features [90]. Of the different parts of the algorithm, the salient region recognition process takes the longest time. However, the computations performed by this module are parallelizeable by dispatching workers to compare particular parts of the database. Aside from the parallel searches, there are two other processes whose jobs are to extract gist and saliency features from the input image, and a master process that assigns jobs and collects results from all landmark database search worker processes. The gist and saliency extraction process, which operates on 160 by 120 size images, take 30–40ms to complete per frame, and have to be run first. They are placed in the computer that is connected to the camera. For this experiment, however, we are runningoff of previously collected data, without runningthe motors. Note that because 44 the information being passed around consists of a few small salient regions (about five regions of 40 by 30 pixels, on average), there is only a small amount of time (4–5ms) spent on data transfer (using the ICE protocol) through the Gigabit Ethernet network. We then run the master search process, which takes about 50–150ms (depending on the number of landmarks in the database) to create a priority queue for ordering landmark comparisons from most to least likely using saliency, gist and temporal cues. For example, in the gist-based prioritization, if the gist features of the input image suggest that the robot location is more likely to be in a certain vicinity, we compare the incomingsalient region withthestoredregionsfoundnearthat place. Thisprioritization improves the system speed because we are only trying to find a first match, which halts the search once it is found, and not the best match, which requires an exhaustive search through the entire database. After these two processes are done, we can then dispatch the landmark search pro- cesses in parallel. For testing purposes, we use 1, 2, 4, and 8 computers to examine the increase in overall system speed. Noting that there are two cores in each computer, we dispatch 2, 4, 8, and 16 landmark database worker processes, respectively. The Local- ization Master then collects the match results to deduce the robot’s most likely location given the visual evidence. This final step takes less than 5ms. Table 2.11 summarizes the various processes information, figure 2.15 shows the pro- gram allocation while figure 2.14 illustrates the flow of the algorithm. We test the system on the same dataset used in [155], which depicts a variety of visually challenging outdoor environments from a building complex (ACB) to parks full of trees (AnFpark) to an open area (FDFpark). The database information for each site, 45 Table 2.11: Beobot 2.0 Localization System Time Breakdown Module Description Computation Time Gist and Saliency Computes various raw visual cortex features (color, intensity, and orientation) from the input image for gist and saliency feature extraction. 30 - 40ms. Localization Master Create a priority job queue (to be sent to the workers) for ordering landmark database comparisons from most to least likely using saliency, gist and temporal cues. It also collects the search results from all search workers. 50 - 150ms. Depends on the size of the database. Localization Worker Compares the incoming salient region with the stored regions based on the prioritization order. The search halts once the first positive match is found. 300 - 3000ms. Depends on the size of the database and the number of workers utilized. Table 2.12: Beobot 2.0 Vision Localization System Testing Environment Num. Training Num. Testing Num. Lmk Num S. Regs Num S. Reg/Lmk Sessions Frames ACB 9 3583 1501 19.79 29710 AnFpark 10 6006 4664 17.69 82502 FDFpark 11 8823 4808 18.86 90660 in their respective rows, can be found in table 2.12 and the images can be viewed in figure 5.16. The table shows the number of training sessions, each of these depicts a different lighting condition in the outdoor environments. This is one of the reasons why the database is so large. The table also introduces the term salient region [155], denoted as SRegs, which is different from a landmark. A landmark is a real entity that can be usedasalocalization cuewhereasasalient region is evidenceobtained at aspecifictime. And thus, there are, on average, about twenty salient regions to depict a landmark to cover different environmental conditions. Table 2.13: Beobot 2.0 Vision Localization System Testing Results Num of ACB AnF FDF Computers Time Err (ft) S. Reg Found/fr Time Err (ft) S. Reg Found/fr Time Err (ft) S. Reg Found/fr 1 1181.60 7.24 2.34/4.89 2387.87 7.45 2.73/4.98 3164.96 14.11 2.51/4.78 2 711.93 5.57 2.38/4.89 1495.23 7.58 2.76/4.98 1909.01 14.32 2.51/4.78 4 499.18 3.72 2.48/4.89 1000.66 7.73 2.81/4.98 1201.90 13.25 2.55/4.78 8 421.45 4.12 2.57/4.89 794.31 7.81 2.94/4.98 884.74 13.40 2.60/4.78 46 Theresultsareshownintable 2.13. Here, weexaminetheprocessingtimeperframe, thelocalization error, and salient regions foundperframe. As wecan see fromthe table, for each site, there is always a decrease in processing time per frame as we increase the number of computers. At the same time, generally, there is an increase in accuracy in two out of thethree sites as the numberof computers is increased. The reason for this is that the localization algorithm itself behaves differently as more and more resources are provided, in that it tries tooptimize between thespeedof computation andtheaccuracy of the results. Consequently, the running time analysis is not as straightforward. That is, we cannot just look at the non-linearity of the relationship between the number of computersandprocessingtime,statingdoublingthenumberofcomputersdoesnothalve the processing time, and say that the algorithm does not take advantage of the available computing efficiently. As we explained earlier, the [155] localization system orders the database landmarks from the most likely to be matched to the least. This is done by using other contextual cues (such as gist features, salient feature vectors, and temporal information) that can becomputedmuchquicker thantheactual database matchingprocess. Theeffect ofthis step is that it gives robot systems with limited computing resources the best possible chance to match the incoming salient regions. In addition, there is also an early-exit strategy that halts the search if the following conditions are met: • 3 regions are matched. • 2 regions matched and 5% of queue has been processed since last match. • 1 region is matched and 10% of queue has been processed since last match. 47 • no regions are matched and 30% of queue has been processed. This policy is designed to minimize the amount of unnecessary work when it is obvious that a subsequent match is very unlikely to be found. However, together with the increase in the number of workers, this policy actually creates a slightly different behavior. First of all, there is a difference between the number of jobs processed by a one- worker setup compared to multiple workers. In the former setup, the localizer master processassignsajob,waitsuntiltheworkerisdone, thencheckswhetheranyoftheearly exit conditions are met, before assigning another job. In the multiple workers case, the master assigns many jobs at the same time and much more frequently. This increases the possibility of a match, as demonstrated by the increase in number of salient regions found in table 2.13. In turn, this slows down the running time by prolonging the search process by 5, 10, or even 15 percent (in a compound case) of the queue, depending on which early-exit conditions are invalidated. On the other hand, however, the higher number of matches found can also increase the accuracy of the system, but not always. As we can see in table 2.13, there is a small but visible adverse effect of letting the search go too long (most clearly in the AnF site). This is because the longer the search process the more likely a false positive is discovered as the jobs lower in the priority queue are in lesser agreement with other contextual information. Furthermore, this is also reflected by the fact that a lot of the salient regions are found early in the search as the numbers do not increase significantly 48 as we add more computers. For example, in the ACB site, compare the salient region found using one computer (2.34) with using eight (2.57). From the table, we estimate that, for these environments, using four computers appears to be the optimum number. Note that the localization system does not have to be real-time, but being able to come up with a solution within seconds, as opposed to a minute, is essential because longer durations would require the robot to stop its motor and stay in place. And this is what we are able to do with Beobot 2.0. In the full setup, the localization system is going to be run in conjunction with a salient region tracking mechanism,whichkeepstrackoftheregionswhileitisbeingcomparedwiththedatabase while still allowing the robot to move freely as long as the region is still in the field of view. If we use just four of the computers for localization, the others can be used for navigational tasks suchas lanefinding,obstacle avoidance, intersection recognition, thus making the overall mobile robotic system real time. Currently, we have a preliminary result ofasystem that localizes andnavigates autonomously inbothindoorandoutdoor environments, reported in [20]. 2.6 Discussions And Conclusions In this chapter, we have described the design and implementation of an affordable research-level mobile robot platform equippedwith acomputing cluster containing eight dual-core processors for a total of 16 2.2GHz CPU’s. With such a powerful platform, we can create highly capable robotic applications that integrate many complex algorithms, use many different advanced libraries, and utilize large databases to recall information. 49 In addition, by using the Computer On Module (COM) Express form factor industry standard, therobot is ableto stave off obsolescence for alonger perioddue to theability to switch COM modules and upgrade to the latest processor and computer technology. Furthermore, by implementing our own robot, we have demonstrated a cost-effective way to build such a computationally powerful robot. For more information for the cost breakdown, please refer to [149]. The trade-off, of course, is in development time and effort. In our lab, we have had 2 people working full time on this project, with a few others helping out here and there. The total design and implementation time has been 18 months from conception to realization. We have had to think about many issues, no matter how small or seemingly trivial, in order to ensure that no major flaws are introduced into the design that can become show-stoppers down the road. However, given that we now have the final design [149], the implementation of a second robot ought to be relatively straightforward and much quicker on the order of 2–3 months. One might wonder why we would go to such extraordinary effort to build such a complex hardware system when there may well be an easier alternative. For example, whynotsimplystackeightlaptopsonamobileplatformconnectedwithEthernetcables? We believe that with such an approach, it will behard to isolate thecomputers from the elements. Cooling, for instance, would have to be done in two steps. First, the internal laptop fans would blow hot air out to a water-proof inner-compartment of the robot body. Then, we would need another set of fans, fitted with filters to drive the air in and out of the robot body. Furthermore, we still have to create space to place all the other devices (for example sensors and motor driver), along with power connections that also 50 have to supply the main computers. The resulting shear amount of cables would easily make the system unwieldy and unappealing. Inourcustomdesign, however, wiresandcabling, whichareoften asourceofconnec- tion failures, have been kept to a minimum thanks to the Printed Circuit Board (PCB) designdirectives. Additionally, theliquidcoolingsystemiswellsealedandrunssmoothly every time we run the robot. In terms of maintenance, we use the robot everyday and have found it to be quite trouble-free. The wall plug-in battery charging system for one, makes it convenient to charge the robot at night before going home. Finally, we would like to add that, although in this paper we are presenting a terrestrial system, the same technology hasbeenappliedtoanunderwaterrobot[180], wheredimensionsandweights becomecritical factors andmodifyingCOTS(ComputerOffTheShelf)systemsmaynot be feasible. Nonetheless, with the benefit of hindsight, there are some things we would have liked to improve upon. One is easier access to various electronic components inside the robot body. For example, four of the COM Express modules are placed underneath the cooling block structure, and taking them out for repair can be somewhat difficult. This is the price we pay for designing such a highly integrated system. Another problem is managing many computers. In an application that requires all eight of Beobot 2.0’s computers, we have to compile and run many programs in parallel with certain ordering constraints. In addition, we also have to properly allocate where each program should be run on which computers, so that there are no computers that are idling while others are overloaded. While these issues cannot always be avoided, some forethought and automation via appropriate scripting can help. Frameworks such as MOSIX or the 51 Scyld beowulfsystem areavailable toaid this process, which is to betested in thefuture on our robot. In the end, we believe that our primary contribution is that Beobot 2.0 allows for a class of computationally intensive algorithms that may need to access large-sized knowl- edgedatabases,operatinginlarge-scaleoutdoorenvironments,somethingthatpreviously may not have beenfeasible on commercially available robots. In addition, it also enables researchers to create systems that run several of these complex modules simultaneously, which is exactly what we are currently working on in our lab. That is, we would like to run the localization system [155], vision-based obstacle avoidance, and lane follow- ing [3] together. We also are planning to add more components such as SLAM and human/robot interaction. The long term goal is to make available plenty of predefined robotic components that can be reused to speed up future project developments. Subsequently, the problem that we foresee is managing these diverse capabilities. We have to make sure that each has enough resources to work with and give priority to the most important and reliable sub-systems in solving the task at hand as well as identifying dangers that threaten the livelihood of the robot. We hope that through our contribution of implementing an economical but powerful robot platform, we can start to see more of the type of complete systems that are needed to thrive in a real world environment. 52 Figure 2.1: Beobot 2.0 electrical system. On the right side of the diagram, there are 2 baseboards, each houses 4 COM Express modules and each module has its own SATA hard-drive. The backbone inter-computer communication is Gigabit Ethernet that is connected through a switch. For visual interface to individual computers, a KVM (Key- board, Video, and Mouse) is used to connect to either a 8-inch LCD Touchscreen or an external monitor, mouse, and keyboard. In addition, a PCI Express - Firewire interface card is used to connect to a low latency camera. The other sensors are connected via the many USB connectors that are panel mounted on top of the robot as well as on the baseboard. The whole system is powered by a 24V battery circuit supply (with kill- switchesforsafetypurposes)andisregulatedthroughasetofdedicatedPico-ATXpower modules. The same battery circuit also powers the motors as well as the liquid-cooling system. 53 Figure2.2: Baseboard. Theimageontheleftisafullyassembledbaseboardwith4COM Express modules. The black plates are the heat-spreaders attached to the processors. There is also an Ethernet and two USB jacks placed on the right side of the board. The layout on the right is the circuit done in Altium [5] Printed Circuit Board (PCB) design software. (a) Pico-ATX (b) KVM Board (c) Liberty312 Battery and Charger Figure 2.3: Various Power related components. 54 Figure 2.4: A SolidWorks [160] model of the robot shown from its side and cut through its center. The bottom of the image displays the Liberty312 wheelchair base, and above it is the robot body in cardinal color. The robot body is divided into two chambers by the dust-proof firewall. The back-chamber completely seals the computers within from the elements. The front of the robot, which houses part of the cooling system, is open to allow for heat dissipation. The heat from the computers is transferred by the liquid in the cooling block, which is attached to theheat-spreaders on each module. The liquid then moves through the radiator, which is cooled by the fans, before going back to the cooling block. In addition, the computing block is shock-mounted on four cylindrical mounts that are used to absorb shocks and vibration. 55 Figure 2.5: Graphical User Interface Display of the ND Navigation system. The system identifies 9 (indexed from 0 to 8, note that 4 and 7 are cut off as they are drawn outside theframeoftheGUI)differentregions. Therobotnextdirectionofmotionisdenotedby the blue line. The yellow lines delineate the boundaries of the navigable region. The red line indicates the directions 60 degrees to the left and right of the robot’s next direction. We also display the robot’s translational and rotational motor command. Both of these numbersrange from -1.0 to1.0 (negative values indicate moving backwards andcounter- clockwise rotation, respectively). Figure 2.6: Snapshot of the constructed environment for ND navigation testing. 56 (a) Env1 (b) Env2 (c) Env3 Figure 2.7: Various Environments for ND navigation testing with an example path that is taken by Beobot 2.0 using the ND Navigation algorithm. SIFT SIFT SIFT SIFT Figure 2.8: Flow of the distributed SIFT database matching algorithm is denoted in increased alphabetical order in the figure and referred below in parenthesis. First the camera passes in the High Definition (1920 by 1080 pixel) frame to the SIFT Master module (A). This module takes about 18 seconds to extract the SIFT keypoints from the input image before sending them to the SIFT Worker processes utilized (denoted as B i , i being the total number of workers). Depending on the number of workers, each take between 16.25 and 353.22 seconds to return a match to the SIFT Master (C i ). Base Board1 Camera COM_E 2 COM_E 3 COM_E 4 COM_E 1 Base Board 2 COM_E 2 COM_E 3 COM_E 4 COM_E 1 SIFT Master SIFT Worker SIFT Worker SIFT Worker SIFT Worker SIFT Worker SIFT Worker SIFT Worker SIFT Worker SIFT Worker SIFT Worker SIFT Worker SIFT Worker SIFT Worker SIFT Worker SIFT Worker Figure2.9: Allocation ofthedifferent programsofthedistributedSIFTdatabase match- ing algorithm in Beobot 2.0. The SIFT Master module is run on one of the cores in computer COM E1, while the various SIFT Worker modules are allocated throughout the cluster. 57 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 50 100 150 200 250 300 350 400 450 SIFT Algorithm Testing Number of Workers Processing Time (sec / frame) Figure 2.10: Results for SIFT database matching testing on Beobot 2.0. Figure 2.11: Flow of the distributed saliency algorithm is denoted in increased alpha- betical order in the figure and referred below in parenthesis. First the camera passes in the high resolution 4000 by 4000 pixel image to the SalMaster module (A). SalMaster pre-process the image which takes 100ms before sending out the image to various sub- channel SaliencyWorker processes (denoted as B i , i being the total number of workers). The color, intensity, and flicker sub-channels take up to 100ms, while the orientation sub-channels take up to 300ms. These results are then re-combined by SalMaster (C i ) and takes less than 10ms. 58 Figure 2.12: Allocation of thedifferent programsof thedistributedsaliency algorithm in Beobot2.0. TheSaliencyMastermoduleisrunononeofthecoresincomputerCOM E1, while the various Saliency Worker modules are allocated throughout the cluster. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 −500 0 500 1000 1500 2000 2500 3000 3500 4000 Visual Saliency Algorithm Testing Number of Workers Processing Time (ms./frame) Figure 2.13: Results for Saliency Algorithm Testing on Beobot 2.0. 59 Figure 2.14: Flow of the localization system is denoted in increased alphabetical order in the figure and referred below in parenthesis. First the camera passes in a 160 by 120 pixel image to the Gist and Saliency Extraction module (A) which takes 30 - 40ms before sending out the Localization Master module (B). This module then allocates search commands/jobs inaformof priority queuetobesent toanumberofLocalization Worker (C i , i being the total number of workers) to perform the landmark database matching. A search command job specifies which input salient region to be compared to which database entry. This takes 50 -150ms, depending on the size of the database. The results are the sent back to Localization Master (D i ) to make the determination of the robot location given the visual matches. The last steps takes less than 10ms. Figure 2.15: Allocation of the different programs of the localization system in Beobot 2.0. TheGistandSaliencyExtraction(GistSal) andLocalization Mastermodulesareal- located computer COM E1, while the various Localization Worker modules are assigned to cores throughout the cluster. Note that there are also 2 worker modules in COM E1. This is because they only run when Gist Sal and Localization Master modules do not, and vice versa. 60 Figure 2.16: Examples of images in the ACB (first row), AnFpark (second row), and FDFpark (third row) 61 Chapter 3 Gist and Saliency Based Navigation 3.1 Introduction Abilitytogotospecificlocationsinone’senvironmentinatimelymannerisfundamental in creating a fully autonomous mobile robotic system. To do so, a robot not only has to be able to move about its environment, but it also has to identify its location. To solve this problem, available approaches use proximity sensors such as Laser Range Finder (LRF), cameras, encoders, and Global Positioning System (GPS). While robot navigation and localization have made strides indoors (using LRF [171, 111]) and on the road (like in DARPA Grand Challenge, relying on a combination of LRF, GPS, and visual road recognition [173, 112]), systems that can perform well in the in-between settings such as a college campus, where a robot needs to travel both inside buildings (through hallways) and outside (between large structures), are still lagging behind. For these scenarios, vision, human primary perception, is ideal because of its versatility in recognizing a variety of environments. However, because of the complexity of a camera image, vision also comes with many challenges such as lighting conditions, perspective, 62 and occlusions. Observe figure 3.1 for an example of a cluttered outdoor environment that our robot Beobot2.0 is operating in. Figure 3.1: Beobot2.0 navigating in a cluttered environment. In the current state of robotic research, vision-based navigation and localization are usually dealt with separately. For the most part, available literature concentrates more on one but not the other. Researchers who work on vision localization manually control their robots to move about the environment. On the other hand, visual navigation is usually performed for a single predefined route, where robots do not have to decide on where to turn when encountering an intersection. This split in effort, is done for good reasons, in that each individual problem is difficult by itself. Although robust vision localization that can operate in all environment are not yet available, state of the art systems can now localize in large areas such as city streets [29, 141], across multiple seasons [182], and in many different indoor [127, 133, 45] and outdoor environments [155, 118, 7]. However, most localization systems are trained and tested while the robot’s movement is controlled by a user. The problem with manual driving is that the robot’s point of view comes from a perfect control. When one wants 63 to integrate the localization module with the rest of the system, including navigation, the robot may need to recognize views it has not seen. For example, the views from the same location with different robot headings, can be strikingly dissimilar. One way to alleviate this issue is through the use of omni-directional cameras (as opposed to a regular straight-forward cameras) [114] as the images produced can be realigned. On the navigation side, there are two basic types of systems. One uses a ”teach- and-replay” paradigm, and the other tries to recognize the direction of the road. In the former [23, 144, 136], the robot first is manually driven through a pre-specified path, while recording all the pertinent information necessary to replay the route being taught. Then, in the replay phase, the robot compares the stored visual features with what is currently seen to compute the positional difference, which is used as a control feedback to stay on the path. This type of system, which utilizes features such as KLT [148] or SIFT [90], needs to recognize surrounding landmarks to navigate. In other words, localization within the path (not throughout the environment) is required. This is done by retracing a sequence of keyframes, keeping track of which frame the robot currently is at. When matching between the current scene and keyframe fails, the system has to re-synchronize itself. This can take extra time as rematching from the start of the series is required. Ontheotherhand,roadrecognition systems, insteadofidentifyinglandmarks,tryto recognize theroad’s appearance as well as boundaries. Somesystems rely onrecognizing roadedges(basedontheintensityofimages, forexample)thatseparatetheroadfromits surroundings [44]. Others do so by performing color and texture segmentation [135, 14]. In addition, there are systems that combine both techniques [48, 50, 91] to extract the 64 road region in the image. However, for these type of systems, the robot merely follows the road and keeps itself within its boundaries, not knowing where it will ultimately end up. If a system has both localization and navigation, we can command it to go to any goal location in the environment by using the former to compute a route and the latter to execute movements. In addition, such an integrated system can internally share resources (visual features, for one) to make the overall system run more effectively. However, this also poses a challenging problem of integrating complex sub-systems. In spite of the difficulties, many vision localization [114] and navigation [24] systems have discussed that it is advantageous to combine localization and navigation. Ours is such an implementation. We present a mobile robot vision system that both localizes and navigates by ex- tending our previous work of Gist-Saliency localization [155]. Instead of just navigating through a pre-specified path (like teach-and-replay), the robot can execute movements through any desired routes in the environment (e.g., routes generated by a planner). On the other hand, unlike teach-and-replay, a stand-alone localization system allows the navigation system to not be overly sensitive to the temporal order of input images. We tested the system (reported in section 5.3) in both indoor and outdoor environments, each recorded multiple times to validate our approach. 65 9 8 6 5 3 4 1 2 Localization Navigation Motion Planner Path Planner Identified Landmark Tracker Input Goal Location Input Landmark Tracker Lateral Difference Estimator 1a 1b Figure 3.2: Diagram of the overall vision navigation and localization system. From an input image the system extracts low-level features consisting of center-surround color, intensity, and orientation that are computed in separate channels. They are then fur- ther processed to produce gist features and salient regions. We then compare them with previouslyobtainedenvironmentvisualinformationtoestimatetherobotslocation. Fur- thermore theidentified salient regions areused todirect therobot’s heading, to navigate itself to its destination. 3.2 Design and Implementations Our presented mobile robot vision navigation and localization system (illustrated in figure 3.2) utilizes biologically inspired features: gist [152] and salient regions [155] of an image, obtained from a single straight-forward view camera (as opposed to omni- directional). Both features are computed in parallel, utilizing shared raw Visual Cortex visual features from the color, intensity and orientation domain [67, 63]. Here we define gist as a holistic and concise feature vector representation of a scene, acquired over very short amount of time. Salient regions, on the other hand, are obtained using the saliency model [67, 63] which identifies parts of an image that readily capture the 66 viewer’s attention. As demonstrated by previous testings [155], the likelihood that the same regions will pop out again when the robot revisits particular locations is quite high. Theactual matching isdone throughSIFTkeypoints extracted withineach salient region. It is important to note that matching keypoints within small region windows instead of a whole image makes the process much more efficient. Our localization system [155, 154] estimates the robot’s location in an environment represented by a topological map. The map denotes paths as edges, and intersections as nodes. We also define the term segment as an ordered list of edges with one edge connected to the next to form a continuous path (an example is the selected three-edge segment, highlighted as a thicker green line, in the map in figure 3.2). Geographically speaking, a segment is usually a portion of a hallway, path, or road interrupted by a crossing or a physical barrier at both ends for a natural delineation. This grouping is motivated by the fact that views/layouts in one path segment are coarsely similar. Because of this, the system can use the gist vector to classify an input image to a segment. After the system estimates the robot’s segment location, we then try to refine the coarse localization estimation by matching the salient regions found in the input image with the stored regions from our landmark database (obtained during training). Note that this processis much slower than segment estimation butproducesahigher localiza- tion resolution. At the end, both gist and salient region observations are incorporated in a back-end probabilistic Monte Carlo Localization to produce the most likely current location. 67 In our training procedure, we run the robot through the environment once. There is no camera calibration nor manual selection of the landmarks. In addition, the training procedure [154] automatically clusters salient regions that portray the same landmark. And, thus, alandmark, a real entity in theenvironment, is established by a set of salient regions. Inasense,theregionsbecomeevidencesthatthelandmarkexists. Furthermore, these regions are stored in the order of discovery. That is, we keep a temporal order of the regions as seen when the robot traverses an environment. This becomes important when we try to navigate using these landmarks. As a result of the compact topological map representation (figure 3.2), we do not localize the robot laterally on the lane nor do we have an accurate knowledge of the robot’s heading with respect to the road. However, for each database salient region, we do not only record its geographical location, but also register its image coordinates to perform navigation. That is, we minimize the difference in image coordinates between theinputregionanditscorrespondingdatabaseregiontobringandkeeptherobotatthe “correct” position on the lane (same position as during training). In the following sub- section 5.2.2, we describe the main contribution of this chapter, the navigation portion of the system and its interaction with the localization component. 3.2.1 Navigation System As shown in figure 3.2, localization and navigation run in parallel in the system. When a new set of input salient regions arrive, the localization module starts its landmark database search (denoted as 1a in the figure). At the same time, the navigation module alsotrackstheseregions(1b). Bytrackingthem,whentheregionsarepositivelyidentified 68 atalatertime(usuallydatabasesearchfinishesintheorderofseconds), eveniftherobot has moved during the search process, we can still locate the salient regions in the image. Whenatracked region islost ormoves away fromtherobot’sfieldofview, wesimplytell thelocalizer tostopsearchingforamatch. Note that, whilethesearch isinprogress, the gist-based segment estimation result, which is available instantaneously on every frame, is used to maintain the location belief. When the database search concludes, if there is at least one matched region, the localization system informs the navigation system that these regions can be used for navigation. The navigation system first transfers the regions tracked by the input land- marktrackertotheidentifiedlandmarktracker. Asecondtrackerisneededtofreeupthe firstone for newer, yet to beidentified landmarksfrom thecurrent frame. The identified landmark tracker keep track of the transferred regions so that we can utilize them for visual navigation cues for a period of time (until the newer landmarks are recognized). The navigation system calculates theimage coordinate difference(we call this lateral difference) between thetracked region and thematched database region (section 3.2.1.2) asafeedbacktothemotorcontroller. Inaddition,wealsobiasthesalientregionmatching (explainedinsection3.2.1.3)whentherobotisattheintersectionssothatitcancorrectly direct itself to the goal location. We start by explaining in details the salient region tracking process in the following sub-section 3.2.1.1. 69 3.2.1.1 Salient Region Tracker Figure 3.3 illustrates the salient region tracking procedure. The tracker utilizes a tem- platematchingalgorithm(OpenCVimplementationwithsquared-errordistance)oncon- spicuity maps [67, 63] from the 7 Visual Cortex sub-channels used in saliency and gist computation (see Figure 3.2; the sub-channels correspond to 4 edge orientations, 2 color contrasts, and luminance contrast). For each region, we perform template matching on eachmap(7associatedtemplates) beforesummingtheresultingdistancemap. Thetem- plates are initialized using7x7 windows from each conspicuity map, centered around the salient point of the region. The conspicuity maps are 40x30 pixels in size, downsampled from theoriginal 160x120 inputimage size, which is acceptable given that we aremostly tracking large salient objects. We then weigh the summation based on thedistance from the previous time step for temporal filtering. The coordinate of the minimum value is the predicted new location. Figure 3.3: Diagram for salient region tracking. The system uses conspicuity maps from 7 sub-channels to perform template matching tracking. Before selecting the predicted new location, the system weighs the result with respect to the proximity of the point in the previous time step. In addition, we also adapt the templates overtime for tracking robustness. 70 In addition, in each frame, we update the templates for robustness (lighting change, etc) using the following adaptive equation: T t,i = .9∗T t−1,i +.1∗N t−1,i (3.1) Here, T t−1,i is a region’s template for sub-channel i at time t− 1, while N t,i is a new region template around the predicted location. Before the actual update, we added a threshold to check if the resulting template would change drastically; a sign that the trackingmaybefailing. Ifthisisthecase, wedonotupdate,hopingthattheerrorisjust coincidental. However, if this occurs three times in a row, we report that the tracking is lost. Thisprocess has proven to bequite robustbecause, aside fromfollowing regions that are already salient, the variety of domains means that for a failure to occur, noise has to corruptmany sub-channels. Inaddition, thenoise isalsominimizedbecausethesaliency algorithm uses a biologically inspired spatial competition for salience (winner-take-all) [67,63],wherenoisysub-channelswithtoomanypeaksaresubdued. Also,theadvantage ofre-usingconspicuitymapsisthattheprocessisfaster. But,moreimportantly,tracking requires no recall of stored information, which allows the procedure to be scalable in the size of the environment. In the current implementation, the system is usually able to track a salient region for up to 100 frames and the process only takes 2ms per frame for each salient region. On each frame, we track up to 10 regions: 5 for the ones already recognized (in the identified landmark tracker), and 5 more that are currently being matched (in the input landmark tracker). 71 Before we calculate the lateral difference from the database match result, we per- form a forward projection to the current frame to take into account the duration of the matching process. Thesearch outputsa match between an input region from many time steps ago and a region in a database which most closely resembles it. However, since we started the search, that region has undergonetransformation as the robot has moved and changed its pose. What wenow actually want is adatabase region that most closely resembles the input region as it appears in the current time step. Fortunately, inthedatabase,eachlandmarkisrepresentedbyasetofviews(regions), sorted temporally duringtrainingas the robot moves througha path. Thus, theforward projection process is basically a re-matching between the tracked region at the current time frame and a region from the matched database landmark that is usually only a few indexes away from the original match. This procedure takes about 10ms and is illustrated in figure 3.4. 3.2.1.2 Lateral Difference Estimation To estimate the lateral difference between image coordinates of the database and input region we utilize the SIFT keypoints match-pairs from all forward-projected matched salient regions. Here we adapt an algorithm by [24] onto the salient regions. That is, instead of using a single whole image from a training run, we use matched database regions that may come from different time steps. Each keypoint pair (from each region) votesforcandidateheadingdirections. Wediscardthevertical coordinatedifference, and are only concerned with the horizontal component as we assume that the robot stays on the ground at all times. 72 Figure 3.4: Forward projection procedure. A detected region R i from time T =t i (top left of the figure) is successfully matched to salient region L j,k , p frames later. This is the region that belongs to Landmark L j with index k (denoted in solid red at the right side of the figure), wherek is one of the multiple views of the landmark recorded during training. At this current timeT =t i+p , the regionR i has changed its appearance as the robot moves (now denoted as R i+p on the bottom left side of the figure). We project forward the matched region R i by finding a corresponding match for R i+p in landmark L j , in this case with region L j,k+l (denoted in solid green at the right side of the green) , which was found further in the path during training. The algorithm is stated below and is illustrated in figure 3.5. We consider three scenariosforthreeresultingheadings(left,straight,right). WedefineX t asthehorizontal coordinate of the SIFT keypoint in a matched input region andX d as the corresponding point of the landmark database region. In addition, we also set the origin to be at the centeroftheimage, denotedasbrokenlinesinthemiddleofeachcaseinthefigure(there are two cases for each of the three scenarios). This way, X d < 0 means the keypoint is to the left of the midline: • if(X t >X d and X t >0), then turn right • else if(X t <X d and X t <0), then turn left • else, go straight 73 Case1 Case1 Case1 Case2 Case2 Case2 Left Straight R ight Figure 3.5: Diagram for the voting scheme of navigation algorithm. There are three scenarios where the robot either moves left, straight, or right. For each scenario, there are two cases depending on the sign of X d , the corresponding point of the landmark database region. IfX d is to the left of the center of the image (the dashed line that goes through the origin of the image coordinate system), then X d <0. The same applies for X t , the horizontal coordinate of the keypoint in a matched inputregion. IfX t <X d and X t < 0, then turn left, else if X t >X d and X t > 0, then turn right. If neither, then go straight. Our robot, Beobot 2.0 [150], uses an electric wheelchair base and has a two-wheel differential drive. It is controlled by commanding a translational and rotational speed between -1.0 and1.0 (full speed backward to full speedforward, and full speedclockwise turn to full speed counter-clockwise turn, respectively). We keep the translation at constant speed, while the turn speed is proportional to the average horizontal pixel difference. Note that the direction of turn is obtained through the voting process above. We find that our well-isolated salient regions make for clean correspondences that, for themost part, overwhelmingly agree withone of thethreecandidate directions. This is because, by using a compact region (as opposed to entire image like the original [24] algorithm), we avoid matching other distracting portions of the image, which may be occupied by dynamic obstacles, such as people. 74 3.2.1.3 Path Planning Biasing What we have implemented so far is to a policy that allows the robot to follow a con- tinuous portion of the route, along the edges of the topological map. However, when the robot arrives at an intersection, there may be multiple paths that stem out from it. Thesystemmayencounterasituationwhereitsimultaneouslyrecognize regionsthat are going to lead the robot to two or more different directions. For example, regions from the segment the robot is currently on may suggest it to continue to go straight, while regions on an incident segment are going to suggest it to make a turn. Here, we consider the task, the assigned goal location, by consulting with the path planner. The system first checks with the localization sub-system to see whether it is ap- proaching an intersection. If so, the path planner then bias the localization system to first compare input regions with database regions along the path to the goal, before ex- panding the search. In addition, for subsequent matches, we also prioritize comparison withdatabase regions that arefurtheralong theintersection turn. Thisway, therobotis goingtofollow thesame sequenceofregion matches that areestablishedduringtraining, which is critical for successful turning. At the end, for failure recovery, whenever the robot cannot recognize any salient region for a period of time (we set for five seconds), it stops and spins in place to scan its surrounding. Note that the path planner re-plans every time the localizer sends out a new current location to make correct adjustments even when the robot loses its way. In summary, the procedure goes as follows: after the go-to-goal command is received from a user interface, the system starts by recognizing at least one salient region to 75 localize. Once the robot is successful in doing so, it plans its path to the goal location. Thenavigation systemthenusestheinitially recognized salient regions toproperlymove closer to its goal. During this movement, the robot continuously try to identify newer regions to follow, to advance its mission. This procedure repeats until the robot arrives at its destination. A snapshot of how the overall system generates robot motion from a frame can be observed in figure 3.6. 3.3 Testing and Results We test the system using our robot, Beobot 2.0 [150] (observe figure 3.1), which is 57cm in length and 60cm in width. The camera itself is about 117cm above the ground. Beobot 2.0 has a computing cluster of eight 2.2GHz Core2 Duo processors. We use four computers for localization, one for navigation, three for future use. We select two different sites, anindoorandanoutdoor environment, fortesting. The first is the hallways of 27.13x27.13m HNB building, with a corridor width of 1.83m. The path forms a square with 90 degree angles when turning from one segment to another (observethemapinfigure3.6). Thesecondoneisthe69.49x18.29m outdoorengineering quad (Equad), where there are buildings as well as trees, with a pathway width of 3m. Snapshots of each environment are shown in figure 3.7. Despite the fact that, in our testing environments, we only map one path on each junction,thepathplannerstillperformitsjobbyselectingthatoption. Notethatpicking the correct path to proceed to the goal out of asmall number of available paths is trivial 76 Table 3.1: Navigation and Localization System Performance Results Site Site Traversal Nav. Loc. Dimensions Length Error Error HNB 27.13 x 27.13m 36.67m 3.68cm 1.15m Equad 69.49 x 18.29m 138.27m 8.78cm 5.31m in the graph-based topological map. On the other hand, the actual turningexecution on the chosen path is much more difficult. For each environment, we runthe robot throughall paths oncefor training. We then run the robot five times indoors and three times outdoors for the testing phase. During testing, werecordthetrajectory oftherobotodometryanditslocalization belief. Weset the ground truth of the robot location by using manually calibrated odometry reading. The baseline for an ideal navigation itself is the center of the path. The results forboth sites areshown intable 3.1, which specifiesthe sites’ dimensions and length of traversal. In addition, nav. error and loc. error refer to the navigational error (deviation from center of the path) and localization error (global robot location discrepancy), respectively. For the first site, HNB, the result is graphed in figure 3.8. For both the top and bottom plots, the x-axis denotes the ground truth location of the robot, 0 being the starting location. Here, the robot runs for 36.67m. The y-axis for the top plot indicates the localization error (in meter, longitudinally along the path) as the robot moves for- ward, while the y-axis for the bottom plot reports the lateral deviation from the center of the hallway (also in meter, transversally to the path). We divide (using vertical lines) 77 and label the route into segments. Note that during the segment transition, the naviga- tion error shoots up as makes its turn. In addition, when the robot arrives at the next segment, it then proceed to recenter itself. As shown in the graph as well as by observation, Beobot2.0 is able to traverse the route, on average, within 3.68cm of lateral deviation from the center of the hallway, without hitting the wall, which we believe is the more important fact. The longitudinal localization error itself averages to 1.15m. Fortunately the longitudinal error, in most all occasions, do not misplace the robot to a different segment. It merely places the robot ahead or behind the actual location in the route. The reason for why there is so much difference (a factor of 30) between the localization and the lateral deviation error, is because the way the landmarks are viewed with a forward-facing camera. That is, the system is very sensitive to transverse translation, butnot as sensitive to longitudinal translation. The results for testing in the outdoor site, Equad, is shown in figure 3.9, using the sameformatasthepreviousgraphforHNB.Beobot2.0isabletonavigatewith8.78cmof lateraldeviationfromthecenterofpath,onaverage. Thelocalizationerroritselfaverages to 5.31m over a route with a total length of 138.27m. Given that the length of the route is 3.77 times that of the indoor data, both localization as well as the lateral deviation error donot increase uncontrollably. For outdoor navigation, thelateral deviation errors obtained here are acceptable as the lanes (3m wide) in the environment are about two times wider. The lateral deviation error increases mainly because the localization module cannot findsalient regions for a longer period of time, which makes the robot lose its navigation 78 cues. This usually occurred when localization was failing (note the spikes in the local- ization error graph). We attribute thisto theaddeddifficulties inrecognizing landmarks outdoors, where lighting conditions are not constant. Note that we train and test the system on separate days. On the other hand, our system has a very low likelihood of false positive identification as it has a high threshold for matching the landmarks. That is, in the event where the robot’s location is misplaced, the navigation system will not produce false cues to follow. This allows the system to stop, recover, and proceed for- ward. There are, however, times where we have to stop the robot because it has swerved off the pathway considerably and gets dangerously close to a ditch or about to hit a pole or a bench. 3.4 Discussion and Conclusions In this chapter we present a robot vision navigation and localization system. It uses salient regions to both localize the robot as well as to direct its heading. By having both mobile capabilities, our system can perform a user-specified command to go to a goal location. This is an extension to most available systems, which are either just a vision localization system (with the movement controlled by a user) or a vision naviga- tion system (which can only perform navigation from pre-specified starting and ending locations). For the most part, the robot is able to navigate using landmarks that are identified by the localizer. However, this is also a source of problem in that if the robot fails to recognize alandmark,itcannotnavigate. Thenextsteptoimprovethesystemistohave 79 robot visually navigate itself without localization during those times. We suggest two different vision techniques. One is road recognition to identify the direction of the path, or whether there is an intersection. The second is a form of 3D structure understanding (using structure from motion [159], or single image [140, 55]). This is especially useful to detect walls and obstacles. 80 S2 S3 S4 SEG1 SEG3 SEG4 SEG2 S1 Figure 3.6: A snapshot of the localization and navigation system. The input (top left) image contains the salient region windows. Green windows mean a database match, while reds are not found. Below the input image, to the left, is the segment estimation computed from gist features. There are four segments in the environment. Here the actual segment 4 is classified as second most likely (to segment 1) by the system. To the right of the segment estimation result are the matched salient regions drawn at the original detected image coordinate. Next to the input image is the robot state projected onto the map: cyan points are the location particles (hypotheses) generated by the Monte Carlo Localization algorithm, the yellow points are the locations of the matched database salient region, the green circle (the center of the blue circle) is the most likely location. The radius of the blue circle is equivalent to 3m. The robot believes that it is 4 meters from the intersection, which is correct within 20cm. On the navigation side (underneath the map), the robot is turning to the left as decided by the keypoint votes. Figure 3.7: Scene examples of the HNB and Equad environment. 81 0 5 10 15 20 25 30 35 40 0 1 2 3 Localization Error distance (m) error (m) 0 5 10 15 20 25 30 35 40 0 0.1 0.2 0.3 0.4 Navigation Error distance (m) error (m) Segment 1 Segment 4 Segment 2 Segment 3 Segment 1 Segment 4 Segment 2 Segment 3 Figure 3.8: Result for the indoor HNB testing. Graph labels are described in the text. 0 20 40 60 80 100 120 140 0 10 20 30 Localization Error distance (m) error (m) 0 20 40 60 80 100 120 140 0 0.2 0.4 0.6 0.8 Navigation Error distance (m) error (m) Seg. 2 Segment 3 Segment 1 Seg. 2 Segment 3 Segment 1 Figure 3.9: Result for the outdoor Equad environment. Graph labels are described in the text. 82 Chapter 4 Road Region and Boundary Estimation based Navigation 4.1 Introduction Ability to navigate in one’s environment is important for a fully autonomous mobile robotic system. One critical task in navigation is to beable to recognize and stay on the road. There are many proximity sensors that have been used to detect the road, such as laser range finder (LRF) [58, 193], stereo cameras [57], and Kinect [30]. However, these sensors have limitations. For one, Kinect, which utilizes infra-red technology, does not work outdoors, due to the presence of sunlight. Furthermore, proximity information, although it can be used to estimate the ground plane, it cannot directly recognize the shape and appearance of the road without the presence of bounding structures such as surrounding walls. Monocularcuesfromacamera, ontheotherhand,havetheversatility ofbeingappli- cable in most environments, while readily encode these road information, which are the critical for autonomous navigation. As such it is important to develop robust monocular roadrecognitiontechniquestoenhancetheavailableproximity-basednavigationsystems. 83 Figure 4.1: Beobot 2.0 performing autonomous navigation in an unknown environment. There are two characteristics of the road that are exploited by our system. One is the similarity appearance within the road, color of the concrete, which can be compromised by shadows. The other cue is the vanishing point at the end of the road, noted by the yellow dot. The difficulty of vanishing point-based systems are their instability, as a small location shift can result in large shape changes on the road area closest to the robot, making it less suitable for navigation tasks. There are a few main issues in using monocular vision road recognition. For one, road appearance varies from place to place, which makes it difficult to define what a road is. Consequently, many navigational approaches limit their working environments to just previously visited places through the teach-and-replay paradigm [23, 20]. These systems require the robot to traverse a specific environment during the training process before it can navigate on the same road during testing. In addition, often times they also need to be trained on many lighting conditions to be illumination invariance. This is because these systems use local corner/keypoint features such as SIFT [90] or KLT [174], which are illumination specific. There are, however, a set of approaches that do not require specific training for a particular environment. These systems try to recognize a road by exploiting its visual characteristics. For example, a system by Rasmussen etal [135] estimates the road as 84 Road Region Estimation Road Boundary Estimation Vanishing Pt Detector Estimate Road Boundary Road Info Extraction Road Color Feedback Kalman FIlter Offset 79px Input Image Robot Controller Road Color/Center Update Road Color Feedback Road Color/Center Update Estimate Road Region Graph Segmentation Road Info Extraction Figure4.2: Overallvisualroadrecognitionsystem. Thealgorithmstartsbyinputtingthe camera image to both theroad region androad boundaryestimation modules. The road region estimator utilizes a graph-based segmenter, while the road boundary estimator relies on a real time vanishing point detection. Both modules output road color average and the road center location, which are fused together by a Kalman Filter. a triangular shape, and perform histogram differencing between the road region and its two flanking areas. The histogram is composed of a small number of bins obtained through k-means LAB color clustering on the image. Another system by [82] uses both color and texture features to cluster the road area, and the subsequent classification confidence values to estimate the extent of the road. [138] relies on the salient nature of the road with respect to its surrounding regions, and grows the road region from a saliency map. One problem that region-growing based systems face is that they are less robust to shadows on the road. Observe figure 4.1 for an illustration of these challenges. In addition, there is also a group of techniques [75, 108, 119] that try to recognize the road by using the road vanishing point. In general the technique relies on detecting converging edge directions to vote for the most likely vanishing point. The biggest problem we found was that the output location can be unstable if there are several equally likely vanishing points. Furthermore, because the vanishing point is far from the robot, a slight deviation can result in a large change in heading direction, making 85 it unsuitable for navigation. It is also important to note that most vanishing point- based approaches are not real time, and, thus, not tested on a robot for an autonomous navigational task. Ournovelcontributionstartsbypresentinganautonomousnavigationapproachusing only monocular cues, comprised of the two above-mentioned and opposing vision tech- niques. That is, we designed and implemented real-time and robust region appearance- based and vanishing point-based road recognition. We then proposea navigation frame- work that fuses both techniques to improve the robustness of the overall system. We tested the system using Beobot 2.0 [150] in one indoor and three outdoor campus en- vironments to demonstrate its capabilities in unconstrained settings. In addition, we also analyze the benefit of using the individual cues in details, as well as comparing our system with other road recognition systems [135, 75]. These two systems represent the two most popular monocular road detection ap- proaches available in the literature. In the following section 5.2, we describe our presented model, report the testing results in section 5.3, and discuss the findings in section 4.4. 4.2 Design and Implementations The overview of our vision navigation system is illustrated in figure 5.5. From the input image, we segment out the region that are considered to be a road (described in subsection 4.2.1), while simultaneously estimate the road boundaries by computing the road vanishing point (described in sub-section 4.2.2). In section 4.2.3, using a Kalman 86 Filter, wethencombine thesetwo algorithms toproducearobustestimation oftheroad, which is used to create control policy for autonomous navigation. 4.2.1 Road Region Estimation The input image size of 320x240 image is first downsampled to 80x60 to speed up the process and to filter out spurious noise input. The road region estimator then segments theimagetolargecontiguousregionsusing[38]andsearchforthesegmentedroadregion. Thegraph-basedalgorithmrecursivelymergeimageregionsiftheintra-regiondifferences is lower than the inter-region difference plus a constant k (we set k =400) Initially, the road center is selected from the center of the image, assuming the robot starts out on the road, facing the direction of the road heading. This is done by creating a small search window W fixed on the bottom center of the image (observe figure 4.3). The region that covers the largest area in W fixed , by the number of pixels, is chosen to be the road region. The system then computes the road color by averaging the pixel values in the region, which will be used to update an appearance prior estimated the Kalman Filter. In addition, the estimator also computes the road region center because it indicates the direction of the road. This is done by estimating a slope of a line that bisects the bottom portion of the road region (observe the yellow trace in figure 4.3). The reason of estimating only using the bottom part is because it is in the immediate front of the robot, which is much more critical for determining the current motion command. The far end of the region will be considered in future time steps, when the computation is repeated. 87 Road region selection Swerving Case Kalman Filter Road Color Road Center Normal Driving Case Fixed Center Window (Wfixed) Float Window (Wfloat) if color is not similar enough if both windows overlap by 60% Figure 4.3: Road region selection process. The estimator searches for the road region in the segmented input image using the center W fixed window if the robot is moving steadily in themiddle of the road. However, if the robot starts to swerve from thecenter of theroad significantly, it then switches toW float . The system switches back toW fixed , when the robot successfully re-centers itself. Aftereach frame, theestimator checks iftheoutputroadregionissimilar(innumber of pixel, coordinate location, and color) to the one produced in the previous frame. In occurrences where the robot swerves too far away from the center of the road, there is usuallyalargedeviation. Insuchcases, itswitchestoanidenticallysizedwindowW float , located at the previous road center estimation. The switch allows the system to react quickly to a dramatic robot movement. As therobot re-centers itself on the road,W float keeps track of the movement, and the estimator only switches back to W fixed when it sufficiently overlaps with W float . By normally using W fixed , the estimator provides a subtle bias to the select regions that are near the center of the image. 4.2.2 Road Boundary Estimation Figure4.4illustratestheprocessoffindingtheroadusingvanishingpointdetection. This module is inspired by [75], which does not readily run real-time. We, on the other hand, optimize the implementation and significantly modify many parts of algorithm to make 88 Find Vanishing Point Build Gabor Pyramids Vanishing Point Voting V p radius=0.35*Image diagonal Find Robust Edges Non-uniform Sampling γ Op θp Figure 4.4: Vanishing point detection module. The module first compute Gabor pyra- mids in 4 directions and 3 scales. It then computes the edge direction confidence scores to select only themost robustedges. Themodulethenvotes forvanishingpointlocation using robust edges within .35*image diagonal below each respective vertical coordinate [75]. To speed up the process, the module performs a non-uniform sampling of the can- didate locations. That is, it densely samples the locations around the vanishing points from the last 10 frames, as well as sparsely samples the rest of the locations. it run100 ms/frameinourrobot. Onenotable modification isto gofrom36orientations to a more realistic 4, which requires more robust processing throughout the algorithm for a comparable performance. In addition instead of only using static information, we utilize a temporal filtering process to increase robustness. Using the input image, the estimator computes dyadic Gabor pyramids G α,s for N angle =4 orientationsα (equally spaced between 0 to 180 degrees), atN scale =3 scales s. For each orientation, the estimator averages out the responses (observe equation 4.1) from all scales to create a single map at the largest resolution. ¯ G α (i,j) =1/N scale N scale −1 X s=0 G α,s ( i 2 s , j 2 s ) (4.1) 89 It then calculates the confidence value of the edge direction Gabor responses in each location by comparing how dominant the largest response SA(1) is with respect to the second SA(2): Conf(i,j) =1− ¯ G SA(2) (i,j) ¯ G SA(1) (i,j) (4.2) The pixels with high confidence values are selected to vote for the vanishing point location. ForeachlocationcandidateV,theestimatorcomputesthelikelihoodbyaccumulating scoresfromalocalneighborhoodofpixelsinahalf-diskregionP. Thecontributingscore of each voting pixel p∈P is computed according to: score(V,p) = 1 1+γ∗dist(V,p) γ ≤8 ◦ 0 otherwise (4.3) with γ = ∠(Vp,O p ) denoting the angle between line Vp and the dominant angle vector at p, illustrated in the Vanishing Point Voting sub-module in figure 4.4. To speed up the process, we introduce a non-uniform sampling over the vanishing point locations. We densely sample the locations around (within 1/10 of the image dimensions) the estimated vanishingpoints from the last 10 frames, andsparsely sample the rest of the image. Observe the Non-uniform Sampling sub-module in figure 4.4. On average, the system evaluates around only 3 percent (30 times speedup) of all possible locations. 90 The estimator then selects the highest scoring vanishing point location and finds its left/rightroadboundariesbycomputingthelikelihoodscoreforeachcandidaterayinfive degree spacing. We propose two improved and more efficient methods: Gabor Response Variance Score (GRVS) and color difference (CD). Observe figure 4.5 for illustration. Vanishing Point Road Support Ray 20deg 20deg Ray i Ray i+1 Ray i-1 140 deg / 5 each = 29 rays Find Road Area Boundary Score R1 R2 Color Di! GRVS Figure 4.5: Finding Road Boundaries. From the vanishing point, we extend ray with 5 degrees spacing and evaluate how likely each ray is a road boundary. Refer to text for details. First, we introduce a novel GRVS method, which measures the consistency of line l using onlyN angle =4 Gabor orientations. The technique creates N angle -orientation bins B α to store the Gabor response of every pixel on line l based on its highest response angle. The GRVS is then computed based on the bin with the highest count B max : GRVS(l) = (|B max |/|l|) 2 std(B max ) (4.4) Here |l| refers to the total number of pixels on line l, while |B max | and std(B max ) are the total number of pixels and Gabor magnitude standard deviation in bin B max . 91 The advantage of GRVS is that it does not require many Gabor orientation bins to establish line consistency. The assumption is that if there exist a line, all the pixels on the line ought to have similar Gabor responses (indicated by low standard deviation), and orientation (by having a clear majority in bin counts) even if the dominant angle is not aligned with the vanishing point ray. Figure 4.6 illustrates the process. VP 1 2 3 4 5 2 0˚ 0 90˚ Orientation Gabor Response Count 45˚ 4 1 1 135˚ 3 5 1 3 Gabor Response Bin (Max Count Bin Bmax = 3 / Total Pixels |l| = 5) Gabor Response Variance Score GRVS = Stdev( , , ) 3 5 1 2 Figure 4.6: Gabor Response Variance Score (GRVS). The Gabor responses of the points along a candidate road boundary line l are indicated by with cyan bars. The table to the right tabulates the total count of dominant angles for each orientation. Here, the 45 ◦ angle wins out, and the result is used to calculate GRVS(l). The color difference (CD) method calculates RGB color average of pixels within 20 degrees of each side of the ray, with the inner road region denoted as R 1 and the outer flanking area as R 2 . We also tested the system using HSV and LAB color space, and find no significant performance difference. Observe the Color Difference sub-module in figure 4.4 for illustration. CD is com- puted using: CD(R 1 ,R 2 )=diff(R 1 ,R 2 )−diff(R Road ,R 1 ) (4.5) 92 Figure4.7: Exampleimagesfromeachtestingenvironment. Wedisplay4examplefigures for sites HNB, AnF, Equad, and SSL. Below each figure, we display the system’s road segmentation estimation, denoted by the red shading. In addition, we also added the vanishing point estimation, denoted by the blue dot, with the green line extended being theroadboundary. Also, weaddedayellow lineonthebottomofeachimagethatpoints to the direction of the robot heading. As we can see, the system works on various road appearance, width, and under many lighting conditions. Furthermore, it is also able to overcome various challenges, such as pedestrians and sporadic shadows. The first term maximizes the contrast between the road and flanking areas R 1 and R 2 , as measured by color euclidean distance. The second term, on the other hand, minimizes thecolor differencebetweenR 1 andthecurrent KalmanFilter estimated road color R Road . The GRVS and CD scores are multiplied to obtain the final likelihood score. We found that the combination of these factors are more robust than either one to handle the variety of road appearances. Theprocessisrepeatedtofindthesecondroadboundarybeforethemodulecomputes the average road color and the road middle point, located at the middle bottom of the area enclosed by the two boundaries. 93 4.2.3 Combining Road Region and Boundary Estimation Both road region and boundary estimation modules outputs the road color average and road middle point state vector X k = R {r,g,b}k x k (4.6) where R {r,g,b}k is RGB color, and x k is the horizontal location of the middle point. The Kalman filter integrates these state vectors using a Zero-Acceleration prediction model because the robot tends to stay on the road center while moving forward and the road color appearance remain similar to previous frame. The Kalman Filter estimation is then fed back to each respective module, thus indi- rectly incorporating information from one module to the other. Finally, the overall system computes the distance between the image center the road center, and use it as a differential input for a PID robot heading control to align the robot to the center of the road. 4.3 Testing and Results We test the system accuracy in recognizing the center of the road in one indoor (HNB) andthree outdoor(AnF,Equad, andSSL)environments. Observefigure4.7forexample scenesandtable5.1forpertinentinformationabouteachsite. Wecarryouttwodifferent testing procedures: one to assess the accuracy of the system in recognizing the center of the road (section 4.3.1), done offline and compared against ground-truth, while the second is for autonomous robot navigation (section 4.3.2). 94 4.3.1 Visual Road Recognition We manually drive therobot through the environments to collect testing videos. In each frame,wethenannotatetheleftandrightboundaries,andextendingthemtomeetatthe vanishing point. We then calculate the road center point using the technique explained at the end of section 4.2.2. Onsomeframes, itishardertoclearlylabelthetwosidesoftheroad, incasessuchas an intersection. For this experiment, we skip these frames altogether because they have more than one heading option. In autonomous navigation, whenever the robot has to makeadecisionnearanintersection, itselectsthenearestroadtotheestimated heading. The error is defined as the deviation between the human annotated and system estimated road center point. In our robot camera setup, one pixel at the bottom of the image is equivalent to one centimeter. We report the experiment results in figure 4.8. We include the results of our road region and boundary estimation, as well as the combined performance. In addition, we also compare our system with two other road recognition systems: [135], which primarily uses region color appearance to find the road, and [75], which detects the vanishing point in the image. Note that our system runs real time, 100 ms per frame, which is critical for autonomous navigation tasks. As a comparison, our implementation of [135] runs 780ms per frame, while [75] runs 43 seconds per frame. All experiments for this section is run on a 2.2GHz Intel Core2 Duo processor. Overall, our full system is able to estimate the road center reliably in all testing environments as its average error is under 5 percent when compared to the road width. 95 Table 4.1: Experiment Sites Information HNB AnF Equad SSL Traversal Length 36.67m 105.75m 138.27m 326.00m Total Frames 3919 7217 5371 8569 Road Width 1.5m 2.12m 3.1m 4.47m For example, in AnF, the average accuracy is 9.41cm, while the road width is 2.12m. Furthermore, our full system produces the best overall result, when compared to the individual algorithms within the system. Although the full system may not always produce the best result for every environment, it is the most stable when compared to the individual algorithms. The road region estimator, which gives the best result in HNB and AnF, tends to perform optimally in environments with uniform road appearance. However, it is less effectiveonroadswheretherearemoreinconsistenciesbecauseoffactorssuchasshadows. This case is illustrated by the example in the fifth column of figure 4.9. On the other hand, the road boundary estimator has higher tolerance to many road appearances, but often with lower accuracy, such as the slightly off right boundary detection in the forth column of figure 4.9. The overall system, however, balances these characteristics as it approachtheaccuracyofroadregionestimatorinuniformroads,butwiththerobustness of road boundaryestimator to provideusable navigation results even in very challenging conditions. In addition we also compare our individualalgorithms with similar alternative meth- ods. Much like our road region estimator, the system by [135] also primarily utilizes color, rather than edge information, to detect the road. However, their system uses a fixed set of triangular templates as shape prior. Ours, on the other hand, segments the 96 10 20 30 HNB AnF Equad SSL Total Region Boundary Full System Rasmussen, etal. [9] Kong, etal. [12] Error (cm) Figure 4.8: Road Recognition Results. The bar graphs illustrate the performance of the different algorithms in recognizing road in different environments. Performance is measured by the difference between the ground truth and estimated road center in cm. Wereporttheresultsbytheindividualalgorithmsinourapproach, roadregion androad boundary estimation, and the overall system. In addition, we also include implementa- tions [135] and [75]. Note: shorter bars are better, the shortest in each group denoted by a star. road region without a pre-conceived notion of road shape, and directly extract the road middle point for navigation. The system [135] is accurate if the triangular road shape is sufficiently visible with similar flanking areas. However, in urban environments, the flanking areas are more complex. For example, one side can be similar to the road, such as in column three of figure 4.9. Our road region estimator, on the other hand, is not constrained with a particular shape and can adapt to this difficulty as can be seen in the example. Its drawback, however, is that it fails when a significant part of the road is covered by a shadow, such as in th fifth column of figure 4.9. Here, the estimator splits the road into multiple segments and selects the one that appears most similar to the color prior. In this case, a shape prior would be helpful, which in our overall system comes from the road boundary estimator. We also compare our road boundary estimator with [75]. These two vanishing point- based algorithms differ in that the former utilizes fewer number of edge orientations 97 Road Region Estimator Road Boundary Estimator Full System Rasmussen, etal. [9] Kong, etal. [12] Figure 4.9: System Comparison Examples. The different systems outputs are organized todifferentrowswiththefirstthreerowsoccupied,inorder,bytheindividualalgorithms in our approach, road color region detection and road boundary estimation, and then the overall system. In addition, we also include an implementation of systems by [135] (forth row) as well as [75] (fifth row). than the latter, 4 and 36, respectively. In our experiments, we find that even with a small numberof orientations, our system maintains the same level of performance, while operating in real-time. This is because a lot of the edge information is repeated between neighboring angles. We also found that the two diagonal Gabor angles (α = 45 ◦ and 135 ◦ ) contribute the most in the vanishing point voting. Another important factor that we added is temporal information, which we found to be critical because static edge information can be noisy and often times makes it hard for a system to distinguish the true vanishing point from the false ones. By adding temporal continuities our estimator is able to discard sporadic false positives and improve its accuracy. Both algorithms, however, have a difficulty in specifying the road shape when only oneboundaryis viewed in the image (sixth column of figure4.9) whentherobot swerves 98 Table 4.2: Autonomous Navigation Performance Results HNB AnF Equad SSL Total Path Length 36.67m 59.09m 64.79m 155.24m 316.60m Auto. Drive 36.67m 57.50m 62.12m 153.45m 309.74m Manual Drive 0.0m 1.58m 2.38m 1.78m 5.74m Performance 100% 97.33% 95.87% 98.84% 98.19% slightly off course. What usually occurred is that the true road vanishing point does not receive enough support. With the help of the road region estimator the overall system can alleviate the error and still outputs a road center much closer to the ground truth center. Given these results we proceed to test system for autonomous navigation. In such scenario, any erroneous deviation produced in one frame will be amplified in subsequent frames if the robot does not make the proper correction. 4.3.2 Autonomous Navigation We test the system using our robot, Beobot 2.0 [150], which is 57cm in length and 60cm in width. The camera itself has a regular 60 degree field-of-view and is mounted 117cm abovetheground,atthecenteroftherobot. Wemeasureautonomousnavigationsuccess by calculating the percentage of distance in which the robot is driving autonomously. Observe figure 4.10 for an example trajectory. We run the robots five times on each site and average its performance. We selected the longest continuous path within each site so that the robot does not have to make decisions at the intersection. Table 4.2 reports the length of the route as well as the percentage of trajectory in which Beobot 2.0 is driving by itself. 99 Start End Total Length 60.54m Figure 4.10: Example trajectory of autonomous navigation using the presented visual road recognition system. Here Beobot 2.0 has to traverse through a 60.54m route. The autonomousnavigationisindicatedbythegreentrail,whiletheremotecontrol(torescue the robot from hitting a wall) is colored in red. The reason why the robot moved to the right is because it decided to take the right turn on a road that suddenly decreases its width by a half. In addition, we also the added a few example images to show the different parts of the route, where we see that the robot has to manage its path around people and other vehicles. From the table, we can see that the system rarely fails, averaging 98.19 percent of autonomous driving. One reason for the failure was that the robot was stuck in a small opening, while going through a door. This occurred because the door frame protrusion was thin, and it quickly went out of the camera field of view. When the robot saw the frame, it tried to avoid it for a short period, until the door was out of the field of view. But then it saw what was beyond the frame, and tried to re-center itself using that portion of the road, not knowing the door frame was in its blind spot. We believe, this case can be improved by better modeling of the environment, using techniques such as SLAM (Simultaneous Localization and Mapping). Another problem that we encountered was that the road concrete some times ex- truded out to a bench seating or a bike parking area. Most times, the robot was able to stay on the road and move beyond the extruded area. However, once or twice, the robot would move toward the extruded area such as in figure 4.7, specifically just prior 100 to the red section of the trajectory. As the robot traversed through the extruded area, its lateral location tend to be biased toward the extrusion to center itself on the road. At some point, the left boundary of the road would be move out of the field of view, while the extrusion itself projected to a triangular shape. The system then fit a road to thecorner of theextrusion, making therobot deviate from theroad. Inthis case a wider angle lens or an omni-directional camera would allow the system to maintain a view of the true road, and prevent such situation. 4.4 Discussion and Conclusions In this chapter we present a novel monocular image-based road recognition system that fuses road region and boundary estimation. By using two complementary ways to char- acterize the road, we are able to produce a system that is robust in various indoor and outdoor environments. In the road region estimation, we introduce an straight-forward segmentation-based technique to isolate the road without shape constraints. On the other hand, in the road boundary estimation, we propose an efficient vanishing point detection, to allow our system to run in real-time on a robot. Forexample,weaddanon-uniformsamplingmechanisminthevanishingpointvoting step to reduce computation time. In addition, we demonstrate that even with a reduced number of orientations, our system still maintain its accuracy because of the use of a new Gabor Response Variance Score (GRVS) method. Furthermore, we also compare our algorithm with other approaches, and show that it performs better than others in our dataset. 101 Finally, our contribution also includes a series of field tests to show that our system is able to autonomous navigate in both indoor and outdoor environments by reliably detecting the navigable road area. 102 Chapter 5 Hierarchical Map Representation for Heterogeneous Sensor Fusion 5.1 Introduction In this chapter, we present a fully autonomous localization and navigation system for a human-sizedservicerobotoperatinginunconstrainedpedestrianenvironments,primarily using monocular vision and complemented by LRF-based obstacle avoidance. Our main contributions are four-fold: • Using vision algorithms (visual attention, landmark recognition, gist classification, localization, road finding) to compute localization and navigation cues that are robust in busy outdoor pedestrian areas; • Developing a framework for the integration of these cues using a proposed hierar- chical hybrid topological/grid occupancy map representation. 103 • Exploiting the hierarchical map, along with applying techniques such as forward projection and tracking, to resolve differences in algorithmic complexity, latency, and throughput, to create an overall real-time robotics system. • Testing over more than 10km total traveled distance, with routes 400m or longer, to demonstrate that the approach indeed is viable and robust even in crowded environmentslikeauniversitycampus. Asfarasweknow,thisisthemostextensive evaluation of a service robot localization and navigation system to date. The system is described in section 5.2 and tested in section 5.3. We then discuss our findings in section 5.4. First, we describe related vision localization and navigation research in the next section. 5.2 Design and Implementation The flow of our overall system is illustrated in figure 5.1. The left side of the image depicts our mobile platform, Beobot 2.0, with the utilized sensors – a forward-pointing camera, a Laser Range Finder (LRF), wheel encoders, an IMU – as well as motors to move the robot. Our IMU [99] is also equipped with a compass that makes magnetic measurements. It is placed high above the computers and motors to minimize magnetic distortion induced by the robot. Also note that ferrous obstacles in the environment may affect the IMU. The two major sub-systems, localization and navigation, are displayed in the middle of the figure. The localization system (the top of the column) is described in section 5.2.1, while the navigation system (bottom) is in section 5.2.2. Because earlier versions 104 Local Navigation Map BeoLRF BeoIMU Encoders Differential Drive Wheel Navigation Sensors Fused Odometry Estimator + Visual Road Recognition Perception Actuation BeoCamera Occupancy Grid Mapping Motion Control Beobot 2.0 Global Localization Map 5 4 7 10 11 12 8 9 6 2 3 1 Gist Extraction Salient Landmark Matching Segment Classification Saliency odometry Odometry grid map road heading segment estimation landmark matches robot location desired linear and angular velocities motor command Figure 5.1: Diagram for the overall localization and navigation system. We run our sys- tem on a wheelchair-based robot called the Beobot 2.0. As shown on the sensor column, the system utilizes an IMU, camera, Laser Range Finder (LRF), and wheel encoders. Weusethecameratoperformvisionlocalization byintegratingresultsfromsalient land- mark recognition and place recognition. For navigation we use visual road recognition for estimating the road heading, and LRF-based grid-occupancy mapping for obstacle avoidance. Thesystemutilizesatopological mapforgloballocationrepresentationanda grid-occupancy map for local surroundingmapping. The grid map, which is represented by the blue rectangle on the topological map, is connected with the topological map by the intermediate goal (green dot). ofthesesub-systemshavebeendescribedpreviously[155,151],hereweprovidesummary descriptions of their technical details, instead focusing on information flow, timing, and interactions between sub-systems, which arekey to designingacomplete workingmobile robot system. Note that both sub-systems utilize the fused odometry. The rightmost column in the figure is the hierarchical spatial representation of the robot’senvironment. Thefirstlevelisaglobaltopologicalmapthatcompactlyrepresents all the needed metric information for global path planning. The next one is a local grid- occupancy map, which details the robot surroundings. Mechanically, the local map is 105 connected to the global map through an intermediate goal location that is registered in both maps (observe figure 5.1). The task of the navigation system is simply to safely reach the intermediate goal. The local map, which is robot-centric, is then advanced along the current graph edge in the global map, to always keep the robot at a fixed location in the local map. For each sub-system, we describe not only the algorithm and its pertinent compo- nents, but also the timescale of the implementation. By analyzing the timescale, we demonstrate the advantages afforded by the hierarchical map construction, as well as the mechanisms that are utilized to overcome long latencies of some components. We then continue to the overall system and examine how it accommodates the different component latencies to ensure a timely system execution. We start the detailed description with the vision localization system. 5.2.1 Localization System We summarize the vision localization system, which is inspired by previously published work [155]. We focus on the mechanisms that allow it to run in real time: multi-level hierarchical and multi-perception localization. As illustrated in figure 5.2, the system is divided into three stages. The first stage takes an input image and extracts low-level VisualCortex-likefeatures,consistingofcenter-surroundcolor, intensity, andorientation channels. These features are used by two parallel processes to extract the “gist” and salient points of the scene, two extensively studied human visual capabilities. We define the gist features as a low-dimensional vector (compared to raw image pixel array) that represents a scene and can be acquired over very short time frames [152]. Saliency, on 106 the other hand, is computed as a measure of interest at every image pixel, to efficiently directthetime-consuminglandmarkmatchingprocesstowards themost likely candidate locations in the image. The gist features and salient regions, along with fused odometry, are input to the back-end Monte-Carlo localization [171, 43] algorithm to estimate the robot’s position. 9 8 6 5 3 4 1 2 Fused Odometry Estimator + Salient Landmark Matching Segment Classification Odometry Tracking Monte-Carlo Localization Topological Map 7 Figure 5.2: Diagram of the vision localization system. The system takes an input image and extracts low-level features in the color, intensity, and orientation domains. They are further processed to produce gist features and salient regions. The localization system thenusesthese features forsegment classification andsalient landmarkmatching (respectively),alongwithfusedodometry,toestimatetherobot’slocationusingaMonte- Carlo framework. The topological map, which currently is manually constructed, is represented by a graph G ={V,E}. The nodes v ∈V are augmented with metric information specifying theircoordinatelocations. Accordingly, theedgecostW e foreachedgee∈E issettothe distancebetweenitsrespectiveend-nodes. Thesystemalsodefinessegmentss⊆E, each an ordered list of edges, with one edge connected to the next to form a continuous path. This grouping is motivated by the fact that the views/layouts in a segment are similar 107 and can be classified using gist features. An example of a segment is the highlighted three-edge segment 2 (thicker and in green) in the map in figure 5.2. The term salient region refers to a conspicuous area in an input image depicting an easily detectedpartoftheenvironment. Anidealsalient regionisonethatispersistently observed from different points of view and at different times of the day. In addition, the set of salient regions that portray the same point of interest is considered jointly, and thatsetiscalledalandmark. Torecognizealandmarkandtoassociateitwithalocation, the system employs a training regimen by manually driving the robot through all the paths in the environment. The produced database is quite large, about one landmark per meter of path. As a result, unlike the instantaneous segment classification, the landmark recognition process can take seconds. However, as opposed to just localizing to the segments, salient landmarks produce coordinate locations. Thelocalization systemtakes advantage ofthecontrastingsegment classification and landmark matching by coarsely but quickly estimating the robot’s segment location in everyvideoframe. Itthenrefinestheestimationtoamoreaccuratemetriclocationwhen the salient region matching results are available. In addition, the system also speeds up the database matching process by using contextual information such as current location belief and segment classification output. It orders the database items to be compared from the most likely to be positively matched to the least. This helps because the search ends, for efficiency reasons, after the first match is found. Also, to keep the database match results relevant despite the processing latency, the system stores the accumulated odometry during the landmark matching process. When the results are 108 ready, the landmark recognition module adds the longitudinal movement offset before sending the resulting location to the back-end Monte-Carlo localization. Within the Monte-Carlo framework, we formulate the location state S t as a set of weighted particles S t = {x t,i , w t,i }, i = 1 ... N at time t, with N being the number of particles. Each particle or possible robot location x t,i = {k t,i ,l t,i }, with k being the segment number and l the length traveled along the segment edges normalized by the segment length (from 0.0 at the start of the segment to 1.0 at the end). Because the localization formulation is over the graph edges but with exact metric location specifi- cation, the search space is effectively continuous one dimensional. That is, even though the locus of the considered locations describea two dimensional geographical space, it is constrainedtowithinthepathedges. Thus,asmallandcomputationally lessdemanding N = 100 suffices, even in kidnapped robot instances. Also, with this formulation, we specify a path as the exact location of the end points and a list of edges that connects them. ThesystemestimatesthelocationbeliefBel(S t )byrecursivelyupdatingtheposterior p(S t |z t ,u t ) using segment estimation z ′ t , landmark matching z ′′ t , and motion measure- ments u t . In addition, the system invokes importance resampling whenever more than 75% of the particles collapse to within the robot width of 60cm. It chooses 95 percent of the total N particles from the existing population by taking into account their weights w t,i , while randomly generates the remaining 5 percent. The latter is done by randomly choosing an integer among the segment numbers for k t,i and decimal value between 0.0 and 1.0 for l t,i . 109 To robustly measure robot motion, we implement a Kalman filter [71], which fuses the wheel encoders and IMU. The filter’s state space is the current absolute heading θ t and distance traveled d t . Using these estimated values, we compute the longitudinal distance traveled u t = d t ∗cos(θ t −θ ψt ), with θ ψt being the absolute road direction estimate, discussed in section 5.2.2.1 below. Thesystemthencomputesthemotionmodelp(S t |S t−1 ,u t )byadvancingtheparticles by u t and adding noise to account for uncertainties such as wheel slippage that can produce large odometry errors. This is done by drawing a random location from a Gaussian probability density p(x t,i |u t ,x t−1,i ), where the mean is the previous location x t−1,i plus the odometryu t , and the standard deviation is 3cm (about 1/6th of a typical single step). If in the process a particle moves past either end of a segment, we change thesegment numberappropriately andnormalize theexcess distance fromtheendof the originalsegment. Iftheoriginalsegmentendsatanintersection withmultiplecontinuing segments, we randomly select one. If no other segment extends the path, we leave the particle at the end. For the visual observations, the segment estimator [152] uses the gist features to classify the input scene, and outputs the likelihood z ′ t ={φ t,j }, j =1... N segment with φ t,j being the likelihood value for segment j at time t. The system then updates the weights w t,i of each particle x t,i to: p(z ′ t |x t,i )= φ t,k t,i P Nsegment j=1 φ t,j ∗φ t,k t,i (5.1) 110 Here, the likelihood that particle x t,i observes z ′ t is proportional to its segment loca- tion estimation valueφ t,k t,i over the total sum (first term) times the value itself (second term). The first term accounts for the segment relative strength, while the second pre- serves the ratio with respect to the absolute 1.0 maximum value. Thelandmarkmatchingprocess,whichisillustratedinfigure5.3, comparesincoming salient regions to the landmark database regions using two sets of signatures: SIFT keypoints [90] within the region bounding box, and salient feature vector. The former denotes the shape of the region, while the latter quantifies its saliency characteristics. We employ a straight-forward SIFT-based recognition system [90] using the suggested parameters, while the salient feature vector matching is described here. Figure 5.3: Matching process of two salient regions using SIFT keypoints (drawn as red disks) andsalient feature vector, whichis a set of feature mapvalues taken at thesalient point location (yellow disk). The lines indicate the keypoint correspondences. The fused image is added to show alignment. Asalientfeaturevectorisasetoffeaturemapvalues[67,63]fromthecolor, intensity, and orientation channels. These values are taken from a 5-by-5 window centered at the salient point location of a region. In total, there are 1050 features: 7 sub-channels times 6 feature maps times 5×5 array. To compare salient feature vectors of two regions ρ 1 and ρ 2 , we factor in feature similarity sfSim (equation 5.2) and salient point location 111 proximity sfProx (equation 5.3) after aligning the regions (observe the fused image in figure 5.3). The former is based on Euclidian-distance in feature space: sfSim(ρ 1 ,ρ 2 ) = 1− q P N sf i=1 (ρ 1,i −ρ 2,i ) 2 N sf (5.2) withN sf beingthefeaturedimension. Foramatchtobeconfirmed,thefeaturesimilarity hastobeabove.75outofthemaximal1.0. Thelocation proximitysfProx,ontheother hand, is the Euclidian distance in pixel units (denoted by the functiondist), normalized by the image diagonal length η: sfProx(ρ 1 ,ρ 2 ) = 1− dist(ρ 1 ,ρ 2 ) η (5.3) The matching threshold for sfProx is 95% or within 5% of η. The resulting matches are denoted as z ′′ t = { o t,j }, j = 1 ... M t with o t,j being the j-th of the total M t matched database regions at time t. The system computes the likelihood of simultaneously observing the matched regions at a given particle location x t,i by updating the weight w t,i =p(z ′′ t |x t,i ): p(z ′′ t |x t,i )= Mt Y j=1 p(o t,j |x t,i ) (5.4) Because each salient region match observation is independent, the system can mul- tiply each likelihood to calculate the joint probability. Each p(o t,j |x t,i ) is modeled by a Gaussian where the likelihood value is set to the probability of drawing a length longer than the distance between the particle and the location where the matched database 112 region is acquired. The standard deviation of the Gaussian is set to 5% of the map diagonal to reflect increased level of uncertainty for larger environments. The system then estimates the robot current location using the highest weighted particle. We illustrate the temporal integration of the localization components with a time- line in figure 5.4. The localization system, along with other sub-systems, uses a sub- scribe/publish architecture, where each component is an independent process. The out- putting localization module, which is highlighted at the last row, processes observations from the components as soon as they are available. This allows the system to not be explicitly tied to a rigid time step synchrony. However, because odometry and segment estimation come in regular intervals, the system appears to produce output regularly from the outside perspective. Camera (30Hz) 100 33.33 66.67 Time (ms) 0 Fused Odometry (15Hz) Raw features, gist, saliency (30Hz) Segment estimation (30Hz) Landmark matching (.33Hz) Odometry tracking (.33Hz) Localization (30Hz) . . . . . . 3000 Figure 5.4: Diagram of the localization sub-system timeline. The figure lists all the localization sub-system components: two sensors, four perceptual processing steps, and a back-end localization estimation. Please see the text for explanation. The top section of the figure (the first two rows) lists the utilized sensors: a camera, which runs 30Hz, and the 15Hz fused odometry. Note that the fused odometry uses a Kalman Filter tocombine information from the80,000Hz encoders and50Hz IMU. Even 113 though odometry is reported less frequently, the information latency is closer to that of the encoders. Furthermore, if we consider the robot speed of under 1m/s, a 15Hz or 66.67ms period latency only amounts to 7cm localization error. Below the sensors are the perceptual components. The raw feature extraction, gist, andsaliency computations start afterbeingtriggered bythecameraoutput, asindicated by the arrow from the first row. The module, which is implemented in parallel over multiple processing cores, takes about 25ms (noted by the length of the solid maroon block). The process then idles until the next image arrives. For each module, we state its throughput in parentheses, with the output indicated by an arrow that drops down from the right end of the processing block. Note that there are also a number of arrows that seem to do the same from the left side of a block. They are actually pass through arrows from their respective original processes. The first arrow from Fused Odometry, forexample, isprocessedbybothOdometryTrackingandLocalization. However, itmay look as if Localization takes an input from Odometry Tracking. Segment classification, which takes less than 1ms to compute, passes its result to landmark matching for search primingand localization module for location update. The landmark matching process, on the other hand, takes three seconds on average (.33Hz), despite a multi-threaded landmark search implementation. However, because odometry tracking keeps track of where the original image is being triggered, the long latency is accounted for. Thefinallocalization moduleitselfoutputsresultsat30Hz withtheinput image and, subsequently, segment estimation being the trigger. A critical point that enables localization to effectively operate in real time is that each perceptual inputalready accounts for its own internal latency. Inaddition, because 114 the topological map formulation is one dimensional along the graph edges and needs only 100 particles to localize, the localization computation itself takes less than 1ms. Thus the system is not burdened by the frequent observation updates, which allows the location belief to be accurate to the most recent input. Furthermore, the system can still be expanded with new observations and modalities of any timescale as long as their latencies are accounted for. Most importantly, the localization system timely mode of operation allows the full system to integrate it with real-time navigation. 5.2.2 Navigation System Thecomponentsofthenavigationsystemarethevisualroadrecognition[151]toestimate the road direction (described in section 5.2.2.1), grid map construction (section 5.2.2.2), and path planning to generate motor commands (section 5.2.2.3). In the last part of section, we put the whole system together, integrating both localization and navigation. Thelocalmaprepresentstherobot’ssurroundingsbyusinga64by64gridoccupancy map [113], with each grid cell spanning 15cm by 15cm spatial area for a 9.6m by 9.6m local map extent. We select 15cm or 1/4 the robot size (our robot is 60cm in length and width) because it is a balance between sufficiently detailed depiction and being manageable for real-time update. As shown in figure 5.1, the robot is displayed as a red rectangle, and is placed at the local maphorizontal center andthreequartersdownthevertical lengthtoincrease front- of-robot coverage. In addition, there is a layer of grid surrounding the map to represent goal locations that are outside the map. The figure shows the goal to be straight ahead, 115 which, byconvention, isthedirectionoftheroad. Thusalargeportionofmotion control is trying to minimize the deviation between the robot and road heading. Given the robot autonomous speed of 0.66m/s average and 0.89m/s maximum, the 7.2m (0.75 x 64 x 15cm) mapping of the robot front area affords at least 8.1 seconds of forward look-ahead to properly react to potential static obstacles. On the other hand, dynamic obstacles such as people have to be evaluated with respect to the local map updaterateof20Hz. Here, usingtheaveragehumanwalkingspeedof1.4m/s, thesystem latency of as much as 100ms translates to 0.14m movement or within one grid size. 5.2.2.1 Visual Road Recognition Our visual road recognition system [151], which is illustrated in figure 5.5, is a novel vanishing point (VP) -based algorithm that uses long and robust contour segments, instead of the smaller and often noisier edgels. In addition, much like the localization system, it also implements atracking mechanism toensurethat it runsinreal time. The output of the system is the absolute road directionθ road . Note that the road recognition system does not make use of odometry information. The system takes the input image and performs Canny edge detection to create an edge map. From here it has two ways to recognize theroad. One (thetop pipelinein the image) is through a full but slower recognition process, which detects segments in the edge map, votes for the most likely VP, and extracts the road lines that are extended from the VP. On the other hand, the bottom pipeline runs in real-time because it only tracks the already available road lines to update the VP location. Because the system has both mechanisms, it balances accuracy and timeliness. 116 Edgemap Calculation Vanishing Point Voting Road Segment Detection Extract Road Lines Tracking/Forward Projection Input Image VP and Lateral Position Refinement Beobot 2.0 Robot Navigation Figure 5.5: Overall visual road recognition system. The algorithm starts by computing a Canny edgemap from the input image. The system has two ways to estimate the road. One is using the slower full recognition step (the top pipeline), where it performs road segment detection. Here, the segments are used to vote for the vanishing point (VP) as well as to extract road lines. The second (bottom) is by tracking the previously discovered roadlinestoupdatetheVPlocation aswellastherobotlateral position. The tracker is also used to project forward output road lines from the top pipeline, which may fall behind the camera frame rate. The road recognition system then outputs the VP-based road heading direction as well as lateral position to the navigation system to generate motor commands. The recognition pipeline uses Hough transform to find straight segments (OpenCV [16] implementation) in the edgemap. The process then filters out segments that are above the manually calibrated horizon line, near horizontal (usually hover around the horizon line), and near vertical (usually part of close-by objects or buildings). The remaining segments vote for candidate VPs on the horizon line, regularly spaced 1/16th image width apart. To cast a vote, each segment is extended on a straight-line to intersect the horizon line (observe figure 5.6). The voting score of segment s for a vanishing point p is the product of segment length |s| and the inverse proportion of the proximity of p to the intersection point of the extended line and the horizon line, denoted byhintercept(s) in the equation below: 117 score(s,p) = (1.0− |hintercept(s),p| μ )∗|s| (5.5) with μ set to 1/8th of image width. μ is also the voting influence limit, with any segment farther not considered. Note that the exact values of the constants in the algorithm do not significantly affect overall system performance. hintercept(s) p s Figure 5.6: Vanishing point (VP) voting. The VP candidates are illustrated by disks on thecalibrated horizon line, withradii proportional totheir respective accumulated votes from the detected segments. For clarity, the figure only displays segments that support the winning VP. A segment s contributes to a vanishing point p by the product of its length and distance between p and the intersection point of the horizon line and a line extended from s, labeled as hintercept(s). To improve the robustness of the current VP estimation (denoted asψ t ), the system multiplies the accumulated scores with the inverse proportion of the proximity to the previous VP location ψ t−1 : ψ t = argmax p X s score(s,p)∗(1.0− |p,ψ t−1 | μ ) (5.6) WethenusethesegmentsthatsupportthewinningVP,indicatedinredinfigure5.6, to extract lines for fast road tracking. The system first sorts the supporting segments based on their lengths. It then fits a line equation through the longest segment using 118 least-squares, and iteratively adds the remaining nearby segments (if all edgel in the segment are within five pixels) to the line, always re-estimating the equation after each addition. Onceall segmentswithincloseproximity areincorporated,thestepisrepeated to create more lines using unclaimed segments, processed in length order. To discard weakly supported lines, the system throws away lines that are represented by less than 50 edgels in the map. Given a line equation from the previous frame and the current edge map, the system calculates a new equation by perturbing the line’s end-points in the image. The first is the road bottom point, which is the onscreen intersection point with either the bottom or the side of the image. The second is the horizon support point, which is the line’s intersection point with aline20pixels below thehorizon line(called thehorizon support line), as illustrated in figure 5.7. horizon support line horizon line Figure 5.7: Line tracking process. The system tracks a line equation from the previous frame(denotedinyellow)inthecurrentedgemapbyfindinganoptimallyfitline(orange) among a set of lines obtained by shifting the horizontal coordinate of the previous line’s bottompoint(bottomoftheimage) andhorizonsupportpoint(intersection pointofline andthehorizon supportline) by+/- 10pixelswith2pixel spacing. Theset ofcandidate shifted points is shown in red on the bottom and on the horizon support line. 119 The system tries to fit a new line by shifting the horizontal coordinate of both end- points by +/- 10 pixels with 2 pixel spacing. The reason for using the horizon support line is because we want each new candidate line, whenextended, to intersect the horizon line on the same side of where it came from. That is, if the candidate line intersects the bottom of the image on the left side of the VP, it should intersect the horizon line on the left as well. We find that true road lines almost never do otherwise. The top pipeline also uses the same tracking mechanism for other purposes. One is to refine the VP estimate because the voting scheme only selects a location among sparsely spaced hypotheses at the calibrated horizon line. By using the intersection of road lines instead, the estimate is much more accurate. The second use of tracking is to help the recognition pipeline catch up to the current frame by projecting the newly discovered road lines through unprocessed frames accumulated during the recognition computation. At the end of both pipelines, the system converts the VP pixel location to angle deviation from the road direction by linear approximation, from 0 ◦ at the center of the imageto27 ◦ attheedge(or.16875 ◦ perpixel). Hereweassumethecameraisfittedwith a standard 60 ◦ field-of-view lens. The system then calculates the absolute road heading by summingtheangular deviation with the current IMU reading. By approximating the road direction in absolute heading, the system does not directly couple the vision-based road recognition with motion generation. Instead it can use IMU readings to maintain robot heading. This, at times, is essential when the road is momentarily undetected visually, e.g., when going under a bridge. 120 To improve the heading estimation robustness, the system, which runs at 30Hz, considers the last 100 absolute headings using a histogram of 72 bins, each covering 5 ◦ . It only updates the road heading estimate if 75% of the values reside within 3 bins, updating it with the average value within those bins. In addition, the system also discards heading inputs that are farther than 30 ◦ from the current estimate, which usually occurs during obstacle avoidance. The resulting filtered VP-based absolute road direction is called θ ψt . This angle by itself is actually not accurate enough to properly drive a robot. Even though the general heading is correct, a single degree of bias can slowly drift a robot to one side of the road. To rectify this, the system locks the robot to its initial lateral position by storing the locations where the road lines intersect with the bottom of the image and trying to maintain these coordinates. The system converts the unit-pixel lateral deviation to metric length dev using an empiricallycalibrateddistanceof9.525mmperpixel. Thesystemthencorrectsforlateral deviation by adding a small proportional bias to θ ψt . θ road = θ ψt +atan(δ/long(g)) (5.7) withθ road beingthefinalabsolute roaddirection to besent tothenavigation system. The second term, which is the lateral deviation correction, is the angle created by the observed deviation δ and the current longitudinal distance from the robot to the goal location long(g), computed using the atan function. Observe figure 5.8 for the illustra- tion. The small correction is sufficient because the robot is already going in the right 121 direction and lateral correction only needs to guard against long-term drift, in the scale of 1m for 30m traversal. Heading Deviation: θ Lateral Position Deviation + = Robot Heading Road Heading 0° 27° 13.5° Angle: Robot Lat. Pos. Road Lat. Pos. 0° 11.95° 1.73° Angle: + Final Deviation: θ ψ road t Figure 5.8: The conversion of the vanishing point (VP) shift from the center of the image and of the lateral deviation from the original location to robot lateral position. In the VP shift, each pixel (in this case 120 pixels) away from the center is multiplied by .16875 ◦ , while on lateral deviation, every pixel (in this case 22.8 pixels) away from the original position equals to 9.525mm. Since the lateral deviation δ is of low priority, it only needs to be corrected by the time the robot arrives at the top of the local map or in the goal longitudinal distance long(g), which is 7.2m. The resulting angle computes to atan(δ/long(g)) =atan(22.8∗.009525/7.2) =1.73 ◦ . Figure 5.9 displays the road recognition system timeline. The canny edge map com- putations run in 3ms and precede both the full recognition (third row) and road line tracking pipelines(fourth). The formerrunsclose to 40ms, with Hough segments lasting for 20ms, VP voting for 3ms, and road line extractions for 15ms. The line tracking pro- cess runs in 5ms, depending on the number of lines being tracked. Usually the system tracks no more than six road lines at a time. ThefinalroadheadingestimationitselfistriggeredbytheIMUoutput,whichreports the current robot absolute heading. This allows the system to quickly correct deviation between the robot and road heading, for both straight and curved roads. [151] has shown that the system is able to follow curved roads. In the chapter, we experimentally tested the system on four paths, one of which (HWY path) contained curved segments. On average, the system performed only slightly worse (10cm lower) than its average 122 performance. The key to following this type of road is being able to extract as much of the straight image segments on the curved path as possible. In this case, these segments point toward a moving vanishing point, which coincides with the changing direction of the road. Camera (30Hz) 100 33.33 66.67 Time (ms) 0 IMU (50Hz) Full road recognition (15Hz) Road line tracker (30Hz) Road heading estimation (50Hz) Edgemap Computation (30Hz) Figure 5.9: Diagram of road recognition timeline. The figure lists all the components in the system: camera and IMU, edgemap computation, the two visual recognition pipelines, and the final road heading estimation. 5.2.2.2 Constructing the Grid Occupancy Map The system updates the grid occupancy values using LRF proximity distances that are already appropriately aligned with respect to the road heading. The grid values are initially set to −1.0 to indicate unknown occupancy. The grid cells are updated only if they are hit by or in line with the LRF rays. For each ray, the system draws a line from the robot location to the endpoint. If the laser line goes through a grid cell, indicating that it is free, the occupancy value is set to 0.0. If the laser line ends on the grid cell, it is set to 1.0. This discrete valued map D is stored and updated at each time step. To generate a path, the system first blurs map D using the equation below. The resulting map is called a blurred map B. 123 B(i,j) = max −n size <=d i ,d j <=n size D(i+d i ,j+d j )∗(1− q d 2 i +d 2 j p (n size +1) 2 +(n size +1) 2 ) (5.8) Here the value B(i,j) is the maximum of the weighted occupancy values within an n size =4neighborhood,where each neighbor’sdiscrete valueD(i+d i ,j+d j ) is inversely weighted by its distance to (i,j). By setting the influence neighborhood to 4 or a full robot width, the repelling force of the obstacle starts to influence the robot half of its lengthaway fromitssurface. Note thatinthecalculation ofB(i,j),D(i,j) isincludedin theneighborhoodwiththedistance weight of1.0, while theweight termsfor thefarthest neighbors evaluate 0.2. In addition, the blurring is not applied to cells with unknown occupancy status, to keep their influence to a minimum. After blurring, these cells are given a conservative value of 0.99 to allow the path planning algorithm to go through them if all other options have been exhausted. As the robot moves longitudinally toward the goal, the grid values are shifted down one row at a time, keeping the robot location as is. This means that the local map is robot-centric in denotingthe location, butallocentric in headingdirection, aligning with theroad. Therobotmovement estimation, which iscomputed usingthefusedodometry, issufficientlyaccurateforgridshiftingpurposes. Hereperfectoccupancyalignmentisnot critical because therobotquickly passesthrough thelocal environment anddiscards any briefly accumulated errors. Unlike in Simultaneous Localization and Mapping (SLAM) formulations, thesystemdoesnotintendtocreateacompleteglobal map,whichrequires reasoning about motion uncertainties. Even when the system encounters an occasional 124 large odometry error, it can quickly recover from grid map misalignment because the occupancy values of every grid visible by the LRF is updated in each frame. This way the misalignment only affects the grids that are in the local map and in robot’s current blind spot. This is fine because the robot always turns to where it is moving to, thus exposing as well as correcting the erroneous blind spots. Observe figure 5.10 for an illustration of the grid occupancy process. It also includes a path biasing step, described in the next section. Figure 5.10: Three stages of the grid occupancy map. The first, leftmost, image is the discrete map D of all the discovered occupancy values, with the white color being unoccupied, black fully occupied, and cyan unknown. The second map is the blurred map B, while the last occupancy map F is biased by most recently produced path to encourage temporal path consistency. Also notice that the unknown grid cells, such as the ones below the robot, do not extend their influence out in both the blurred and biased map. The grid construction timeline is shownin figure 5.11. The process, because themap uses a relatively coarse 15cm by 15cm resolution, has a low computational requirement of less than 1ms, and outputs results at 20Hz, after receiving each LRF readings. The path planning step, on the other hand, takes substantially longer. 125 LRF (20Hz) 100 33.33 66.67 Time (ms) 0 IMU (50Hz) Fused Odometry (15Hz) Road heading estimation (50Hz) Grid map construction (20Hz) Figure 5.11: Diagram of grid map construction timeline. The figure lists all the com- ponents: the three utilized sensors, the road heading estimation, and the grid map construction process. 5.2.2.3 Path Generation The system first applies A* search to the local map to create a coarse path trajectory to the goal. For each grid cell, the search process creates eight directed edges to the surrounding nodes. The edge weight is computed by adding the edge length andF(i,j) value of the destination cell, which is the sum of the destination B(i,j) and a bias term to encourage path consistency. The bias term is calculated by adding the proximity of the destination cell to the corresponding closest point in the previous A* path, divided by the local map unit pixel diagonal length λ. F(i,j) =B(i,j)+ min k=1 to n path p (P k,i −i) 2 +(P k,j −j) 2 λ (5.9) Here the previous pathP is defined ask ordered and connected points P k , withP k,i and P k,j referring to the i and j coordinate locations of the path point. To save time, the algorithm does not add edges to cells going into fully-occupied nodes D(i,j) = 1.0. Consequently, to include all non-fully occupied cells, all F(i,j) values are capped at .99 for nodes D(i,j) 6= 1.0. In addition, the system zeroes out the bias if distance to the 126 closest obstacle is less than 2 unit pixels, to allow for flexibility of finding an optimum nearby path. We findthat adding a bias and generating a path on every frame, which is computed efficiently because the coarse 15cm grid size, is more practical for encouraging temporal smoothness. The more burdensome alternative would be to advance and modify an old path when ever there are changes to the local map. Here, generating new paths allows for thefastest possiblereaction to new changes without a needto explicitly detect them. Thereare times whenA*cannot returnapath tothegoal, usuallybecause therobot is moving into a dead-end and the goal is on the other side of the wall. In this case the system creates a straight forward path, to keep the robot moving ahead until it is directly in front of a wall, within two robot lengths. This ensures that there is no way to the goal, and not because of inaccuracies in the local map construction, especially using far away LRF values. In cases where the robot is truly stuck, the system invokes an exploratory procedure by spinning the robot 360 ◦ . In the future, we plan to implement a more robust procedure such as moving backwards as well as considering pertinent contextual information. To allow for maximum obstacle avoidance and to smooth out the discrete path, the system applies amodifiedelastic bandalgorithm [132] usingthecurrent LRF laser scans as individual obstacles. Each path point undergoes an iterative process for t max = 20, where at each time step the point is moved by an attractive force A to straighten the path, and a repulsive force R to repel it perpendicular to the closest obstacle. The attractive force is defined as: 127 A t+1,k = P t,k+1 +P t,k−1 2 −P t,k (5.10) On the other hand, the repulsiveforceR uses a vectorC t,k =P t,k −closestObs(P t,k ), whichextends fromtheclosest obstacleclosestObs(P t,k ) ofpathpointP t,k to that point. The force can then be defined as: R t+1,k = C t,k |C t,k | ∗ m−|C t,k | |C t,k | (5.11) Here the first term normalizes the vector C t,k , while the second term creates an exponential magnitude weight, where the closer the obstacle the larger the repulsive force. Note that, when identifying the closest obstacle points, the system models the robot body more accurately as a square. Here the maximum obstacle distance m = 5 limitstheobstacleconsiderationtowithin5gridunitsoftherobot,otherwiseR t+1,k =0. Equation 5.12 below combines the two factors: P t+1,k =P t,k +W t ∗( 1 3 ∗A t+1,k + 2 3 ∗R t+1,k ) (5.12) The weight W t decays overtime, becoming linearly weaker 5% at a time from .2 to .01, or W t =.2∗(1.0− t tmax ). To compute the motor command using the smoothed path, the system takes the path’s first step and applies the Dynamic Window Approach (DWA) [42]. It not only calculates the deviation between the current robot heading and the heading of the ini- tial path step, but also takes into account the robot dynamics. This is done by only 128 considering commands within the allowable velocity space based on the estimated ve- locity calculated from the previous command, its execution delay, and motor and body dynamics. Furthermore, the approach then accurately simulates each resulting arc tra- jectory of the allowable commands (translational and rotational velocity), modeling the robot shape to test whether any part of the robot would hit an obstacle. In the end, the system selects a motor command that is expected to best achieve the target heading and velocity, while staying clear of the obstacles. Figure 5.12 illustrates the three stages of the motion generation. Figure 5.12: The three stages of the path trajectory and motion command generation. The figure displays both the full local map on the left and a zoomed part within the red rectangle on the right for a clearer picture. The system first runs A* to generate a rigid path shown in green in both images. The system then deforms the rigid path (in blue) to optimally and smoothly avoid obstacles. Finally, the system uses Dynamic Window Approach (DWA) to generate the actual command, one that accounts for the current robot dynamics. That resulting command is modeled accurately as a curve in orange. In addition, the figure also displays the predicted location of the robot, noted by the red disk. Note that even though the displayed DWA trajectory deviates from the smoothed path, this is only for one step. When the next time step is executed, the front of the 129 robot is expected to be in the dotted location. At that point, the system is going to recompute the path again. After all the steps are computed and executed, the actual robot trajectory usually ends up similar to the smoothed path. The path planning timeline, which integrates the sub-systems in our presented sys- tem, can be viewed in figure 5.13. The path planning sub-system creates the occupancy grid map and computes A* in 5ms, path smoothing step (elastic band algorithm) in 1.8ms, DWA step in 5ms, which totals to about 12ms. The output path itself is trig- gered by the grid-occupancy map output, which is triggered by the LRF output. While other modules influence the robot motion in almost every time step, localiza- tion updates the intermediate goal in the local map quite infrequently. For the most part, the goal location is placed in the canonical straight ahead position (observe figure 5.12). There are, however, a few occurrences where the goal is moved, such as during a turn procedure, or when the goal is within the local map dimensions. For turns, the localization system sends a command to the local map to initiate a turn. It specifies the turnangle as the angle created by theinvolved incomingand outgoing edges in themap. The navigation system then moves the goal to the appropriate location, triggering a path generation that forces the robot to turn immediately. The system then lets the robotmoveuntilitsangleisalignedtothegoaldirection,usuallyinlessthantwoseconds. When the turn is completed, the local map is rotated so that the goal location is at the canonical straight-forward position, and the robot can proceed in a normal fashion. The framework has the flexibility to allow the robot to still perform robustly despite not using some of the sub-systems. In the absence of the localization system, the robot can keep itself on the road. If the road recognition is also absent, the robot then simply 130 100 33.33 66.67 Time (ms) 0 Road heading estimation (50Hz) Grid map construction (20Hz) Localization (30Hz) Path planning (20Hz) Motor Control Execution (20Hz) Figure 5.13: Diagram of full system timeline. The final motor command to be executed by the robot (bottom row) is generated by the path planning sub-system. It utilizes all the previously described sub-system, with grid map construction being the trigger. tries to move in the direction of its heading when the system is turned on, while also avoiding obstacles. In addition, in the case where any of the sensors (camera, LRF, and IMU) is broken, it is easy to set up the corresponding sub-systems (localization, road heading estimation, grid map construction, respectively) to trigger on other events, such as fused odometry. 5.3 Testing and Results We rigorously test the system using our robot, Beobot 2.0 [150], pictured in figure 5.1, at the University of Southern California campus (USC) in Los Angeles. We construct three routes of 511.25m, 615.48m, and 401.16m in length (paths shown in figure 5.14 in red, blue, and green, respectively) through many parts of the campus. The goal is to cross the finish line at the end of the route, wherever the robot is laterally on the road. In the figure, we also include paths in black, which are roads that have been traversed duringsystem development but are not used in the documented experiment. They show the extent of the network of roads that we have used during testing. In addition, some of these paths are currently too dangerous for the robot to traverse. One is the leftmost 131 portion of the network. This is a narrow sidewalk with steep boundary cliff and people walking behind and toward the robot. Currently, the situation forces the robot either to quickly get off the sidewalk or be stuck thinking that the road is a dead end. Beobot 2.0 logged over 3 hours 17 minutes or 10.11km of recorded experiment con- ducted within a three month winter period (December - February). The average robot speed during autonomous driving is 0.66m/s (about half the average walking speed of people), with a maximum of 0.89m/s. The difference is caused by a dropping battery charge as the experiment wears on. A video summary can be found in [149], while a snapshot is displayed in figure 5.15. For the experiments, as seen in figure 5.1, we use a subset of Beobot 2.0’s sensors: a front-facing monocular camera and LRF (117cm and 60cm above the ground, respec- tively), as well as wheel encoders and IMU. The camera outputs 320×240-pixel images, which are kept at the same dimension by the road recognition module, but scaled to 160×120 for vision localization. We did not apply camera calibration or lens distortion correction, which may help in salient region matching. Beobot2.0hasasixteen-core2.6GHzdistributedcomputingplatform,whichwetake full advantage of by employing a subscribe/publish architecture to manage all function- alities, from sensor data acquisition, localization, navigation, to actuation. In addition, the setup simplifies the expansion process if we want to add capabilities such as object recognition and human robot interaction. All the code is available in the INVT toolkit page at [64]. We selected routes that make up a fairly complete representation of the different environments at USC. They are also picked to vary the localization and navigation 132 Figure 5.14: Map of the experiment routes at the University of Southern California. They are the libraries (LIBR) route, indicated in red, the professional schools (PROF) route in blue, and the athletic complex (ATHL) route in green. The remaining routes in black arenotusedinthedocumentedexperiment buthave beentraversed duringsystem development. challenges and conditions, which are listed in table 5.1. Each route consists of four segments. Between them are intersections that force the system to make decisions of which road to take. Scenes from each route are displayed on separate rows in figure 5.16. The first route is called the libraries route (LIBR) because the path goes through a cluster of libraries at the heart of the campus. This route includes roads with some of the densest crowds. The second route at the south part of USC goes through the professional schools (PROF) neighborhood,wheretherearemore trees. Thethirdroute, 133 Figure5.15: Asnapshotofthesystemtest-run. Top-left(main)imagedisplaystherobot moving towards the goal. Below is the road recognition system which identifies the road heading (indicated by the green disk) and the lateral position of the robot (blue stub on the bottom of the image). To the right are three salient landmark matches as well as the segment estimation histogram below. The top-right of the figure shows the global topological mapofthecampus,withthecurrentrobotlocationindicatedbyaredsquare, and the goal by a blue star. It also draws the current segment in pink, matching the segment estimation output underneath landmark matching. Finally, the right bottom of the figure displays the local grid occupancy map, which shows a robot path to avoid an approaching pedestrian. going towards the north part of campus, is called the athletic (ATHL) route, which rounds out the variety of scenes in the experiments. There are parts of the full system that have been previously published. [155] shows that the vision localization system can maintain comparable results on a number of lighting conditions at different times of the day, from the brightest (noon time) to the darkest (early evening), and in real time [150]. [151] has also demonstrated that the visual road recognition system performs better than the state of the art [135, 75, 21] on a standard (offline) dataset [21]. 134 Figure5.16: Examples of scenes from thelibraries(LIBR), professional schools (PROF), and athletic complex (ATHL) route, each displayed in its respective row. The figure also depicts various challenges that the robot encountered, such as pedestrians, service vehicles,andshadows. Theseimagesareadjustedtoremovethebluishtintintheoriginal images for better display. Table 5.1: Experiment Sites Information Site Route Pedestrian Weather/lighting Speed Total Total Total Obstable Length Density Condition (m/s) Training Testing Time Avoidance (m) (Person/m) Frames Frames (s) Time (s) Library 511.25 .19 4 - 6pm, partial sun .61 16621 13825 838.35 29.99 .97 12 - 2pm, sunny .44 16688 34979 1165.88 73.13 Professional 615.48 .06 4 - 6pm, cloudy .73 21488 20000 840.79 0.00 .36 12 - 2pm, sunny .89 9702 20773 692.37 46.10 Athletic 401.16 .04 4 - 6pm, clear sky .77 19022 11017 522.21 13.67 .43 11am - 6pm, sunny .68 9102 17499 583.24 27.10 In this chapter, because the full system connects the navigation and localization modules, we examine the effect of one module upon the other. In addition, since both modules have been shown to be robust in the presence of various lighting conditions, we focus on the effect of crowding on our system. That is, we test the system on two conditions: low and high crowd; the former during off-peak hours, while the latter during peak hours, with many pedestrians on the road. [151] has shown that the road recognition system is affected by crowding more so than lighting. Here, we complete the analysis by rigorously testing the whole system. 135 We carry out a number of testing procedures, measuring different metrics. In section 5.3.1, we focus on the localization system, while section 5.3.2 is on navigation. In the former,weisolatethelocalization problembycomparingitsperformancewhentherobot ismanually drivenversusfullyautonomous. Inaddition, wesystematically manual drive the robot on a straight-line at different offsets from the center of the road to gauge how navigation error affects localization results. We also isolate localization results at the junctions. Junction is a critical part of a route because it is where a robot has to make a turn. In the navigation section, we evaluate how well the navigation system maintains the robot’s lateral position throughout traversal, and how it is affected by obstacle avoidance. Altogether, we evaluate the system using twelve full path traversals, four for each route,forallcombinationsofconditions: autonomousmodeinlowcrowd,manualdriving in low crowd, autonomous mode in high crowd, and manual driving in high crowd. In addition, we also add four 180m manual runs to test the effect of lateral shift on localization results. Although there are many more environments that we can test the robot in, these selected sites encompass a large part of the challenges that outdoor autonomous mobile systems face today. We believe there are many lessons that can be gained from testing the robot at these challenging sites and under the proposed conditions. 5.3.1 Localization Testing Wemeasurethelocalization systemaccuracybycomparingitsoutputwithmanuallyob- tained groundtruth. We decided to usethis metric because thesystem never completely 136 failed to localize, and evaluating displacement error can provide a better understanding of how it performed. The error measured here is in the longitudinal direction, along the graphedgesofthetopological map. Werecordthegroundtruthbysendingsignalstothe system, indicating the true current robot location. The signals are sent at specific parts of the path and are spaced appropriately (as small as 3 meters) to accurately record the robot’s location throughout the experiment. Assuming the robot is running at approx- imately constant speed, we use interpolation to fill in the ground truth between signal points. Note that these signals are only processed after testing and are not used during run-time. For our experiments, we manually initialized the system with the starting lo- cation. In [155] we have shown that the system can re-localize from kidnapped robot as well as randominitialization. Here wedecided not to dosobecause the system currently doesnotconverge tothetruelocation inreasonableamountof timewith100% certainty. To create a landmark database and a segment classifier, we run the robot manually through the routes to take images for our training procedure [154]. We have systemati- cally shown in [155] for the localization system and [152] for the segment classifier that training at different times with three hours intervals endows our system with lighting invariance. In those papers, we test the system from 9 am to late afternoon, after per- forming multiple the prescribed training runs. We found that the segment classification errors are within 5 percent of the average, while the localization errors are within 60 cm, regardless of lighting condition. Thus, here, we test the system at different times of day, but only with the training and testing taken in similar lighting for each site. Table 5.1 lists the number of training and testing images involved. 137 The localization boxplots can be viewed in figure 5.17 for all three sites as well as the total, in both crowding condition and driving mode. We also show in figure 5.18 an example run from the PROF route under low crowd condition for more detailed observation. To reduce the effect of dependence in the samples we average the errors every 30 samples or within 1 second (localization output is 30Hz) before creating the box plots. In addition, because our result distributions are not gaussian, we use non- parametric statistical significance tests. Because of this, we do not test for interaction between effects, e.g. whether crowd level causes greater error for autonomous navigation than for manual driving. 1.0 2.0 LIBR ATHL Total Auto & Low Crowd Manual & Low Crowd Auto & High Crowd Manual & High Crowd Error (m) PROF 0.5 1.5 2.5 Crowd Level 3.5 4.5 3.0 4.0 5.0 5.5 6.5 6.0 Navigation Mode Autonomous Low Crowd Manual High Crowd *** *** Figure 5.17: Box plot of metric localization error from different routes, for both manual andautonomousnavigation, andinbothlowandhighcrowdconditions. Inaddition, the figure also reports the total performance accuracies for each condition. The figure shows that the overall autonomous mode error median is 0.97m, while medians of the various conditions remain low, from .75 to 1.5. Each box plot notes the median (the thicker middle line), the first and third quartile, and the end whiskers, which are either the minima/maxima or 1.5 * IQR (inter-quartile range). The stars above the lines joining various box plots note the statistical significance with one star indicates p < 0.05, two stars p<0.01, and three stars p<0.005. 138 0 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 450 500 450 0m 3.5m Start Finish Figure 5.18: Detailed diagram of localization error while the robot autonomously nav- igates the professional schools (PROF) route during low crowd condition. The error is indicated by a gradation of red color, with the darker hue being higher error. The results show that the system is able to satisfactorily localize within 0.97 meters (median) of the true location while driving autonomously and reaching goals of more than four hundred meters away. The farthest localization disparity is 7.15m, which happens in the middle of a long segment, allowing the system to recover. The main source of error, which is prevalent throughout all conditions, can be attributed to the lessexactnatureofthesalientregionmatchingstep. Tospeeduptheprocess,thesystem uses small salient regions of about one third the downscaled 160 by 120 input image. For landmarks that are near the robot, the small size regions are sufficient to pinpoint locations with high accuracy. In fact, whenever the system observed nearby landmarks, 139 thelocalization errortendsto decrease. However, whenthelandmarksarefaraway from the robot, they can appear identical from a large range of viewing distances. Because of the low discrimination power, often times, the system cannot quickly correct a sizeable steady-state error. Themainproblemisthatthereisnowayofknowingwhichlandmarksarefarfromthe robotasthesystemonlyrecordstherobotlocationwhenviewingthem. Apossiblelabor- intensive solution without incorporating long range 3D sensors would beto annotate the estimated distances of the landmarks to the robot. This way, the system would know that landmarks in the form of the top of buildings are hundreds of meters away and would not be as discriminating as signs on top of a nearby door. In terms of the effect of routes on system performance, the data shows that it is statistically significant (Friedman test, p< 0.005). As shown in the figure, the Athletic (ATHL) route generally produces lower error distributions than that of the libraries (LIBR) and professional school (PROF). We found that the disparity comes from a sustained spike in error, which often occurs when the robot is driving on a long segment and which the system continually utilizes far away silhouettes at the end of the road as a landmark. An example is taken from the last segment of PROF route, where the system sustains errors above 3m for more than 50m (observe figure 5.18). As the robot is moving through a corridor (shown in figure 5.19), the light at the end makes that part of the image very salient, rendering other parts less conspicuous by comparison. As for the difference between low and high crowd conditions, the difference is sta- tistically significant (Friedman test, p < 0.005) but with near equal medians (0.98m 140 Figure 5.19: Landmark recognition in a building corridor at the last segment of PROF route. Theopeningattheendistheonlysalientlandmarkthatisconsistentlyrecognized at different distances, with minimally observed changes throughout the traversal. low crowd vs. 0.96 high crowd). Most times, the localization system is able to iden- tify enough landmarks among the pedestrians just as well as when the crowd is much sparser because our training procedure rejects moving objects as possible landmarks during training [154]. However, there are times when pedestrians do affect the system, particularlywhentheyoccludethelandmarksthatareatgroundlevel. Theselandmarks tend to be the ones that are closer to the robot. When this happens, the robot has to rely on landmarks that are higher above pedestrians. They are likely to be farther away, thus reducing the accuracy of the localization system. In addition, we also observe a statistically significant localization performance dif- ferences between autonomous and manual-drive navigation (Friedman test, p < 0.005) with the autonomous drive median of 0.96m being lower than that of manual driving (1.00m). This is quite surprising given that in autonomous drive the robot can veer in and out of the middle of the road due to the granularity of the road heading perception. That is, it takes substantial deviation from the road center for the system to react to theerror. Furthermore, it can also beattributed to robot responsedelay, uneven ground surfaces, and motors not having equal efficiency, biasing the robot to drift to one side. 141 In manual drive, on the other hand, the robot moves steadily in the middle of the road, which is the same vantage point as during training. When the robot is not moving at the center of the road, there are fewer landmarks thatarematchedwiththeonesstoredinthedatabase, whichlowersthesystemaccuracy. We demonstrate this point by manually driving the robot at different lateral positions, 0m, 1.33m, 2.33m, 3.33m away from the road center at the first segment of LIBR route. We selected this segment because it has sections with many nearby salient landmarks as well as sections that primarily use a far away silhouette at the end of the segment. 2.0 Conditions Center of the road 1.33m off center 2.33m off center 3.33m off center Error (m) 1.0 1.5 0.5 *** * *** 3.0 3.5 2.5 Figure 5.20: The localization error box plot for lateral deviation of 0m, 1.33m, 2.33m, and 3.33m from the center of the road. All the error differences compared to that of driving in the middle of the road are statistically significant based on Wilcoxon rank sum test (p = 1.093∗10 −7 vs. 1.33m, p = 0.015 vs. 2.33m, and p = 2.476∗10 −5 vs. 3.33m). As we can see in figure 5.20, there are statistically significant differences in perfor- mancebetween0mlateralpositionandtheothers,withthesystembecomingincreasingly less accurate as it moves away from the center. The good news is, even at a deviation of 3.33m, the localization system still does not completely misplace the robot. This leads us to believe that, as long as the robot does not permanently travel far away from the 142 center of the road, instead moving in and out of the center, the system can match the localization performance of the manual driving. Another aspect of the testing that we observe closely is during obstacle avoidance. In total, in our data, there are 25 trajectory-changing obstacles that affect the robot’s heading. This definition differentiates a select few from many other nearby pedestrians seen in the camera but not close enough to warrant avoidance. A majority of them only results in slight deflections in the robot’s trajectory, and not a clearly visible maneuver. Refer to table 5.1 for the accumulated times that the robot spent avoiding obstacles, listed on the last column. 3.0 4.0 3.5 4.5 5.0 1.0 2.0 0.5 1.5 2.5 LIBR ATHL Obstacle Avoidance Obs. Low Crowd Obs. High Crowd Error (m) PROF *** *** *** Non Obs. Low Crowd Non Obs. High Crowd Obs. Both Crowds Non. Obs Both Crowds Non-Obstacle Avoidance Figure 5.21: Box plot of localization errors during obstacle avoidance on the different routes, and in both low and high crowd conditions. Note that the result for the PROF route during low crowd is missing because the robot did not execute a single obstacle avoidance maneuver on that run. As shown in figure 5.21, we observe statistically significant performance differences between low and high crowd conditions duringobstacle avoidance maneuvers (Friedman Test,p<0.005), lowandhighcrowdconditionsduringnon-obstacleavoidance traversals (Friedman Test, p < 0.005), as well as between obstacle and non-obstacle avoidance 143 traversals for both crowd levels combined (Friedman Test, p < 0.005). It should be pointedout that themediandifferenceforthelast comparison isonly 1cm, 0.95m during obstacle avoidance versus 0.96m during non-obstacle avoidance. This stability, which is also shown by the respective distribution shapes, demonstrates that the localization system can endure these significant but short perturbations. At times we see that an avoidance maneuver actually allows the robot to go back to the center of the road, improving the localization accuracy. In fact, on average, the localization error actually decreases from 1.39m to 1.09m before and after avoidance. For the most part, the amount of localization error carried by the system is man- ageable, especially in the middle of segments, where the task of the robot is mainly to proceed forward. However, it becomes critical when the robot is trying to turn to the next segment at an intersection. In figure 5.22, we plot the turn locations with respect to the center of the intersection. Fortunately, considering that the robot is traveling on campus roads of 6 to 8 meters in width, it never misses the road completely. Even when the robot needs to turn to narrower roads, it manages to dosobecausethese roadstendto bein between buildings. AnexampleiswhentherobotisenteringthecorridoratthelastsegmentofPROFroute. In this case, the obstacle avoidance, or wall following behavior, kicks in and funnels the robot to the segment. We believe that in order for the robot to properly turn at the center of an intersection, both localization and navigation must work in concert. That is, localization canseedthesystemthatitisnearlytimetoturn,whilenavigation should then estimate the configuration of the junction using the involved road boundaries to fine-tune the turn location. 144 Robot Width (0.6m) Longitudinal Deviation (m) 0 0.6 1.2 1.8 2.4 3.0 3.6 4.2 4.8 -0.6 -1.2 -1.8 -2.4 -3.0 -3.6 -4.2 -4.8 0 0.6 1.2 1.8 2.4 3.0 3.6 4.2 4.8 -0.6 -1.2 -1.8 -2.4 -3.0 -3.6 -4.2 -4.8 Lateral Deviation (m) Figure 5.22: Plot of turning locations throughout the experiment. The black partially- blocked arrow coming from the bottom of the figure indicates the direction of where the robot is coming from. It is pointing toward the gray and black circle representing the center of the intersection, which is the ideal turn location. The color of the dots, which corresponds to the color of the arrows, indicates where the robot is trying to go: red is for turning left, green for right, and yellow for straight. Each red, yellow, and green dot represents where the robot was when turning (the robot turns in place). 5.3.2 Navigation Testing We measure the accuracy of the system in being able to navigate and maintain its position in the middle of the road on many different road types and challenges. They rangefromcleanroadswithclearborders,roadswithcomplexmarkings,toroadswithout clear boundaries. Navigation error is definedas thephysical lateral distance of therobot centerfromthecenteroftheroad. Wemeasurethegroundtruthbymanuallyannotating therobot’slateral positionwhencrossingcertainpointsofthepath. Wetheninterpolate the points in between using odometry. 145 Figure 5.23 reports the result, which is divided by sites, and then by crowding con- ditions. Overall, the system is able to stay within 1.37m median of the center, which is acceptable given the challenges that the robot faced. However, clearly, there are a number of aspects that needs improving. 1.0 2.0 LIBR ATHL Total Auto & Low Crowd Auto & High Crowd Error (m) PROF 0.5 1.5 2.5 3.0 4.0 3.5 4.5 5.0 6.0 5.5 6.5 *** Figure 5.23: Results of the navigation system in each site and under low and high crowdingconditions. Wemeasuretheerroraslateraldeviationoftherobotfromtheroad center. The far right bars display the distributions of low vs. high crowd performance, which is statistically significant (Friedman test, p<0.005). The datashows that there is astatistically significant differencebetween thelow and high crowd conditions (Friedman test, p < 0.005), with the median of 1.23m for low crowd vs. 1.55m for high. As mentioned in the previous section, excluding numerous slight changes of direction, there are only a low number (25 in total) of obvious obstacle avoidancesituations. Onereasonforthisisbecausethe7.2mfrontextentofthelocalmap allows the robot to register obstacles early enough to move away gracefully. In addition, the robot only encounters a small number of static obstacles because it is moving in 146 0 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 450 500 450 0m 3.5m Start Finish Figure 5.24: Trajectory of the robot on the PROF route trial run under low crowd condition. Here the deviation from the ground truth center of road is indicated by the gradation of thered color, with thedarker huebeinghigher error. In addition, the green bar behind the trajectory indicates that the error is within 1.5m of the center of the road, while the wider black bar is for 5m. the middle of the street. Also, because of the robot’s appearance and relatively slower movingspeed,pedestriansareawareofitapproachingthem. Insomecases, however, the robotdoeshavetobreakfromitspathtoavoidcollision. And,sometimes, thenavigation performance suffers because of it. It is important to note that the starting point of an avoidance maneuver’s temporal window is when the robot first rapidly changes its direction, and ends when it stabilizes itslateral position. Forsomesituations, however, thesystemerrorremainslongafterthe maneuverisdeemedover. Forexample,duringtheLIBRtestruninlowcrowdcondition, 147 therobothastoavoidtwopeople,streetsigns,andalongbikerackinsuccession(observe figure 5.25). Because the compound maneuver keeps the robot on the left flank of the road (3m from the center) for so long, the system resets its current lateral position, thinking that the robot is at a new center of the road. Had the robot moved back to the center, as opposed to staying out in the flank for more than 70 meters, the overall median would have been 0.87m, lower than the actual 1.09m error. Wedecidedtoaddaresetoptionbecausethecurrentroadrecognitionsystemdoesnot havethecapabilitytorecognizethetruecenterrobustly. Resettingthecenterwhenevera newstablelateralpositionisestablishedallowsthesystemtoreacttochangingboundary positions. If the system is able to estimate the road center, it would help not only for obstacle avoidance recovery, but also in the case where the localization system turns at the wrong point of the intersection at the start of a segment. Currently, the navigation system assumes that the original lateral position is the center of the road, and simply tries to maintain that spot. Figure 5.25: Successive avoidance maneuver by the robot. It has to avoid two people (first image), street signs (second), and long bike rack (third). Wealsoanalyzethenavigationerrorduringobstacleavoidance, withtheresultshown infigure5.26. Again,thefigureshowsstatisticalsignificanceonallcases: navigationerror duringobstacle avoidance for low vs. high crowd conditions (Friedman Test,p<0.005), 148 during non-obstacle avoidance traversals for low vs. high crowd conditions (Friedman Test, p < 0.005), as well as between obstacle and non-obstacle avoidance traversals for both crowd conditions combined (Friedman Test, p < 0.005). Here the difference for the latter comparison is more pronounced, 1.58m vs. 1.36m medians, respectively. In addition, the disparity is also supported by the average error increase between start and end of obstacle maneuver of 32 cm. 3.0 3.5 0.5 1.5 2.5 Error (m) 1.0 2.0 5.0 5.5 4.5 4.0 LIBR ATHL PROF Obstacle Avoidance *** *** Non-Obstacle Avoidance Obs. Low Crowd Obs. High Crowd Non Obs. Low Crowd Non Obs. High Crowd Obs. Both Crowd Non. Obs Both Crowd *** Figure 5.26: Navigation errors during obstacle avoidance from different routes, and in bothlowandhighcrowdconditions. Note thattheresultforthePROFrouteduringlow crowd ismissingbecausetherobot didnotexecute asingle obstacle avoidance maneuver during the time. Evenwithoutobstacleavoidance, inthecaseofthePROFrouteduringthelowcrowd condition (indicated by missing bar in figure 5.26), there is still error in the system. A major reason for this is because, in some areas, visual cues for road direction are very sparse. An example is the water fountain area in the PROF route, where the road changes seamlessly to an open space before continuing to another road (observe figure 5.27). In the absence of these visual cues, and thus of a vanishing point, the robot simply maintains the last road IMU heading estimate while avoiding obstacles whenever 149 needed. Despite this difficult stretch, however, the robot was still able to arrive at the target location, about 200m further. On a related topic, open areas present a challenge in terms of constructing a topo- logical representation because the robot can move about in any direction. In a truly open area devoid of obstacles, there can be any number of maps that can be created to describethespace, butin theend, these donot matter since thespace is openand hence can be traversed in any way. Our strategy with mapping open spaces thus incorporates the robot behavior of keep going straight while avoiding obstacles. So, for example, in the case of a large plaza with 4 roads connecting to it from the North, East, South and West, we just reduce the plaza to its center point and extend the 4 roads to meet at that point. If the open space does contain fixed obstacles, these will be represented as additional nodes in our topological map. Figure 5.27: Views of the environment as the robot is moving through a water fountain area. Throughout the experiment, there were only a few times when the robot failed to detect obstacles in its path. On these occasions, we stepped in front of the robot to assert the presence of the obstacles, and let the system adjust by itself. One cause of the mistakes isbecausetheplanarLRFisplacedtoohighabovethegroundtodetectshorter or protrudingobject parts such as the seat of a bench. However, we find that there is no correct height because, if it is placed low, the robot may attempt to drive underneath 150 an obstacle. Other reasons to note are the inability of the LRF to consistently detect thin objects (e.g., poles, fences), and objects with holes such as a bicycle approached at certainangles. Furthermore,thenavigation systemcurrentlyisunabletodetectnegative obstacles suchas aditch or ahole, makingit unsafetonavigate onroads withhazardous boundaries such as sidewalks, ramps, and cliffs. In section 5.4.3, we describe our plan to rectify these problems. 5.4 Discussion and Conclusion In this chapter we present a real-time visually-guided autonomous localization and navi- gationsystem. Wedemonstratethesystem’sefficacyusingawheeledrobotanddrivingit throughlong-range andchallenging routes ofat least 400m at theUniversity ofSouthern California (USC) campus, for a total of 10.11km of documented testing. In each route, the robot has to show its robustness in navigating among crowds, turning at several intersections, and stopping at the assigned destination. In section 5.4.1, we discuss why thesystemisabletoperformaswellasitdid. Wethencompareitsperformancewiththe available state of the art systems and describe how our work relates to them in section 5.4.2. We finish by listing the current system failures as well as future improvements in section 5.4.3. For our setup, we have clearly framed theobjective of our system to withina specific type of environments and application and, at this point, we do not know whether this approach generalizes to a completely different scenario (e.g., autonomous cars). How- ever, the service robot task in unconstrained pedestrian environments has quite a broad 151 potential impact onto society. We believe our extensive work will help be an inspiration for many others in continuing the research in autonomous mobile robotics. 5.4.1 Performance Evaluation A key factor for our system’s success is the hierarchical representation, which effectively separates the localization and navigation tasks. Here, the former is dealt with at the local map level, while the latter is at the global map level. The separation enables the navigation module to independently control the robot, affording the localization module a longer periodto performits computations. This, in fact, is anecessity because it takes seconds for the localization module to compare input landmarks with its large stored database. However, unlike obstacle avoidance and road heading correction, localization- related tasks such as turning at an intersection or stopping at the goal are infrequently invoked and can be updated at a substantial delay. In addition, the infrequent goal location updates do not affect the navigation system on a frame-to-frame basis because it is assumed that the robot wants to go toward the direction of the road. On the contrary, by having a mechanism that keeps the robot’s viewing perspective fromornearthecenter oftheroad, wedonothavetotrainthevisionlocalization system onlandmarksfromotherperspectives. Suchrequirementswouldalmostcertainlyputour recognition system past its limits because we use a regular camera as opposed to omni- directional camera [114, 182] or creating 3D landmark representation [177]. In addition, staying in the middle of the road also allows the robot to better see other semantically importantobjects,suchasstoporexitsigns,buildingentrances, anddoors. Inthefuture 152 we would like to have the system beable to navigate through open areas where the road lines are not readily visible, while still be able to optimally view these important cues. Another important aspect for the success of the system is the implementation of manytechniquestoaccount fordifferentprocessingdelays ofthevisionmodules. For the long latency of the localization module, the system implements a multilevel recognition with gist for segment recognition, and landmark recognition to refine the localization hypothesis. In addition, the system also projects forward the results by adding an odometry difference to the localization estimation between the start and end of the landmark recognition process. Also, in the road recognition system, the module runs two processing pipelines: one for the full visual road recognition, while another tracks previously detected road lines to produce updated estimates while the full recognition pipeline looks for new road lines. Interms of implementation, theoverall system usesa numberof parameters. For the most part, we have chosen them experimentally during development. We tried several values for each parameter and used the one that produce the best performance. For example, we tried to incorporate the last 100 and 500 absolute headings in the road direction filteringstep. We foundthat the former, which wedecided on usingintheend, allows for robust filtering while also being responsive enough to detect a change in road direction. On theother hand, thelatter seems to adapt to road changes noticeably more slowly. Furthermore, when possible, we and others have previously conducted extensive parametric testing of system components: for attention [66], gist [152], SIFT [90], road recognition [151]. 153 Toobtain atrulyoptimal set ofparameters, wehavetotest allpossiblecombinations of all possible values. Unfortunately, the presented system is tightly coupled, where the number of variants to consider when adjusting possible parameters is subject to combinatorial explosion. Nevertheless, we have shown that the system still performs quite robustly with the parameters chosen. We believe that most of these parameters are not extremely sensitive for performance, and there is flexibility in selecting these numbers. In this chapter we tried to give an intuition for what the factors are that should be considered, and we provide these values to gauge what is the expected ideal range. We do not claim that these are the best or only possible values. It is possible that better values will be found in future research. However, we report those numbers to allow the readers to reproduce our exact system if they want to. To that end, we also provide the software source code at [64] to port to other robotic systems, or as a guide to adapt our algorithms for specific usage. 5.4.2 System Comparison Our evaluation, in terms of scope, is most similar to the Tsukuba challenge [200], which takes its human-sized service robot participants through 1.1km route of pedestrian road in the city of Tsukuba. This is unlike the Darpa grand challenge [110] and European Land-Robot Trial (ELROB) [142], which are for autonomous cars, or AUVSI’s Intelli- gent GroundVehicleCompetition(IGVC)[167]andNASA’sCentennialChallenge[116], which are conducted on open grass fields. In terms of goals, our system relates to a number of fully autonomous systems that areoperatinginoutdoorpedestrianenvironments. TheACEproject robot [88] performs 154 outdoornavigationbutreliesonpassersbytoguideittothefinaldestination1.5kmaway. The EUROPA project robot [163], on the other hand, is able to navigate and localize in unconstrained urban settings using LRF sensors. To the best of our knowledge, we did not find a fully autonomous visually-guided localization and navigation system that is tested in the pedestrian environment to compare with. One that comes closest is RatSLAM [102], which takes inspiration from research done in how rat brains perform navigation tasks. It has a fully autonomous version [103], where it simultaneously localizes and maps (SLAM) the environment usingvision, and navigates using a planar LRF indoors. Our system, despite not having SLAM yet, performs both navigation and localization outdoors. Another aspect of RatSLAM to note is its visual features. It uses a column average intensity profile taken at specific windows for three different purposes: scene recognition, rotation detection, and speed estimation. This is a stark difference from what most state-of-the-art vision localization systems,includingours,used: computationallycomplexkeypoint-basedrecognition. Our perceptual system, which relies on SIFT-based salient landmark database matching, is more inline with that of thecurrent SLAM systems: FAB MAP [29] or CAT-SLAM [92]. In terms of results, the best way to benchmark our system is by directly comparing ourresultswithothersinthesamesettings. However, thiswouldbealargeundertaking. Tohaveafaircomparison, thesystemswouldhavetoberunonthesamerobotthatthey are originally published in, or at least the same configuration of sensors. In addition, there is also an issue of code availability as well as the care needed to port and have it perform optimally. A way to circumvent this is by requesting the authors to run their systems on a recorded data. However, we are dealing with active systems, where 155 different decisions at each step can lead to very different outcomes. Similarly, simulation exerciseswouldalsofallshortofreal-lifetestingbecausetherobotoperatesinadynamic, unpredictable, andnot fullyaccessible (in theA.I. sense) world, whichwebelieve cannot be well captured by the current world models. Given these difficulties, we decided to compare our results with the results of other systems as reported in their respective papers. Our localization median error of 0.97m is quite high compared to that of outdoor LRF-based systems such as [109], which reports error near 10 cm. This is because proximity data is much more discriminating when there is sufficient amount of valid laser readings, which is not always the case outdoors. Visual features, on the other hand, observe the appearance of the environment. They help the system to rely less on nearby surfaces by extending its perceptual range. Because of this we believe our vision-based navigation and localization system will complement laser-based systems. Theuseofgeographical distance fromgroundtruthaslocalization metric inourtest- ing is actually different than how vision-based SLAM systems are currently evaluated. They use precision-recall curve, which focuses more on the recognition step. Here preci- sion isthenumberof correct landmarkmatches over thenumberof matches, while recall is the number of correct matches over the numberof possible matches. Ourrecall rate is close to one landmark per five possible matches in an image (20%). Our precision rate, on the other hand, is close to 100% because of the high threshold on positive matches, at least 3 keypoints in a proper affine configuration. Note that almost perfect precision rate is the norm for keypoint-based systems. Thus, precision is usually set at 99% to 156 100% beforeevaluating therecall rate. From [92], wegaugeourrecall rate iscompetitive as anything above 10% is respectable. Recognition rate is also the metric of choice for visual place recognition systems. As [153] and a number of others [161, 37] have shown, the performance of our gist-based system, which is about 80% for the standard USC dataset, is only 10% lower than some of the more recent systems. In a sense, place recognition is a different classification problem than that of visual SLAM systems. The former has to discriminate a smaller number of classes, but with more in-class variance, while the latter has to discriminate many more classes, with fewer number of views within class. We believe that, to build a vision localization system that can perform robustly in many environments, a system has to be able to distinguish places at different levels, from the general place names to the specific locations. For navigation, there are a number of metrics that are currently used for evalua- tion. One is the success rate in reaching an assigned goal. Although localization is also involved, it is clear to us that the main difficulty lies in path creation and execution. Even if the path created is perfect, if the robot is not able to properly execute the com- mands, it can still hit an obstacle. Thus, systems are also evaluated on the severity of the failure, whether it is a square hit or just a graze [95]. Our system, aside from a numberof isolated incidents, usuallyprovidesabout30cm of space between its bodyand the obstacles. In addition, we can also consider execution speed, which is currently less emphasized. The goal for service robots, which are meant to help people, such as ours is to at least match that of an average person (about 1.4m/s). Our average velocity is currently half (0.66m/s) that and is competitive with most currently available systems. 157 It should be noted that robot speed also depends on computational platform and can be increased by simply upgrading them. Thesuccessofasystemisalsoinfluencedbythetravel distance, thelongerthehigher the likelihood of failure. For shorter distances of 40 to 50 m, it is not unusual to achieve near 100% success rate [177]. Our system is asked to reach a goal of 400m or more in busy outdoor environments. Out of six fully autonomous runs, we have to intervene in two of them, all of which are more than 200m from the start. As a comparison, there are a numberof indoor systems that have achieved much longer autonomy: lasting a full 26.2 mile marathon [95] and 40km over 37 hours [103]. Because a large part of failure to avoid an obstacle originates in the detection step, another way of evaluating is based on the produced obstacle avoidance map. This way is often seen in systems [97] that try to detect negative obstacles (e.g., a dip next to a sidewalk). Given the large area which our system is operating in, it would be time- consuming to create a complete ground truth occupancy map. We are, however, able to qualitatively assess our system and produce a list of events in which it fails, in section 5.4.3. Ourlateral deviationnavigation metricdelvesdeeperintothequalityofrobottraver- salbecausewebelievethemiddleoftheroadallowstherobottobeinpositiontoperform more tasks, if called upon, along the way. Our road recognition system has been shown to be statistically superior in detecting road centers in images with a number of other state-of-the-art systems [75, 135, 21] with an accuracy level of near 40cm. However, the system performs worse in our robot long-range testing, with a median of 1.37m. Thus, it is hard to predict how this, or other image-based metrics [135, 101] that measure road 158 segmentation overlap, would translate to actual autonomous navigation. As such, it is important to create a testing regimen where all the systems are tested on a robot, under the same conditions. 5.4.3 Failure Cases and Future Improvements During testing we observe a number of aspects that we would like to improve upon. Automatic localization initialization. The current localization system has to be given an initial location to guarantee correct convergence. It usesthe current location to orderthedatabase landmarkswhencomparingthemwiththeinputregions. Inaddition, the system also employs a search cutoff in the comparison process for efficiency reasons. Thus, instead of going through the whole database, there are times when the matching landmarkmaynotevenget tobecomparedwiththeinputregionbecauseofanincorrect prior. Another related issue is the need to fine tune the matching criteria to maximize matches while keeping false positives to a minimum. In addition, when the robot is lost, it should be more active in searching for better views, where the landmarks are more likely to be matched. Simultaneous Localization & Mapping. Currently, the localization map is still manually constructed. We always understood that this is inconvenient, although, for many application scenarios (see figure 1.1), we believe this step is well worth the ef- fort because the robot will be used many times in the same environment. In the next iteration, we plan to create the metric topological map automatically using a Simul- taneous Localization and Mapping (SLAM) algorithm. There are a number of SLAM algorithms that aim to create such hierarchical map. A system by [96] connects local 159 overlapping LRF-based occupancy maps to a single global topological map. Similarly [125] constructs a topological map by associating stereo-based local submaps. Note that these systems have only been tested indoors, and it may not be trivial to adapt them to outdoor environments. For our system, we are particularly interested in creating a map with meaningful nodes that represent important geographical points, similar to [134], whereit identifies uniqueLRFsignatures producedby junctionsinindoorenvironments. The key is to robustly do the same in unconstrained pedestrian outdoor environments, where the presence of surrounding walls cannot be assumed. One way is by using an online gist feature grouping technique [182], which detects a large change in the feature space, correspondingto moving into a new segment. In addition, related to metric topo- logical SLAM, there is also a research problem [12, 83] of how to topologically represent more complex path or space such as winding roads or open area. Proper Turning at an Intersection. Although the localization system did not completely fail or misplace the robot during testing, at times, it converged a few meters awayfromthetruelocation. Thisbecomesanissuewhentherobotisnearanintersection and needs to turn. If the turn occurs far away from the center of the intersection, the robot may end up on the shoulder instead of near the center of the next road. One way to solve this problem is by explicitly recognizing the junction or intersection. Given a prior that the robot is near an intersection, the system can look for road lines perpendicularto the lines that are pointing toward the current robot direction. Another way to avoid traveling on the side of the road is by recovering the true road center establishedduringtraining. Thevisuallandmarksinthestoreddatabasealsocarryimage coordinate locations andcanbeusedtoadjustthetarget lateral position. Thisapproach 160 is similar to teach-and-replay [20], but different in that the result is not directly coupled with motor command generation. It merely corrects for the road center estimation wheneverthematches areavailable, preventingthesystemfromhavingtostoptherobot when recognition failed. This technique is also useful for navigating when the vanishing point is undetected, as well as on more extreme curved and winding roads, improving upon the system’s current ability to follow the more gradually changing roads [151]. Robust Obstacle Avoidance Navigation. Intermsofnavigational shortcomings, there are a few times when the robot failed to detect obstacles in its path, particularly, thin or short objects, as well as objects with holes. In addition, there are also a number of scenarios that we did not include in our test, but are expected to currently lead to systemfailuresandneedtobehandledinthefuture: curbs,ramps,off-roads. Toimprove the system competencies in these situations, we are adding an up-down sweeping LRF, as well as applying techniques by [72, 97]. We believe a moving LRF is the best way to maximize obstacle visibility, while the algorithms enable our system to handle those situations more robustly. In addition, we also plan to incorporate the evidence from the sweeping LRF with our visual road recognition module by identifying which road lines are solid boundariesandshould never becrossed. Here, theroad boundarycan benoted by visual appearance change, change in proximity profile, or asharp change in elevation. Once we incorporate these improvements, we believe the resulting system will be robust in many different outdoor scenarios, sufficient to travel for even longer distances. Better Road Detection. The system also needs to improve on detecting roads in more difficult conditions, such as bright sunlight. This can be done in both hardware and software. For the former, a camera with wider dynamic range would be a welcome 161 addition. In addition, having a lens with wider field-of-view is also helpful, especially for the wider uniform roads with no direction cues between the boundaries. For software improvements, a better low-level edge and line segment detector (with local contrast adjustments)cansignificantlyboostperformance. Adifferentwaytoimprovethesystem isbyutilizingvisualcuesotherthanroadlines. Weareplanningtoincorporatetheuseof histogramdifferencingtosegmentouttheroadfromitsflankingareas,muchlike[21]. By creating a system that runs a number of contrasting approaches in parallel, we hope to enable the system to robustly detect more road types, even in more naturalistic off-road environments. Related to this, the localization system, particularly our biologically- inspired gist classification and saliency, have been shown to perform well on non-urban environments [155, 152, 65]. Improved Navigation in Open Areas. In situations where there is no road in front of the robot, the system simply maintains its current forward heading set by the last seen road, while trying to avoid obstacles. This goes on until there is a continuing road or the localization module instructs to make a turn. In these situations, the system performs navigation exclusively using LRF, producing behaviors such as wall-following or centering along a hallway. Our robot in a sense follows a subsumption architecture [17], i.e., whenaroadis detected, follow it, otherwisedefault todrivingstraight forward. Thus the navigation system is not dependent on the presence of the vanishing point. On the contrary, it only uses visual road recognition when it is advantageous. This way the road recognition system enhances the standard laser-based navigation system largely utilized today. In pedestrian environments, however, there are open areas such as a square or a large hall, where there are no nearby wall surfaces to guide the robot. 162 In the experiment, the system was able to navigate through an open water fountain area (observe figure 5.27). However, this success, in large part, is because the continuing road is aligned with the previous road. In less ideal configurations, the system can be confused and drive to the wrong road. To rectify this problem, the system must have a richer understanding of the space, knowing all the roads that come in and out of the area. Navigation in Crowded Areas. As for the effect of crowding to our system, we are encouraged by the results that show that it is able to localize and navigate through those difficult conditions. In the future we would like improve on Beobot’s navigation skills in moving more naturally among people[176], for example using a fast GPU-based pedestrian recognition system [13] to properly anticipate where they are headed. We can also try to characterize how pedestrian behave as a group [53], which would bemore feasible for denser crowds. 163 Chapter 6 DeepVP: Deep Learning for Vanishing Point Detection on 1 Million Street View Images 6.1 Introduction The capability of an autonomous driving vehicle to detect road boundary is crucial to determiningtheheadingdirection andstaying ontheroad(Fig. 6.1). Typical navigation tasks relie on proximity sensors such as Laser Range Finders (LRF) [58, 193] to capture the surrounding geometry information. Current LRFs such as from Velodyne[47] are able to measure depth in 360 degrees. They are the primary sensors that have allowed Google’s Car [168] to travel autonomously over 1 million miles. However, these sensors have challenges to recognizing lane markings on the road without physical geometry differences. Road recognition methods can be categorized into two types. The first is the color- based approach, which recovers road boundaries by recognizing specific features such as color histograms[135, 82], color contrast between road and flanking areas[138], and road region candidates from color-based segmentation[51, 25, 38] 164 Figure 6.1: Autonomous car navigation in a highway environment. The system needs to estimates the road heading as well as road boundary The second type of road recognition technique finds the road’s vanishing point first to determine heading. Vanishing Point (VP) is the point where parallel road boundaries converge. Byidentifyingthevanishingpointlocationintheimage,thesystemcanfurther recover both road boundaries. Traditional vanishing point detection usually relies on hand-crafted edge sensitive features such as Gabor filters [75, 21, 198], Gaussian filters [76], or Hough transform lines[187, 151, 87, 11, 199]. In addition, typical filter-based VP detectors tend to be slow because they use a voting scheme which may have up to O(n 2 ) time complexity. As a remedy, [204] use sky segmentation to first estimate the horizon line, to reduce the voting area to the non-sky area. Moreover, [21, 107] use only 4 gabor orientation channels {0 ◦ ,45 ◦ ,90 ◦ ,135 ◦ } and selective voting to speed up the voting process. Although traditional VP detection methods are able to estimate the road with certain accuracy, hand-crafted features are usually sensitive to spurious edges such as shadow boundaries. 165 Recently, deep learning [86] has dramatically improved the state-of-the-art results in many machine learning domains. Here we briefly review applications of deep archi- tecture models with a focus on convolutional neural networks (CNNs) in the domains of computer vision, speech recognition, and natural language processing. Convolutional neural networks are composed of multiple processing layers, trained end-to-end to auto- matically learnrepresentationsofdatawithmulti-level abstractions. Incomputervision, CNNs have achieved great success in image classification [80, 158, 165], object detection [46, 147], image segmentation [22, 89], activity recognition from videos [69, 157] and many others. Two sensational advances are: in 2012, Krizhevsky et al. [80] trained a 7-layer convolutional neural network on the large-scale ImageNet [31] dataset to do image classification, and achieved a top-5 test error rate of 15.3%, almost half the error rate of the second best algorithm at the time. In 2014, Karpathy et al. [69] collected a sports-1M video dataset with 487 classes of sports, and used CNN to learn spatio- temporal information from video clips to classify videos into different sports categories, again beating the state of the art by a large margin. Inspeechrecognition, deeparchitectureshaveboostedtheresultssignificantly[54,2]. SimilartotrainingCNNsinvision, researchers [2]usedlargevoice datasets totraintheir CNN.Deeparchitectureshavealsobeenusedtosolvevariousnaturallanguageprocessing tasks. Kim[70]proposedtouseaCNNtoclassifysentences, andZhangandLeCun[202] trained CNNs on various large-scale datasets to understand text from scratch. The success of deep neural models (e.g., CNNs) is partially attributed to the avail- ability of big datasets. In vision, the ImageNet database [31] has 14,197,122 images, sports-1M [69] has 1M sports videos; in speech, researchers used hours of transcripted 166 voice datasets to train DNNs/CNNs as well; in language, the trainingdata is even larger as we have freely available online articles and documents. A sufficiently large amount of training data with sufficient diversity is necessary to effectively train complex deep architectures, since these have very large numbers of degrees of freedom. In light of the limitation of traditional detectors and of the recent successes of deep neural networks, we propose to leverage the power of deep learning models to auto- matically learn a VP detector. Our novel contribution hence starts by presenting a completely new approach to estimate vanishing point adapted from deep learning. We train a deep model end-to-end, which takes an image as input, and outputs its vanishing point location. Compared to traditional algorithmic methods, which go through sev- eral sequential steps to predict the vanishing point location, our deep vanishing point algorithm is a fast, feed-forward neural network evaluation that directly returns the VP. However, trainingdeepnetworkmodelsrequireslarge-scale labeleddatasets. Tothebest of our knowledge, available labeled vanishing point datasets are too small to properly train a deep network (Table 6.1). This motivates us to collect our own dataset with 1 million Google street-view images with labeled vanishing point locations. The dataset includes worldwide range of road appearance in various lighting conditions. The dataset is the only one that contains different cameras angles from each scene location. We also present a method to auto-collect and label VP images by utilizing Google Street View API, which enables future dataset expansion. We make the dataset and source code available publicly 1 . We trained the convolutional neural network on this largest VP 1 http://ilab.usc.edu/kai/deepvp 167 dataset available to date, and, as detailed below we achieved, by far, the best results ever reported in vanishing point detection method. Inthefollowingsection6.2, wedescribeournewVPdataset. Wepresentthenetwork architecture in section 6.3. We report the testing results in section 6.4, and discuss the findings in section 6.5. 6.2 Vanishing Point Dataset Figure 6.2: DeepVP Dataset collected from worldwide routes 6.2.1 New dataset for deep learning training Current available image datasets to assist vanishing point algorithm development and testing are shown in Table 6.1. Most are collected and hand labeled in urban area. The most common one [75] contains 1,003 images from web image search and desert road images. Deep learning approaches typically require a much larger amount of training exam- ples. Existing VP datasets fall short in the total number of images; even the largest 168 existing dataset [21] only contains 25,076 images from four continuous video routes. Moreover, most of them are only collected in a small area, where the scenes are too similar and lack appearance variations. In addition, the collected VP images are usually recorded from a front facing camera mounted on a moving vehicle. This causes most of theVPlocation tobenear thecenter of theimage. Suchunbalanced traininglabelsmay cause strong center bias and may decrease accuracy at other image locations. Here we collected a freely-distributed dataset, the DeepVP-1M dataset, designed for large scale machine learning purposes. It contains 1,053,425 images with resolution 300x300 from 23 routes across 21 countries. The total length of the routes is 125,165 kilometers, which ensures wide coverage of road appearance. 6.2.2 Data collection method Figure 6.3: Discretized VP labels in a total of 15x15=225 labels 169 We proposeanovel way to collect labeled vanishingpoint images from Google Street View. Street View was chosen because 1) it covers a wide range of road appearances, 2) it has a panorama capability which can generate multiple VP views from individual locations, 3) it provides camera parameters such as pitch information to the estimate horizon line, and 4) when using the image capture vehicle’s navigation direction, we can align VP center to the road heading direction based on the navigation route. Further- more, Google Street View provides an API to fetch an image given GPS coordinates (longitude and latitude) and camera parameters (pitch, heading). img =StreetViewAPI(lon,lat,pitch,heading) (6.1) 6.2.3 Image Collection from a single GPS location To predict VP coordinates in an image, we collect at every location a set of images that covers different VPs from top left to bottom right. We pan and tilt the camera view with step 5 ◦ from -35 ◦ to 35 ◦ in the panorama scene, resulting in a total of 15×15 =225 images from a single GPS location (shown in figure 6.3) 6.2.4 Align road heading with image heading Whilecamerapitch,whichiscompensatedforroadinclination,canbedirectlytranslated into theVP’s Y-axis coordinate (assumingthat the road’s upor down curvature is small as is the case in most paved roads), estimating the VP’s X-coordinate requires us to find angle difference θ between camera heading and road direction heading, as illustrated in Fig. 6.4. When we request a google navigation direction from pointA toB, google map 170 Figure 6.4: Compute angle difference between camera heading and road heading returns a series of polylines to represent each road segment. Along each polyline, we can further break down to a group of waypoints that indicate each GPS location in the same road segment. Given two waypoint coordinates P1 (lon,lat) and P2 (lon,lat) , we then compute latitude and longitude difference (Ψ ,Φ) to approximate the angle of the road with respect to camera heading. This angle θ is the offset of the road heading. Φ=log tan P2 lat 2 + π 4 tan P1 lat 2 + π 4 (6.2) Ψ=P2 lon −P1 lon (6.3) θ =atan2(Ψ,Φ) (6.4) 171 Because the VP is likely to not be well defined at intersections where the Google vehicle took a sharp turn, we here only use waypoints well within the route segments, as the segment boundaries often coincide with such turns. 6.2.5 Mapping camera angle to pixel coordinate To overcome camera distortion and non-linearity, we manually create a correspondence map from camera pitch angle to image Y-axis, using a third-order polynomial, whose parameters were derived from a small subset of manually-anotated data. y =aθ 3 +bθ 2 +cθ+d (6.5) Similarly, we can apply the same principle to find the road heading to x-axis mapping. Figure 6.5: The CNN architecture used for vanishing point detection. We closely follow the architecture of AlexNet [80], only changing the filters of the last two fully connected layers to 1024. The architecture in our notation form is: C96-P-C256-P-C384-C384- C256-P-F1024-F1024-O. Refer to text (section 6.3 from description. 172 6.3 Network architecture and its adaptation to vanishing point detection In our scenario, we formulate vanishingpoint detection as aCNN classification problem. First we discretize numerical vanishing point coordinates into discrete labels. In a way, vanishing point detection is transformed into a classification problem: given a image, predicting the location of the vanishing point is equivalent to predicting the discrete label of its vanishing point. The predicted label is remapped to numerical coordinates in the image, obtaining the vanishing point location. 6.3.1 Network architecture ThereareseveralprevalentCNNarchitectures,includingAlexNet,VGGandGoogLeNet. Here, without loss of generality, we use AlexNet [80]. AlexNet has achieved astonishing performance in object and scene classification [80, 203]. It is a linear chain feed-forward architecture with images as input and class labels as output. We use an AlexNet implementation in [184], which consists of the input layer, 5 convolutional layers (Conv), 2 fully connected layers (fc) and the top label layer. Con- cretely, the first two layers are divided into 4 sub-layers: convolution, local response normalization (LRN), rectified linear units(ReLUs) and max-pooling. Layers 3 and 4 are composed of convolution and ReLUs. Layer 5 consists of convolution, followed by ReLUs and max-pooling. There are two fully connected layers, fc6 andfc7, stacked on top ofpool5, and each of them is followed by ReLU non-linearity. The last label layer is a fully connected layer with thenumberof nodesequal to thenumberof discrete classes. 173 To be concise, we use abbreviations Ck, ReLU, Fk, P, D, O to represent a convo- lutional layer with k filters, an ReLU non-linearity layer, a fully connected layer with k filters, a pooling layer, a dropout layer, and an output layer. Since ReLU non-linearity layer is followed by every convolutional/fully-connected layer, we furtheromit it for sim- pler and cleaner architecture representation, in this way, the AlexNet is represented as: C96−P−C256−P−C384−C384−C256−P−F4096−F4096−O. In the following text, we use our notation for architecture representations. 0 10 20 30 40 50 60 70 80 mean L2 distance error (pixels) AL2MI WA2FL TX2ND BZ2CL russia MT2NC SouthAfrica LA2NY CZ2ET tailand london−city newzealand australia PT2IT japan norway la−city stockholm la−mountain indonesia paris boston lasvegas Total Gabor VP Deep VP Figure 6.6: Route-wise vanishing point detection accuracy of two algorithms. As seen the DeepVP algorithm consistently wins the Gabor VP method over all routes, with significantly accuracy improvements. This is the clear demonstration of the superiority of DeepVP over the traditional arithmetic algorithms. 6.3.2 vanishing point detection We closely follow the original AlexNet architecture, only changing the number of filters on thelast two fullyconnected layers to1024. Tobespecific, weusethis architecture for vanishing point detection: C96−P −C256−P −C384−C384−C256−P −F1024− F1024 −O. By reducing the number of filters, the capacity of the CNN is reduced accordingly, which may resultinadecrease in predictionaccuracy. However, inourcase, 174 weonly have225 discrete vanishingpoint labels, andexperimental resultsshow that it is sufficient to maintain the prediction accuracy, but with much fewer parameters to learn. In total, we have 50M parameters, compared with 220M in the original AlexNet. 6.4 Experiments 6.4.1 Dataset setup We collected Google Street View images from 23 different routes. Along each route, we use images collected along the first 3/4 of the route as training data, and the remaining 1/4routeastestdata. Underthispartition,imagesintestareneverseenduringtraining, minimizing the overlap between training and testing. We end up with 790,069 training images and 263,356 test images. Each image is associated with 1 (out of 225) discretized vanishing point label. 6.4.2 CNN setup WetrainAlexNet topredictthediscretizedvanishingpointafterinitializing theparame- ters with random Gaussian weights. Dropout layers following two fully connected layers use dropout rate of 0.5. We do not peform data augmentation during training since our dataset is already sufficiently large to allow us to achieve the best known results. To optimize settings, we use the typical softmax loss function as the objective, and runstochastic gradient descent (SGD) to minimize theloss objective, withbatch sizeset to 128. We set the starting learning rate to 0.01. The network is trained for 20 epochs, approximately 129K iterations. The learning rate is reduced by a factor of 10 after each 175 10,000 iteration. We train AlexNet usingthe publicly available Matconvnet [184] toolkit on an Intel 6-Core i7-5930K 3.5Ghz computer with a Nvidia Tesla K40 GPU. 6.4.3 Quantitative Comparison We evaluate out new Deep VP method prediction on a total of 263,356 test images, and compare it to Chang’s method [21] implemented in C++, as a representative Gabor VP method which has already been shown to perform better than [75] and [135]. Table6.2showsthat DeepVPhasaclearly improvedoverall accuracy (92.09%) com- paredtoGaborVP(47.99%). Deep VPisextremelyaccurate (99.25%) inestimatingthe horizon line (X-axis accuracy in Table 6.2. We believe this is because the X-axis ground truth labels are directly mapped from camera pitch, which has minimal uncertainty. In comparison, the Y-axis accuracy (92.38%), which estimates the road heading, requires an extra angle offset estimation (θ), which may decrease precision. Incomputation timecomparison, DeepVP is28timesfaster thanGaborVPinterms of conventional CPU speed. This is due to DeepVP not requiring a complex voting scheme like Gabor VP. Furthermore, DeepVP can efficiently utilize a GPU to compute its result, further increasing its speed by ∼63×. 6.4.4 Independent Routes Comparison In figure 6.6, we compute the Euclidean distance between the labeled ground truth and a method’s prediction result for each route. The results show that the DeepVP method is more accurate than Gabor VP in all routes. Total L2 distance error from DeepVP is 9.83 pixels compared to Gabor VP’s 37.67 pixels. 176 6.4.5 VP Coordinate Comparison 7 15 23 33 51 70 88 114 141 165 192 DeepVP GaborVP Figure 6.7: The location-dependent accuracy of vanishing point estimation by two algo- rithms. As seen, the Gabor VP algorithm is biased, since it achieves better prediction accuracy when the ground-truth VPs are near the image center, while the prediction accuracy of DeepVP is relatively location-insensitive. fc1024 fc512 fc256 fc128 fc64 fc32 fc16 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 # of nodes of fc layers accuracies of vanishing point detection accuracies vs. network capacities top−1 accuracy top−2 accuracy top−3 accuracy top−4 accuracy top−5 accuracy Figure 6.8: Effect of network capacity on VP detection accuracy. We show empirically that when the number of neurons in the last two fully connected layers is as large as the number of the output layer labels, the CNN achieves as good a performance as a larger capacity network. Fig. 6.7 evaluates the L2 distance error on each vanishing point coordinate. It shows that groundtruthlocations affect thetraditional Gabor VPmethodsignificantly. Gabor VPhasbetteraccuracywhenthevanishingpointisnearthemiddleregionslightly below the image center, indicated in blue in Fig. 6.7. As we examine each prediction result, we 177 Figure 6.9: Visualization of receptive fields of filters at different layers. Top: top 25 images with the strongest activations for filters at a few layer (pool1, pool2, conv3, and pool5); Bottom: 25 image patches from the corresponding images on the 1st row, that excite the filter the most. find that if the ground truth location is too high in the image, the Gabor VP method is affected by the shadows and markings on the road. If the ground truth location is too low, there is not enough road region in the image for Gabor VP to operate reliably. This increases the prediction error in Fig. 6.7, indicated in red. On the other hand, DeepVP is more stable in all regions. That is because our dataset has training labels equally distributed among all locations. The system not only relies on edge-sensitive features but more balanced dataset. All the results were obtained under the architecture C96−P −C256−P −C384− C384−C256−P −F1024−F1024−O, which has 1024 neurons in each of the last two fully connected layers. Here we further investigate how network capacity affects VP prediction accuracy. We shrink the number of neurons in the last two fully connected layers byafactorof2gradually, trainthenetworkusingthesametrainingdata, andplot 178 Figure 6.10: example images from different sites. the prediction accuracies in Fig. 6.8. As long as the number of neurons approaches or exceeds the number of output labels, 225 in our case, the prediction accuracies plateau. 6.4.6 Visualization We use all test images as input, compute their activation responses for each filter on each layer, and show the top 25 images (Fig. 6.9, top) with the strongest activations for each filter and the corresponding image patch (Fig. 6.9, bottom) within the image which excites the filter most. Fig. 6.9 shows (1) receptive field size increases as the layer goes deeper; (2) meaningful sub-structures of scenes are learned automatically, e.g., Gabor-like edges in pool1 and horizon lines in pool2. 6.5 Conclusions Inthispaper,wepresentaninnovative vanishingpointestimation methodbasedondeep convolutional neural networks. Our results show that DeepVP outperforms the state-of- the-art algorithmic vanishing point method. DeepVP is capable of predicting vanishing point location in a wide range of environments with highly efficient performance. Also note that we collected the largest vanishing point dataset to date, with over 1 million 179 images, to provide more comprehensive training, testing, and development for future vanishing point algorithms. Using the same a novel way to automatically collect the dataset and ground truth labels, more domain-specific datasets can also be added for different navigation application. 180 Table 6.1: Overview of most available vanishing point dataset. Year Dataset Ref Domain Scene Class Scene Coverage Total Images 2003 York Urban Lane Segment [32] urban 2 small 102 2009 Kong’09 [75] outdoor 3 medium 1,003 2011 Eurasian Cities [9] urban 1 small 103 2012 PKU Campus [87] urban 1 small 200 2012 Chang’12 [21] urban 2 small 25,076 2014 Le’14 [85] urban 1 small 1,600 2015 TVPD [8] urban 2 small 102 2016 DeepVP-1M - outdoor 23 large 1,053,425 Table 6.2: Comparison of results with hand-crafted feature approach methods Deep VP Gabor VP[21] X-axis Accuracy(±5 ◦ ) 99.25% 59.24% Y-axis Accuracy(±5 ◦ ) 92.38% 72.78% Overall Accuracy(±5 ◦ ) 92.09% 47.99% CPU time(s) 0.56 15.73 GPU time(s) 0.0089 - 181 Chapter 7 Conclusion and Future Works Inthisthesiswepresentareal-time visually-guidedautonomous localization andnaviga- tion system. We demonstrate the system’s efficacy using a wheeled robot and driving it throughlong-range andchallenging routes ofat least 400m at theUniversity ofSouthern California (USC) campus, for a total of 10.11km of documented testing. In each route, the robot has to show its robustness in navigating among crowds, turning at several intersections, and stopping at the assigned destination. For our setup, we have clearly framed theobjective of our system to withina specific type of environments and application and, at this point, we do not know whether this approach generalizes to a completely different scenario (e.g., autonomous cars). How- ever, the service robot task in unconstrained pedestrian environments has quite a broad potential impact onto society. We believe our extensive work will help be an inspiration for many others in continuing the research in autonomous mobile robotics. 182 References [1] AAI Canada, Inc. AAI Canada, Inc. - Intelligent Robots - Khepera II. http: //www.aai.ca/robots/khep_2.html,October 2009. Accessed: October 15, 2009. [2] OssamaAbdel-Hamid,Abdel-rahmanMohamed,HuiJiang,LiDeng,GeraldPenn, and Dong Yu. Convolutional neural networks for speech recognition. Audio, Speech, and Language Processing, IEEE/ACM Transactions on, 22(10):1533–1545, 2014. [3] C. Ackerman and L. Itti. Robot steering with spectral image information. IEEE Transactions on Robotics, 21(2):247–251, Apr 2005. [4] M. Agrawal and K.G. Konolige. Real-time localization in outdoor environments using stereo vision and inexpensive gps. In Proc. International Conference on Pattern Recognition (ICPR), volume 3, pages 1063–1068, 2006. [5] Altium Limited. Altium - Next generation electronics design. http://www. altium.com, July 2009. Accessed: July 15, 2009. [6] American Honda Motor Co. Inc. Asimo - The World’s Most Advanced Humanoid Robot. http://asimo.honda.com/,July 2009. Accessed: July 15, 2009. [7] A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer. A fast and incremental method for loop-closure detection using bags of visual words. IEEE Transactions On Robotics, Special Issue on Visual SLAM, 24(5):1027–1037, 2008. [8] Vincent Angladon, Simone Gasparini, and Vincent Charvillat. The toulouse van- ishing points dataset. In Proceedings of the 6th ACM Multimedia Systems Confer- ence (MMSys ’15), Portland, OR, United States, 2015. [9] Olga Barinova, Victor Lempitsky, Elena Tretiak, and Pushmeet Kohli. Geometric image parsing in man-made environments. In Proceedings of the 11th European Conference on Computer Vision: Part II, ECCV’10, pages 57–70, Berlin, Heidel- berg, 2010. Springer-Verlag. [10] H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded up robust features. In Proc. European Conference on Computer Vision (ECCV), pages 404–417, 2006. [11] Jean-Charles Bazin and Marc Pollefeys. 3-line ransac for orthogonal vanishing point detection. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ In- ternational Conference on, pages 4282–4287. IEEE, 2012. 183 [12] Patrick Beeson, Joseph Modayil, and Benjamin Kuipers. Factoring the mapping problem: Mobile robotmap-buildingin thehybridspatial semantic hierarchy. The International Journal of Robotics Research (IJRR), 29:428–459, April 2010. [13] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool. Pedestrian detection at 100 frames per second. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 290–2910, Providence, RI, USA, 2012. IEEE Computer Society. [14] M.R. Blas, M. Agrawal, A. Sundaresan, and K. Konolige. Fast color/texture seg- mentationforoutdoorrobots. InIntelligentRobots andSystems, 2008. IROS2008. IEEE/RSJ International Conference on, pages 4078–4085, Sept 2008. [15] J. Borenstein and Y. Koren. The vector field histogram- fast obstacle avoidance for mobile robots. IEEE Journal of Robotics and Automation, 7(3):278–288, June 1991. [16] G Bradski. Open source computer vision library, 2001. [17] R. A. Brooks. A robust layered control system for a mobile robot. IEEE Trans. Robotics and Automation, 2(1):14–23, 1986. [18] W. Callister. Materials Science and Engineering - An Introduction. John Wiley & Sons, INC, 2003. [19] Carnegie Mellon University Robotics Institute. LAGR Robot Platform. http:// www.rec.ri.cmu.edu/projects/lagr/description/index.htm, July 2009. Ac- cessed: July 15, 2009. [20] C.-K.Chang,C.Siagian, andL.Itti. Mobilerobotvisionnavigation &localization using gist and saliency. In Proc. IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), pages 4147–4154, Oct 2010. Both first authors contributed equally. [21] C.-K. Chang, C. Siagian, and L. Itti. Mobile robot monocular vision navigation based on road region and boundary estimation. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages1043–1050, Oct2012. [22] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014. [23] Z. Chen and S. Birchfield. Quantitative vision-based mobile robot navigation. In Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 2686 – 2692, May 2006. [24] Zhichao Chen and S.T. Birchfield. Qualitative vision-based path following. Robotics, IEEE Transactions on, 25(3):749–754, June 2009. 184 [25] Kuo-Yu Chiu and Sheng-Fuu Lin. Lane detection using color-based segmentation. InIntelligent Vehicles Symposium, 2005. Proceedings. IEEE,pages706–711. IEEE, 2005. [26] D. Chung, R. Hirata, T. N. Mundhenk, J. Ng, R. J. Peters, E. Pichon, A. Tsui, T. Ventrice, D. Walther, P. Williams, and L. Itti. A new robotics platform for neuromorphic vision: Beobots. In Lecture Notes in Computer Science, volume 2525, pages 558–566, Nov 2002. [27] COM Express Extension. COM Express Extension - Consortium. http://www. comexpress-extension.org/specs/specs.php, October 2009. Accessed: Octo- ber 15, 2009. [28] D. Coombs, M. Herman, T.Hong, andM. Nashman. Real-time obstacle avoidance using central flow divergence and peripheral flow. In Proc. International Confer- ence on Computer Vision (ICCV), pages 276–283, june 1995. [29] M.CumminsandP.Newman. Fab-map: Probabilisticlocalization andmappingin the space of appearance. The International Journal of Robotics Research (IJRR), 27(6):647–665, 2008. [30] J. Cunha, E. Pedrosa, C. Cruz, A. Neves, and N.Lau. Using a depth camera for indoor robot localization and navigation. In RGB-D Robotics: Science and Systems (RSS) Workshop, Los Angeles, California, 2011. [31] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In Computer Vision and Pattern Recog- nition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. [32] Patrick Denis, James H. Elder, and Francisco J. Estrada. Efficient edge-based methods for estimating manhattan frames in urban imagery. Computer Vision – ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part II, pages 197–210, 2008. [33] Dimension Engineering LLC. Sabertooth 2X25 regenerative dual motor driver. http://www.dimensionengineering.com/Sabertooth2X25.htm, July 2009. Ac- cessed: July 15, 2009. [34] Ethan Eade and Tom Drummond. Unified loop closing and recovery for real time monocular slam. In Proc. British Machine Vision Conference (BMVC), pages 6.1 – 6.10, September 2008. [35] ETX Industrial Group. ETX Industrial Group - Consortium. http://www. etx-ig.org/consortium/consortium.php,July 2009. Accessed: July 15, 2009. [36] Evolution Robotics, Inc. Evolution Robotics development platform and OEM solutions for robot software navigation technology, object recognition. http:// www.evolution.com, July 2009. Accessed: July 15, 2009. 185 [37] E. Fazl-Ersi and J. Tsotsos. Histogram of oriented uniform patterns for robust place recognition and categorization. The International Journal of Robotics Re- search (IJRR), 31(2):468–483, feb 2012. [38] P. Felzenszwalb and D. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59:167–181, 2004. [39] B. Finio, B. Eum, C. Oland, and R. J. Wood. Asymmetric flapping for a robotic fly using a hybrid power control actuator. In IROS, St. Louis, Missouri, USA, October 2009. [40] P. Fiorini and Z. Shiller. Motion planning in dynamic environments using velocity obstacles. The International Journal of Robotics Research (IJRR), 17(7):760–772, 1998. [41] D. Fox, W. Burgard, H. Kruppa, and S. Thrun. A probabilistic approach to collaborative multi-robot localization. Autonomous Robots, 8(3):325–344, June 2000. [42] D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance. IEEE Robotics & Automation Magazine, 4(1):23 – 33, 1997. [43] Dieter Fox, Wolfram Burgard, Frank Dellaert, and Sebastian Thrun. Monte carlo localization: Efficient position estimation for mobile robots. In Proc. of Sixteenth National Conference on Artificial Intelligence (AAAI’99)., July 1999. [44] U Franke, D Gavrila, S Goerzig, F Lindner, F Paetzold, and C Woehler. Au- tonomous driving goes downtown. IEEE Intelligent Systems, pages 32–40, 1998. [45] S. Frintrop, P. Jensfelt, and H. Christensen. Attention landmark selection for visual slam. In IROS, Beijing, October 2006. [46] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hi- erarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014. [47] C.GlennieandD.D.Lichti. Staticcalibrationandanalysisofthevelodynehdl-64e s2 for high accuracy mobile scanning. Remote Sensing, 2(6):1610–1624, 2010. [48] Chunzhao Guo, S. Mita, and D. McAllester. Stereovision-based road boundary detection for intelligent vehicles in challenging scenarios. In Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, pages 1723– 1728, Oct 2009. [49] R. He, S. Prentice, and N. Roy. Planning in information space for a quadrotor helicopter in gps-denied environments. In ICRA, Pasadena, California, USA, May 2008. 186 [50] YinghuaHe,HongWang, andBoZhang. Color-basedroaddetectioninurbantraf- fic scenes. Intelligent Transportation Systems, IEEE Transactions on, 5(4):309– 318, Dec 2004. [51] YinghuaHe,HongWang, andBoZhang. Color-basedroaddetectioninurbantraf- fic scenes. Intelligent Transportation Systems, IEEE Transactions on, 5(4):309– 318, 2004. [52] G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded classification models: Combining models for holistic scene understanding. In Advances in Neural Infor- mation Processing Systems (NIPS 2008), 2008. [53] P.Henry, C.Vollmer, B.Ferris, andD.Fox. Learningtonavigate throughcrowded environments. In Proc. IEEE International Conference on Robotics and Automa- tion (ICRA), pages 981–986, 2010. [54] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012. [55] Derek Hoiem, Andrew N Stein, Alexei A Efros, and Martial Hebert. Recovering occlusion boundaries from a single image. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007. [56] Hokuyo Automatic CO. LTD. Photo sensor— PRODUCTS. http://www. hokuyo-aut.jp/02sensor/index.html#scanner, July 2009. Accessed: July 15, 2009. [57] S. Hrabar and G. Sukhatme. Vision-based navigation through urban canyons. Journal of Field Robotics, 26(5):431 – 452, 2009. [58] Z. Hu, H. Wang, L. Zhang, and H. Yu. Laser based sensor boundary recognition of mobile robot. In International Conference on Networking, Sensing and Control, Okayama, Japan, March 2009. [59] Sergi. B. i Badia, Pawel Pyk, and Paul F. M. J. Verschure. International journal of robotics research. A Fly-Locust Based Neuronal Control System Applied to an Unmanned Aerial Vehicle: The Invertebrate Neuronal Principles for Course Stabilization, Altitude Control and Collision Avoidance, 26(7):759 772, July 2007. [60] iRobot Corporation. PackBot. http://www.irobot.com/sp.cfm?pageid=171, July 2009. Accessed: July 15, 2009. [61] iRobot Corporation. Roomba Vacuum Cleaning Robot. http://store.irobot. com/category/index.jsp?categoryId=3334619&cp=2804605, July 2009. Ac- cessed: July 15, 2009. 187 [62] iRobot Corporation. SeaGlider. http://www.irobot.com/sp.cfm?pageid=393, January 2010. Accessed: January 15, 2010. [63] L. Itti. Models of Bottom-Up and Top-Down Visual Attention. bu;td;mod;psy;cv, Pasadena, California, Jan 2000. [64] L. Itti. iLab Neuromorphic Vision C++ Toolkit (iNVT). http://ilab.usc.edu/ toolkit/, December 2012. Accessed: December 15, 2012. [65] L. Itti and P. F. Baldi. Bayesian surprise attracts human attention. Vision Re- search,49(10):1295–1306, May2009. Topcitedarticle2008-2010awardfromVision Research. [66] L. Itti and C. Koch. Feature combination strategies for saliency-based visual at- tention systems. Journal of Electronic Imaging, 10(1):161–169, Jan 2001. [67] L.Itti,C.Koch,andE.Niebur. Amodelofsaliency-basedvisualattentionforrapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, Nov 1998. [68] Ituner Networks Corp. PicoPSU-80-WI-32V Power Supply. http://www. mini-box.com/PicoPSU-80-WI-32V?sc=8&category=981, July 2009. Accessed: July 15, 2009. [69] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk- thankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. [70] Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014. [71] P. H. King. A Low Cost Localization Solution Using a Kalman Filter for Data Fu- sion. Master’s thesis, Virginia Polytechnic Institute and State University, Blacks- burg, VA, USA, 2008. [72] KlaasKlasing,GeorgiosLidoris,AndreaBauer,FlorianRohrmuller,DirkWollherr, and Martin Buss. The autonomous city explorer: Towards semantic navigation in urban environments. In Proc. International Workshop on Cognition For Technical Systems (COTESYS), 2008. [73] S.KoenigandM.Likhachev. D*lite. InAAAIConference of ArtificialIntelligence (AAAI), pages 476–483, 2002. [74] S. Koenig, M. Likhachev, Y. Liu, and D. Furcy. Incremental heuristic search in artificial intelligence. Artificial Intelligence Magazine, 25(2):99 – 112, 2004. [75] H. Kong, J.-Y. Audibert, and J. Ponce. General road detection from a single image. IEEE Transactions on Image Processing, 19(8):2211 – 2220, August 2010. 188 [76] Hui Kong, Hatice Cinar Akakin, and Sanjay E Sarma. A generalized laplacian of gaussian filter for blob detection and its applications. Cybernetics, IEEE Trans- actions on, 43(6):1719–1733, 2013. [77] Kontron. Design Guide for ETXexpress Carrier Boards, May 2007. [78] Kontron. ETXexpress-MC. http://us.kontron.com/products/ computeronmodules/com+express/etxexpress/etxexpressmc.html, July 2009. Accessed: July 15, 2009. [79] J. Kramer and M. Scheutz. Development environments for autonomous mobile robots: A survey. Autonomous Robots, 22(2):101–132, 2002. [80] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information pro- cessing systems, pages 1097–1105, 2012. [81] J. J. Kuffner and S. M. LaValle. Rrt-connect: An efficient approach to single- query path planning. In Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 995–1001, April 2000. [82] T. Kuhnl, F. Kummert, and J. Fritsch. Monocular road segmentation using slow feature analysis. In IEEE Intelligent Vehicles Symposium (IV), pages 800 – 806, june 2011. [83] Benjamin Kuipers, Joseph Modayil, Patrick Beeson, Matt Macmahon, and Francesco Savelli. Local metrical and global topological maps in the hybrid spa- tial semantic hierarchy. In Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 4845–4851, 2004. [84] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matchingforrecognizingnaturalscenecategories. InProc. IEEEComputerSociety Conference on Computer Vision and Pattern Recognition (CVPR), pages 2169– 2178, Washington, DC, USA, 2006. IEEE Computer Society. [85] Manh Cuong Le, Son Lam Phung, and Abdesselam Bouzerdoum. Lane detection in unstructured environments for autonomous navigation systems. In Computer Vision–ACCV 2014, pages 414–429. Springer, 2014. [86] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015. [87] Bo Li, Kun Peng, Xianghua Ying, and Hongbin Zha. Vanishing point detection using cascaded 1d hough transform from single images. Pattern Recogn. Lett., 33(1):1–8, January 2012. [88] Georgios Lidoris, Florian Rohrmullera, Dirk Wollherr, and Martin Buss. he au- tonomous city explorer (ace) project mobile robot navigation in highly populated urban environments. Proc. IEEE International Conference on Robotics and Au- tomation (ICRA), pages 1416–1422, May 2009. 189 [89] JonathanLong,EvanShelhamer,andTrevorDarrell. Fullyconvolutionalnetworks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. [90] D.G.Lowe. Distinctiveimagefeaturesfromscale-invariantkeypoints. Intl. Journal of Computer Vision, 60(2):91–110, 2004. [91] M. Lutzeler and E.D. Dickmanns. Ems-vision: recognition of intersections on unmarked road networks. In Intelligent Vehicles Symposium, 2000. IV 2000. Pro- ceedings of the IEEE, pages 302–307, 2000. [92] Will Maddern, Michael Milford, and Gordon Wyeth. Cat-slam: probabilistic lo- calisation and mapping using a continuous appearance-based trajectory. The In- ternational Journal of Robotics Research (IJRR), 31:429–451, April 2012. [93] P. Maes and R. A. Brooks. Learning to coordinate behaviors. In AAAI, pages 796 – 802, 1990. [94] Major’s Mobisist. Major’s Mobisist :: Power Wheelchairs :: Liberty 312. http: //www.movingwithdignity.com/product.php?productid=16133,July2009. Ac- cessed: July 15, 2009. [95] E. Marder-Eppstein, E. Berger, T. Foote, B. Gerkey, and K. Konolige. The office marathon: Robust navigation in an indoor office environment. In Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 300 – 307, May 2010. [96] E. Marder-Eppstein, E. Berger, T. Foote, B. Gerkey, and K. Konolige. Kurt konolige, eitan marder-eppstein, bhaskara marthi. In Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 3041 – 3047, May 2011. [97] Jrme Maye, Ralf Kaestner, and Roland Siegwart. Curb detection for a pedestrian robotinurbanenvironments. InProc. IEEE International Conference on Robotics and Automation (ICRA), pages 367–373, 2012. [98] J. Michels, A. Saxena, and A. Y. Ng. High speed obstacle avoidance using monoc- ular vision and reinforcement learning. In International Conference on Machine Learning, 2005. [99] MicroStrain, Inc. 3DM-GX2:: MicroStrain, AHRS Orientation Sensor. http: //www.microstrain.com/3dm-gx2.aspx,July 2009. Accessed: July 15, 2009. [100] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell., 27(10):1615–1630, 2005. [101] Ondrej Miksik. Rapid vanishing point estimation for general road detection. In Proc. IEEE International Conference on Robotics and Automation (ICRA), May 2012. 190 [102] Michael MilfordandGordonWyeth. Mappingasuburbwithasinglecamerausing a biologically inspired slam system. IEEE Transactions on Robotics, 24(5):1038– 1053, 2008. [103] Michael Milford and Gordon Wyeth. Persistent navigation and mapping using a biologically inspired slam system. The International Journal of Robotics Research (IJRR), 29:1131–1153, August 2010. [104] J. Minguez and L. Montano. Nearness diagram (nd) navigation: Collision avoid- ance in troublesome scenario. IEEE Transactions on Robotics and Automation, 20(1):45 – 59, 2004. [105] MobileRobots, Inc. Seekur Unmanned Ground Vehicle. http://www. mobilerobots.com/Seekur.html,July 2009. Accessed: July 15, 2009. [106] MobileRobots Inc. The High Performance All-Terrain Robot. http://www. activrobots.com/ROBOTS/p2at.html,July 2009. Accessed: July 15, 2009. [107] Peyman Moghadam, Janusz A Starzyk, and W Sardha Wijesoma. Fast vanishing- point detection in unstructured environments. Image Processing, IEEE Transac- tions on, 21(1):425–430, 2012. [108] P. Moghamadam, J. A. Starzyk, and W. S. Wijesoma. Fast vanishing-point de- tection in unstructured environments. IEEE Transactions on Image Processing, 21(1):425 – 430, January 2012. [109] C.Montella, T.Perkins, J.Spletzer, andM.Sands. Tothebookstore! autonomous wheelchair navigation inan urbanenvironment. In Proc. International Conference on Field and Service Robotics (FSR), Matsushima, Japan, July 2012. [110] M. Montemerlo, J. Becker, S. Bhat, H. Dahlkamp, D. Dolgov, S. Ettinger, D. Haehnel, T. Hilden, G. Hoffmann, B. Huhnke, D. Johnston, S. Klumpp, D. Langer, , A. Levandowski, J. Levinson, J. Marcil, D. Orenstein, J. Paefgen, I. Penny, A. Petrovskaya, M. Pflueger, G. Stanek, D. Stavens, A. Vogt, and S. Thrun. Junior: The stanford entry in the urban challenge. Journal Field Robotics, 25(9):569–597, September 2008. [111] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM 2.0: An im- proved particle filtering algorithm for simultaneous localization and mapping that provably converges. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico, 2003. IJCAI. [112] Michael Montemerlo, Sebastian Thrun, Hendrik Dahlkamp, and David Stavens. Winning the darpa grand challenge with an ai robot. In In Proceedings of the AAAI National Conference on Artificial Intelligence, pages 17–20, 2006. [113] H. Moravec and A. Elfes. High resolution maps from wide angle sonar. In Proc. IEEE International Conference on Robotics and Automation (ICRA), volume 2, pages 116–121, March 1985. 191 [114] A.C. Murillo, J.J. Guerrero, and C. Sagues. Surf features for efficient robot local- izationwithomnidirectionalimages. InProc. International Conference on Robotics and Automation (ICRA), pages 3901–3907, April 2007. [115] T. Nakamura and M. Asada. Motion sketch: Acquisition of visual motion guided behaviors. In International Joint Conference on Artificial Intelligence (IJCAI), pages 126 – 132, 1995. [116] NASA. Centennialchallenges. http://www.nasa.gov/offices/oct/stp/centennial challenges/index.html, January 2013. Accessed: January 15, 2013. [117] National Institute of Standards and Technology. Why are there no volume li-ion battery manufacturers in the united states? http://http://www.atp.nist.gov/ eao/wp05-01/append-4.htm,July 2009. Accessed: July 15, 2009. [118] Paul Newman, Gabe Sibley, Mike Smith, Mark Cummins, Alastair Harrison, Chris Mei, Ingmar Posner, Robbie Shade, Derik Schroeter, Liz Murphy, Winston Churchill, Dave Cole, and Ian Reid. Navigating, recognizing and describing ur- ban spaces with vision and lasers. The International Journal of Robotics Research (IJRR), 28(11 - 12):1406–1433, November/December 2009. [119] M. Nieto and L. Salgado. Real-time vanishing point estimation in road sequences using adaptive steerable filter banks. Advanced Concepts for Intelligence Vision Systems Lecture Notes in Computer Science, pages 840 – 848, 2007. [120] David Nist´ er, Oleg Naroditsky, and James Bergen. Visual odometry for ground vehicle applications. Journal of Field Robotics, 23(1):3 – 20, 2006. [121] N. Ouerhani, A. Bur, and H. Hugli. Visual attention-based robot self-localization. In Proc. European Conference on Mobile Robotics (ECMR), pages 8–13, Ancona, Italy, September 2005. [122] PNISensorCorporation. PNISensorCorporation-SensorsModules-AllProducts -MicroMag3: 3-AxisMagnetometer. http://www.pnicorp.com/products/all/ micromag-3, July 2009. Accessed: July 15, 2009. [123] Politecnico di Milano. RTAI - the RealTime Application Interface for Linux from DIAPM. https://www.rtai.org/,July 2010. Accessed: July 15, 2010. [124] D. Pomerleau. Knowledge-based training of artificial neural networks for au- tonomous robot driving. Robot Learning, pages 19 – 43, 1993. [125] Vivek Pradeep, GerardMedioni, andJames Weiland. Robotvisionforthevisually impaired. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 15 – 22. IEEE Computer Society, 2010. [126] A. Pronobis, B. Caputo, P. Jensfelt, and H.I. Christensen. A discriminative ap- proach to robust visual place recognition. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS), pages 3829–3836, 2006. 192 [127] Andrzej Pronobis, Barbara Caputo, Patric Jensfelt, and Henrik I. Christensen. A discriminative approach to robust visual place recognition. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’06), Beijing, China, oct 2006. [128] Andrzej Pronobis, Oscar M. Mozos, Barbara Caputo, and Patric Jensfelt. Multi- modal semantic place classification. The International Journal of Robotics Re- search (IJRR), Special Issue on Robotic Vision, 29(2-3):298–320, feb 2010. [129] QNX Software Systems. QNX realtime RTOS - Middleware, development tools, realtime operating system software and services for superior embedded design. http://www.qnx.com/, July 2010. Accessed: July 15, 2010. [130] Qseven Standard. Qseven: About Qseven. http://www.qseven-standard.org/, October 2009. Accessed: October 15, 2009. [131] M. Quigley and A. Ng. Stair: Hardware and software architecture. In AAAI 2007 Robotics Workshop, Vancouver, British Columbia, Canada, 2007. [132] S. Quinlan and O. Khatib. Elastic bands: Connecting path planning and control. In Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 802–807, 1993. [133] A. Ramisa, A. Tapus, R. Lopez de Mantaras, and R. Toledo. Mobile robot local- ization using panoramic vision and combination of local feature region detectors. In ICRA, pages 538–543, Pasadena, CA, May 2008. [134] A. Ranganathan and F. Dellaert. Online probabilistic topological mapping. In The International Journal of Robotics Research (IJRR), volume 30, pages 755 – 771, May 2011. [135] C. Rasmussen, Y. Lu, and M. Kocamaz. Appearance contrast for fast, robust trail-following. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3505 – 3512, October 2009. [136] Eric Royer, Maxime Lhuillier, Michel Dhome, and Jean-Marc Lavest. Monocular vision for mobile robot localization and autonomous navigation. International Journal of Computer Vision, 74(3):237 – 260, September 2007. [137] M. A. Salichs, R. Barber, A. M. Khamis, M. Malfaz, J. F. Gorostiza, R. Pacheco, R. Rivas, A. Corrales, E. Delgado, and D. Garca. Maggie: A robotic platform for human-robotsocial interaction. InInternational Conference on Robotics, Automa- tion, and Mechatronics (RAM), Bangkok Thailand, 2006. [138] P. Santana, N. Alves, L. Correia, and J. Barata. Swarm-based visual saliency for trail detection. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 759 –765, oct. 2010. 193 [139] J. Santos-Victor, G. Sandini, F. Curotto, and S. Garibaldi. Divergent stereo in autonomous navigation: Learning from bees. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 434 –439, 1993. [140] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Depth perception from a single still image. In AAAI, pages 1571–1576, 2008. [141] Grant Schindler, Matthew Brown, and Richard Szeliski. City-scale location recog- nition. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–7, Los Alamitos, CA, USA, 2007. IEEE Computer Society. [142] Frank E. Schneider and Dennis Wildermuth. Results of the european land robot trial and their usability for benchmarking outdoor robot systems. In Toward Au- tonomous Robotic Systems, volume 6856 of Lecture Notes in Computer Science, pages 408–409. Springer, 2011. [143] S. Se, D. G. Lowe, and J. J. Little. Vision-based global localization and mapping for mobile robots. IEEE Transactions on Robotics, 21(3):364–375, 2005. [144] Sinisa Segvic, Anthony Remazeilles, Albert Diosi, and Francois Chaumette. A mapping and localization framework for scalable appearance-based navigation. Computer Vision and Image Understanding, 113(2):172 – 187, February 2009. [145] Segway Inc. Segway - Business - Products & Solutions - Robotic Mobil- ity Platform (RMP). http://www.segway.com/business/products-solutions/ robotic-mobility-platform.php,July 2009. Accessed: July 15, 2009. [146] SensComp,Inc. MiniSensors. http://www.senscomp.com/minis.htm,July2009. Accessed: July 15, 2009. [147] Pierre Sermanet, David Eigen, Xiang Zhang, Micha¨ el Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013. [148] J. Shi and C. Tomasi. Good features to track. In IEEE CVPR, pages 593–600, 1994. [149] C.Siagian,C.K.Chang,andL.Itti. Beobot2.0. http://ilab.usc.edu/beobot2, December 2013. Accessed: December 15, 2012. [150] C.Siagian, C.-K.Chang,R.Voorhies, andL.Itti. Beobot2.0: Clusterarchitecture for mobile robotics. Journal of Field Robotics, 28(2):278–302, March/April 2011. [151] C. Siagian, C.K. Chang, and L. Itti. Mobile robot navigation system in outdoor pedestrian environment using vision-based road recognition. In Proc. IEEE Inter- national Conference on Robotics and Automation (ICRA), May 2013. Both first authors contributed equally. 194 [152] C.SiagianandL.Itti. Rapidbiologically-inspiredsceneclassificationusingfeatures sharedwithvisualattention. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(2):300–312, Feb 2007. [153] C. Siagian and L. Itti. Comparison of gist models in rapid scene categorization tasks. In Proc. Vision Science Society Annual Meeting (VSS08), May 2008. [154] C. Siagian and L. Itti. Storing and recalling information for vision localization. In IEEE International Conference on Robotics and Automation (ICRA), Pasadena, California, pages 1848–1855, May 2008. [155] C. Siagian andL. Itti. Biologically inspiredmobile robotvision localization. IEEE Transactions on Robotics, 25(4):861–873, July 2009. [156] Reid Simmons. The curvature-velocity method for local obstacle avoidance. In Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 3375–3382, 1996. [157] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Sys- tems, pages 568–576, 2014. [158] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [159] Noah Snavely, Steven M Seitz, and Richard Szeliski. Skeletal graphs for efficient structure from motion. In CVPR, volume 1, page 2, 2008. [160] SolidWorks Corp. SolidWorks 3D CAD Design Software. http://www. solidworks.com/, July 2009. Accessed: July 15, 2009. [161] D. Songand D. Tao. Biologically inspired feature manifold for scene classification. IEEE Transactions on Image Processing, 19(1):174 – 184, 2010. [162] Sony Entertainment Robot Europe. Aibo. http://support.sony-europe.com/ aibo/, January 2009. Accessed: January 15, 2009. [163] Cyrill Stachniss. Europa. http://europa.informatik.uni-freiburg.de/,June 2013. Accessed: June 30, 2013. [164] Hauke Strasdat, J.M. M. Montiel, andAndrewJ. Davison. Scale drift-aware large scale monocular slam. In Robotics: Science and Systems (RSS), volume 2, page 5, June 2010. [165] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. 195 [166] Jean-Philippe Tardif, Yanis Pavlidis, and Kostas Daniilidis. Monocular visual odometry in urban environments using an omnidirectional camera. In Proc. IEEE/RSJ International Conference on Intelligence Robotics and Systems (IROS), pages 2531 – 2538, September 2008. [167] BernardL.Theisen,PhilipFrederick, andWilliamSmuda. The18thannualintelli- gent ground vehicle competition: trends and influences for intelligent ground vehi- cle control. In Proc. SPIE 7878, Intelligent Robots and Computer Vision XXVIII: Algorithms and Techniques, January 2011. [168] S. Thrun. Google’s driverless car, 2011. ”Talk was viewed at http://www.ted.com/talks/sebastian thrun google s driverless car.html on September 1, 2012”. [169] S. Thrun, M. Bennewitz, W. Burgard, A.B. Cremers, F. Dellaert, D. Fox, D. H¨ ahnel, C. Rosenberg, N. Roy, J. Schulte, and D. Schulz. MINERVA: A second generation mobile tour-guide robot. In Proc. IEEE International Conference on Robotics and Automation (ICRA), 1999. [170] S. Thrun, D. Fox, and W. Burgard. A probabilistic approach to concurrent map- ping and localization for mobile robots. Machine Learning, 31:29–53, 1998. [171] S. Thrun, D. Fox, W. Burgard, and F. Dellaert. Robust monte-carlo localization for mobile robots. Artificial Intelligence, 128(1-2):99–141, 2000. [172] S. Thrun and Y. Liu. Multi-robot slam with sparse extended information lers. In 11th International Symposium of Robotics Research, volume 15, pages 254–266, 2005. [173] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale, G. Hoffmann M. Halpenny, K. Lau, C. Oakley, M. Palatucci, V. Pratt, P. Stang, S. Strohband, C. Dupont, L.-E. Jendrossek, C. Koelen, C. Markey, C. Rummel, J. van Niekerka, E. Jensen, P. Alessandrini, G. Bradski, B. Davie, S. Ettinger, A. Kaehler, A. Nefian, and P. Mahoney. Stanley, the robot that won the darpa grand challenge. Journal Field Robotics, 23(9):661–692, 2005. [174] C. Tomasi and T. Kanade. Detection and tracking of point features. Technical Report CMU-CS-91-132, Carnegie Mellon University, April 1991. [175] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based vision system for place and object recognition. In Proc. International Conference on Computer Vision (ICCV), pages 1023 – 1029, Nice, France, October 2003. [176] Peter Trautman and Andreas Krause. Unfreezing the robot: Navigation in dense, interacting crowds. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS), pages 797–803, 2010. [177] E. Trulls, A. C. Murtra, J. P´ erez-Ibarz, G. Ferrer, D. Vasquez, J. M. Mirats- Tur, and A. Sanfeliu. Autonomous navigation for mobile service robots in urban pedestrian environments. Journal of Field Robotics, 28(3):329 – 354, 2011. 196 [178] I. Ulrich and J. Borenstein. Vfh+: Reliable obstacle avoidance for fast mo- bile robots. Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 1572–1577, 1998. [179] I. Ulrich and I. Nourbakhsh. Appearance-based place recognition for topological localization. In Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 1023 – 1029, April 2000. [180] USC Robotics. The SeaBee Autonomous Underwater Vehicle. http://ilab.usc. edu/uscr/index.html, July 2009. Accessed: July 15, 2009. [181] USGlobalSat Inc. USGlobalSat Incorporated. http://www.usglobalsat.com/ p-47-em-408-sirf-iii.aspx,July 2009. Accessed: July 15, 2009. [182] C. Valgren and A. J. Lilienthal. Incremental spectral clustering and seasons: Appearance-based localization in outdoor environments. In Proc. IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 1856–1861, Pasadena, CA, 2008. [183] G. R. A. Vargas, K. Nagatani, and K. Yoshida. Adaptive kalman filtering for gps-based mobile robot localization. In IEEE International Workshop on Safety, Security and Rescue Robotics, Rome, Italy, September 2007. [184] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. Proceeding of the ACM Int. Conf. on Multimedia, 2015. [185] Via. Pico-ITX Mainboard Form Factor. http://www.via.com.tw/en/ initiatives/spearhead/pico-itx,February2009. Accessed: February15,2009. [186] J. Wang, H. Zha, and R. Cipolla. Coarse-to-fine vision-based localization by indexing scale-invariant features. IEEE Trans. Systems, Man and Cybernetics, 36(2):413–422, April 2006. [187] Yue Wang, Eam Khwang Teoh, and Dinggang Shen. Lane detection and tracking using b-snake. Image and Vision computing, 22(4):269–280, 2004. [188] C. Weiss, H. Tamimi, A. Masselli, andAndreasZell. A hybridapproach forvision- based outdoor robot localization using global and local image features. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS), pages 1047 – 1052, 10 2007. [189] Willow Garage. PR-2 - Wiki. http://pr.willowgarage.com/wiki/PR-2, July 2009. Accessed: July 15, 2009. [190] N. J. Wilson. In Principles of Artificial Intelligence. Springer Verlag, Berlin, Ger- many, 1982. [191] Wind River. Wind River : RTLinuxFree. http://www.rtlinuxfree.com/, July 2010. Accessed: July 15, 2010. 197 [192] B. Wu and R. Nevatia. Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. International Journal of Computer Vision, 75(2):247–266, November 2007. [193] K. Wurm, R. Kummerle, C. Stachniss, and W. Burgard. Improving robot naviga- tion in structured outdoor environmets by identifying vegetation from laser data. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1217 –1222, oct. 2009. [194] K.M.Wurm,A.Hornung,M.Bennewitz,C.Stachniss,andW.Burgard. Octomap: A probabilistic, flexible, and compact 3d map representation for robotic systems. In ICRA Workshop on Best Practice in 3D Perception and Modeling for Mobile Manipulation, pages 300 – 307, May 2010. [195] Xenomai. Xenomai: Real-TimeFrameworkforLinux. http://www.xenomai.org/ index.php/Main_Page, July 2010. Accessed: July 15, 2010. [196] XTX Consortium. XTX: About XTX. http://www.xtx-standard.org/, July 2009. Accessed: July 15, 2009. [197] Libo Yang and Steven M. Lavalle. The sampling-based neighborhood graph: An approach to computing and executing feedback motion strategies. IEEE Transac- tions on Robotics and Automation, 20(3):419 – 432, 2004. [198] Weibin Yang, Xiaosong Luo, Bin Fang, Daiming Zhang, and Yuan Yan Tang. Fast and accurate vanishing point detection in complex scenes. In Intelligent Trans- portation Systems (ITSC), 2014 IEEE 17th International Conference on, pages 93–98. IEEE, 2014. [199] Bin Yu and Anil K Jain. Lane boundary detection using a multiresolution hough transform. In Image Processing, 1997. Proceedings., International Conference on, volume 2, pages 748–751. IEEE, 1997. [200] Shinichi Yuta, Makoto Mizukawa, Hideki Hashimoto, Hirofumi Tashiro, and Tsuyoshi Okubo. Tsukuba challenge 2009 - towards robots working in the real world: Records in 2009. Journal of Robotics and Mechatronics, 23(2):201–206, 2011. [201] W. Zhang and J. Kosecka. Localization based on building recognition. In IEEE Workshop on Applications for Visually Impaired, pages 21 – 28, June 2005. [202] Xiang Zhang and Yann LeCun. Text understanding from scratch. arXiv preprint arXiv:1502.01710, 2015. [203] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pages 487–495, 2014. [204] Chen Ziyu and He Zhen. Simple road detection based on vanishing point. Journal of Electronic Imaging, 23(3):033015, 2014. 198
Abstract (if available)
Abstract
Unmanned ground vehicles (UGV) is one of the highly versatile carriers for transportation, surveillance, search, and rescue task. For the service type mobile robot that ability to travel through indoor and outdoor environment may encounter complex challenges different than that of street vehicles. The urban pedestrian environment is typically GPS-denied which demands a further integrated approach of perception, estimation, planning and motion control to surmount. In this thesis, we present the design and implementation of Beobot 2.0—an autonomous mobile robot that operates in unconstrained urban environments. We developed a distributed architecture to satisfy the requirement for computationally intensive algorithms. Furthermore, we propose several biological-inspired visual recognition methodologies for indoor and outdoor navigation. We describe novel vision algorithms base on saliency, gist, image contour, and region segment to construct several perception modules such as place recognition, landmark recognition, and road lane detection. To conquer the latencies and update frequencies of each perception module while performing real-time navigation task. We further investigate hierarchical map representation to fuse the quick, yet limited state information while time-consuming but higher discriminating data remains in processing. We validated our system using a ground vehicle that autonomously traversed several times in multiple outdoor routes, each 400m or longer, in a university campus. The routes featured different road types, environmental hazards, moving pedestrians, and service vehicles. In total, the robot logged over 10km of successfully recorded experiments, driving within a median of 1.37m laterally of the center of the road, and localizing within 0.97m (median) longitudinally of its true location along the route.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Biologically inspired mobile robot vision localization
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Speeding up trajectory planning for autonomous robots operating in complex environments
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Decentralized real-time trajectory planning for multi-robot navigation in cluttered environments
PDF
Multi-robot strategies for adaptive sampling with autonomous underwater vehicles
PDF
Learning to detect and adapt to unpredicted changes
PDF
Learning distributed representations from network data and human navigation
PDF
Outdoor visual navigation aid for the blind in dynamic environments
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Adaptive sampling with a robotic sensor network
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Crowd-sourced collaborative sensing in highly mobile environments
PDF
Single-image geometry estimation for various real-world domains
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Transfer learning for intelligent systems in the wild
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Robust loop closures for multi-robot SLAM in unstructured environments
PDF
Toward situation awareness: activity and object recognition
PDF
Efficient pipelines for vision-based context sensing
Asset Metadata
Creator
Chang, Chin-Kai
(author)
Core Title
Autonomous mobile robot navigation in urban environment
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/27/2016
Defense Date
02/25/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
autonomous navigation,computer vision,deep learning,mobile robotics,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Itti, Laurent (
committee chair
), Li, Hao (
committee member
), Tjan, Bosco S. (
committee member
)
Creator Email
chinkaic@usc.edu,mezlxx@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-244804
Unique identifier
UC11277818
Identifier
etd-ChangChinK-4391.pdf (filename),usctheses-c40-244804 (legacy record id)
Legacy Identifier
etd-ChangChinK-4391.pdf
Dmrecord
244804
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chang, Chin-Kai
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
autonomous navigation
computer vision
deep learning
mobile robotics