Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Hybrid methods for robust image matching and its application in augmented reality
(USC Thesis Other)
Hybrid methods for robust image matching and its application in augmented reality
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HYBRIDMETHODSFORROBUSTIMAGEMATCHING ANDITSAPPLICATIONINAUGMENTEDREALITY by WeiGuan ADissertationPresentedtothe FACULTYOFTHEUSCGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (COMPUTERSCIENCE) May2014 Copyright 2014 WeiGuan Contents ListofFigures v ListofTables xi Abstract xii Chapter1 Introduction 1 Chapter2 RelatedWork 5 Chapter3 FastMatchingwithGridMethod 7 3.1 ReviewofImageMatchingwithLocalFeatures . . . . . . . . . . . . . 8 3.2 ImagePartitioningandGridRetrieval . . . . . . . . . . . . . . . . . . 9 3.3 MatchingwithHierarchicalGrids. . . . . . . . . . . . . . . . . . . . . 11 3.3.1 ScaleDetectionbyRetrieval . . . . . . . . . . . . . . . . . . . 12 3.4 MatchingPropagationforStablePose . . . . . . . . . . . . . . . . . . 14 3.5 PerformanceEvaluationsandDiscussions . . . . . . . . . . . . . . . . 16 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter4 MatchingunderLightingVariations 22 4.1 ReviewofIllumination-RobustFeatures . . . . . . . . . . . . . . . . . 24 4.2 ImageMatchingthroughRelighting . . . . . . . . . . . . . . . . . . . 25 4.2.1 GeometryEstimationandRelighting . . . . . . . . . . . . . . . 29 4.2.2 RelightingResultsandDiscussions. . . . . . . . . . . . . . . . 34 4.3 MatchingwithLineContextFeatures . . . . . . . . . . . . . . . . . . . 36 4.3.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.2 LineContextDetector . . . . . . . . . . . . . . . . . . . . . . 37 4.3.3 LineContextDescriptor . . . . . . . . . . . . . . . . . . . . . 41 4.3.4 ImageMatchingandIndexing . . . . . . . . . . . . . . . . . . 45 4.4 LineContextEvaluationandDiscussions . . . . . . . . . . . . . . . . . 46 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 ii Chapter5 ImageAlignmentwithPointClouds 55 5.1 ReviewofImageRegistrationwithPointClouds . . . . . . . . . . . . . 57 5.2 InitialPoseEstimationwithPoint-basedFeatures . . . . . . . . . . . . 59 5.2.1 SyntheticViewsofLiDARData . . . . . . . . . . . . . . . . . 59 5.2.2 Generationof3DFeaturePointsCloud . . . . . . . . . . . . . 60 5.3 IterativePoseRefinement . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3.1 RefinementbyMorePointCorrespondences . . . . . . . . . . . 62 5.3.2 GeometricStructuresandAlignment . . . . . . . . . . . . . . . 63 5.3.3 RefinementbyMinimizingErrorFunction . . . . . . . . . . . . 64 5.4 ResultsandDiscussions . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Chapter6 6-DOFTrackingwithPointClouds 72 6.1 ReviewofTrackingwithPointClouds . . . . . . . . . . . . . . . . . . 73 6.2 PreprocessingofPointClouds . . . . . . . . . . . . . . . . . . . . . . 75 6.2.1 SyntheticViewsandFeaturePointCloud . . . . . . . . . . . . 75 6.2.2 GeometricModelingwithGraphicalModels . . . . . . . . . . . 76 6.3 Trackingwith3DPlanes . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3.1 FeatureDetectiononPlanes . . . . . . . . . . . . . . . . . . . 78 6.3.2 FeatureTrackings . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4 PoseRecoveryandSmoothing . . . . . . . . . . . . . . . . . . . . . . 81 6.4.1 PoseRecoveryScheme . . . . . . . . . . . . . . . . . . . . . . 81 6.4.2 PoseSmoothingProcess . . . . . . . . . . . . . . . . . . . . . 83 6.5 PerformanceandDiscussions . . . . . . . . . . . . . . . . . . . . . . . 84 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Chapter7 BuildingAnOutdoorAugmentedRealitySystem 89 7.1 OverviewofOutdoorAugmentedRealitySystem . . . . . . . . . . . . 90 7.2 TheSimilaritiesandDifferenceswithSLAM . . . . . . . . . . . . . . 92 7.3 DataCollection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.3.1 CollectionofGeotaggedImages . . . . . . . . . . . . . . . . . 92 7.3.2 CollectionofRangeData . . . . . . . . . . . . . . . . . . . . . 93 7.4 DataProcessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.4.1 OverlappedClusteringofImages . . . . . . . . . . . . . . . . . 93 7.4.2 RangeDataModeling . . . . . . . . . . . . . . . . . . . . . . . 95 7.5 OutdoorTrackingwithRobustPoseRecovery . . . . . . . . . . . . . . 96 7.5.1 SituationsofPoseLostandRecovery . . . . . . . . . . . . . . 96 7.5.2 ContinuousTrackingforLongDistance . . . . . . . . . . . . . 97 7.6 SystemPerformanceandAugmentedReality . . . . . . . . . . . . . . 98 iii 7.7 MiscellaneousI:GPS-aidedTrackingwithHandoffProcess . . . . . . . 101 7.8 MiscellaneousII:Tracking-basedVirtualWorldNavigation . . . . . . . 102 7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Chapter8 ConclusionandFutureWork 107 8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 8.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 ReferenceList 110 iv ListofFigures 1.1 An example of an AR system. (a) An image taken by a user. (b) The estimated user location is indicated by the red dot. (c) The image with augmentedinformation. A2Dlabeland3Dsignareplacedintheimage. 2 1.2 Matching two images under different lighting environments. (a) with SURFfeature(b)withLineContextfeature . . . . . . . . . . . . . . . 3 1.3 ThecoloredrangescansforpartoftheUSCcampus. . . . . . . . . . . 4 3.1 Motivation of image partitioning. (a) only parts of the query image can findtheirmatches. (b)thewholeimagematchespartofdatabaseimage. (c)thematchedimageindatabase. . . . . . . . . . . . . . . . . . . . . 9 3.2 An image is hierarchically partitioned into 4 by 4 grids for both query image and database image. Four different grid sizes are used, they are 1×1,2×2,3×3and4×4(thewholeimage). . . . . . . . . . . . . . . . 10 3.3 An example of 8 by 8 partitioning. Four different grid sizes are used, 1×1,2×2,4×4and8×8(thewholeimage). . . . . . . . . . . . . . 10 3.4 The top 5 grids are selected according to the number of features. The selectedgridsareshowninredrectangles. . . . . . . . . . . . . . . . . 11 3.5 Imagetakenfromdifferentviewpointsarestoredinthedatabase. . . . 13 3.6 Scaledetectionbyretrievals. Theimagewiththeclosestviewpointwill beretrieved(thickestarrow). . . . . . . . . . . . . . . . . . . . . . . . 14 v 3.7 The matching propagation process. Two propagations are shown in the figure. H 1 iscalculatedfromthematchedpointsintheredarea(x 1 andx ′ 1 areonematchingpair). Thecorrespondingpointforx 2 canbeestimated byx ′′ 2 =H 1 ·x 2 . Therealcorrespondingpointx ′ 2 canbefoundnearbyx ′′ 2 . ThenH 2 iscalculatedfromallmatchedpointswithinthebluerectangle. Similarly,x ′ 3 canbefoundedintheneighborhoodofestimatedlocationx ′′ 3 . 15 3.8 The 3 propagations in the image matching process. Different colors are used for the correspondence lines in different propagations. The propagationstepsare1grid,2gridsand4gridsforthe1st,2ndand3rd propagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.9 Comparisons of four different matching schemes on 500 frames. The proposed matching scheme improves the frame rate significantly with samerobustnessasSURFfeatures. . . . . . . . . . . . . . . . . . . . . 16 3.10 Retrievalresultsforimagewithlargeocclusions. Forretrievingwiththe entire image, the correct image has lower scores than some unexpected imagesduetoocclusionsintheimage. . . . . . . . . . . . . . . . . . . 17 3.11 Precision-recall curve for retrievals with grid and whole image. (a) retrieval with non-occluded images (b) retrieval with occluded images. Grid method performs better for occluded image since the bag of fea- turesfortheentireimageislargelyaffectedbytheoccludingparts. . . 18 3.12 ReprojectionerrorsofthecalculatedposeaftereachpropagationforFig. 3.8. All the matched features are displayed. The features are displayed accordingtotheorderofpropagations. Thosewithverylargeerrorsare outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.13 ReprojectionerrorsofthecalculatedposeaftereachpropagationforFig. 3.8. Only the inlier features are displayed. The features are displayed accordingtoorderofpropagations. . . . . . . . . . . . . . . . . . . . . 19 3.14 The localization system and augmented reality. (a) the captured image isrotated. (b)theimageiscapturedatafurtherdistance. (a)and(b)are within the same cluster area. (c) and (d) are two examples captured in otherclusterareas.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.15 Occlusionshandlingofthelocalizationsystem. Duetothegridmethod, theproposedsystemiseffectiveinhandlingocclusions. . . . . . . . . . 21 vi 4.1 Thechallengingcasesforimagematchingindifferentilluminatingenvi- ronments. Thephotosontoparetakenin(a)thesunlight(b)arainyday (c)nighttime. Thephotosatbottomaretakeninanormaldaytime. . . . 23 4.2 Theframeworkofrelightingprocess. I =originalimage, F =priorillu- mination information, R = reflectance, S = shading, C = specular, G = geometryimage, S ′ =modifiedshading, I ′ =reconstructedimage. . . . . 27 4.3 Environment map for reference image in Fig. 4.4-(a) and constellation of light sources (green dots) generated by important sampling method [66]. TheHDRenvironmentmapistone-mappedtolowdynamicrange imagefordisplay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 The decomposition of reference image with method [96]. The specular partsareremovedby[94]. . . . . . . . . . . . . . . . . . . . . . . . . 30 4.5 The Jacobi iteration process for geometry estimation. The solution can bewellobtainedinabout5to20iterations. . . . . . . . . . . . . . . . 32 4.6 (a) The test image under a different illumination environment. (b) The estimatedgeometry(representedasadepthimage). (c)Therelitimage. 33 4.7 Matching results with SIFT features. (a) direct match using original images(b)matchingontherelitimage. . . . . . . . . . . . . . . . . . . 35 4.8 (a)and(b)aretwogroupsofmatchingresultsdisplayed. Foreachgroup, the figures from top to bottom, left to right are, reference image and its illuminations, test image and its illuminations, SIFT matching results withcontraststretching,histogramequalizationandproposedrelighting method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.9 Recall-precisionforimagematchingamongover200images. SIFTfea- turesareusedforallthemethods.. . . . . . . . . . . . . . . . . . . . . 37 4.10 (a)Curvesatvariousscalesarefittedbylinesegmentsindifferentsitua- tions. Darkbluedenoteshigherscalelevel,andlightbluedenoteslower scalelevel. (b)Linesegmentswithindistance2s areconsideredcontext segments. (c)Akeypointanditscontextsegmentsonanedgeimage. . . 39 4.11 (a) multiple points representation of line segments (b) sample points in 3D sphere coordinates (c) voting to neighboring bins from a sample point(d)logr,a,b histogram(darkergridrepresentslargerweights) . . 43 vii 4.12 (a) Two features have different layouts of line segments. After normal- ization,theirdescriptorswillbeidentical. (b)Theexamplepairsinself- constructeddataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.13 Repeatability test on both standard dataset (Mikolajczyk’s, (a) - scale androtation,(b)-affine,(c)-lighting)andcollecteddataset((d)-lighting). 48 4.14 Point correspondences obtained by matching Line Context features and comparisonresultswithotherlocalfeaturesforimagesinFigure4.1. . . 49 4.15 Precision-recallmeasuresonMikolajczyk’sstandarddataset(atoe)and constructed dataset (f). The cases handled are (a) scale and rotation (b)viewpointchanges(c)-(e)illuminationchanges(f)morechallenging illuminationchanges. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.16 MatchingresultsforExample4.12-(b)-I. . . . . . . . . . . . . . . . . . 53 4.17 MatchingresultsforExample4.12-(b)-II. . . . . . . . . . . . . . . . . 53 4.18 MatchingresultsforExample4.12-(b)-III. . . . . . . . . . . . . . . . . 54 4.19 MatchingresultsforExample4.12-(b)-IV. . . . . . . . . . . . . . . . . 54 5.1 (a)The3DLiDARdatawithcolorinformation(sampledbysoftwarefor fastrendering). (b)The2Dimageofthesamescenetakenattheground level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 The virtual cameras are placed around the LiDAR scene. They are placed uniformly in viewing directions and logarithmically in the dis- tance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3 ThesyntheticviewsofLiDARdata. 2Dfeaturesareextractedfromeach syntheticview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.4 SIFT features in 3D space. The 3D positions are obtained by reproject- ing2Dfeaturesontothe3DLiDARdata. . . . . . . . . . . . . . . . . . 62 5.5 (a) The initial camera pose and 3D to 2D matching. (b) Camera pose after 1st iteration. (c) 3D to 2D matching based on refined pose. (d) Cameraposeafter2nditeration. . . . . . . . . . . . . . . . . . . . . . 63 5.6 (a) The refined pose with more correspondences. (b) Camera pose after minimizingtheerrorfunction. . . . . . . . . . . . . . . . . . . . . . . 65 viii 5.7 Intensity differences between projected image and camera image. (a) errorsafteriterativerefinements(b)errorsafteroptimization. . . . . . . 66 5.8 Theclosestcorrespondingedges. Theparametersupdatescanbeobtained fromthesecorrespondences. Left-theedgesarenotwellalignedfortar- getimageandprojectedimage. Right-theclosestedgecorrespondences. 67 5.9 (a)Thecameraimage(b)Theinitialviewcalculatedfromlimitednum- berofmatches(c)Therefinedviewbygeneratingmorecorrespondences (d) It has no further refinement (from more matches) after 2 or 3 itera- tions (e) The pose is refined by minimizing the proposed error function (f)Thevirtualbuildingiswellalignedwiththerealimageforthecalcu- latedview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.10 The estimated camera pose with respect to the same scene as in Figure 5.9 but from a different viewpoint. The right figure shows the mixed realityofbothvirtualworldandrealworld. . . . . . . . . . . . . . . . 69 5.11 (a) The errors after each iterative refinement. (b) The errors before and afteroptimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.12 Oneexampleoftheiterationprocess. Left-projectedlidarimage,mid- dle - projected image blended with original image, right - difference betweenprojectedimageandoriginalimage. . . . . . . . . . . . . . . . 70 5.13 Another example of the iteration process. Left - projected lidar image, middle-projectedimageblendedwithoriginalimage,right-difference betweenprojectedimageandoriginalimage. . . . . . . . . . . . . . . . 71 6.1 Real images are mapped onto the range data to register color informa- tion. Syntheticviewsaregeneratedtocovermoreviewpoints. . . . . . . 76 6.2 (a) Point cloud is segmented into ground (dark grey), building walls (light grey) and others (green). (b) The points on planar structures are modeledwithasetof3Dplanes. . . . . . . . . . . . . . . . . . . . . . 79 6.3 Projection of planes onto the frames. Features on the planes can be easilyidentified. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.4 The camera pose recovery after lost. The frame (in the purple color) is occluded so that the poses of the following frames cannot be calculated frompreviousframes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 ix 6.5 Theposeupdateprocess. Theestimatedpose issmoothlyupdatedin M frames. Foreachframet i ,anupdatedposeofframet 0 iscalculated. And the pose of frame t i is obtained by multiplying a sequence of T k with k from1to i. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.6 Modeling of range data with simple geometries. (a), (c) and (e) are the segmentationresults-ground(darkgrey),buildingwalls(lightgrey)and others(green). (b),(d)and(f)arethemodelingwithasetof3Dplanes. 85 6.7 (a) The camera pose at frame 1 (b) camera pose at frame 120 with con- strained optical flow method in [13] (c) camera pose at frame 120 with ourproposedmethod. . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.8 (a) The errors for magnitude of translation vector T (in centimeters) (b) TheerrorsforzdirectionafterrotationtransformR(indegrees). . . . . 87 6.9 The projection errors (eqn. 5.7) for different methods. (a) The SLAM framework in [75] (b) The method proposed in [13] (c) Our tracking system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.1 Theonlinepartofourtrackingsystem. . . . . . . . . . . . . . . . . . . 91 7.2 (a)rangescandata(b)therangedataaftercolormapping . . . . . . . . 93 7.3 TheoverlappedclusteringofcollectedimageswithK=5. . . . . . . . . 95 7.4 (a)ThesegmentationofLiDARintofacades,treesandgrounds. (b)The LiDARdataismodeledwithasetof3Dplanes. . . . . . . . . . . . . . 98 7.5 Precision-recall for SURF and Line Context in pose recovery scheme. SIFTandcolorhistogramarealsocompared. . . . . . . . . . . . . . . 99 7.6 Framesinvideosequencewithaugmentedvirtualaxis. . . . . . . . . . 100 7.7 TheLiDARdataisprojectedwiththeposeestimatedfromframes. . . . 100 7.8 Thesofthandoffprocess. H isthethreshold. . . . . . . . . . . . . . . . 101 7.9 Thereconstructedvirtualenvironmentwithmethod[109]and[100]. . . 103 7.10 (a)to(e)showsomeexamplesofthecapturedimagesinrealworldand the corresponding views in virtual world with estimated camera posi- tionsandorientations. (f)showstheirlocationsonamap. . . . . . . . . 106 x ListofTables 3.1 The accuracy rate up to each step. In most cases, step 1 can return the correctqueryresults. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1 Comparisonofdifferentfeatures(SIFT,SURF,ShapeContext,LineSig- nature,MSER+SIFT,LineContext)onimagesinFigure4.1andFigure 4.12-(b). The number of correct matches over the number of detected matches is reported. It also compares feature extraction time, matching timeforonepairofimages,andcomplexityformatchingwithnimages. TreestructurescanbeappliedforsomefeaturestoachieveO(lgn)com- plexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.1 SuccessrateofusingSURFandLineContextinposerecoveryscheme. 99 7.2 The delay between virtual environment navigation and real world nav- igation. The two types of navigation are considered well synchronized andhappeningsimultaneously. . . . . . . . . . . . . . . . . . . . . . . 105 xi Abstract This thesis presents new matching algorithms that work robustly in challenging situa- tions. Image matching is a fundamental and challenging problem in vision community due to varied sensing techniques and imaging conditions. While it is almost impossi- ble to find a general method that is optimized for all uses, we focus on those matching problems that are related to augmented reality (AR). Many AR applications have been developed on portable devices, but most are limited to indoor environments within a small workspace because their matching algorithms are not robust out of controlled conditions. The first part of the thesis describes 2D to 2D image matching problems. Exist- ing robust features are not suited for AR applications due to their computational cost. A fast matching scheme is applied to such features to increase matching speed by up to 10 times without sacrificing their robustness. Lighting variations can often cause matchfailuresinoutdoorenvironments. Itisachallengingproblembecauseanychange in illumination causes unpredicted changes in image intensities. Some features have been specially designed to be lighting invariant. While these features handle linear or monotonic changes, they are not robust to more complex changes. This thesis presents a line-based feature that is robust to complex and large illumination variations. Both featuredetectoranddescriptoraredescribedinmoredetail. xii The second part of the thesis describes image sequence matching with 3D point clouds. Feature-based matching becomes more challenging due to different structures between 2D and 3D data. The features extracted from one type of data are usually not repeatable in the other. An ICP-like method that iteratively aligns an image with a 3D pointcloudispresented. Whilethismethodcanbeusedtocalculatetheposeforasingle frame, it is not efficient to apply it for all frames in the sequence. Once the first frame pose is obtained, the poses for subsequent frames can be tracked from 2D to 3D point correspondences. It is observed that not all points on LiDAR are suitable for tracking. A simple and efficient method is used to remove unstable LiDAR points and identify features on frames that are robust in the tracking process. With the above methods, the posescanbecalculatedmorestablyforthewholesequence. With provided solutions to above challenging problems, we have applied our meth- ods in an AR system. We describe each step in building up such a system from data collectionsandpreprocessing,toposecalculationsandtrackings. Thepresentedsystem isshowntoberobustandpromisingformostAR-basedapplications. xiii Chapter1 Introduction Image matching is a fundamental problem in the vision community. While it is a broad topic and there is no general method that is optimized for all uses, we focus on thosematchingproblemsthatarerelatedtoaugmentedreality(AR).Withthepopularity of smart phones, many applications have been developed on various portable devices. However,mostoftheseapplicationsarelimitedtotheindoorenvironmentwithinsmall workspacesbecausetheirmatchingalgorithmsarenotrobustunderuncontrolledcondi- tions. Toovercomesuchlimitations,thefollowingchallengesneedtobehandled: • Efficient pose recovery: Camera tracking can be easily lost due to occlusions and motions. Fast and robust pose recovery schemes are necessary for any track- ing system. Without efficient pose recovery, the user experience of AR system willbelargelyaffected. • Robust to lighting variations: Image matching often fails due to lighting changes in outdoor environments. A robust AR system should be able to con- ductrobustmatchingundervariouslightingconditions. • Real-time characteristic: One requirement in AR is that the pose for each framehastobecalculatedinrealtimeoratleastatinteractivespeed. Whilemany features are designed to be robust, these features are not fast enough to maintain thenecessaryspeed. 1 • 2D to 3D matching: Laser range scans can be used as 3D references for large scale environments. However, the registration between 2D images and 3D point cloudsisachallengingproblem. • Trackinginlargescaleareas: Frame to frame tracking usually contains small errorsduetonoisesandintensityvariations. Whiletheerrorsaretolerableinsmall workspaces, sucherrorswillaccumulateasthetrackingdistanceincreases. Error accumulationsneedtobelimitedintheARsystem. In this thesis, various matching problems are discussed and new methods are pre- sented. The first part of the thesis (Chapter 3 and Chapter 4) presents 2D to 2D image matching problem. The second part of the thesis (Chapter 5 to Chapter 7) discusses image matching with 3D point clouds. The benefit of using point clouds is that we can obtain accurate 3D information of large scale areas with laser technologies in an effi- cient way. Based on the calculated camera pose, AR objects can be applied onto the image. Figure1.1showsanexampleofanARapplicationthatreturnstheuserlocation andbuildinginformation. (a) (b) (c) Figure 1.1: An example of an AR system. (a) An image taken by a user. (b) The estimateduserlocationisindicatedbythereddot. (c)Theimagewithaugmentedinfor- mation. A2Dlabeland3Dsignareplacedintheimage. In the tracking process, the camera pose can be lost in many cases. Whenever the pose is lost, the system should recover the pose immediately. We present a speedup 2 matching algorithm that can recover the pose in real time. Not only is the matching speedincreased,buttherobustnessisalsoincreasedforsomeocclusioncases. (a) (b) Figure1.2: Matchingtwoimagesunderdifferentlightingenvironments. (a)withSURF feature(b)withLineContextfeature In outdoor environments, lighting variations often cause problems in matching. Matching images with different illuminations is challenging because intensities of an image are not just dependent on illuminations but also on other information such as geometries. Any changes in illumination will cause unpredicted changes on intensi- ties, which make many features fail in the matching process. Figure 1.2-(a) shows the matching results with SURF features. In the figure, the lighting conditions for the two imagesaredifferentbecauseofthesunshineeffects. SURFcanonlyreturnafewcorre- spondences. Some features have been proposed to handle linear or monotonic lighting changes, but they cannot handle more complex changes. In this thesis, a novel feature is presented to handle this problem. Both feature detector and descriptor are designed toberobusttoilluminationvariations. Figure1.2-(b)showsthematchingresultsofour newfeature. ThedetailsofthefeaturewillbediscussedinChapter4. 3 Figure1.3: ThecoloredrangescansforpartoftheUSCcampus. Withtheadvancesinlasertechnologies,itismoreconvenienttoobtainthe3Dinfor- mation of large scale areas. Figure 1.3 shows an example of the scanned point cloud. The accuracy of the obtained depth information is as good as (if not higher than) the reconstructed point clouds by state-of-the-art structure from stereo techniques. How- ever, it is not straightforward to match 2D images with 3D point clouds. In the second part of this thesis, an image to point cloud alignment method is presented. An initial estimated pose is required as the input. The algorithm refines the pose iteratively and a moreaccurateposecanbeobtained. Whiletheabovemethodcanbeusedtocalculatetheposeforeachindividualframe, it is not efficient to do so. Frame to frame tracking method can be used to maintain the correspondences with point clouds. A preprocessing step is conducted to extract those pointsthatarereliableforthetrackingpurpose. The remainder of this thesis is organized as follows: Chapter 2 summarizes related literature on matching and tracking problems. Chapter 3 presents a fast matching method that can recover the pose in real time. In Chapter 4, the lighting problem is discussed and a novel lighting-robust feature is presented. When 3D point cloud data is available, it can be used to calculate the camera pose. An iterative pose alignment method is presented in Chapter 5. In Chapter 6, feature tracking methods are applied to match a sequence of images with point clouds. In Chapter 7, the pipeline of an AR systemispresented. Finally,thethesisisconcludedinChapter8. 4 Chapter2 RelatedWork This chapter briefly reviews the related work in the scope of camera tracking system. Previous research related to each specific topic is reviewed individually in Section 3.1 for speeded-up image matchings, Section 4.1 for robustness to illumination variations, Section5.1forimagealignmentwithpointclouds, andSection6.1forcameratracking withpointcloudsor3Dmodels. Dependingonwhetherwehave3Dinformationofthescenes,theresearchoncamera trackingmethodscanbedividedintotwogroups,i.e. SLAM-basedandmatching-based. A considerable amount of research has been conducted on vision-based SLAM. Some pioneer visual SLAM implementations are [24,25,29,35,48,52,70,75]. Due to the drifting and lighting problems, most of the single camera SLAM implementations are forindoorenvironmentsinarelativelysmallworkspace. Since 3D point clouds are easier to be constructed or collected nowadays, a full SLAM solution is usually not necessary for most camera tracking-based applications. Tracking with well established 3D data is considered a promising alternative. Many methodshavebeenproposedinmatchingimageswithpointsclouds. Thesemethodscan befurthercategorizedintokeypoint-basedmatching[2,8,27,78,98]andstructure-based matching [59,60,91,92]. For keypoint-based matchings, feature points such as SIFT [61],SURF[5]andotherfeatures[11,18,20,47,56,58,85]areusuallyused. Instructure- based matchings, lines are utilized in many researches. Both types of matchings have advantagesanddrawbacks. Forexample,whilekeypoint-basedmatchingsarenotrobust toilluminationvariations,line-basedmatchingsarenotscalableorfastenough. 5 In our tracking system, we employ both keypoint-based matching and structure- based matching schemes. The keypoints are used for pose initializations, while struc- tural features are used for further pose refinement. For keypoint-based matchings, the lighting issue is a challenging problem especially in outdoor environments. In the past few years, some features have been specially designed to cope with such prob- lems [30,40,41,95,103,106]. One limitation of these features is that they can only handle monotonic illumination changes. For more complex illumination changes, their robustnessisnotsufficienttorecoverthepose. Besides lighting robustness, some features are specially designed to increase the matching speed [4,44,99]. While these methods can accelerate the matching process, the feature robustness is usually sacrificed. These techniques work effectively for their application uses, but they are not suitable or sufficient for more critical systems that requirebothrobustnessandreal-timespeed. Thoughthefocusofthisresearchisnotabout3Dmodeling,wedohaveamodeling process in the data preprocessing step. A further study and comprehensive survey for urban modeling can be referred in [77,80,81]. For our tracking system, we present a simple but rapid modeling method that serves our own purpose. Our modeling method ismoresuitableandconvenientforrobusttrackinginurbanoutdoorareas. Finally, this thesis is the first to present a complete pipeline of camera tracking system for large scale outdoor environments. Many challenging problems have been studied and investigated. Theories and algorithms are developed based on the tracking system. The results in this thesis are competitive in terms of both tracking robustness andARuserexperience. 6 Chapter3 FastMatchingwithGridMethod One of the most challenging problems in both vision and AR community is to conduct image matchings efficiently with sufficient robustness. While there are many robust featuresproposedinthepastdecade,thesefeaturesareusuallynotcomputationallyeffi- cient for realtime applications. In the meantime, most fast features sacrifice robustness to meet with realtime requirements. It seems that it is difficult to improve robustness or speed without sacrificing the performance of the other. In this chapter, we will cope withthischallengingproblem. SIFT[61]andSURF[5]aretwoofthemostpopularfeaturesthatarerobusttoscales, rotationsandviewpointchanges. Bothfeaturesfirstdetectkeypointsandthencalculate the descriptors. In the two phases, detecting keypoints consume much more time than forming descriptors. An important cause of the computational cost is that the detectors conduct scale space analysis and detect keypoints across all the scales, which are quite timeconsuming. Moreover,featurematchingsalsoconsumealotoftimewhenthereare a large number of features. The hierarchical grid partitioning method can significantly increasethematchingspeedwhilemaintainingitsrobustness. BesidesSIFTandSURF, itcanbeappliedtoalltypesofpoint-basedfeatures. Wewilltalkaboutthedetailsofourgridmethodintheremainderofthischapter. We provide a review of existing features that try to increase matching speed in Section 3.1. WediscussimagepartitioningandgridretrievalinSection3.2. InSection3.4,weshow how the grids are used to speed up the matching process. We evaluate the performance ofthemethodinSection3.5. Andfinally,wesummarizethischapterinthelastsection. 7 3.1 ReviewofImageMatchingwithLocalFeatures There are some existing methods [4,44,99]that try to match images with less com- putational cost in order to be applied on portable devices. Wagner et al. [99] presents a modifiedSIFTthatiscreatedfromafixedpatchsizeof15x15pixelsandformadescrip- tor with only 36 dimensions. The modified feature is used for efficient nature tracking with interactive speed on current-generation phones. Henze et al. [44] combines the simplified SIFT with a scalable vocabulary tree to achieve interactive performance for object recognition. The simplified features consume less computational cost which is necessary for mobile applications. Azad et al. [4] present a combination of Harris cor- nerdetectorandSIFTdescriptor,whichcomputesfeatureswithahighrepeatabilityand goodmatchingproperties. ByreplacingtheSIFTkeypointdetectorwithextendedHar- ris corner detector, the algorithm can generate features in real time. Some other similar methodsare[22,36,46,49,93]. BinaryfeatureshavegainedconsiderableattentionaftertheBRIEFfeaturehasbeen proposed by [14]. The big advantage of BRIEF over SURF is its faster extraction time and lower memory requirement. A series of other binary feature descriptors are devel- oped based on BRIEF. ORB [87] adds rotation invariance to the binary descriptor. The rotation is estimated using the intensity centroid method [86]. BRISK [55] is also a binaryfeaturedescriptor. IncontrasttoBRIEFandORB,theimagesamplingpositions are not drawn randomly anymore. Besides taking the rotation of a feature point into account, BRISK utilizes scale space theory to adapt the sampling pattern to the maxi- muminscalespace. Thus,BRISKisrotationandscaleinvariant. The techniques described above use either simplified-version of robust features or fast binary properties to achieve faster speed. While these techniques work effectively for their applications, they are not suitable for systems that require more robustness. 8 In the following sections, we will present our proposed matching scheme that satisfies theserequirements. 3.2 ImagePartitioningandGridRetrieval (a) queryimage1 (b) queryimage2 (c) matchedimage Figure3.1: Motivationofimagepartitioning. (a)onlypartsofthequeryimagecanfind their matches. (b) the whole image matches part of database image. (c) the matched imageindatabase. In most cases, it is part of the query image rather than the whole image that have corresponding matches in the database (as in Fig. 3.1-(a) and (c)). It is possible that querying the whole image with bag of feature method would also give us the correct results. However, the unmatched parts will lower the ranking or similarity score and reduceretrievalaccuracy. To better refine the retrieval results, we divide the query image into 4×4 or 8×8 grids. Note that the image is partitioned in a hierarchical way. As shown in Fig. 3.2, there are sixteen 1×1 grids, nine 2×2 grids, four 3×3 grids and one 4×4 grid (image itself). Not all these hierarchical grids are used to query their corresponding matches in the database. For example, in Fig. 3.2-(a), the grid with red rectangle is used as the query grid. We will discuss about detailed process in the next section. When 8×8 partitioningisused,asshowninFigure3.3,weneedanadditionallevelofquerying. 9 Figure3.2: Animageishierarchicallypartitionedinto4by4gridsforbothqueryimage and database image. Four different grid sizes are used, they are 1×1, 2×2, 3×3 and 4×4(thewholeimage). Figure3.3: Anexampleof8by8partitioning. Fourdifferentgridsizesareused,1×1, 2×2,4×4and8×8(thewholeimage). On the other side, it is also true that in many cases the query image is only part of the whole image in database, as the examples in Fig. 3.1-(b) and (c). So it is also nec- essary to do hierarchical partitioning on images in retrieval database. There are totally 16+9+4+1=30 hierarchical grids for one image, and any grid with number of features greaterthanN 0 willbeconstructedinthedatabase. Thisistomakesurethateveryimage (orgrid)inretrievaldatabasehassufficientnumberoffeaturestobeidentifiedwithless ambiguity. In our experiments, we set N 0 =100. As the image scale level goes up in 10 its pyramids, the number of features is geometrically decreased. Therefore, for most images,therewillbeonly40to50hierarchicalgridsacrossall9scales. Besidesincreasingtheretrievalaccuracy,therearetwomorebenefitsofhierarchical partitioning,fastmatchingandocclusionhandling,whichwillbediscussedlater. 3.3 MatchingwithHierarchicalGrids As we mentioned previously, different from construction of an image database, not all grids of the query image are used in the querying process. This is because many grids haveoverlappingparts,anditisinefficienttoquerythedatabasewithduplicatepartsfor manytimes. Therefore,wedesignourquerystrategyinfollowingthreesteps. Figure 3.4: The top 5 grids are selected according to the number of features. The selectedgridsareshowninredrectangles. Step one. We select top three 1×1 grids (top 5 grids for 8×8 partitioning as in 3.4) with the largest number of features. The number must be greater than N 0 to be a valid candidate grid. In decreasing order, we query each grid (if valid) to retrieve the most promising hierarchical grid with the highest rank score in database. Pairwise matching is conducted to validate query results. A query is considered successful if the number ofinliermatchesisgreaterthanN ′ 0 . Ifallthreequeriesfail,gotothenextstep. 11 Step two. Similar process as the first step, but one 2×2 grid and one 3×3 grid is usedtoquerythedatabase. Weselectbothgridsthathavethelargestnumberoffeatures amongtheirsizes. Ifallqueriesfail,gotostepthree. Stepthree. Usethe4×4(wholeimage)gridtoquery. Mostofthetime,stepthreeisnotreached. However,whilequeryingwithhierarchi- cal grids provides better performance in most cases, with step three, the query results areasgoodasqueryingwithoutpartitionsintheworstcase. Additionally,toefficiently computethenumberoffeaturesforeachhierarchicalgrid,a4×4integralimageisused. UptoStep1 G (1) 1×1 G (2) 1×1 G (3) 1×1 Accuracy 64.4% 83.4% 89.8% UptoStep2 G 2×2 G 3×3 Accuracy 93.2% 96.6% UptoStep3 G 4×4 Accuracy 98.3% Table3.1: Theaccuracyrateuptoeachstep. Inmostcases,step1canreturnthecorrect queryresults. Table 3.1 shows the accuracy rate of image retrieval within a database with over 200 different images. With about 40 hierarchical grids per image, there will be around 8,000gridsinthedatabase. However,foreachgridquerying, therearemultiplecorrect matchesinthedatabasesincemanygridshaveoverlappingparts. Asthetableshows,to get a correct match, the first step is sufficient in most case. On average, 1.6 queries are neededtoachieveaccuracyupto98.3%. 3.3.1 ScaleDetectionbyRetrieval Scale space theory is a framework for multi-scale signal representation. It is usually usedtohandleimagestructuresatdifferentscalesbyrepresentinganimageasafamily ofimagepyramids. Mostpopularscale-invariantfeaturessuchasSIFTandSURFmake 12 useofscalespaceanalysistomatchimagesindifferentscalelevels. However,detecting keypoints across scales is the most costly part in the matching process, so it is compu- tationallyinefficientespeciallyforportabledeviceswhichhavelesscomputingpowers. Figure3.5: Imagetakenfromdifferentviewpointsarestoredinthedatabase. The workload can be largely reduced if we generate images taken from different viewpoints (as in Fig. 3.5) in the offline process, and detect scales by image retrievals (asinFig. 3.6). Thiscanberepresentedbythefollowinginequality, T one−scale +T retrieval <<T multi−scale . (3.1) where T one−scale and T multi−scale are detection time for single scale and multiple scales respectively,andT retrieval denotesthetimeusedforretrieval. Take working on a 1.5GHz processor as an example, the scale-invariant detections on 640 by 480 images usually take about 900ms for SIFT detectors and 300ms for SURF detectors, but single scale detection only takes 90ms and 25ms respectively, and retrievalprocess costs about 5ms to 10ms. In this way, the whole detection process can besignificantlyspeededup. To generate keypoints at different scales, Fast-Hessian detector as in SURF is used. DifferentlyfromSURFdetector,insteadofchangingtheboxfiltersize,wekeepthefilter sizeconstantandapplyittoimagepyramidsforupto9scaleswithstep1.2. Theimage 13 Figure 3.6: Scale detection by retrievals. The image with the closest viewpoint will be retrieved(thickestarrow). ateachscaleisbuiltintothedatabaseforretrieval. Notethatinthiscase,theconceptof scaleisforallkeypointsontheimageasawholeratherthanasinglekeypoint. 3.4 MatchingPropagationforStablePose Wequerythedatabasewithaselectedgridonthequeryimage. Thisprocesswillreturn the most similar hierarchical grid in the database. In order to calculate the camera pose for this query image, we need to match those features on the two corresponding grids. Theproposedefficientdescriptorsareusedinthematchingprocess. It takes much less time to match two grids since there are fewer features compared to the whole image. However, it is better to match all the feature points on image to obtainamoreaccurateandstablecamerapose. Withtheproposedmatchingpropagation method,wecanmatchallthesefeaturesinanefficientway. 14 Figure 3.7: The matching propagation process. Two propagations are shown in the figure. H 1 is calculated from the matched points in the red area (x 1 and x ′ 1 are one matchingpair). Thecorrespondingpointforx 2 canbeestimatedbyx ′′ 2 =H 1 ·x 2 . Thereal correspondingpointx ′ 2 canbefoundnearbyx ′′ 2 . ThenH 2 iscalculatedfromallmatched points within the blue rectangle. Similarly, x ′ 3 can be founded in the neighborhood of estimatedlocationx ′′ 3 . Fig. 3.7showsthedetailedmatchingpropagationprocess. Thebasicideaisthatfora nearbyunmatchedfeaturepointonthequeryimage,thelocationofitsmatchingfeature ondatabaseimagecanbeestimatedbyapplyingthehomographictransformationwhich is calculated from existing matched features. So instead of searching the whole image space, we have restricted the searching range within a much smaller area. Since we apply image matchings for outdoor buildings in our application, the building surfaces arealmostplanar-likecomparedtotheviewpointdistance. Therefore,thehomographic transformation is sufficient for estimating locations for most features on the image. An exampleof3-stepmatchingpropagationsonrealimageisshowninFigure3.8. Figure3.8: The3propagationsintheimagematchingprocess. Differentcolorsareused forthecorrespondencelinesindifferentpropagations. Thepropagationstepsare1grid, 2gridsand4gridsforthe1st,2ndand3rdpropagation. 15 The matching time with 2 propagation steps is about 10-25ms and 3 propagation steps is about 20-30ms, compared to 100ms-150ms for directly matching the whole images. Since there are nearly 90% that the smallest 1×1 grid is selected, on average thepropagationmethodwillsavealargeportionofcomputationaltime. 3.5 PerformanceEvaluationsandDiscussions ComputationalPerformance We have tested the proposed algorithms on a portable device with a 1.5GHz processor. Fig. 3.9 shows our comparisons of four different matching schemes on 500 frames. As we can see from the comparisons, by shifting the scale space analysis from detection process to offline process and replace scale detections with retrieval techniques, we can almost double the frame rate. Moreover, with hierarchical partitioning method, the matchingtimecanbefurtherreducedsignificantlywithframerateupto4to6times. 0 100 200 300 400 500 0 200 400 600 800 1000 1200 1400 1600 1800 2000 frame no. processing time per frame in ms SIFT feature matching SURF feature matching retrieval in scale space retrieval + partitioning Figure 3.9: Comparisons of four different matching schemes on 500 frames. The pro- posed matching scheme improves the frame rate significantly with same robustness as SURFfeatures. 16 RetrievalAccuracyforHierarchicalGrids Tomatchtwoimages,wehavetohandletheoccludingcasessincetheimagescouldbe taken under different situations. For example, there are moving vehicles or passengers onanoutdoorimage. Withthehierarchicalpartitioninggrids,wecaneffectivelyhandle theseocclusionswithoutaffectingtheperformance. Fig. 3.10andFig. 3.11showsome examplesof gridretrievalsandtheir performance. Allthe retrievalsare conductedwith occludedqueryimagesinadatabasewithover200differentimages. Figure 3.10: Retrieval results for image with large occlusions. For retrieving with the entire image, the correct image has lower scores than some unexpected images due to occlusionsintheimage. PoseRefinementwithPropagations As shown in Fig. 3.12, the reprojection errors of all the matching points are shown. We can see that the errors after the third propagation are the smallest. Without any propagations,thereprojectionerrorsarelargeforthefeaturesthatarefarfromthegrid. For example, the errors in the third propagation area are much larger than the errors in the second propagation area, and those errors inside the grid area are close to 0. Figure 3.13showssimilarresultsbutonlyreprojectionerrorsoftheinliersaredisplayed. 17 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−precision recall whoe image retrieval hierarchical grid retrieval (a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1−precision recall hierarchical grid retrieval whole image retrieval (b) Figure3.11: Precision-recallcurveforretrievalswithgridandwholeimage. (a)retrieval with non-occluded images (b) retrieval with occluded images. Grid method performs betterforoccludedimagesincethebagoffeaturesfortheentireimageislargelyaffected bytheoccludingparts. Figure 3.12: Reprojection errors of the calculated pose after each propagation for Fig. 3.8. Allthematchedfeaturesaredisplayed. Thefeaturesaredisplayedaccordingtothe orderofpropagations. Thosewithverylargeerrorsareoutliers. LocalizationandAugmentedReality We conduct many experiments for our localization system in the outdoor. Some of the resultsareshowninFig. 3.14andFig. 3.15. Aswecansee,thesystemisdemonstrated toberobustformanydifferentcasessuchasimagecapturedwithrotations,atdifferent distances and with large occlusions. The user locations are displayed by red dots in the 18 Figure 3.13: Reprojection errors of the calculated pose after each propagation for Fig. 3.8. Onlytheinlierfeaturesaredisplayed. Thefeaturesaredisplayedaccordingtoorder ofpropagations. map. Thebuildingnamesandlocationsarealsodisplayedwitha2Dlabelanda3Dsign inthecapturedimages. 3.6 Summary We present an efficient matching scheme with fast and robust performance. We use approximated nearest neighbor to select the best matching grid and match the features insidetwogrids. Propagationsareappliedtoachievemorematchesontheimage. Exper- imentsshowthattheproposedmatchingframeworkincreasethematchingspeedsignifi- cantlywithlittlerobustnesssacrificed. Besides,withthegridmethod,theposerecovery canbeachievedinrealtimewhenevertheposeislostduetoocclusionsorfastmotions. 19 (a) (b) (c) (d) Figure 3.14: The localization system and augmented reality. (a) the captured image is rotated. (b) the image is captured at a further distance. (a) and (b) are within the same clusterarea. (c)and(d)aretwoexamplescapturedinotherclusterareas. 20 (a) (b) Figure 3.15: Occlusions handling of the localization system. Due to the grid method, theproposedsystemiseffectiveinhandlingocclusions. 21 Chapter4 MatchingunderLightingVariations Matchingimageswithdifferentilluminationsisafundamentalandchallengingproblem. The matching is difficult because the intensities of an image are not linearly dependent on its illuminations. So any changes in illuminations will cause unpredicted changes in image intensities, which make point-based features probably fail in the matching process. While many proposed features are invariant in scales, rotations and robust to local deformations, they usually fail in cases when large illumination variations exist in the images, as the examples shown in Figure 4.1. In this chapter, we try to cope withthisproblembyproposinganewtypeoflocalfeature,whichdescribesthecontext informationofneighboringlinesegments. There are four challenges in constructing robust features that are based on line seg- ments. The first challenge is to ensure that the interest points are repeatable with stable scales under large illumination changes. In our approach, we utilize edge pixels that havesufficientlylargegradientvaluestomakesurethesepixelswillprobablyappearin allimages. Moreover,edgegroupingisenforcedsothatsamescalescanbeobtainedon differentimages. Thesecondchallengeistodesignadescriptorwhichislesssensitiveto illuminationvariationsbutwithsufficientdistinctiveness. Toformsuchadescriptor,the layout of edge pixels and signs of gradients rather than the gradient values (as in SIFT) areused. Itispossiblethatthegradientsignsmaychangeinsomerarecases. However, in most cases they remain the same because the signs are mainly determined by the reflecting characteristics which are invariant to illuminations. The third challenge is to 22 maintain the feature robustness under unstable edge or segment detections. The unsta- bilitycouldbecausedbyocclusions,viewpointchangesordifferentsensors. Wehandle this by using multiple points representation of line segments. With multiple sample points, any part of segment will make partial contributions in computing the descrip- tors. The forth challenge is about fast matching and scalability to large datasets. There are algorithms that exploit complicated matching schemes to match two stereo images. However, these algorithms are designed to do pairwise matchings and not scalable to largedatabases. Sinceimagematchingsusuallyplayanimportantroleinretrievaltech- niques, it is more promising if the features can be applied in retrieval processes. To achieve scalable matchings, we need the feature to be in simple representation form so thattreestructurescanbeusedtoreducetimecomplexity. (a) (b) (c) Figure4.1: Thechallengingcasesforimagematchingindifferentilluminatingenviron- ments. Thephotosontoparetakenin(a)thesunlight(b)arainyday(c)nighttime. The photosatbottomaretakeninanormaldaytime. Giventhatourmethodisbuiltonsomeofthepreviouswork,weexplicitlystateour originalcontributionsasfollows: 23 1. Weformallydefineanoveltypeoffeaturethatisrobusttochallengingimagesthat aremainlycausedbylargeilluminationvariations. Thefeatureisbasedoncontext information of line segments. The detailed keypoints detection and descriptor formationprocessesarepresentedinlatersections. 2. We propose a scalable matching method based on the proposed features. The method can be applied to retrieval process so that images of the same scenes but underdifferentilluminatingenvironmentscanbesuccessfullyretrieved. 3. Wehaveverifiedourproposedmethodonaself-constructeddatabasewhichcon- tainsalargenumberimagepairsunderdifferentilluminations. 4.1 ReviewofIllumination-RobustFeatures Inthepastdecade,alargenumberoflocalfeaturesareproposedinthecomputervision community. These features are useful in many applications such as object recognition, image integration and image retrieval. They are necessarily to be distinctive, easy to extract, and robust to local deformations. According to the level of pixel groupings adopted in their descriptor construction process, these features can be classified into four categories. The first category is intensity based features. While we cannot list all of them, popular ones are SIFT [61], SURF [7] and ASIFT [73] etc. These features makewelluseofthenormalizedgradientsandhistogramsofgradientdirectionstoform descriptorsthatarerobusttolocaldeformationsandsmallilluminationchanges. Instead ofusingthegradientvalues,featuresinthesecondcategorymakeuseofedgepixelsor edgels that are generated by thresholding these gradients in a sophisticated way. There is a rich literature covering the concept of edge detection and we will not review them 24 thoroughly here. Some popular edge-based features are [16,65,69,88,111]. The draw- back of edgel based features is that they highly rely on edge detections and suffer from imageclutters. In recent years, more line based features were proposed [6,23,31,101,104]. While curve extractions are still challenges, lines are much simpler to detect and represent. A line is a higher-level grouping of edge pixels, and they are less sensitive to edge detections, clutters and illumination changes. However, most features in this category are designed for pairwise matching and not scalable to large dataset. The region based features in the forth category are quite different from each other. We still classify them into the same group because they all use within-region information in certain ways to formtheirdescriptors. Somerepresentativefeaturesandcomparisonstudiesare[17,64, 68,89]. Some features are specially designed to cope with complex illumination variations [30,40,41,95,103,106]. These features belong to the first category and are based on intensity orders rather than values so that their descriptors are invariant to non-linear intensity changes. However, one limitation of such features is that theyare only able to handle monotonic illumination changes. The feature proposed in this chapter does not havesuchlimitation. 4.2 ImageMatchingthroughRelighting In this section, we focus on a slightly easier problem, i.e. image matching with known prior illumination information. The problem is still challenging due to the compli- cated relationship between illuminations and image intensities. The proposed method can match two images under very different illumination environments through image decomposition and relighting techniques. As we know, an image can be decomposed 25 intotwotypesofintrinsicimages,reflectanceimageandshadingimage. Thereflectance of a scene describes how each point reflects light, while the shading of the scene is the interaction of the surfaces in the scene with the environmental lights. So a reflectance image can be considered as the same image after removing the illumination effects. However, trying to match reflectance images does not work, if only reflectance com- ponent exists, the number of interest points will be largely limited due to constancy of pixelintensities. To handle this problem, we propose a method to relight one of two images to make them having similar illumination conditions. The prior illumination information is needed and can be obtained in several ways. For example, if the environment is known with invariant lightings, we can have the illumination information by modeling these light sources. There are also methods [50,72] that try to recover the lighting informa- tion from images. These methods will work only if a single or very limited number of light sources exist. In our work, much more flexible lightings are allowable. We use environment map to obtain the knowledge of illuminations. Compared to other ways, our method is relatively easier to convey and more complex illuminating environments canberepresented. Ascenecanberelitgiventheenvironmentallightingsandsurfacenormals. Thesur- face normals can be estimated from the shading component which is obtained through decompositions. Animageisusuallydecomposedintoareflectanceimageandashading imagewithcertainassumptions[10,94,96]. In[10],itisassumedthatthereflectanceis piecewiseconstantandtheilluminationcomponentisspatiallysmooth,sotheillumina- tion image can be derived by removing large derivatives from the input image. Instead of relying on illumination smoothness, Tappen et al. [96] proposed a learning-based method to separate reflectance edges and illumination edges in a derivative image. Tan et al. [94] propose a method that is able to separate the diffuse reflectance and specular 26 reflectance. By comparing the logarithmic differentiation of specular-free images, the proposedmethodcansuccessfullyremovethehighlightsfrominputimages. Figure 4.2: The framework of relighting process. I = original image, F = prior illumi- nation information, R = reflectance, S = shading,C = specular, G = geometry image, S ′ =modifiedshading,I ′ =reconstructedimage. To get the geometry information from shading image is an ill-posed problem and it does not have a unique solution, i.e. several surface geometries can yield the same shadingimage[12]. Muchworkhasbeendoneinthisareainthe90s. In[9],Belhumeur et al. prove that when the lighting direction and the albedo of the surface are unknown, the solutions are a continuous family of surfaces. Dupuis and Oliensis [28] prove the uniqueness of constrainted C 2 solutions and also characterize some C 1 solutions. Tsai et al. [97] proposes a simple but efficient SFS algorithm that employs a discrete linear approximation of reflection function, which combines the Lambertian model and the 27 Torrance-Sparrowmodel. Inourwork,wealsoassumetheLambertianmodelandspec- ularreflectancemodel,asmostworkdoes. Besides,theilluminationinformationisalso employedingeometryestimationprocess. OverviewofRelightingFramework To match two images with different illuminations, our method relights one image with theilluminationoftheotherone. Let’ssaywehavetwoimages,testimageI 1 ,reference image I 2 , and their illumination information F 1 and F 2 . Fig. 4.2 shows the framework ofproposedrelightingprocess. LightingInformationfromEnvironmentMap As we have mentioned before, there are several ways to get illumination information. In this work, we make use of environment maps, which are obtained through the light probemethod[26,51]. Sixlowdynamicrangeimagesarecapturedinfrontofamirrored ball with different exposures (speed shutters of 1, 1/4, 1/15, 1/60, 1/250, and 1/1000 seconds) and combined into one HDR image. Then spherical transformation is applied togeneratetheenvironmentmap. Thelightinginformationismodeledbyaconstellationoflightsources,whichcanbe extracted from the environment map by using importance sampling method [66]. Each samplerepresentsadistantlightsourceintheenvironment. Thecumulativedistribution function (CDF) is constructed so that sampling points are distributed according to light energiesonthemap. Usingtoomanysampleswillbecomputationallyinefficientwhile usingtoofewsamplescannotaccuratelymodelthelightinginformation. Inexperiments, we use 256 samples due to its good tradeoff between efficiency and accuracy. Fig. 4.3 showsthesampledlightsources. 28 Figure 4.3: Environment map for reference image in Fig. 4.4-(a) and constellation of light sources (green dots) generated by important sampling method [66]. The HDR environmentmapistone-mappedtolowdynamicrangeimagefordisplay. ImageDecomposition Aswehavementionedbefore,thereareseveralwaystogetilluminationinformation. In thiswork,wemakeuseofenvironmentmaps,whichareobtainedthroughthelightprobe method[26]. Sixlowdynamicrangeimagesarecapturedinfrontofamirroredballwith differentexposures(speedshuttersof1,1/4,1/15,1/60,1/250,and1/1000seconds)and combinedintooneHDRimage. Thensphericaltransformationisappliedtogeneratethe environmentmap. Thelightinginformationcanbeextractedfromtheenvironmentmap by using importance sampling method. Each sample represents a distant light source in the environment. The cumulative distribution function (CDF) is constructed so that sampling points are distributed according to light energies on the map. In experiments, we use 256 samples due to its good tradeoff between efficiency and accuracy. Fig. 4.3 showsthesampledlightsources. 4.2.1 GeometryEstimationandRelighting Torelightanimage,weneedgeometryinformationoftheimage. Thoughimagedecom- position(asinFig. 4.4)simplifiestheproblembyremovingsomeambiguities,recover- ing geometry from the shading image is still ambiguous and unsolved. However, with 29 (a) referenceimage (b) reflectancepart (c) shadingpart Figure4.4: Thedecompositionofreferenceimagewithmethod[96]. Thespecularparts areremovedby[94]. priorilluminationinformation,thegeometrycanbeestimatedbyourproposedmethod. The Lambertian surface reflection model is assumed, i.e. the shading values depend on thelightingintensities,lightingdirectionsandsurfacenormals. GeometryfromShadingandIllumination After image decomposition, the shading component can be considered an image with constant surface albedo. For Lambertian surfaces, the reflectance function in terms of surfacegradients pandqismodeledas, S(x,y) = Ref(p,q) = 1+pp ′ +qq ′ √ 1+p 2 +q 2 √ 1+p ′2 +q ′2 (4.1) whereS(x,y)istheshadingvalueatpixel(x,y),Ref isthereflectancefunction, p= ¶Z ¶x , q= ¶Z ¶y , p ′ = costsins coss , q ′ = sintsins coss ,t ands are the tilt and slant of the incoming light directionatpixel (x,y). 30 Thevaluesof pandqarerepresentedinthediscreteformas p=Z(x,y)−Z(x−1,y) andq=Z(x,y)−Z(x,y−1). Sotheequation4.1canberewrittenasafunction f, 0 = f(S(x,y),Z(x,y),Z(x−1,y),Z(x,y−1)) = S(x,y)−Ref(p,q) (4.2) To solve equation 4.2, we use linear approximation of function f and Jacobi iterative method,theequationissimplifiedintothefollowingform, Z n (x,y)=Z n−1 (x,y)+ −f(Z n−1 (x,y)) d dZ(x,y) f(Z n−1 (x,y)) (4.3) where Z n (x,y)representsthe nthiterationforthedepthmap. Thesecondtermofabove equationcanberepresentedintermsof p,q, p ′ andq ′ ,whichcanbecalculatedbyvalues from depth map Z n−1 (x,y) in the previous iteration. With Z 0 (x,y)=0 for all pixels as theinitialestimation,thedepthmapcanbeiterativelyrefinedusingequation4.3. Note that in our case, p ′ and q ′ depends on the values of (t,s), which are not constantthroughoutallpixelssincedifferentpartsoftheobjectsurfacemaybecovered bydifferentlightsources. Soweneedtocalculate(t,s)foreachpixel. Tocalculatethe cumulative lightings on each pixel in the image, we need to know which light sources haveeffectsonthepixel. LetP x,y bethesubsetofalllightsourcesA={l i },i=1..256for pixel (x,y), and ⃗ D x,y , L x,y be the averaged lighting direction and sum intensity for that pixel. Initially, we assume that every pixel is covered by all the light sources, so P 0 x,y =A for all (x,y). With ⃗ N x,y as the surface normal for pixel (x,y), r i and d i be radiance and 31 direction for light source i, we need to maintain å i l i ∈P x,y r i ( ⃗ d i · ⃗ N x,y ) = L x,y ( ⃗ D x,y · ⃗ N x,y ). Thenwehave, ⃗ D t x,y = å i l i ∈P t x,y r i ⃗ d i ∥å i l i ∈P t x,y r i ⃗ d i ∥ , L t x,y =∥ i å l i ∈P t x,y r i ⃗ d i ∥ (4.4) where P t x,y , ⃗ D t x,y and L t x,y arevaluesof P x,y , ⃗ D x,y and L x,y atthetthiterationrespectively. L 0 x,y is used as the illumination chromaticity for removing highlights from the original image. The value of (t t ,s t ) for every pixel is calculated from ⃗ D t x,y . So with equation 4.3, the depth image Z t (x,y) can be generated. The surface normals ⃗ N t x,y are then obtained fromZ t (x,y). Withsurfacenormals,weupdatethelightsourcesforeachpixelasP t x,y = {l i | ⃗ d x,y · ⃗ N t−1 x,y >0}. In case P t x,y is empty, we assign it with a set that contains the only lightsource l i withminimumradiance. Withupdated P t x,y ,wecalculate ⃗ D t x,y and L t x,y by equation4.4. Figure 4.5: The Jacobi iteration process for geometry estimation. The solution can be wellobtainedinabout5to20iterations. At the initial step of solving Z 0 (x,y), we assume that all pixels have the same inci- dentlightradiance. However,aftereachiteration,everypixelhasitsowngroupoflight sources. WehavetonormalizetheshadingimageS(x,y)sothattheshadingvaluesonly dependonsurfacenormalsandlightingdirections,regardlessoflightsourceintensities. Tonormalize S(x,y)atthetthiteration,wehave S t (x,y)=S(x,y)/L t x,y . Inthenextiter- ation, S t (x,y) is used to calculate Z t+1 (x,y) and N t+1 x,y . The Jacobi iteration steps are 32 shown in Fig. 4.5. The iteration process will converge when the values of P t x,y for all pixels are stable. Usually 5 to 20 iterations are sufficient for the process to converge or almost converge, i.e. only a few pixels have unstable subsets. In the latter case, we also stop iterations since the instability is probably due to noise pixels. With the prior lightinginformation,wecanestimatethetestimagegeometryasshowninFig. 4.6. ModifiedShadingandImageRelighting The surface normal can be calculated from its neighboring pixels on the depth image. ThenewshadingcomponentcanbegeneratedbyapplyingtheLambertianmodel. When we calculate the surface normals, the normals of edge pixels are averaged. Therefore, weneedtheoldshadingcomponenttomaintaintheedgeinformationintherelitimage. Wecombinebothshadingimageswithweightsa asfollows, S ′ (x,y)=aS(x,y)+(1−a)L x,y ( ⃗ D x,y · ⃗ N x,y ) (4.5) where L x,y and ⃗ D x,y arecalculatedfromequation4.4. Thegeneratedshadingimagewitha =0.3andtherelitimageareshowninFig. 4.6. (a) testimage (b) estimatedgeometry (c) relitimage Figure 4.6: (a) The test image under a different illumination environment. (b) The estimatedgeometry(representedasadepthimage). (c)Therelitimage. 33 As shown in Fig. 4.6-(c), the depth image Z(x,y) is obtained through the iteration process. Surfacenormalscanbecalculatedfromthedepthimage,andrelightingisdone withbothsurfacenormalsandnewlightinginformation. 4.2.2 RelightingResultsandDiscussions ExperimentalResults To match two images with very different illumination conditions is a challenging prob- lem. However,withourproposedrelightingmethod,thematchingresultscanbesignif- icantlyimproved. Figure4.7and4.8showsthematchingresultswithSIFTfeatures. As wecansee,whilethereareveryfewmatchingpointsontheoriginaltestimage,thenum- berofmatchesislargelyincreasedontherelitimage. Moreover,withtraditionalcontrast adjustmentmethodssuchascontraststretchingandhistogramequalization,point-based featuressuchasSIFTwillnotmatchwell(Fig. 4.8). Besides pairwise matchings, we also tested our method to match images within a database. We have constructed a database with over 200 images under 15 reference illuminations. The query images are taken under different illuminations from the ref- erences. The bag of SIFT features is used for matching in the data set. Different con- trastenhancementmethodsareusedforcomparisons. Figure4.9showsrecall-precision curveofthematchingresults. SummarywithDiscussions We propose a novel method to match images under significantly different illumination environments. An iterative algorithm is designed for estimating the surface normal for eachpixelintheimagefromtheshadingcomponentandpriorilluminationinformation. 34 (a) matchingonoriginalimage (b) matchingonrelitimage Figure4.7: MatchingresultswithSIFTfeatures. (a)directmatchusingoriginalimages (b)matchingontherelitimage. The image is then relit and point-based features such as SIFT can be applied in the matchingprocess. Thereare tworeasonable assumptionsthat need be satisfiedfor the method to work effectively. Firstly, the lighting sources should be distant from the object in the image, so the incoming light directions can be considered uniform on the object. Secondly, the exhibition of object surfaces should be close to Lambertian reflectance. It may be difficult to apply the approach if the surfaces are mirror-liked surfaces. With the above assumptions satisfied, the proposed relighting method is promising in matching these challengingimages. 35 Figure4.8: (a)and(b)aretwogroupsofmatchingresultsdisplayed. Foreachgroup,the figures from top to bottom, left to right are, reference image and its illuminations, test image and its illuminations, SIFT matching results with contrast stretching, histogram equalizationandproposedrelightingmethod. 4.3 MatchingwithLineContextFeatures 4.3.1 Motivation Through extensive experiments and based on existing research, we found that the best grouping level to handle large illumination variations is using curves or lines. Since curves require more sophisticated groupings and can be approximated by multiple line segments, we use line segments as the primitives of our feature descriptor. We call the 36 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1−Precision Recall relighting original stretching hist equal Figure4.9: Recall-precisionforimagematchingamongover200images. SIFTfeatures areusedforallthemethods. proposedfeatureLineContext,sinceitisinspiredbyShapeContext. Wewilltalkabout thedetailsofourproposedfeatureinthefollowingsections. 4.3.2 LineContextDetector InterestPointsDetectionandInitialScale It is well known that edges are present at various scales. To detect edges at different scalesweusemulti-scaleCannyedgedetector[15]withGaussianderivativesatseveral pre-selected scales. Adaptive Canny thresholds are used to ensure sufficient number of edges but our method is not sensitive to such threshold values. The scale step is set to 1.4. For each edge point, Laplacian operator is used to estimate the size of its neighborhood that has robust responses across all scales. Given the edge point, we compute Laplacian responses for several scales. Then we select the scale for which the responseattainsanextremum[61]. OnepropertyofLaplacianoperatoristhatthescale 37 that achieves extremum is equal to the distance to step-edges ( [57]). After extremum detections,eachedgepointisassignedascales. Not all detected edge points are considered keypoints. Two phases are used to remove unstable edges. The first phase is the above-mentioned Laplacian process. Those edges that do not attain a distinctive extremum over scales will be removed at the same time. The second phase is to remove those edges with homogeneous gra- dients, i.e. the points where the underlying curves have zero curvatures. We apply Harris matrix [42] to achieve this purpose. Let A be the Harris matrix calculated on the edge point andl 1 andl 2 be its two eigenvalues. The edge point will be rejected if min(l 1 ,l 2 )<T res , where T res is the response threshold. Moreover, we can also control thenumberofinterestpointsbythresholdonHarriseigenvalues. After the above two steps, the remaining edge points are relatively stable across different images and are considered keypoints. Each keypoint has a scale s, which is calculatedfromtheLaplacianoperator. Duetointensitydifferences, thescalesofsame interestpointmaynotbeconsistentonimageswithverydifferentilluminations. Sowe considers astheinitialscaleoflinecontextfeatures. Thegroupinginformationwillbe enforcedtoobtainmoreaccuratescales,whichwillbeillustratedlater. LineSegmentsintheContext The edge pixels are linked to connected curves at different scales. These curves are fitted by straight line segments with an approximation method similar to [3]. Each line segmentisassignedanorientation. Theorientationisthesameasthelinewithdirection determined by the sign of average gradients on its two sides along the segment. This is illustratedbycurve candsegment 4inFigure4.10-(a). As shown in Figure 4.10-(a), several cases need to be considered for representing curves with line segments. One curve may be fitted by multiple segments like curve a. 38 Twosegmentswithsmallgapinbetweenaremergedintoonelargersegmentnomatter they are on the same scale (curve b and d) or different scales (curve f and g). In the latter case, the merged segment only exists in the lower scale (segment 8). Besides, all thesegmentsinhigherlevelsarealsosegments(segment 1,2and3)orpartofsegments (segment 5and7)inlowerlevels. (a) (b) (c) Figure 4.10: (a) Curves at various scales are fitted by line segments in different situa- tions. Darkbluedenoteshigherscalelevel,andlightbluedenoteslowerscalelevel. (b) Linesegmentswithindistance2s areconsideredcontextsegments. (c)Akeypointand itscontextsegmentsonanedgeimage. For each keypoint, we need to find line segments in its neighborhood, which is also called context of the feature. The line segments lying inside or partially inside the con- textarecalledcontextsegments. Theinitialscales providesanestimatesizeofsearch- ing area. However, since s is equal to the distance to step edges, it is not surprising that the area within distances may not contain sufficient number of segments. By suf- ficiency,wemeanthattheareacontainsenoughsegmentstobedistinctivefromanother area. Through extensive experiments, we found that 2s is the best trade-off between computations and distinctiveness. The technique of feature scale extension to involve more information is widely used and has been discussed in [67]. Let d point−set (v,S) be 39 theshortestdistancefrompoint vtoasetofpoints S. Withkeypoint k andall-segments {seg i },thesegmentsset Linthecontextisdefinedas, L(k)={seg i |d point−set (k,seg i )≤2s} (4.6) Figure 4.10-(b) shows an example of context segments. All the segments in scale level s and lower scales are included in the context as long as part of the segment is within distance 2s. Note that segments with very small lengths are removed because these segments are unstable on different images (segment 9 and 10). The feature descriptor will be formed by the remaining segments. Figure 4.10-(c) shows a keypoint and its contextsegmentsonanimageafteredgedetection. ScaleandRotationInvariance Feature scale. The Line Context feature is designed to be scale and rotation invariant. Toachievethesetwotypesofinvariance,weneedtodefinetheexactscaleandreference orientationforeachfeaturepoint. The scale 2s that is used to search neighboring segments is not suitable to describe the size of keypoint context because many segments are partially outside this range. Therefore,weneedtorecalculatethescaleforeachfeaturepoint. Letm i bethemidpoint of segment seg i in the context of keypoint k, d point−point (a,b) be distance between two points aandb. Thescale sforthefeatureisdefinedas, s(k)= å seg i ∈L(k) d point−point (k,m i ) |L(k)| (4.7) During descriptor calculation process, all the context segments of keypoint k will be normalizedbyafactorofs(k). 40 Feature orientations. Each keypoint descriptor is assigned a canonical orientation so that the descriptor is invariant to rotations. This orientation is determined by the dom- inant orientation of context segments in the lowest edge detection scale. A segment orientation histogram is created to vote for the dominance. The histogram has 36 bins covering 360 degree range of rotations. The vote from each segment is weighted by itslengthsothatlongersegmentshavelargercontributions. Thekeypointorientationis thendeterminedbythepeakinthehistogram. Toresolveambiguities,thoseorientations thatareveryclosetothepeakvaluearealsoconsidered. Inourexperiments,wesetthe closeness ratio as 0.8. In other words, if the largest weight of voting isW max , all those orientations with weights greater than 0.8W max are considered new orientations. For each new orientation, a different descriptor is created. Therefore, each keypoint may havemultipledominantorientationsanddescriptors. 4.3.3 LineContextDescriptor Multiple points representation. We know that the edge detection and linking are not always consistent, which means that many segments that are detected in one image may not be detected or partially detected in other images. Therefore, any particular single point on the segment is not sufficient to describe the whole segment. To resolve this problem, we use multiple sampled points as the representation of segments in the context. Tosampleonthesegments,wefirstnormalizethecontextofkeypoint k byitsscale s(k). Afternormalization,theaveragedistancetothemidpointofthesecontextsegments would be unit 1. The lengths of these segments are also normalized accordingly. For each segment, starting from the midpoint along the segment itself in its two opposing directions,wesamplethesegmentwithstep0.1unit. Figure4.11-(a)showsanexample ofcontextsegmentssampling. 41 AsshowninFigure4.11-(a),fourparametersareusedtodescribeeachsamplepoint. Theyare1)thedistancertothekeypoint,2)theanglea∈[0,360)betweenthedirection from keypoint to sample point and reference direction (keypoint dominant orientation), 3) the angleb ∈[−180,180) between reference direction and the orientation of under- lying segment and 4) the underlying segment scale s. After sampling, all the sample pointswillbeusedtoformthekeypointdescriptor. Log-2D-polar voting. Our goal is to produce a compact descriptor for each keypoint by computing a coarse histogram of the relative coordinates of segment sample points. We use a log-polar-like coordinate system to vote for the relative distances. Different fromtraditionallog-polarcoordinates,a2D-polarcoordinateisusedsincewehavetwo angularparametersa andb. Tobeabletohandlelocaldeformations,wechoosetouse 6 equally spaced angle bins for both a and b/2, and 4 equally spaced log-radius bins for logr. We useb/2 as the coordinates rather thanb becauseb/2 is more convenient toberepresentedinthe3DspherecoordinateswhichisshowninFigure4.11-(b). Thevotingfromeachsegmentsamplepointisdoneasfollows. Let pbethesample point, r ′ , a ′ , b ′ and s be the four parameters belonging to p, s 0 be the initial scale of the keypoint, which is also the highest scale for all context segments. Let (a 0 , b 0 /2, r 0 ) be the center point of the bin containing p. Besides the container bin, on each of the three coordinate directions, we also choose the neighbor bins that are closer to the sample point. This is illustrated in Figure 4.11-(c). Therefore, we have four bins (a 0 , b 0 /2, r 0 ), (a 1 , b 0 /2, r 0 ), (a 0 , b 1 /2, r 0 ) and (a 0 , b 0 /2, r 1 ), and each bin will be voted withcertainweights. TheirweightsaredenotedasW 0 ,W a ,W b andW r respectively. Eachsamplepointhasascaleparameters. Thosepointswithhigherscalesaremore reliablethanpointsinlowerscales. Therefore,weightscarriedbysamplepointsareset inproportiontotheirscales. Letsamplepointwiththehighestscales 0 carryingweight 1,soanypointwithscales willcarryaweights/s 0 . 42 (a) (b) (c) (d) Figure4.11: (a)multiplepointsrepresentationoflinesegments(b)samplepointsin3D sphere coordinates (c) voting to neighboring bins from a sample point (d) logr, a, b histogram(darkergridrepresentslargerweights) The weight voted to a bin by a sample point is determined by the weight carried by the point and the distance between the point and the bin center. If the bin contains the point, it is voted with the full weight of the point. In another word, W 0 (s 0 ) = 1 and W 0 (s) =s/s 0 . For the other three bins, their weights are exponentially decreasing as the distance to their bin centers increase. The unweighted voting scheme could also be used,buttheweightedvotinghasmoderateimprovementsonrobustness. 43 The accumulated weights from all sample points form a 3D descriptor which is showninFigure4.11-(d). Manybinshave0votewhicharesimilartoSIFTdescriptors. Throughexperiments, wefoundthat itis usuallynotnecessary tocoversegmentsinall scales. Thescales 0 andonelevelloweraregoodestimationsformostcases. BinweightscalculationLet(a ′ ,b ′ /2,logr ′ )bethecoordinateofsamplepoint p. The distance to its neighbor bin ina direction is|a ′ −a 1 |. To calculate the weights for the otherthreebins,wefirstdefineanexponentialfunction f asfollows, f(d,l)=exp(− d l ) (4.8) function f(d,l)representstherelativeweightassigningtothebinatdistanced withref- erence distance l. The reference distance is the distance between two neighboring bins ineachcorrespondingdirection. Fordirectionsa andb,thedistancesarecalculatedas, d a i = min(|a ′ −a i |, 360−|a ′ −a i |), i=0,1 d b i = min(|b ′ −b i |, 360−|b ′ −b i |)/2, i=0,1 l a = min(|a 0 −a 1 |, 360−|a 0 −a 1 |) l b = min(|b 0 −b 1 |, 360−|b 0 −b 1 |)/2 Therefore,theweightsvotedtotheneighborbinsina andb directionsare, W a (s) = f(d a 1 ,l a ) f(d a 0 ,l a ) ·W 0 (s)= f(d a 1 −d a 0 ,l a )·W 0 (s) (4.9) W b (s) = f(d b 1 ,l b ) f(d b 0 ,l b ) ·W 0 (s)= f(d b 1 −d b 0 ,l b )·W 0 (s) (4.10) 44 LetB i denotesthespacecoveredbyinnermostbinsandB o denotethespacecovered byoutermostbins. Theweightassignedtotheneighboringbinin r directionis, W r (s)= 0 if p∈B i &|logr 1 −logr ′ |>l logr 0 if p∈B o &|logr 1 −logr ′ |>l logr f(d logr ,l logr )·W 0 (s) otherwise (4.11) where d logr =|logr 1 −logr ′ |−|logr 0 −logr ′ |and l logr =|logr 1 −logr 0 |. 4.3.4 ImageMatchingandIndexing FeatureDistanceandSimilarity Giventwofeatures, theirdistanceisdefinedasthe 1-normdistancebetweenthe unnor- malized descriptors. The feature has been normalized by its scale s(k) before comput- ing its descriptor. Once the descriptor is computed, it should not be normalized again because the descriptor represents the accumulation of weights rather than weight dis- tribution. Figure 4.12-(a) shows an example of such a difference. As the figure shows, whileleftandrightaretwodescriptorswithdifferentweights,theyhavethesameweight distribution,andthedescriptorwillbecomeidenticalafternormalization. Through experiments, we found that using 1-norm distance generates better perfor- mance than 2-norm distance. Let v 1 and v 2 be two descriptors, then their similarity m(v 1 ,v 2 )isdefinedas, m(v 1 ,v 2 )=1− ||v 1 −v 2 || 1 ||v 1 || 1 +||v 2 || 1 (4.12) 45 MatchingandIndexingwithLineContextFeatures To match two images, we find the most similar feature for each feature in the other image. The distance ratio method is used to remove ambiguous matches. In experi- ments, we set the ratio to 0.8. Besides pairwise matching, the proposed feature can be easily extended to a large dataset. A vocabulary tree is constructed for the dataset and bag of features are used to retrieval relevant images. Since our feature is invariant to illumination changes, we are able to retrieve those images of the same scene but taken underdifferentenvironments. (a) (b) Figure 4.12: (a) Two features have different layouts of line segments. After normal- ization, their descriptors will be identical. (b) The example pairs in self-constructed dataset. 4.4 LineContextEvaluationandDiscussions In this section, we show experimental results to demonstrate effectiveness of the pro- posed Line Context feature. We will evaluate the performance of both its detector and descriptor. To validate our detector/descriptor, we have collected images under various condi- tionsfromboththeInternetandtakenbyourselves(asinFigure4.12-(b)). Theseimages have large illumination variations as well as scale and viewpoint changes. The ground 46 truth correspondences are obtained by enforcing both homographic (larger error toler- ance due to non-planar parts) as well as epipolar (smaller error tolerance) constraints. We compare our proposed feature with other local features on standard dataset as well asconstructeddataset. Repeatability. Repeatability is the percentage of keypoints that are detected at same positions and scales across two or more images. For our experiments, we compare the repeatability of Line Context detector with other detectors such as Edge Foci [111], Harris/HessianLaplace[68],MSER[64]andDoGdetector[61]. Wetuneeachdetector to generate approximately the same number of interest points per image ranging from 800 to 2000 points. Figure 4.13 shows the repeatability measures for Mikolajczyk’s dataset [67] and constructed dataset. In the figure, (a) and (b) show results of handling scale, rotation and affine transformations, while (c) and (d) are results for illumina- tion variations. As the results show, Line Context detector is comparable with Hessian Laplace for handling scales, rotations and affine transformations, in which cases Edge Foci performs the best. However, for images with illumination variations, our detector iscompetitivewithEdgeFociandslightlybetterincaseswhenthevariationsarelarge. Feature matching. For the same detector, the matching performance varies by using different descriptors. We compare our Line Context descriptor with OSID [95], which is one of representative descriptors that are specially designed to handle illumination variations. Different combinations of detectors (LC - Line Context, EF - Edge Foci, HeAff-HessianAffine)anddescriptors(LineContext,OSID,SIFT)arecompared. Figure 4.14 shows point correspondences obtained by matching Line Context fea- tures (detector+descriptor) for images in Figure 4.1. We define precision as the per- centage of correct matches out of all matches and recall as percentage of keypoints withcorrectcorrespondences. Thecomparisonswithotherdetector/descriptorpairsare 47 2 2.5 3 3.5 4 4.5 5 5.5 6 0 5 10 15 20 25 30 35 40 45 50 Boat − Scale & Rotation 1st vs image no. repeatability Line Context Edge Foci Harris Laplace Hessian Laplace MSER DoG (a) 2 2.5 3 3.5 4 4.5 5 5.5 6 0 10 20 30 40 50 60 Graffiti − Affine 1st vs image no. repeatability Line Context Edge Foci Harris Laplace Hessian Laplace MSER DoG (b) 2 2.5 3 3.5 4 4.5 5 5.5 6 10 20 30 40 50 60 70 80 90 1st vs image no. repeatability Leuven − Lighting Line Context Edge Foci Harris Laplace Hessian Laplace MSER DoG (c) 1 2 3 4 5 6 7 0 10 20 30 40 50 60 70 Fig 1 and Fig 6 image pair no. repeatability Line Context Edge Foci Harris Laplace Hessian Laplace MSER DoG (d) Figure 4.13: Repeatability test on both standard dataset (Mikolajczyk’s, (a) - scale and rotation,(b)-affine,(c)-lighting)andcollecteddataset((d)-lighting). shown above each image pair. It can be observed that images in pair (a) have very dif- ferent tones, and images in pair (c) are almost in two opposite illuminating conditions, thecombinationofourdetectoranddescriptorsignificantlyoutperformsothercombina- tions on these challenging cases. For image pair (b), which is taken with blurry effects, ourdetector+descriptorhassimilarperformancetoEdgeFocicombinedwithOSIDand outperformstheothers. We have tested our feature on more than 120 pairs of challenging images that are taken under different circumstances. The challenges are mainly caused by various illu- mination changes, such as different lighting sources, day-or-night illuminating and dif- ferent weather conditions etc. The average precision-recall curve is drawn in Figure 48 (a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1−precision recall LC+LC EF+LC EF+OSID EF+SIFT HeAff+OSID HeAff+SIFT (b) (c) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1−precision recall LC+LC EF+LC EF+OSID EF+SIFT HeAff+OSID HeAff+SIFT (d) (e) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1−precision recall LC+LC EF+LC EF+OSID EF+SIFT HeAff+OSID HeAff+SIFT (f) Figure 4.14: Point correspondences obtained by matching Line Context features and comparisonresultswithotherlocalfeaturesforimagesinFigure4.1. 4.15-(f). It clearly shows that our feature performs better than the other features on thesechallengingcases. WehavealsotestedourapproachonMikolajczyk’sdatasetwhichisoneofthestan- dard datasets to verify different feature characteristics. In their illumination cases, the images are less challenging compared to ours in two aspects. The viewpoint changes 49 are smaller and the illumination changes are relatively uniform and monotonic. Figure 4.15-(c), (d) and (e) show the comparisons on handling these cases. It is demonstrated thatastheilluminationdifferencesbecomelarger,ourdetectoranddescriptorcombina- tion is more robust compared to others. For “Leuven" 1st and 6th images, our feature performs much better than the others. We also test Line Context on scale/rotation and viewpoint changes, and the results are reported in Figure 4.15-(a) and (b) respectively. The proposed feature is comparable to others in handling scale/rotaion variances. For imageswithviewpointchanges,ourfeatureisnotasgoodassometopperformerssuch as SIFT with Hessian Affine detector. This is because under viewpoint changes, the parameters r, a and b for each sample will be affected at the same time, which cause variancesonthevotingresults. However,theperformanceonviewpointchangesisclose tomanyclassicdescriptorsandisquiteacceptableinmanyapplications. Runningperformance. InTable4.1,weshowrun-timecomparisonswithseveralpop- ular features (detector+descriptor). For MSER detected regions, SIFT descriptors are used. As can be observed, our approach can detect more correct matches than most of these features. The Line Context extraction and matching time are similar to oth- ers. Whileinsome examplesLS candetectmore correctmatches, itsmatchingprocess takesmuch longer due to its complicated matching scheme. For the same reason, LS is not suitable for tree structures without major modifications thus not scalable to a large numberofimages. Matching examples. We show more comparisons results of matching for image pairs in 4.12-(b). SIFT, SURF, OSID and Line Context are compared. As the figure shows, thematchingwithourproposedfeaturecangeneratemorecorrectcorrespondences. ItshouldalsobenotedthatonelimitationofLineContextfeatureisthatit doesnot perform well in images with all small or fine-grained textures. In these images, there arenoexplicitcurvesorlines. Thetexturesaremadeupoftinysegmentsinalldifferent 50 4.1-(a) (b) (c) 4.12-(b)-I (b)-II (b)-III (b)-IV Extract 1-Pair nimages SIFT 2/11 6/13 1/17 2/21 2/15 1/14 5/21 0.9-1.3s 0.1-0.2s O(lgn) SURF 0/7 4/9 0/12 0/11 1/8 0/12 4/17 0.3-0.4s 0.1-0.2s O(lgn) SC 3/17 3/11 1/13 2/16 3/18 1/12 3/18 0.7-1.2s 0.2-0.5s O(lgn) LS 29/37 51/55 19/21 23/35 11/15 72/77 52/65 3-4s 12-18s O(n) MSER 0/2 1/7 2/9 2/7 0/5 0/8 1/12 0.2-0.3s 0.2-0.4s O(lgn) LineContext 31/47 36/53 49/60 36/53 32/41 42/58 34/39 0.8-1.5s 0.2-0.3s O(lgn) Table 4.1: Comparison of different features (SIFT, SURF, Shape Context, Line Signa- ture, MSER+SIFT, Line Context) on images in Figure 4.1 and Figure 4.12-(b). The number of correct matches over the number of detected matches is reported. It also comparesfeatureextractiontime,matchingtimeforonepairofimages,andcomplexity formatchingwithnimages. Treestructurescanbeappliedforsomefeaturestoachieve O(lgn)complexity. directions. When the context size is small, the voting is more sensitive to changes or noises. In this case, we cannot form stable context segments around keypoints, and the computeddescriptorsarenotrobust. 4.5 Summary Wehavepresentedanewtypeoffeature,called Line Context. Thefeatureisformedby describingthelayoutoflinesegmentsinthecontextofaninterestpoint. Pixelgrouping is enforced after Laplacian-of-Gaussian to detect a more repeatable scale under large illumination changes. Moreover, to achieve robustness to unstable edge detections and linkings,multiplesamplepointsareusedtorepresenteachsegment. Avotingschemeis thenproposedtogenerateathreedimensionalhistogram. Theunnormalizedweightsin thehistogramformalinecontextdescriptor. Similartomostlocalfeatures,linecontexts arerobusttoocclusionsandviewpointchanges. Significantimprovementsareachieved onmatchingchallengingimagessuchaslargeilluminationvariations. 51 0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1−precision recall Boat 1st and 4th LC+LC EF+LC EF+OSID EF+SIFT HeAff+OSID HeAff+SIFT (a) 0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−precision recall Graffiti 1st and 4th LC+LC EF+LC EF+OSID EF+SIFT HeAff+OSID HeAff+SIFT (b) 0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1−precision recall Leuven 1st and 2nd LC+LC EF+LC EF+OSID EF+SIFT HeAff+OSID HeAff+SIFT (c) 0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−precision recall Leuven 1st and 4th LC+LC EF+LC EF+OSID EF+SIFT HeAff+OSID HeAff+SIFT (d) 0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1−precision recall Leuven 1st and 6th LC+LC EF+LC EF+OSID EF+SIFT HeAff+OSID HeAff+SIFT (e) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1−precision recall constructed database LC+LC EF+LC EF+OSID EF+SIFT HeAff+OSID HeAff+SIFT (f) Figure 4.15: Precision-recall measures on Mikolajczyk’s standard dataset (a to e) and constructed dataset (f). The cases handled are (a) scale and rotation (b) viewpoint changes(c)-(e)illuminationchanges(f)morechallengingilluminationchanges. 52 (a) MatchingwithSIFT (b) MatchingwithSURF (c) MatchingwithOSID (d) Matchingwithproposedfeature Figure4.16: MatchingresultsforExample4.12-(b)-I. (a) MatchingwithSIFT (b) MatchingwithSURF (c) MatchingwithOSID (d) Matchingwithproposedfeature Figure4.17: MatchingresultsforExample4.12-(b)-II. 53 (a) MatchingwithSIFT (b) MatchingwithSURF (c) MatchingwithOSID (d) Matchingwithproposedfeature Figure4.18: MatchingresultsforExample4.12-(b)-III. (a) MatchingwithSIFT (b) MatchingwithSURF (c) MatchingwithOSID (d) Matchingwithproposedfeature Figure4.19: MatchingresultsforExample4.12-(b)-IV. 54 Chapter5 ImageAlignmentwithPointClouds Thischapterdealswiththeproblemofautomaticposeestimationofa2Dcameraimage with respect to a 3D point cloud of an urban scene, which is an important problem in computer vision. Its application include urban modeling, robots localization and augmented reality. One way to solve this problem is to extract features on both types of data and find the 2D-to-3D feature correspondences. However, since the structures of this two types of data are so different, the features extracted from one type of data are usually not repeatable in the other (except for very simple features such as lines or corners). Instead of direct extracting in 3D space, features can be extracted on their 2D projectionsand2D-to-2D-to-3Dmatchingschemecanbeused. (a) (b) Figure 5.1: (a) The 3D LiDAR data with color information (sampled by software for fastrendering). (b)The2Dimageofthesamescenetakenatthegroundlevel. Asremotesensingtechnologydevelops,mostrecentLiDARdatahasintensityvalue foreachpointinthecloud. Somedatasetalsocontainscolorinformation. Theintensity information is obtained by measuring the strength of surface reflectance, and the color 55 information is provided by an additional co-located optical sensor that captures visible light. These information is very helpful for matching 3D range scans with 2D camera images. Unlike geometry-only LiDAR data, intensity-based features can be applied in theposeestimationprocess. Figure 5.1 shows the colored point cloud data and a camera image taken on the ground. As we can observe, the projected point cloud looks similar to an image that is taken by an optical camera. The fact is that if the point cloud is dense enough, the projected 2D image can be treated the same way as a normal camera image. However, there are several differences between projected image and an image taken by a camera. First, there are many holes on the projected image due to the missing data. This is usually caused by the non-reflecting surfaces and occlusions in the scene. Second, if the point cloud intensity is measured by reflectance strength, the reflectance property of invisible lights are different from that of visible lights. Even in the case that visible lightsareused toobtain dataintensities, thelighting conditionscould bedifferentfrom thelightingofacameraimage. Inthischapter,wepresentanalgorithmthatcanhandle pointcloudswithbothtypesofintensityinformation. The intensity information of point clouds is useful for camera pose initialization. However, due to intensity differences, occlusions etc., there are not many correspon- dencesavailableandonesmalldisplacementinanyofthesematchingpointswillcause large errors in the computed camera pose. Moreover, in most urban scenes, there exist many repeated patterns, which make many features fail in the matching process. With the initial pose, we can estimate the location of corresponding features and limit the searching range within which repeated patterns does not appear. Therefore, we can generate more correspondences and refine the pose. After several iterative matchings, we further refine the camera pose by minimizing the differences between the projected image and camera image. The estimated camera pose is more stable after the above 56 two steps of refinement. The contribution of the work in this chapter is summarized as follows. 1. We present a framework of camera pose estimation with respect to 3D terrestrial LiDARdata that contains intensity values. No prior knowledgeabout the camera positionisrequired. 2. We have designed a novel algorithm that refines the camera pose in two steps. Bothintensityandgeometricinformationareusedintherefinementprocess. 3. We have tested the presented framework on different point clouds. The results showthattheestimatedcameraposeisaccurateandtheframeworkcanbeapplied inmanyapplicationssuchasmixedreality. Theremainderofthischapterpresentsthealgorithminmoredetail. Wefirstdiscuss some related work in Section 2. Section 3 describes the camera pose initialization pro- cess. Following that, Section 4 discusses the algorithm that refines camera pose. We showexperimentalresultsinSection5andconcludethechapterinthelastsection. 5.1 ReviewofImageRegistrationwithPointClouds There has been a considerable amount of research in registering images with point clouds. The registration methods vary from keypoint-based matching [2,8], structure- based matching [59,60,91,92], to mutual information based registration [102]. There are also methods that are specially designed for registering aerial LiDAR and aerial images[27,33,63,98,100]. Whenthepointcloudcontainsintensityvalues,keypoint-basedmatchings[2,8]that are based on similarity between point cloud intensities and image intensities can be 57 applied. Feature points such as SIFT [61] are extracted from both images and a match- ing strategy is used to determine the correspondences thus camera parameters. The drawback of intensity-based matching is that it usually generates very few correspon- dencesandtheestimatedposeisnotaccurateorstable. Najafietal[78]alsocreatedan environment map to represent an object appearance and geometry using SIFT features. Vasile, et al. [98] used LiDAR data to generate a pseudo-intensity image with shadows thatareusedtomatchwithaerialimagery. TheyusedGPSastheinitialposeandapplied exhaustive search to obtain the translation, scale, and lens distortion. Ding et al. [27] registered oblique aerial images based on 2D and 3D corner features in the 2D images and3DLiDARmodelrespectively. Thecorrespondencesbetweenextractedcornersare generated through Hough transform and a generalized M-estimator. The corner corre- spondencesareusedtorefinecameraparameters. Ingeneral,arobustfeatureextraction andmatchingschemeisthekeytoasuccessfulregistrationforthistypeofapproaches. Instead of point-based matchings, structural features such as lines and corners have been utilized in many researches. Stamos and Allen [91] used matching of rectangles from building facades for alignment. Liu et al. [59,60,92] extracted line segments to form “rectangular parallelepiped", which are composed of vertical or horizontal 3D rectangularparallelepipedintheLiDARand2Drectanglesintheimages. Thematching of parallelepiped as well as vanishing points are used to estimate camera parameters. Yang,etal.[105]usedfeaturematchingtoaligngroundimages,buttheyworkedwitha very detailed 3D model. Wang and Neumann [100] proposed an automatic registration methodbetweenaerialimagesandaerialLiDARbasedonmatching3CS(“3Connected Segments) in which each linear feature contains 3 connected segments. They used a two-level RANSAC algorithm to refine putative matches and estimated camera pose fromthecorrespondences. 58 Figure 5.2: The virtual cameras are placed around the LiDAR scene. They are placed uniformlyinviewingdirectionsandlogarithmicallyinthedistance. Givenasetof3Dto2Dpointorlinecorrespondences,therearemanyapproachesto solvetheposerecoveryproblem[21,37,43,45,79,83]. Thesameproblemsalsoappear in pose recovery with respect to point cloud which is generated from image sequences [38,39]. In both cases, a probabilistic RANSAC method [32] was also introduced for automaticallycomputingmatching3Dand2Dpointsandremoveoutliers. Inthispaper, we will apply keypoint-based method to estimate initial camera pose, then use iterative methodswithRANSACbyutilizingbothintensityandgeometricinformationtoobtain therefinedpose. 5.2 InitialPoseEstimationwithPoint-basedFeatures 5.2.1 SyntheticViewsofLiDARData To compute the pose for an image taken at an arbitrary viewpoint, we first create “syn- thetic" views that cover a large viewing directions. Z-buffers are used to handle occlu- sions. Ourapplicationistorecoverthecameraposesofimagestakeninurbanenviron- ments, so we can restrict the placement of virtual cameras to the height of eye-level to simplifytheproblem. Generally,theapproachisnotlimitedtosuchimages. 59 We place the cameras around the LiDAR in about 180 degrees. The cameras are placed uniformly in the angle of views and logarithmically in distance, as shown in Figure 5.2. The density of locations depends on the type of feature that we use for matchings. If the feature is able to handle rotation, scale and wide baseline, we need less virtual cameras to cover most cases. In contrast, if the feature is neither rotation- invariant nor scale-invariant, we need to select as many viewpoints as possible, and rotatethecameraateachviewpoint. Furthermore,itshouldbenotedthattheviewpoints cannotbetooclosetopointcloud,otherwisethequalityofprojectedimageisnotgood enough to generate initial feature correspondences. In our work, we use SIFT [61] features which are scale and rotation invariant and robust to moderate viewing angle changes. We select 6 viewing angles uniformly and 3 distance for each viewing angle inalogarithmicway. ThesyntheticviewsareshowninFigure5.3. 5.2.2 Generationof3DFeaturePointsCloud We extract 2D SIFT features for each synthetic view. Once the features are extracted, we project them back onto the point cloud by finding intersection with the first plane that is obtained through plane segmentation with method [91]. It is possible that the same feature is reprojected onto different points through different synthetic views. To handlethisproblem,wepostprocessthesefeaturepointssothatclosepointswithsimilar descriptors are merged into one feature. Note that we can also get the 3D features by triangulationmethod. However,suchmethoddependsonmatchingpairssoitgenerates muchfewerfeaturesforpotentialmatch. Theobtainedpositionsof3Dkeypointsarenot accurateduetoprojectionandreprojectionerrors,butgoodenoughtoprovideaninitial pose. Wewilloptimizetheirpositionsandcameraposeinlaterstage. The generated 3D feature cloud is shown in Figure 5.4. Each point is associated with one descriptor. For a given camera image, we extract SIFT features and match 60 Figure 5.3: The synthetic views of LiDAR data. 2D features are extracted from each syntheticview. themwiththefeaturecloud. Adirect3Dto2DmatchingandRANSACmethodisused to estimate the pose and remove outliers. When we use RANSAC method, rather than maximizingthenumberofinliersthatareconsensustothehypothesizedpose,wemake modificationsasfollows. Weclustertheinliersaccordingtotheirnormaldirections. Inlierswithclosenormal directionswillbegroupedintothesamecluster. Let N 1 andN 2 bethenumberofinliers for the largest two clusters. Among all the hypothesized poses, we want to maximize thevalueof N 2 ,i.e. [R|T]=argmax [R|T] N 2 . (5.1) 61 Figure 5.4: SIFT features in 3D space. The 3D positions are obtained by reprojecting 2Dfeaturesontothe3DLiDARdata. This is to ensure that not all the inliers lie within the same plane, in which case the calculatedposeisunstableandsensitivetopositionerrors. 5.3 IterativePoseRefinement 5.3.1 RefinementbyMorePointCorrespondences Withtheestimatedinitialpose,wecangeneratemorefeaturecorrespondencesbylimit- ingthesimilaritysearchingspace. Forthefirstiteration,westilluseSIFTfeature. From 2nditerationon,wecanuselessdistinctivefeaturestogeneratemorecorrespondences. Inourwork,weuseHarriscornersasthekeypoints. Foreachcornerpoint,anormalized intensityhistogramwithinan8x8patchiscomputedasthedescriptor. Itscorresponding point will probably lie within the neighborhood of H by H pixels. Initially, H is set to 64 pixels. For each iteration, the size is reduced to half since more accurate pose is obtained. We keep the minimum searching size to 16 pixels. Figure 5.5 shows a few iterationsandmatchingresultswithinreducedsearchingspace. 62 (a) (b) (c) (d) Figure 5.5: (a) The initial camera pose and 3D to 2D matching. (b) Camera pose after 1st iteration. (c) 3D to 2D matching based on refined pose. (d) Camera pose after 2nd iteration. 5.3.2 GeometricStructuresandAlignment The purpose of geometric structure extraction is not to form features to generate cor- respondences. Instead, they are used to align 3D structure with 2D structures in the camera image. In our work, line segments are used to align 3D range scans with 2D images. Therefore,weneedtodefinethedistancebetweentheselinesegments. Therearetwotypesoflinesinthe3DLiDARdata. Thefirsttypeisgeneratedfrom thegeometricstructure,whichcanbecomputedattheintersectionsbetweensegmented planar regions and at the borders of the segmented planar regions [91]. The other type of lines is formed by intensities. These lines can be detected on the projected synthetic imagewithmethod[3]andreprojectedonto3DLiDARtogettheir3Dcoordinates. For eachhypothesispose, these 3Dlines are projectedonto 2D imagesand we measurethe alignmenterrorasfollows. 63 E line = 1 N M å i=1 N å j=1 K(l i ,L j )·max(D(l i1 ,L j ),D(l i2 ,L j )), (5.2) wherel i isthe ith2Dlinesegmentwithl i1 andl i2 asitstwoendpoints, L j is jth3Dline segment. MandNarenumberof2Dsegmentsand3Dsegmentsrespectively. K(l i ,L j )is abinaryfunctiondecidingwhetherthetwolinesegmentshavesimilarslopes. D(l i1 ,L j ) is a function describing the distance from the endpoint l i1 to projected line segment L j . ThefunctionK and Daredefinedasfollows, K(l,L)= 0 for (l,L)< K th 1 for (l,L)≥ K th (5.3) D(l 12 ,L)= 0 ford(l 12 ,L)≥ D th d(l 12 ,L) ford(l 12 ,L)< D th (5.4) where(l,L)representstheangledifferencebetweenthetwolinesegments,andd(l 12 ,L) is the distance from endpoint l 1 or l 2 to projected line segment L. K th and D th are two thresholdsdecidingwhetherthetwosegmentsarepotentialmatches. Inourexperiment, weset K th =p/6,D th =W/20,whereW istheimagewidth. 5.3.3 RefinementbyMinimizingErrorFunction Oncewehaveobtainedthecameraposewithiterativerefinements,wecanfurtherrefine the pose by minimizing the differences between LiDAR-projected image and camera image. The differences are represented by an error function, which is composed of two parts, line differences and intensity differences. We have talked about line differences above. Theintensityerrorfunctionisdefinedasfollows, 64 (a) (b) Figure 5.6: (a) The refined pose with more correspondences. (b) Camera pose after minimizingtheerrorfunction. E intensity = 1 |{i}| å i (s·I 3D (i)−I 2D (i)) 2 , (5.5) where I 3D (i) and I 2D (i) are intensity values for the ith pixel on projected image and camera image respectively. |{i}| is the number of projected pixels. s is the scale factor that compensate the reflectance or lighting differences. s will take the value that can minimizetheintensityerrors,sotheaboveerrorfunctionisequivalentto, E intensity = å i (I 2D (i) 2 − I 3D (i) 2 ·I 2D (i) 2 I 3D (i) 2 ), (5.6) Theoverallerrorfunctionisaweightedcombinationoftwoerrorfunctions, E pose =aE line|pose +(1−a)E intensity|pose , (5.7) 65 (a) (b) Figure5.7: Intensitydifferencesbetweenprojectedimageandcameraimage. (a)errors afteriterativerefinements(b)errorsafteroptimization. where pose is determined by the rotation R, translation T or equivalently 3D positions of keypoints P. We seta =0.5 in our experiments. Since intensity errors usually have largerscales,thiswillmakeintensityalargereffectontheoverallerrorfunction. Therelativeposeisrefinedviaminimizationoftheaboveerrorfunction: (R,T,P)=argmmin R,T,P E pose|R,T,P . (5.8) We use similar method with coordinate descent. Instead of gradient decent in each iteration, we use closest-point method to update each parameter as in Figure 5.8. The refinementresultsareshowninFigure5.6and5.7. 5.4 ResultsandDiscussions We have tested more than 10 sets of scenes with camera images taken from different viewpoints. Figure 5.9 shows an example of pose recovery through iterations and opti- mizations. Afteraseriesofrefinementthroughiterativematchingsandoptimization,we can get an accurate view of a given camera image. Figure 5.10 shows an image of the 66 Figure 5.8: The closest corresponding edges. The parameters updates can be obtained from these correspondences. Left - the edges are not well aligned for target image and projectedimage. Right-theclosestedgecorrespondences. samescenebuttakenfromanotherview. Itcanbeeasilyobservedthatthevirtualimage iswellalignedwiththerealimagebyblendingthetwoimagestogether. Wehavealsomeasuredtheprojectionerrorsforeachrefinementprocess. Theresults are shown in Figure 5.11-(a) and Figure 5.11-(b). As is shown in Figure 5.11-(a), the errorsstayconstantafter3rdrefinements. Thisisbecausethatusuallywehaveobtained sufficientcorrespondencesafter3rditerationtogetastablecamerapose. Theerrorsare caused by the errors in calculating the 3D position of keypoints. This can be further improvedbyadjustingtheposetogetevensmallerprojectionerrors,asshowninFigure 5.11-(b). However,duetomovingpassengers,occlusions,lightingconditionsetc.,there arealwayserrorsbetweenaprojectedimageandacameraimage. In Figure 5.12 and Figure 5.13, we show more results for the iterative alignment betweenimagesandLiDARdata. Theblendedimageanddifferenceimagebetweenthe realimageandprojectedLiDARareshowninthefigures. 5.5 Summary We have presented a framework of camera pose estimation with respect to 3D terres- trial LiDAR data. The LiDAR data contains intensity information. We first project the 67 (a) (b) (c) (d) (e) (f) Figure 5.9: (a) The camera image (b) The initial view calculated from limited number of matches (c) The refined view by generating more correspondences (d) It has no fur- ther refinement (from more matches) after 2 or 3 iterations (e) The pose is refined by minimizingtheproposederrorfunction(f)Thevirtualbuildingiswellalignedwiththe realimageforthecalculatedview. LiDARontoseveralpre-selectedviewpointsandcalculatetheSIFTfeatures. Thesefea- tures are reprojected back onto the LiDAR data to obtain their positions in 3D space. These 3D features are used to compute the initial pose of the camera pose. In the next stage, weiterativelyrefinethecameraposebygeneratingmorecorrespondences. After that,wefurtherrefinetheposethroughminimizingtheproposedobjectivefunction. The function is composed of two components, errors from intensity differences and errors from geometric structure displacements between projected LiDAR image and camera image. We have tested the proposed framework on different urban settings. The results showthattheestimatedcameraposeisstableandtheframeworkcanbeappliedinmany applicationssuchasaugmentedreality. 68 (a) (b) Figure5.10: TheestimatedcameraposewithrespecttothesamesceneasinFigure5.9 but from a different viewpoint. The right figure shows the mixed reality of both virtual worldandrealworld. 0 2 4 6 8 10 0 10 20 30 40 50 60 70 80 iteration no. errors (a) 0 5 10 15 20 25 30 0 10 20 30 40 50 60 image no. errors after iterative refinements after optimization (b) Figure 5.11: (a) The errors after each iterative refinement. (b) The errors before and afteroptimization. 69 (a) (b) (c) (d) Figure5.12: Oneexampleoftheiterationprocess. Left-projectedlidarimage,middle- projectedimageblendedwithoriginalimage,right-differencebetweenprojectedimage andoriginalimage. 70 (a) (b) (c) (d) Figure5.13: Anotherexampleoftheiterationprocess. Left-projectedlidarimage,mid- dle - projected image blended with original image, right - difference between projected imageandoriginalimage. 71 Chapter6 6-DOFTrackingwithPointClouds Tracking with vision-based sensors is a critical and popular problem in vision com- munity. Optical sensors provide a tremendous amount of information about the user’s environments. Withtheincreasingpopularityofsmartdevices,moreandmoreapplica- tions are based on robust tracking algorithms. However, there are many challenges in buildingarobusttrackingsystem. Firstly, how the scene data is constructed so that pose initialization and re- initialization can be conducted robustly. It is known that camera pose can be easily lost due to occlusions and fast motions. The tracking system should be able to robustly recover the pose under different situations. Secondly, pose estimation with respect to pointcloudsis notstraight forward. Thestructures ofimages andpoint cloudsarevery different, so features extracted from one type of data are usually not repeatable in the other. Thirdly, there are many noises in point clouds and the 3D modeling itself is a challenging research problem. Our tracking system cannot rely on a well-established modeling algorithm. Instead, we need some other ways to ensure tracking efficiency and robustness. In this chapter, we will present a novel tracking system with above issueswellhandled. Theremainderofthischapterpresentsthetrackingmethodsinmoredetail. Wefirst discuss some related work in Section 6.1. Section 6.2 describes the point cloud prepro- cessingprocess. Followingthat,Section6.4talksabouttheposerecoveryscheme. The detailed tracking method is discussed in Section 6.3. We show experimental results in Section6.5andconcludethechapterinthelastsection. 72 6.1 ReviewofTrackingwithPointClouds In the last decade, a considerable amount of research has been conducted on vision- basedSLAM.Camerascanproviderichinformationabouttheenvironmentandvarious features can be detected on the images. Some popular visual SLAM implementations that use single cameras are [29,35,48,52,70,75]. There are some viual SLAMs that make use of wide-angle cameras such as [24,25], as they allow tracking in a wider motion ranges. Due to the drifting and lighting problems, most of the single camera SLAMimplementationsareforindoorenvironmentsinarelativelysmallworkspace. AsLiDARdataiseasiertoaccessnowadays,afullSLAMsolutionsareusuallynot necessary for most camera tracking problems. Tracking with well established LiDAR data is considered a promising alternative. While we do not simultaneously construct andoptimizethemap,weneedtoaccuratelyregistertheimageswithLiDARdata. Many methodshavebeenproposedinimageregistrationwithLiDARdata. Thesemethodscan becategorizedintokeypoint-basedmatching[2,8]andstructure-basedmatching[59,60, 91,92]. WhentheLiDARdatacontainsintensityvalues,keypoint-basedmatchings[2,8] thatarebasedonsimilaritybetweenLiDARintensityimageandcameraintensityimage can be applied. Feature points such as SIFT [61] are extracted from both images and a matching strategy is used to determine the correspondences thus camera parameters. Najafi et al [78] also created an environment map to represent an object appearance and geometry using SIFT features. Vasile, et al. [98] used LiDAR data to generate a pseudo-intensity image with shadows that are used to match with aerial imagery. Ding et al. [27] registered oblique aerial images based on 2D and 3D corner features in the 2D images and 3D LiDAR model respectively. The corner correspondences are used to refine camera parameters. Instead of point-based matchings, structural features such as lines have been utilized in many researches. Stamos and Allen [91] used matching 73 of rectangles from building facades for alignment. Liu et al. [59,60,92] extracted line segments to form “rectangular parallelepiped". The matching of parallelepiped as well as vanishing points are used to estimate camera parameters. Yang, et al. [105] used featurematchingtoaligngroundimages,buttheyworkedwithaverydetailed3Dmodel. Wang and Neumann [100] proposed an automatic registration method between aerial images and aerial LiDAR based on matching 3CS (“3 Connected Segments") and used a two-level RANSAC algorithm to refine putative matches and estimated camera pose fromthecorrespondences. One of the main challenges in camera tracking is the error drifting issues. Many methods have been proposed to solve this problem [34,54,84,110]. In [54], Leptit et al. treated tracking as a detection task. They stored patches of textured models from differentviewpointsandmatchedeachframetooneofthekeyframes.[110]proposeda two-stageapproachtosolvethefeaturedriftproblem. First,thetranslationiscalculated from the previous frame to the current frame. Then the affine motion between the first frame and the current frame is estimated. The translation parameters of the new affine motion parameters constitute the final feature position for the current frame. In [34], local optimization is applied for 3D human motion capture. Since it searches only for thelocallybestsolution,itusuallycannotrecoverfromerrorsandrequiresaninitializa- tion. Without prior information, the tracking often fails in case of fast movements and occlusions. In [84], they proposed a method that combines static and adaptivetemplate trackingtoreduceerrordrift. However,templateshavelimitedmodelingcapabilitiesas theyrepresentonlyasingleappearanceoftheobject. Inthischapter,wewillproposea trackingsystemthatcanestimatetheinitialposeandrecoverthelostposes. Thesystem can also minimize error drifts in consecutive frames. We will talk more details in the followingsections. 74 6.2 PreprocessingofPointClouds We scan overlapped range data along streets in the urban area. At the same time, color images are also taken. The range scans are merged into larger scans with ICP method in[107]. Themethod[60]isusedtoregisterimageswithrangescansandcolorscanbe mapped onto the scan data. Each taken image is also labeled with geographic informa- tionsuchasUTMcoordinatessothatimagescanbegroupedbytheirlocationstoreduce searchingrange. Wewillusecoloredrangedatatogeneratemoresyntheticviewsinthe preprocessingstep. 6.2.1 SyntheticViewsandFeaturePointCloud Tobeabletoestimatecameraposefrommorepossiblelocations,wecreatemore“syn- thetic" views to cover a large range of viewpoints. Z-buffers are used to handle occlu- sions. Our system is mainly designed for hand-hold cameras, so we can restrict the placement of virtual cameras to the height of eye-level to simplify the problem. Gener- ally,theapproachisnotlimitedtosuchassumptions. We place the virtual camera around each scene and uniformly in the viewing direc- tions and at various distances. The density of camera positions depends on the charac- teristics of the feature we use for matching. In our system we use SURF features for pose initialization. SURF features are scale and rotation invariant, so we can put the virtual cameras in a relatively sparse way. It should also be noted that the viewpoints cannot be too close to the range scans, otherwise the quality of projected image is not goodenoughtogeneraterepeatablefeatures. Examplesofsyntheticviewsareshownin Figure6.1. We extract their SURF features for both real images and synthetic images. The features are back projected to get the SURF points cloud. The points cloud is used 75 Figure 6.1: Real images are mapped onto the range data to register color information. Syntheticviewsaregeneratedtocovermoreviewpoints. for image matching and pose initialization. A vocabulary tree is created for each point cloud. The2D-2D-3Dmatchingschemeisusedforposeestimation. 6.2.2 GeometricModelingwithGraphicalModels There are many planar structures in urban areas such as building facades and grounds. Whilethepointsontheseplanesmaycontainintensityorcolorinformation,theyprovide little geometric information other than planarity. Therefore, it has no loss of geometric informationifwemodelthesepointswithsetsof3Dplanes. Themainadvantageofmodelingthesepointswith3Dplanesisthatwecanfastand easilydetectthosefeaturesthatareontheplanes. Thesefeatureshave3Dpositionsand canbeusedtocalculatethecamerapose. Whenthesefeaturesaredetectedandtracked, the2Dto3Dcorrespondencesaremaintainedsothatcameraposescanbecalculatedin afastandstraightforwardway. Wewillpresentasimpleandfastmodelingmethodthat issuitableforourtrackingsystem. We first apply two-class graph-cut to segment the points on planar structures and non-planar structures. The main criteria we use is normal variance. For each point, 76 we use its closest 20 points to estimate its normal direction with simplified version of method [71]. To construct the graph (V,E), we first define neighborhood distance e, which means that all the points that are within distancee are considered its neighbors. EachpointisconsideredanodeinV,soV isthesetofallpointsinpointcloud. Fortwo nodes, if their distance is less thane, we construct an edge in E. We will optimize the followingenergyfunction, E(c)= å p∈V D(p,c p )+ å (p 1 ,p 2 )∈E S(c p 1 ,c p 2 ). (6.1) where c p representsthelabelofpoint pwith c p ∈{planar,non-planar}. Dand S arethe datacostfunctionandsmoothnessfunction. The most effective criterion for classifying a point is its normal direction. Though thereareseveralotherproposedcriteriasuchasheights[82], itisonlyapplicablewhen LIDAR points on the roof are accurately obtained, which is often not the case in our data. The data cost in equation (1) is composed of twocriteria both of which are based onnormaldirections. D(p,c p )=Var(N p )+bmin(N z p ,1−N z p ). (6.2) whereN p isthenormaldirectionofpoint p,Var(N)isthenormaldirectionvariances. N z p isthesubcoordinateofN p inzdimension,i.e. thedirectionperpendiculartotheground. We assume that the grounds are flat and building facades are vertical, min(N z ,1−N z ) givesthedivergencetoeitherhorizontalorverticaldirection,whicheverissmaller. The equation can be easily modified when the ground is not flat. To calculate normal vari- ance,weuse Var(N p )= 1 |V e (p)| å q∈V e (p) (1−N p ·N q ). (6.3) 77 whereV e (p)={q|(p,q)∈E},i.e.,alltheneighborpoints qwithindistancee from p. For the smoothing factor, we favor label transition within the neighborhood if the normal direction of two points do not differ too much. So for (p 1 ,p 2 )∈E, we simply set, S(c p 1 ,c p 2 )=a 1(c p 1 ̸=c p 2 &N p 1 ·N p 2 >0.5). (6.4) Thesegmentationresultsarenotsensitivetotheparametersa andb withinacertain range. Inourexperiments,wesetbothoftheirvaluesto1. Figure6.2leftcolumnshows the segmentation results for planes and non-planes. Grounds and facades are further segmentedaccordingtotheirnormaldirections. After segmentation, we remove those points that are not on planar structures. By thresholding the normal directions, we can easily differentiate the grounds and walls. We use region growing [19] method to further segment each facade. Figure 6.2 right column shows the segmented planes. Different planes are represented with different colors. 6.3 Trackingwith3DPlanes 6.3.1 FeatureDetectiononPlanes The reason we use planes as basic modeling structures is twofold. First, it is easier and faster to identify those features to be tracked by projecting 3D planes onto the frames. Second,someposecalculationmethodsarenotgoodatcalculatingfromcoplanarpoints. With 3D planes, we can easily tell which points are on the same planes and which are not. Sosuitableposecalculationmethodscanbeusedappropriately. 78 (a) (b) Figure 6.2: (a) Point cloud is segmented into ground (dark grey), building walls (light grey) and others (green). (b) The points on planar structures are modeled with a set of 3Dplanes. 6.3.2 FeatureTrackings Whenwecalculatethetransformationbetweentwoconsecutiveframes,wefirstquickly identify those features that are on different planes. For each group of features on the sameplanep i ,wefirstcalculatethehomographytransformationH p i andthenobtainthe rotationR p i andtranslationT p i byhomographydecomposition. Fornon-planarfeatures, 79 Figure 6.3: Projection of planes onto the frames. Features on the planes can be easily identified. we calculate the essential matrix and then estimate rotation R non and translation T non . Both decomposition methods can be referred to [62]. To make the pose more stable, we only select point correspondences that can generate conforming poses. In another word, we discard the calculated pose [R p i |T p i ] or [R non |T non ] if it diverges from other poses. Afterremovingtheunconformingposes,wecombinetheremainingposesinthe followingway, R=Rod −1 ( å p i ∈P a i Rod(R p i )+a non Rod(R non )), T = å p i ∈P a i T p i +a non T non . (6.5) wherea i anda non aretheproportionoffeaturenumberonp i andnon-planarstructures. AndRod istheRodriguerepresentationofrotationtransform. 80 6.4 PoseRecoveryandSmoothing 6.4.1 PoseRecoveryScheme Therearetwosituationswhentheposerecoveryprocessisrequired,attheinitialization step and when the tracking is lost. The two processes are handled in slightly different waysandwewilltalkaboutthemseparately. PoseInitialization We use retrieval-based method to estimate the initial position of camera pose. The dataset images are real images as well as synthetic images. SURF features are used to generatevisualwords. Weretrieveimageofthesamescenefromthedatasetthatiswell registeredwithLiDARdata. Aniterativeposerefinementprocessisconductedtoobtain moreaccuratestartingcamerapositions. We project the LiDAR data according to the calculated pose. Let I ′ (x,y) be the projectedimageandI(x,y)betheinitialframe. Werefinebothextrinsicparametersand locallengththroughiterativeoptimizationsofthefollowingobjectfunction, {R,T, f}=argmax {R,T,f} å (x,y)∈V (I ′ (x,y)−I(x,y)) 2 |V| . (6.6) whereV is the set of all projected pixels. We stop the iteration process until the errors arebelowacertainthreshold. PoseRecoveryfromLost Therearemanysituationsunderwhichthetrackingwillbelost. Themostcommontwo cases in outdoor environment are 1) occlusions by moving passengers or vehicles 2) camera moving too fast that either makes image blurs or consecutive frames differ too 81 Figure 6.4: The camera pose recovery after lost. The frame (in the purple color) is occluded so that the poses of the following frames cannot be calculated from previous frames. much. In either cases, a robust system should be able to recover the pose immediately oncethecameraisunoccludedorsloweddownagain. We recover camera pose by making use of the online created feature point cloud. When the pose of initial frame is available, we can detect features on the frame and backprojectthesefeaturesontoLiDAR.SinceKLTfeaturesaremainlyusedforframeto frame tracking, we use KLT keypoints with SURF descriptors in the matching process. Otherdetectoranddescriptorcombinationscanbeused. SinceKLTfeaturesneedtobe detectedoneveryframe,itismoreconvenienttoemploythesefeaturesinoursystemand good performance can be achieved. When the 2D-3D feature correspondences are re- established, the pose is successfully recovered. The recover process is shown in Figure 6.4. The above strategy can handle most lost cases. However, if the camera is moving away from the previous scene, we cannot find any matches with the created feature pointscloud. Tosolvethisproblem, therearetwoways. Thefirstwayistore-initialize thepose,i.e. tomatchSURFfeaturesandestimatetheposeagain. Thematchingcanbe done in a local neighrhood area. The second way is by periodically create new feature 82 pointcloud. Thiswayiseffectiveespeciallywhenthecameramovessmoothlyfromone scenetoanother. Beforecreatingthepointcloud,iterativeposerefinementisconducted to avoid error accumulations. When the new frame comes after pose lost, we can first matchfeatureswiththerecentcreatedpointscloud. Ifthisfails,areinitializationprocess willbenecessary. 6.4.2 PoseSmoothingProcess Asthetrackingcontinuesovertime,therewillbeerrorsintheposecalculationbetween consecutiveframes. These errors are usually caused by image noises, geometric distor- tions, intensity variations etc. Though the errors are very small between two frames, they become quite obvious after a long sequence of frames. We propose a smoothing schemetohandlethiserrordriftingproblem. To avoid error accumulations, we make use of synthetic image which is generated by previous successfully tracked pose. We match the frame with synthesized image to recalculate the pose. However, we cannot directly update the pose since it will cause a posejump. Weneed a schemethat can update thepose in a smoothway. Theproposed schemeisshowninFigure6.5. To make the pose recovery process smoother, we do not update the current pose immediately. Instead,weupdatetheposeduringthenextMframes. LetK(t i−1 ,t i )bethe set of KLT feature correspondences between frame i−1 and frame i, i =0,1,2,...,M. The pose of frame t 0 , i.e. P K 0 can be calculated from K(t −1 ,t 0 ). Let S(t s ,t 0 ) be the set of correspondences between frame t 0 and synthetic image and the corresponding calculatedposeofframet 0 isP S 0 . LetT i (K)bethecameraposetransformation(rotation andtranslation)calculatedfromK(t i−1 ,t i ). Theestimatedcameraposeforframet 0 isP K 0 (with drifting) and P S 0 (without drifting). So we smoothly interpolate P 0 as a weighted averageofP K 0 andP S 0 forthenextM frames. AfterM frames,thecalculatedposewillbe 83 Figure 6.5: The pose update process. The estimated pose is smoothly updated in M frames. For each frame t i , an updated pose of frame t 0 is calculated. And the pose of framet i isobtainedbymultiplyingasequenceofT k with k from1to i. P S 0 P M i=1 T i (K). Therefore, we calculate T i and P i 0 for every new framet i . The P i 0 is set to i M P K 0 +(1− i M )P S 0 . Theweightedaverageoftransformationscanbereferredtoequation 6.5inthenextsection. Foreachframet i ,theposeiscalculatedas, P i =P i 0 P M i=1 T i (K). (6.7) Inthisway,theposeissmoothlyrefinedandatframet M ,wewillhaveawell-refined pose. 6.5 PerformanceandDiscussions Fordataacquisition,wescannedseveralscenesintheurbanareaandregisteredtherange scanswithcaptured2Dimagestogetthecoloredrangedata. Wehavetestedourtracking system on all the locations of interests. A laptop with Intel Core i7 720 processor and 84 4GB DDR3 is used. The video sequences are taken by a roughly calibrated camera, underthedimensionof640by480with30framespersecond. (a) (b) (c) (d) (e) (f) Figure 6.6: Modeling of range data with simple geometries. (a), (c) and (e) are the segmentationresults-ground(darkgrey),buildingwalls(lightgrey)andothers(green). (b),(d)and(f)arethemodelingwithasetof3Dplanes. Figure 6.6 shows results of geometric modeling of scanned range data in the pre- processing step. The segmentation process is used to remove most points on trees, cars and some other non-planar structures. Majority of points on grounds and facades are 85 retained. Itisnotedthattherearemanyholesintherangedata. Theseholesaremainly causedbythenon-reflectablecharacteristicsofwindowglassesonthebuildingfacades. Sincethewindowsareonthesameplaneasthewalls,itisreasonabletofilltheholesin the3Dplanes. Aftermodeling,therewillbemanysmall-sizedplanes. Figure6.6shows theresultsafterthesesmallplanesareremoved. We have tested our system on many taken video sequences and compared with the methodproposedin[13]. Figure6.7showsonetheofvideosequences. Asshowninthe figure, our method can stably estimate the camera pose and outperform the method in [13]especiallywhenthescenecontainsmainlyplanarstructures. InFigure6.8,wealso quantitatively measured the drifted errors in both translations and rotating directions. For each frame in the video sequence, we calculate and optimize the camera pose at offline as the ground truth. As the figure shows, with periodic pose adjustment, we can prevent the camera from drifting away. In addition, by making use of planar structures, wecanfurtherstabilizetheposewithinsmallererrorranges. (a) (b) (c) Figure 6.7: (a) The camera pose at frame 1 (b) camera pose at frame 120 with con- strained optical flow method in [13] (c) camera pose at frame 120 with our proposed method. InFigure6.9,wealsoshowcomparisonresultswiththeSLAMframework[75]and the method proposed in [13]. Whenever the pose is lost in either method, the pose will be re-initialized. We have measured the errors that is defined in equation 5.7. As the 86 (a) (b) Figure6.8: (a)TheerrorsformagnitudeoftranslationvectorT(incentimeters)(b)The errorsforzdirectionafterrotationtransformR(indegrees). results show, with our tracking system, not only we can maintain a robust tracking all thetime,wecanalsolimittheerrorsinmuchsmallerranges. (a) (b) (c) Figure 6.9: The projection errors (eqn. 5.7) for different methods. (a) The SLAM frameworkin[75](b)Themethodproposedin[13](c)Ourtrackingsystem 6.6 Summary Wehavepresentedatrackingsysteminalargescalearea. The3DterrestrialLiDARdata andsetofimagesareusedinthesystem. Apreprocessstepisneededtoregisterallthese imagestotherangescanssothateachpointisassignedacolorvalue. Wegeneratemore syntheticimagestocovermoreareasthatisusedtoestimatetheinitialcamerapose. For 87 the tracking part, once the camera pose is initialized, we use KLT features to calculate theposeforconsecutiveframes. Syntheticviewsaregeneratedperiodicallytoadjustthe pose with respect to the LIDAR data. At the mean time, geometric planes are used to identify coplanar points and different pose recovery methods are used for different sets ofpoints. Wehavetestedthepresentedframeworkondifferentpointclouds. Theresults show that the estimated camera pose is stable and the drifting errors are reduced. The systemcanbeappliedinmanyapplicationssuchasoutdooraugmentedreality. 88 Chapter7 BuildingAnOutdoorAugmented RealitySystem In this chapter, we present an application example that is based on the previously pro- posed tracking system. The data for ourtracking system is a set of 3D range scans togetherwithregistered2Dimages. Thesystemcontainsofflinestageandonlinestage. In the offline stage, synthetic images are generated from virtual viewpoints that are not covered by original images. The LiDAR data is also simplified and modeled by a set of 3D planes. In the online stage, the system tracks the back-projected features on the LiDAR. The benefit of using 3D planes is that features on such planes are more stable and convenient for calculations. When the pose is lost due to occlusions or fast move- ments,theproposedsystemcanrecovertheposeimmediately. Throughperiodicupdate, wecanavoiderroraccumulations. Anotherchallengeofoutdoortrackingproblemisthe lightingvariations. WeapplyLineContextfeaturethatisrobusttoilluminationchanges and scalable to large datasets. Based on the calculated camera pose, augmented infor- mation such as building names and locations can be delivered to the user in real time. We will talk about the details of the application system in more details in following sections. 89 7.1 OverviewofOutdoorAugmentedRealitySystem Ourproposedsystemcontainsofflinepreprocessingpartandonlinetrackingpart. Inthe offline part, we first register LiDAR data and 2D images by using methods proposed in[60]. Thecolorsfromimagesaremappedontorangedata. Thensyntheticimagesare generated to cover more viewpoints that will be used to estimate the initial pose. Both realimagesandsyntheticimagesaregroupedbytheirlocationssothatimagesearchcan be conducted in the neighborhood area. For LiDAR data, we extract points that lie on planar structures such as grounds and building facades. These points are modeled with 3Dplanes. Theplanestructureswillbeusedtotrackcameraposeintheonlinepart. The online tracking process is shown in Figure 7.1. As the figure shows, there are four stages for the tracking part, and each stage is composed of several steps. In the initialization stage, the system first uses SURF [5] features to locate the frame. If the matchingfails,whichiscommoninoutdoorenvironmentduetolightingvariations,we will use our proposed feature named Line Context to match the frame. Line Context feature is more robust to lighting changes. The details of the feature will be covered in later sections. Since the proposed feature has more computational cost, we use SURF to do the matching first which is faster and working well in many situations. After the frame is successfully matched, we can estimate the pose from 2D to 3D matching with featurepointscloud,whichisbuiltintheofflinepreprocessingpart. Theestimatedpose is not accurate due to registration errors and unpredicted noises. These small errors couldbeproblemsfortrackingprocess. Therefore,afurtherposerefinementisneeded. Once a well-refined pose is obtained, the second stage will construct an online feature points cloud by back projecting the KLT [90] keypoints on frames. SURF descriptors are used for pose recovery purpose. The KLT feature is used since it is a lightweight feature and suitable for tracking purpose. The third stage is continuous tracking, we 90 Figure7.1: Theonlinepartofourtrackingsystem. track KLT features frame by frame so that correspondences with feature points cloud canbe maintained. When thereare veryfewcorrespondences as the cameramoves, we create a new feature point cloud. Due to occlusions or fast motions, the pose can often be lost, then the system will enters the fourth stage. It recovers the pose by matching KLT keypoints and SURF descriptors with newly created feature points cloud. If the new frame has no match with feature cloud, the system has to go back to the first stage andstartoveragain. Inthiscase,theposecanbereinitializedbymatchingframesinthe localneighborhood. 91 7.2 TheSimilaritiesandDifferenceswithSLAM Manystate-of-the-artARtrackingsystemsuseSimultaneousLocalizationandMapping (SLAM) as their frameworks. While SLAM is a widely studied area, monocular visual SLAM is still an open research problem. For survey of visual-based SLAM, one can referto[76]. Our application goal is different from SLAM in the aspect that we do not simulta- neouslyconstructthemapwhiletracking. Instead, wehavethescenedatabeforehand, andwewanttomaintainrobusttrackingallthetimeinthisenvironment. Ourproblemis becomingmorepracticalasmoreandmoreLiDARdataisavailableduetofastadvance oflasertechnologies. Itisquitefeasibletoobtainthe3Dscenedataaccuratelyevenfor alargescalearea. Formostcameratrackingbasedreal-worldapplications,thesolution to our problem is more promising. Therefore, our application is a good alternative to SLAMsystemformanytrackingapplicationsespeciallyinaugmentedrealityarea. 7.3 DataCollection 7.3.1 CollectionofGeotaggedImages Our application is mainly used for outdoor urban areas. For a certain district, we take images that cover the locations of interests such as facades of buildings. At each loca- tion,theimagesaretakenfromdifferentviewsatdifferentdistances. SURFfeatures[5] are used due to its fast speed and robustness. Moreover, the UTM coordinates and building-related information for each location are also recorded. Such information can be obtained from GPS device or manually input. In our system, 120 locations are recorded. For each location, we take 10 images from different viewing angles and dis- tances. Therefore,therearetotally1,200imagesinourdatabase. 92 7.3.2 CollectionofRangeData We scan overlapped range data along streets in the urban area. The range scans are mergedintolargerscanswithICPmethodin[107]. Themethod[60]isusedtoregister imageswithrangescansandcolorscanbemappedontothescandata. Sinceeachtaken image is labeled with geographic information such as UTM coordinates, these images canbegroupedbytheirlocationstoreducesearchingrange. Weusecoloredrangedata togeneratemoresyntheticviewstocovermoreviewpoints. Somerangescansareshown inFigure7.2. (a) (b) Figure7.2: (a)rangescandata(b)therangedataaftercolormapping 7.4 DataProcessing 7.4.1 OverlappedClusteringofImages To reduce the searching range, we cluster the images according to their locations. An image is associated with the cluster whose center is nearest to its location. When an 93 image is associated with a cluster, the subsequent image searches can be conducted within the cluster. To build overlapped K clusters, we first use traditional k-means clustering method to get K non-overlapped clusters. For those locations that are at the boundaries,weassignthemtoallthoseclustersthatarehassimilardistancetotheclos- estone. Thestepsareshownasfollows, 1. Suppose there are N different locations, and the UTM coordinates (easting and northing) for location i are L i = (X i ,Y i ), i = 1..N. Let the maximum error caused by a GPSdevicebe E andthelongestdistancefromwhich auserisallowedtotakepictures foralocationbeD. LetH bethethresholdforhandoff(wewilldiscussinlatersection). WewanttoclusterthelocationsintoK groups. 2. (Traditional K-means) Randomly select K locations. We consider these K loca- tionsasthecentersof K clusters. 3. Foreachlocation,assignittotheclusterwhosecenterisnearesttoit. 4. For each cluster, recalculate the cluster center. Repeat step 3 and step 4 until it converges. Thengotostep5. 5. (Overlapped assignment) For each location i, let d i be the distance to the nearest cluster center. Let centers for the K clusters be C j = (X j ,Y j ), j = 1..K. We assign the location itocluster j ifandonlyif∥L i −C j ∥ 2 ≤ d i +max(E, D, H). Thestep2tostep4aretraditionalk-meansclusteringmethod. Weaddonemorestep for overlapped assignment to handle three problems. Firstly, the instability and inaccu- racy of GPS. Secondly, the position of the user is usually different from the location of interest. Thirdly, the user may go back and forth within certain areas, which can cause unnecessary swaps of databases in the memory. Figure 7.3 shows an example of K=5 clusters. 94 Figure7.3: Theoverlappedclusteringofcollectedimageswith K=5. 7.4.2 RangeDataModeling There are many planar structures in urban areas such as building facades and grounds. We will model LiDAR data with a set of 3D planes. The main advantage of modeling these points with 3D planes is that we can fast and easily detect those features that are ontheplanes. Thesefeatureshave3Dpositionsandcanbeusedtocalculatethecamera pose. When these features are detected and tracked, the 2D to 3D correspondences are maintainedsothatcameraposescanbecalculatedinafastandstraightforwardway. We apply the method proposed in 6.2 to generate the set of 3D planes. The most effectivecriterionforclassifyingapointisitsnormaldirection. Thoughthereareseveral otherproposedcriteriasuchasheights[82],itisonlyapplicablewhenLIDARpointson theroofareaccuratelyobtained,whichisoftennotthecaseinourdata. 95 7.5 OutdoorTrackingwithRobustPoseRecovery 7.5.1 SituationsofPoseLostandRecovery Therearetwosituationswhentheposerecoveryprocessisrequired,attheinitialization step and when the tracking is lost. The two processes are handled in slightly different waysandwewilltalkaboutthemseparately. Initialposeestimation We use retrieval-based method to estimate the initial position of camera pose. The dataset images are real images as well as synthetic images. SURF and Line Context features are used to generate visual words. We retrieve image of the same scene from thedatasetthatiswellregisteredwithLiDARdata. Aniterativeposerefinementprocess isconductedtoobtainmoreaccuratestartingcamerapositions. Recoveryafterposelost The most common cases under which the tracking will be lost are 1) occlusions by movingpassengersorvehicles2)cameramovingtoofastthateithermakesimageblurs orconsecutiveframes differtoomuch. In eithercases, thesystem is able to recoverthe poseimmediatelyoncethecameraisunoccludedorsloweddownagain. Torecoverthecamerapose,themethodproposedin6.4.2isused. Wedetectfeatures ontheinitialframeandbackprojectthesefeaturesontotheLiDARdata. KLTkeypoints with SURF descriptors are used in the matching process. When the 2D-3D feature correspondences are re-established, the pose is successfully recovered. If the camera is moving away from the previous scene, we cannot find any matches with the created feature points cloud. There are two ways to solve this problem. The first way is to 96 re-initialize the pose, i.e. to match SURF and Line Context features and estimate the pose again. The matching can be done in a local neighrhood area. The second way is by periodically create new feature point cloud. This way is used when the camera moves smoothly from one scene to another. Before creating the point cloud, iterative poserefinementisconductedtoavoiderroraccumulations. Whenthenewframecomes after pose lost, we can first match features with the recent created points cloud. If this fails,areinitializationprocesswillbenecessary. 7.5.2 ContinuousTrackingforLongDistance Oncewehave2Dto3Dfeaturecorrespondences,wefirstquicklyidentifythosefeatures thatareondifferentplanes. Foreachgroupoffeaturesonthesameplanep i ,wecalculate thecameraposewithmethod[74]. TherotationR p i andtranslationT p i arethenobtained. We also use all the features from all planes to calculate the pose R all and T all . To make theposemorestable,weonlyselectpointcorrespondencesthatcangenerateconforming poses. Therefore, we discard any calculated pose [R p i |T p i ] or [R all |T all ] that diverges fromotherones. Afterremovingthenon-conformingposes,wecombinetheremaining posesinthefollowingway, R=Rod −1 ( 1 2 å p i ∈P a i Rod(R p i )+ 1 2 Rod(R all )), T = 1 2 å p i ∈P a i T p i + 1 2 T all . (7.1) where a i is the proportion of feature number on plane p i . And Rod is the Rodrigue representationofrotationtransform. Onecharacteristicofmethod[74]isthatthecomputationalcostislineartothenum- ber of correspondences n. Therefore, the computational cost of the above process is 97 twice as much as simply using all the point correspondences. We found that by remov- ing non-conforming poses, the generated pose has less jitter effects and more stable duringthetrackingprocess. 7.6 SystemPerformanceandAugmentedReality Figure 7.4-(a) shows segmentation results of the range scan in Figure 7.2. The seg- mentation process removes most points on trees, grass and other non-planar structures. Majorityofpointsonbuildingfacadesareretained. Itisnotedthattherearesomeholes in the range data. These holes are mainly caused by occlusions and surfaces with non- reflectivitycharacteristic. Aftermodeling,therearesomeplaneswhicharetoosmallfor thetrackingpurpose. Figure7.4-(b)showstheresultswiththesesmallplanesremoved. (a) (b) Figure 7.4: (a) The segmentation of LiDAR into facades, trees and grounds. (b) The LiDARdataismodeledwithasetof3Dplanes. 98 Inposeinitializationprocess,weusebothSURFandLineContextfeaturestomatch the initial frame. While SURF is fast and robust, Line Context can handle more chal- lengingcasessuchaslargerlightingchanges. Wehavetestedtherobustnessformatch- ing image within a large database. A database with over 1500 images is constructed. Figure7.5showscomparisonresultswithSIFT,SURFandcolorhistogram. Itindicates that on retrieving these images, the performance is significantly improved with Line Contextfeatures. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 recall precision Line Context SIFT SURF Color Histogram Figure7.5: Precision-recallforSURFandLineContextinposerecoveryscheme. SIFT andcolorhistogramarealsocompared. inputframes correctlymatched successrate(%) failurerate(%) SURF 505 351 69.5 30.5 LineContext 154 113 73.4 26.6 SURF+LineContext 505 464 91.2 8.8 Table7.1: SuccessrateofusingSURFandLineContextinposerecoveryscheme. Table 7.1 shows the success rate of using SURF with and without Line Context to match the initial frame. We have tested the pose recovery scheme on different dataset under various outdoor environments. It is shown that with only SURF feature in the recoveryscheme,itisabout70%oftimethattheframeissuccessfullymatched. Ifboth typesoffeaturesareused,thesuccessratewillincreaseto91%. 99 Figure7.6: Framesinvideosequencewithaugmentedvirtualaxis. Wehavetestedoursystemonmanytakenvideosequences. Figure7.6showsseveral frames of one video sequence. As shown in the figure, our method can continuously trackthecameraposefromoneendoftherangescantotheotherend. Figure7.7shows the projected LiDAR data with pose that is estimated from the corresponding frame. It is shown that the pose is very accurate and the projected LiDAR data is well aligned withtheframe. (a) (b) (c) (d) (e) (f) (g) (h) Figure7.7: TheLiDARdataisprojectedwiththeposeestimatedfromframes. 100 7.7 MiscellaneousI:GPS-aidedTrackingwithHandoff Process When GPS is available in the tracking process, the GPS information can be used to locatetheposeattheinitialstep. TheroughlocationofcameraisobtainedthroughGPS device so that the closest cluster is determined. Due to the inaccuracy of GPS devices, the locations of two or more cluster centers that are nearly close will be assigned at the same time. An overlapped clustering method that is based on k-means is used. The more accurate location information and the camera pose can be recovered from 2D to 3Dmatchings. WithUTMlocationprovidedbyGPS,theclusterwhosecenterisclosesttothecam- era can be found. The corresponding database will then be loaded into the memory. However,whenthecameraisatsomelocationsthatarenearlyclosetotwoclustercen- ters and the GPS location jumps around, there will be unnecessary swaps of memories. Toovercomethisproblem,wesetahandoffthresholdH toavoidsuchoverheads. Figure7.8: Thesofthandoffprocess. H isthethreshold. 101 As shown in Fig. 7.8,C 1 andC 2 are two clusters. The camera is initially at point a, which is closer to clusterC 1 . When the camera crosses the boundary to point b which is closer to cluster C 2 , it is still associated with cluster C 1 . The association will not change if the camera goes back to point a. However, if the camera goes further with extra distance H to point c, the association will be changed to clusterC 2 . When it goes back to b or a, the camera is still associated to clusterC 2 . In our system, since the GPS has an error E and the camera may take pictures with distance D from the location of interest,wewillusemax(E, D, H)insteadof H. 7.8 Miscellaneous II: Tracking-based Virtual World Navigation Navigation through 3D virtual environment is required in many interactive graphics and virtual reality applications. While much research works on how to navigate the virtual environment efficiently, there is little work on the combination of real world navigationwithvirtualworldnavigation. Suchcombinationhasinterestingapplications. For example, we can use virtual world to exactly simulate how a user navigates the real world. Besides, it can also be used in the battlefield, where the commander can monitor the soldiers whose locations and orientations are reflected in the virtual world andmakeordersaccordingly. Thelocationandorientationinformationcanbeobtained bysensorssuchasGPS,compass,accelerometerandgyroscope. However,suchsensors arenotalwaysavailableandsometimestheyarenotreliable. Forexample,GPSrequires line-of-sight and the accuracy highly depends on the weather. Compasses are easily affected by the surrounding environments. Accelerometers and gyroscopes suffer from integration drift errors. Therefore, other types of sensors are necessary for providing 102 more robust and accurate information. Optical sensors are among the most promising choices. In this section, we present a tracking-based navigation system for large-scale 3D virtual environments. When a user navigates the real world with a handheld camera, the captured image is used to estimate its location and orientation. These location and orientationinformationarealsoreflectedinthevirtualenvironment. Experimentsshow thatourproposednavigationsystemisefficientandwellsynchronizednavigationinthe realworld. VirtualWorldReconstruction WeuseLIDARdataandaerialimagesastheinputtoreconstruct3Dvirtualenvironment which mainly includes buildings and the ground. The method [109] is used to convert LIDAR to triangular meshes and [100] is used to automatically map the textures. The reconstructedvirtualenvironmentisshowninFig. 7.9. Figure7.9: Thereconstructedvirtualenvironmentwithmethod[109]and[100]. 103 VirtualWorldNavigation We fix the field of view for the virtual camera and set its position and direction accord- ingly. Whenwereconstructthevirtualenvironment,foreachlocationofinterests,a3D point cloud is generated from images at different positions and views. The 3D point cloud and 3D models are manually registered. In this way, the camera pose can be calculatedfromthe2Dto3Dmatchings. Let camera position for the returned patch be P ′ w and the reference direction of 3D point cloud be D ′ w in world coordinates. Let the camera pose of returned patch in the point cloud coordinates be C ′ 3D = [R ′ 3D |T ′ 3D ], and the pose of current camera beC 3D = [R 3D |T 3D ]. The virtual camera position P w and orientation D w in world coordinates can becalculatedas, P w = P ′ w +(T 3D −T ′ 3D ), (7.2) D w = R 3D D ′ w . (7.3) With P w and D w values, we can calculate the position and orientation for the virtual camerainthevirtualenvironment. The navigation delay is caused by several processes in the system, feature gener- ation, patch retrieval and pose calculation. In our system, FAST corner detector and SURFdescriptorareused. Table7.2showsthecomputingtimeforeachprocess. We conduct many experiments and test all the locations of interests for our simul- taneous navigation system. A laptop with Intel Core i7 720 processor and 4GB DDR3 is used for the experiments. Some of the results are shown In Fig. 7.10. As we can see from the figure, the estimated camera positions and orientations are not exactly the same as those of the real camera. There are mainly two causes for such errors. Firstly, someerrorsarecausedbyautomaticreconstructionof3Dmodels. Thebuildingsarenot 104 CameraPoseRecovery Time(ms) FASTdetector 5-25 SURFdescriptor 15-35 Patchretrieval 10-30 Posecalculation 3-5 Totaldelay 50-90 Table7.2: Thedelaybetweenvirtualenvironmentnavigationandrealworldnavigation. The two types of navigation are considered well synchronized and happening simulta- neously. modeledwithaccurateheightsanddistancesfromneighboringbuildings. Secondly,the manual registration of the 3D models and 3D point cloud also introduce errors. How- ever,theseerrorsarequiteacceptableforthenavigationsystem. Moreover, we can see that in all the examples shown in Fig. 7.10, the buildings are partially occluded by trees. The proposed patch-based retrieval method in Chapter 3 willavoidtheocclusionpartsandpickthemostdistinctivepatchtomatchimagesinthe database. Therefore,ourproposednavigationsystemiseffectiveinhandlingocclusions anddynamics. 7.9 Summary Wehavepresentedacompletepipelineofaugmentedrealitysystemthatworksinlarge scale outdoor environment. The 3D LiDAR data and real images are needed for the system. A preprocess step is conducted to register these images with LiDAR data and synthetic images are generated to cover more viewpoints. Both real images and syn- theticimagesareusedtogettheinitialcamerapose. Duetotheoutdoorlightingissues, we designed a novel feature that is robust to lighting variations. When the tracking starts, we back project KLT features onto LiDAR to generate feature point cloud. The 3DplanesarebuiltintheofflineparttomodeltheLiDARdata. Frametoframetracking 105 Figure7.10: (a)to(e)showsomeexamplesofthecapturedimagesinrealworldandthe corresponding views in virtual world with estimated camera positions and orientations. (f)showstheirlocationsonamap. isnecessarytomaintainpointscorrespondence withthepointcloud. Wealsoprovidea poserefinementprocesswhichisrequiredperiodicallytogenerateaccuratepointscloud andavoiderroraccumulations. Moreover,whenGPSisavailableinthetrackingprocess, weprovideahandoffmethodthatcanmakecameratransitionsmoothlybetweenneigh- boring clusters. We also propose a virtual worldnavigation application that is based on ourtrackingsystem. Therealworldtrackingcanbereflectedinvirtualworldwithwell synchronization in real time. We have tested the presented systems on different urban settingsandrobusttrackingperformancecanbeachievedconsistently. 106 Chapter8 ConclusionandFutureWork 8.1 Conclusion Thisthesisstudiesvariousproblemsinrobustimagematchingandtrackinginlargescale environment. Therearemanysubproblemsinthetrackingsystem,andeachsubproblem is studied in one chapter. Chapter 3 presents a fast pose recovery algorithm. A grid method is presented to increase the matching speed. Retrieval techniques are used to detect scales of keypoints and an adaptive scoring scheme is designed for performance improvements. Propagations are applied to achieve more matches on the image. Both efficiencyandrobustnessareachievedwithourmatchingalgorithms. In outdoor environment, matching usually fails due to lighting variations. In Chap- ter 4, a new type of feature named Line Context is presented. The feature is formed by describingthelayoutoflinesegmentsinthecontextofaninterestpoint. Pixelgrouping is enforced after Laplacian-of-Gaussian to detect a more repeatable scale under large illumination changes. Moreover, to achieve robustness to non-consistent edge detec- tions and linkings, multiple sample points are used to represent each segment. A vot- ing scheme is designed to generate a three dimensional histogram. The unnormalized weights in the histogram form a line context descriptor. Similar to most local features, linecontextsarerobusttoocclusionsandviewpointchanges. Significantimprovements areachievedonmatchingchallengingimagessuchaslargeilluminationvariations. From Chapter 5 to Chapter 7, LiDAR point clouds are used in the tracking system. WithLiDARdata,wecanextendoursystemtolargescaleareas. 107 Chapter 5 presents a pose estimation method with respect to 3D terrestrial LiDAR data. The data also contains intensity information. Point clouds are projected onto sev- eral pre-selected viewpoints and SURF features are detected on each synthetic image. These features are back projected onto point clouds to obtain their 3D positions. They are used to compute the initial camera pose. In the next stage, camera pose is further refined by minimizing an objective function. The function is composed of two com- ponents,intensitydifferencesandgeometricstructuredisplacementsbetweenprojected point clouds and camera image. The alignment method has been demonstrated to be moreaccuratethantraditionalkeypointbasedmethods. AmethodthatmatchesasequenceofimageswithpointcloudsispresentedinChap- ter6. Apreprocessstepisrequiredtoremovethosepointsthatarenotreliableforfeature trackings. Oncethecameraposeisinitialized,weuseKLTfeaturestotracktheposefor subsequent frames. The point cloud is modeled with a set of 3D planes. These planes areusedtoidentifycoplanarfeaturepointsthatcanmaintainastablepose. Themethod has been tested on different point clouds and stable poses are maintained for the whole imagesequence. Finally in Chapter 7, we have presented a complete pipeline of augmented reality system that works in large scale outdoor environment. The system contains offline pre- processing part and online tracking part. Each part has been discussed in more detail. When GPS is available, a handoff method is used to make smooth camera transition between neighboring clusters. We also present a virtual world navigation application that is based on our tracking system. The real world camera pose can be reflected in virtual world in the real time. We have tested the presented systems on different urban settingsandrobusttrackingperformanceisachievedconsistently. 108 8.2 FutureWork Possiblefutureworklieswithinfollowingtwodirections. Fusionwithothersensors There are many other types of sensors available besides optical sensors. In Chapter 7, we talked about how GPS can be used for the tracking system. In practice, many other sensors such as accelerometer and gyroscope can be applied. They provide measures of translations and orientations. Though the precision is not as high as vision-based sensors, they are not affected by occlusions or illumination changes. Therefore, it will bemorerobustifoutputsfrommultiplesensorsarecombined. ExtendedKalmanFilter (EKF)isoftenappliedtofusedifferenttypesofsensors. Ourtrackingsystemwillbenefit fromsuchadditionalinputs. Trackingwithrangecameras Many cameras are able to capture the depth information of an image. For example, time-of-flightcameraisonetypeofsuchcameras. Differentfromthesparsepointcloud which is generated by stereo triangulation method, range cameras will generate dense depth information. The depth image does not suffer from many problems in camera images such as lighting variations and parallax problem etc. With developed 3D to 3D registration techniques, the pose can be recovered by registering camera range images with LiDAR data. The AR system will become more robust by using the additional depthinformation. 109 ReferenceList [1] S. Agarwal, N. Snavely, I. Simon, S. Seitz, and R. Szeliski. Building rome in a day. InProceedingsoftheInternationalConferenceonComputerVision(ICCV), 2009. [2] D. G. Aguilera, P. R. Gonzalvez, and J. G. Lahoz. An automatic procedure for co-registrationofterrestriallaserscannersanddigitalcameras. ISPRSJournalof Photogrammetryand RemoteSensing,64(3):308–316,2009. [3] N. Ansari and E. J. Delp. On detecting dominant points. Pattern Recognition, 24(5):441–451,1991. [4] P.Azad,T.Asfour,andR.Dillmann. Combiningharrisinterestpointsandthesift descriptor for fast scale-invariant object recognition. In EEE/RSJ International Conferenceon IntelligentRobots andSystems(IROS),2009. [5] H.Bay,A.Ess,T.Tuytelaars,andL.V.Gool. Speeded-uprobustfeatures(surf). ComputerVisionandImageUnderstanding,110(3):346–359,2008. [6] H. Bay, V. Ferrari, and L. V. Gool. Wide-baseline stereo matching with line segments. In CVPR,2005. [7] H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded up robust features. In ECCV,pages404–417,2006. [8] S.BeckerandN.Haala. Combinedfeatureextractionforfacadereconstructio. In ISPRSWorkshopon LaserScanning,2007. [9] P. N. Belhumeur, D. J. Kriegman, and A. L. Yuille. The bas-relief ambiguity. IJCV,35(1):33–44,1999. [10] A. Blake. Boundary conditions of lightness computation in mondrian world. In CVGIP,1985. [11] Y. Boykov and D. Huttenlocher. daptive bayesian recognition in tracking rigid objects. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2000. [12] M. Brooks. Two results concerning ambiguity in shape from shading. AAAI, pages36–39,1983. 110 [13] T. Brox, B. Rosenhahn, D. Cremers, and H.-P. Seidel. High accuracy optical flow serves 3-d pose tracking: Exploiting contour and flow based constraints. In ECCV,2006. [14] M.Calonder,V.Lepetit,andP.Fua. Brief: Binaryrobustindependentelementary features. In ECCV,pages778–792,2010. [15] J.Canny. Acomputationalapproachtoedgedetection. PAMI,8:679–698,1986. [16] O.CarmichaelandM.Hebert.Shape-basedrecognitionofwiryobjects.InCVPR, pages401–408,2003. [17] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld: A system for region-based image indexing and retrieval. Visual Information and InformationSystems,1999. [18] T.ChamandJ.Rehg. Amultiplehypothesisapproachtofiguretracking. InProc. IEEEConf.Computer VisionandPatternRecognition(CVPR),1999. [19] M. J. Chantal Revol. A new minimum variance region growing algorithm for imagesegmentation. In PatternRecognitionLetters,1997. [20] Y.Chen,Y.Rui,andT.Huang. Jpdaf-basedhmmforreal-timecontourtracking. In Proc.IEEE Conf.Computer VisionandPatternRecognition(CVPR),2001. [21] S.ChristyandR.Horaud. Iterativeposecomputationfromlinecorrespondences. ComputerVisionandImageUnderstanding,73(1):137–144,1999. [22] A. I. Comport, E. Marchand, M. Pressigout, and F. Chaumette. Real-time mark- erless tracking for augmented reality: the virtual visual servoing framework. IEEE Transactions on Visualization and Computer Graphics, 12(4):615–628, July2006. [23] P.DavidandD.DeMenthon. Objectrecognitioninhighclutterimagesusingline features. In ICCV,2005. [24] A. Davison, Y. Cid, and N. Kita. Real-time 3d slam with wide-angle vision. In In The 5th IFAC/EURON Symposium on Intelligent Autonomous Vehicles (IAV), 2004. [25] A. Davison, I. Reid, N. Molton, and O. Stasse. Monoslam: Real-time single cameraslam. In PAMI,2007. [26] P.E.DebevecandJ.Malik. Recoveringhighdynamicrangeradiancemapsfrom photographs. In SIGGRAPH,1997. [27] M. Ding, K. Lyngbaek, and A. Zakhor. Automatic registration of aerial imagery with untextured 3d lidar models. In Computer Vision and Pattern Recognition (CVPR),2008. [28] P. Dupuis and J. Oliensis. An optimal control formulation and related numerical methodsforaprobleminshapereconstruction. The Annals of Applied Probabil- ity,4(2):287–346,1994. 111 [29] E. Eade and T. Drummond. Edge landmarks in monocular slam. In Image and VisionComputing,2006. [30] B.Fan,F.Wu,andZ.Hu.Aggregatinggradientdistributionsintointensityorders: Anovellocalimagedescriptor. In CVPR,2011. [31] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid. Groups of adjacent contour seg- mentsforobjectdetection. PAMI,30(1):36–51,2008. [32] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fiting with applications to image analysis and automated cartography. Graphicsand ImageProcessing,24(6):381–395,1981. [33] C. Frueh, R. Sammon, and A. Zakhor. Automated texture mapping of 3d city modelswithobliqueaerialimagery. InSymposiumon3DDataProcessing,Visu- alizationand Transmission,pages3963–403,2004. [34] J.Gall, B.Rosenhahn, andH.peterSeidel. Drift-freetrackingofrigidand artic- ulatedobjects. In CVPR,2008. [35] L. Goncalves, E. D. Bernardo, D. Benson, M. Svedman, J., N. Karlsson, and P. Pirjanian. A visual front-end for simultaneous localization and mapping. In ICRA,2005. [36] H. Grabner, C. Leistner, and H. Bischof. Semi-supervised on-line boosting for robusttracking. InProceedingsoftheEuropeanConferenceonComputerVision (ECCV),2008. [37] W. Guan, L. Wang, M. Jonathan, S. You, and U. Neumann. Robust pose estima- tion in untextured environments for augmented reality applications. In ISMAR, 2009. [38] W. Guan, S. You, and U. Neumann. Recognition-driven 3d navigation in large- scalevirtualenvironments. In IEEEVirtualReality,2011. [39] W. Guan, S. You, and U. Neumann. Efficient matchings and mobile augmented reality. In ACMTOMCCAP,2012. [40] R.GuptaandA.Mittal.Smd: Alocallystablemonotonicchangeinvariantfeature descriptor. In ECCV,2008. [41] R. Gupta, H. Patak, and A. Mittal. Robust order-based methods for feature description. In CVPR,2010. [42] C.HarrisandM.Stephens. Acombinedcornerandedgedetector. InProceedings ofthe 4thAlveyVisionConference,pages147–151,1988. [43] R. I. Hartley and A. Zisserman. Multiple view geometry in computer vision. CambridgeUniversityPress,ISBN:0521540518,secondedition,2004,2004. [44] N.Henze,T.Schinke,andS.Boll. Whatisthat? objectrecognitionfromnatural features on a mobile phone. In Proceedings of the Workshop on Mobile Interac- tionwiththe Real World,2009. 112 [45] R. Horaud, F. Dornaika, B. Lamiroy, and S. Christy. Object pose: The link betweenweakperspective,paraperspectiveandfullperspective. ComputerVision andImageUnderstanding,22(2),1997. [46] S. S. Intille, J. W. Davis, and A. E. Bobick. Real-time closed-world tracking. In IEEEConferenceon ComputerVisionandPatternRecognition(CVPR),1997. [47] M. Isard and A. Blake. Condensation-conditional density propagation for visual tracking. Intâ ˘ A ´ Zl J.ComputerVision,29(1),1998. [48] P. Jensfelt, D. Kragic, J. Folkesson, and M. Bjorkman. A framework for vision basedbearingonly3dslam. In ICRA,2006. [49] H. Jin, P. Favaro, and S. Soatto. Real-time feature tracking and outlier rejection withchangesinillumination. In Proceedings of the International Conference on ComputerVision(ICCV),2001. [50] J. Lalonde, A. Efros, and S. Narasimhan. Estimating natural illumination from a singleoutdoorimage. In ICCV,2009. [51] G.W.LarsonandR.Shakespeare. Rendering with radiance: the art and science of lighting visualization. MorganKaufmannPublishersInc.,SanFrancisco,CA, USA,1998. [52] T.LemaireandS.Lacroix. Monocular-visionbasedslamusinglinesegments. In ICRA,2007. [53] V.Lepetit,F.Moreno-Noguer,andP.Fua. Epnp: Anaccurateo(n)solutiontothe pnpproblem. InternationalJournalofComputerVision(IJCV),2009. [54] V.Lepetit,J.Pilet,andP.Fua. Pointmatchingasaclassificationproblemforfast androbustobjectposeestimation. In CVPR,2004. [55] S.Leutenegger,M.Chli,andR.Siegwart. Brisk: Binaryrobustinvariantscalable keypoints. In ICCV,2011. [56] B. Li and R. Chellappa. Simultaneous tracking and verification via sequential posteriorestimation. InProc.IEEEConf.ComputerVisionandPatternRecogni- tion(CVPR),2000. [57] T. Lindeberg. Scale-space theory in computer visionobject recognition by affine invariantmatching. KluwerAcademicPublishers,1994. [58] A. Lipton, H. Fujiyoshi, and R. Patil. Moving target classification and tracking fromreal-timevideo. InProc.IEEEWorkshopApplicationsofComputerVision, 1998. [59] L.LiuandI.Stamos.Automatic3dto2dregistrationforthephotorealisticrender- ingofurbanscenes.InComputerVisionandPatternRecognition,pages137–143, 2005. 113 [60] L.LiuandI.Stamos.Asystematicapproachfor2d-imageto3d-rangeregistration inurbanenvironments.InInInternationalConferenceonComputerVision,pages 1–8,2007. [61] D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, pages1150–1157,1999. [62] E. Malis, M. Vargas, T. Num, E. Malis, and M. Vargas. Deeper understanding of the homography decomposition for vision-based control. In INRIA Research Report,2007. [63] A. Mastin, J. Kepner, and J. Fisher. Automatic registration of lidar and optical images of urban scenes. In Computer Vision and Pattern Recognition (CVPR), pages2639–2646,2009. [64] J. Matas, O. Chum, M. Urba, and T. Pajdla. Robust wide baseline stereo from maximallystableextremalregions. In BMVC,pages384–396,2002. [65] J. Meltzer and S. Soatto. Edge descriptors for robust wide-baseline correspon- dence. In CVPR,pages1–8,2008. [66] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1092,June1953. [67] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. PAMI,2005. [68] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffal- itzky,T.Kadir,andL.V.Gool. Acomparisonofaffineregiondetectors. InIJCV, 2005. [69] K. Mikolajczyk, A. Zisserman, and C. Schmid. Shape recognition with edge- basedfeatures. In BMVC,pages779–788,2003. [70] M.MilfordandG.Wyeth. Featurelessvehicle-basedvisualslamwithaconsumer camera. In Proceedings of the 2007 Australasian Conference on Robotics and Automation,2007. [71] N.J.MitraandA.Nguyen. Estimatingsurfacenormalsinnoisypointclouddata. InProceedingsofthenineteenthannualsymposiumonComputationalgeometry, 2003. [72] D.Miyazaki,R.Tan,K.Hara,andK.Ikeuchi. Polarization-basedinverserender- ingfromasingleview. In ICCV,2003. [73] J. Morel and G. Yu. Asift: A new framework for fully affine invariant image comparison. In SIAM Journalon ImagingSciences,pages438–469,2009. [74] F.Moreno-Noguer,V.Lepetit,andP.Fua. Accuratenon-iterativeo(n)solutionto thepnpproblem. In ICCV,pages1–8,2007. 114 [75] E.Mouragnon,M.Lhuillierand,M.Dhome,F.Dekeyser,andP.Sayd. Monocular visionbasedslamformobilerobots. In ICPR,2006. [76] N. Muhammad, D. Fofi, and S. Ainouz. Current state of the art of vision based slam. In Image Processing: Machine Vision Applications II, volume 7251, Feb 2009. [77] P.Musialski,P.Wonka,D.G.Aliaga,M.Wimmer,L.vanGool,andW.Purgath- ofer. A survey of urban reconstruction. In EUROGRAPHICS 2012 State of the ArtReports,pages1–28,May2012. [78] H. Najafi, Y. Genc, and N. Navab. Fusion of 3d and appearance models for fast objectdetectionandposeestimation. In ACCV,pages415–426,2006. [79] D. Oberkampf, D. DeMenthon, and L. Davis. Iterative pose estimation using coplanarfeaturepoints. In CVGIP,1996. [80] M. Pollefeys, L. V. Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch. Visual modeling with a hand-held camera. International Journal of ComputerVision,59(3):207–232,2004. [81] M. Pollefeys, D. Nistér, J. M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels, D. Gallup, S. J. Kim, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewénius, R. Yang, G. Welch, and H. Towles. Detailed real-timeurban3dreconstructionfromvideo. InternationalJournalofComputer Vision,78(2-3):143–167,2008. [82] T. Pylvanainen, J. Berc., T. Korah, V. Hedau, M. Aanj., and R. Grzesz. 3d city modelingfromstreet-leveldataforaugmentedrealityapplications. In3DIMPVT, 2012. [83] L.QuanandZ.Lan. Linearn-pointcameraposedetermination. In PAMI,1999. [84] A.Rahimi,L.P.Morency,andT.Darrell. Reducingdriftindifferentialtracking. ComputerVisionandImageUnderstanding,109(2),2008. [85] R. Rosales and S. Sclaroff. 3d trajectory recovery for tracking multiple objects and trajectory guided recognition of actions. In Proc. IEEE Conf. Computer Visionand PatternRecognition(CVPR),1999. [86] P.L.Rosin. Measuringcornerproperties. CVIU,73(2),1999. [87] E.Rublee,V.Rabaud,K.Konolige,andG.Bradski. Orb: anefficientalternative tosiftorsurf. In ICCV,2011. [88] J. M. S. Belongie and J. Puzicha. Shape matching and object recognition using shapecontexts. IEEETransactionsonPatternAnalysisandMachineIntelligence (PAMI),24(4):509–522,2002. [89] E. Shechtman and M. Irani. Matching local self-similarities across images and videos. In CVPR,pages1–8,2007. [90] J.ShiandC.Tomasi. Goodfeaturestotrack. In CVPR,1994. 115 [91] I.StamosandP.K.Allen. Geometryandtexturerecoveryofscenesoflargescale. ComputerVisionandImageUnderstanding,88(2):94–118,2002. [92] I. Stamos, L. Liu, C. Chen, G.Wolberg, G. Yu, and S. Zokai. Integrating auto- matedrangeregistrationwithmultiviewgeometryforthephotorealisticmodeling oflarge-scalescenes. Int. J.Comput. Vision,pages237–260,2008. [93] D.-N. Ta, W.-C. Chen, N. Gelfand, and K. Pulli. Surftrac: Efficient tracking and continuousobjectrecognitionusinglocalfeaturedescriptors.InIEEEConference onComputer VisionandPatternRecognition(CVPR),2009. [94] R. T. Tan and K. Ikeuchi. Separating reflection components of textured surfaces usingasingleimage. PAMI,27(2):179–193,February2005. [95] F.Tang,S.H.Lim,N.Chang,andH.Tao. Anovelfeaturedescriptorinvariantto complexbrightnesschanges. In CVPR,pages2631–2638,2009. [96] M. Tappen, W. Freeman, and E. Adelson. Recovering intrinsic images from a singleimage. In NIPS,2002. [97] P.TsaiandM.Shah. Shapefromshadingusinglinearapproximation. IVC,12(8), October1994. [98] A. Vasile, F. R. Waugh, D. Greisokh, and R. M. Heinrichs. Automatic align- ment of color imagery onto 3d laser radar data. In Applied Imagery and Pattern RecognitionWorkshop,2006. [99] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg. Pose tracking from natural features on mobile phones. In Proceedings of the Interna- tionalSymposium on Mixed and AugmentedReality(ISMAR),2008. [100] L.WangandU.Neumann. Arobustapproachforautomaticregistrationofaerial imageswithuntexturedaeriallidardata. In Computer Visionand PatternRecog- nition(CVPR),pages2623–2630,2009. [101] L. Wang, U. Neumann, and S. You. Wide-baseline image matching using line signatures. In ICCV,2009. [102] R.Wang,F.Ferrie,andJ.Macfarlane. Automaticregistrationofmobilelidarand spherical panoramas. In Computer Vision and Pattern Recognition Workshops (CVPRW),pages33–40,2012. [103] Z.Wang,B.Fan,andF.Wu. Localintensityorderpatternforfeaturedescription. In ICCV,2011. [104] Z.Wang, F.Wu, andZ. Hu. Msld: A robustdescriptor forline matching. In PR, 2009. [105] G. Yang, J. Becker, and C. Stewart. Estimating the location of a camera with respecttoa3dmodel. In 3DDigital ImagingandModeling,2007. 116 [106] Y. Yu, K. Huang, W. Chen, and T. Tan. A novel algorithm for view and illu- mination invariant image matching. IEEE Transactions on Image Processing, 21(1):229–240,2012. [107] Z. Zhang. Iterative point matching for registration of free-form curves and sur- faces. InternationalJournalofComputerVision,13(12):119–152,1994. [108] Z. Zhang. Flexible camera calibration by viewing a plane from unknown orien- tations. In ICCV,pages666–673,1999. [109] Q. Zhou and U. Neumann. A streaming framework for seamless building recon- struction from large-scale aerial lidar data. In IEEE Conference on Computer Visionand PatternRecognition(CVPR),2009. [110] T. Zinfier, C. Grafil, and H. Niemann. Efficient feature tracking for long video sequences. In PatternRecognition,2004. [111] C.L.ZitnickandK.Ramnath. Edgefociinterestpoints. In ICCV,2011. 117
Abstract (if available)
Abstract
This thesis presents new matching algorithms that work robustly in challenging situations. Image matching is a fundamental and challenging problem in vision community due to varied sensing techniques and imaging conditions. While it is almost impossible to find a general method that is optimized for all uses, we focus on those matching problems that are related to augmented reality (AR). Many AR applications have been developed on portable devices, but most are limited to indoor environments within a small workspace because their matching algorithms are not robust out of controlled conditions. ❧ The first part of the thesis describes 2D to 2D image matching problems. Existing robust features are not suited for AR applications due to their computational cost. A fast matching scheme is applied to such features to increase matching speed by up to 10 times without sacrificing their robustness. Lighting variations can often cause match failures in outdoor environments. It is a challenging problem because any change in illumination causes unpredicted changes in image intensities. Some features have been specially designed to be lighting invariant. While these features handle linear or monotonic changes, they are not robust to more complex changes. This thesis presents a line‐based feature that is robust to complex and large illumination variations. Both feature detector and descriptor are described in more detail. ❧ The second part of the thesis describes image sequence matching with 3D point clouds. Feature‐based matching becomes more challenging due to different structures between 2D and 3D data. The features extracted from one type of data are usually not repeatable in the other. An ICP‐like method that iteratively aligns an image with a 3D point cloud is presented. While this method can be used to calculate the pose for a single frame, it is not efficient to apply it for all frames in the sequence. Once the first frame pose is obtained, the poses for subsequent frames can be tracked from 2D to 3D point correspondences. It is observed that not all points on LiDAR are suitable for tracking. A simple and efficient method is used to remove unstable LiDAR points and identify features on frames that are robust in the tracking process. With the above methods, the poses can be calculated more stably for the whole sequence. ❧ With provided solutions to above challenging problems, we have applied our methods in an AR system. We describe each step in building up such a system from data collections and preprocessing, to pose calculations and trackings. The presented system is shown to be robust and promising for most AR‐based applications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Automatic image matching for mobile multimedia applications
PDF
Combining object recognition and tracking for augmented reality
PDF
3D object detection in industrial site point clouds
PDF
Object detection and recognition from 3D point clouds
PDF
Line segment matching and its applications in 3D urban modeling
PDF
3D deep learning for perception and modeling
PDF
Point-based representations for 3D perception and reconstruction
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
City-scale aerial LiDAR point cloud visualization
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Hybrid mesh/image-based rendering techniques for computer graphics applications
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Green learning for 3D point cloud data processing
PDF
Digitizing human performance with robust range image registration
PDF
Towards more occlusion-robust deep visual object tracking
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
Asset Metadata
Creator
Guan, Wei
(author)
Core Title
Hybrid methods for robust image matching and its application in augmented reality
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
03/14/2014
Defense Date
01/15/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
augmented reality,camera tracking,Computer Science,computer vision,graphical model,grid method,image matching/registration,line-based features,machine learning,OAI-PMH Harvest,point cloud,segmentation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
You, Suya (
committee chair
), Kuo, C.-C. Jay (
committee member
), Neumann, Ulrich (
committee member
)
Creator Email
wguan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-369764
Unique identifier
UC11296079
Identifier
etd-GuanWei-2297.pdf (filename),usctheses-c3-369764 (legacy record id)
Legacy Identifier
etd-GuanWei-2297.pdf
Dmrecord
369764
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Guan, Wei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
augmented reality
camera tracking
computer vision
graphical model
grid method
image matching/registration
line-based features
machine learning
point cloud
segmentation