Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Accurate image registration through 3D reconstruction
(USC Thesis Other)
Accurate image registration through 3D reconstruction
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ACCURATE IMAGE REGISTRATION THROUGH 3D RECONSTRUCTION by Yuping Lin A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2010 Copyright 2010 Yuping Lin Acknowledgments When I first entered my Ph.D. program, I wasn’t sure that I could finish the program and become a doctor. It is because of the people mentioned below that I am able to make my dream come true. Firstly, I would like to thank my advisor, Professor G´ erard Medioni. When I was only a first year masters student, he generously offered me a research project, opening me to the field of computer vision research. He encouraged me to continue the research, and helped me to enter the USC computer science Ph.D. program. During the course of my Ph.D. study, there were countless times when he provided me with valuable insight into problems, pointing out possible ways to approach them, and motivating me to hang on in the face of frustrations. My admiration of and gratefulness to him are beyond words. I would also like to thank Professor Ram Nevatia, my instructor in CS 574 Computer Vision and CS 674 Advanced Computer Vision. His lectures are the building blocks on which I built my foundations in computer vision. I appreciate him, along with Professor C.-C. Jay Kuo, Professor Laurent Itti, and Professor Alexander Walsh, for the time they served as committee members of my qualifying exam, and their constructive advices. Throughout the years of my Ph.D. study, I have also been grateful to a number of senior students, who shared their valuable knowledge with me. First and foremost, I would like to thank Wei-Kai Liao. We have known each other since before I came to the States, and maintained close contact through the present. He never tires of long discussions of my problems, from which I always discover a clearer picture of what next ii to do. I also deeply thank Tae-Eun Choe and Qian Yu, who helped me comprehend the full picture in retinal image registration research and UAV image registration research respectively. My appreciation also goes to Yann Dumorier, who helped me develop a better defense presentation. I cherish my friendships with past and present members of the USC vision lab: Dr. Sung Chun Lee, Dr. Douglas Fidaleo, Dr. Pradeep Natarajan, Dr. Chang Yuan, Dr. Matheen Saddiqui, Dr. Bo Wu, Dr. Chang Huang, Dr. Jongmoo Choi, Cheng-Hua Pai, Cheng-Hao Kuo, Huei-Heng Liao, Yuan Li, Li Zhang, Xuemei Zhao, Dian Gong, Bo Yang, Eunyoung Kim, Thang Ba Dinh, Jan Prokaj, Vivek Kumar Singh, Prithviraj Banerjee, Pramod K. Sharam, Nancy Levien, and Bill Ortiz. I would also like to thank my girlfriend, Ying-Chun Hsiao, and my friends, for accom- panying me through these years. They made my Ph.D life a lot happier and easier. Finally, I want to express my greatest love to my parents. They focus much attention on myeducation, andarealways supportiveofmydecisions. IhopemybecomingaPh.D. makes them proud. iii Table of Contents Acknowledgments ii List of Tables vii List of Figures viii Abstract xi Chapter 1: Introduction 1 Chapter 2: Related Work 7 2.1 InvariantFeatureExtractionandMatching ................. 7 2.2 2DRegistrationandImageMosaicing . ................... 9 2.3 3DReconstructionofCamerasandStructure ................ 10 2.4 DenseStructureInference ........................... 11 Chapter 3: Aerial Image Registration 13 3.1 Introduction................................... 13 3.1.1 Problem Formulation and Issues . . . . . . ............. 13 3.1.2 RelatedWork.............................. 14 3.1.3 Overviewofourapproach ....................... 15 3.2 UAVInterFrameRegistration ........................ 18 3.3 UAV-to-MapRegistration . .......................... 18 3.3.1 UAV-to-MapCorrespondenceEstablishment ............ 19 3.3.2 Extracting Reasonable Correspondences 20 3.3.3 LocalMosaic .............................. 21 3.4 ExperimentalResults.............................. 22 3.5 SummaryandConcludingRemarks...................... 23 Chapter 4: Retinal Image Registration: A 2D Approach 25 4.1 Introduction................................... 25 4.1.1 Issues . . . . .............................. 25 4.1.2 RelatedWork.............................. 26 4.1.3 OverviewofOurApproach ...................... 27 iv 4.2 SIFTonEdgeResponseImages. ....................... 29 4.2.1 ImagePre-processing.......................... 30 4.2.2 ExperimentalResults ......................... 31 4.3 FeatureMatchinginTwoImages ....................... 32 4.3.1 IterativeNearestNeighborMatching................. 33 4.4 ChainedRegistration.............................. 33 4.4.1 CostFunctionforChainedRegistration ............... 35 4.4.2 ExperimentalResults ......................... 36 4.5 Summary .................................... 38 Chapter 5: Retinal Image Registration: 3D Reconstruction 39 5.1 Introduction................................... 39 5.1.1 Issues . . . . .............................. 39 5.1.2 RelatedWork.............................. 40 5.1.3 OverviewofOurApproach ...................... 40 5.2 EpipolarGeometryEstimation . ....................... 41 5.3 Camera Pose Estimation Using 4-Pass Bundle Adjustment . . . ...... 42 5.4 DenseSurfaceDepthInference . ....................... 45 5.5 MutualInformationComputationontheGPUPlatform . ......... 46 5.5.1 ApproximateValue........................... 48 5.5.2 ApproximateDerivative . ....................... 50 5.6 ExperimentalResults.............................. 51 5.6.1 3DReconstruction ........................... 51 5.6.2 GPU Acceleration . . . . ....................... 54 5.7 Summary .................................... 57 Chapter 6: 3D Urban Scene Reconstruction 58 6.1 Introduction................................... 58 6.2 RelatedWork .................................. 61 6.3 Approach .................................... 62 6.3.1 MatchingScore............................. 63 6.3.2 HeightEstimationalonganImageColumn ............. 64 6.3.3 Two-PassDynamicPrograming. ................... 66 6.3.4 HeightEstimationfortheEntireImage ............... 68 6.4 LineExtraction................................. 71 6.5 Experiments................................... 74 6.6 Summary .................................... 76 Chapter 7: 3D Face Reconstruction from Weakly Calibrated Wide Baseline Images 78 7.1 Introduction................................... 78 7.2 RelatedWork .................................. 80 7.3 Approach .................................... 82 7.3.1 SparseReconstruction ......................... 82 7.3.2 DenseReconstruction ......................... 85 7.4 ExperimentalResults.............................. 88 v 7.4.1 SyntheticFaceModels ......................... 89 7.4.1.1 Accuracyofreconstructionusingsilhouette ........ 90 7.4.1.2 Accuracyinimperfectprofileviews ............ 90 7.4.2 RealFaceModels............................ 91 7.4.2.1 Computationcomplexity .................. 91 7.4.2.2 Qualityofreconstructionresults . ............. 92 7.5 Summary .................................... 93 Chapter 8: Conclusion and Future Directions 95 8.1 SummaryofContributions........................... 95 8.2 FutureDirections................................ 97 Bibliography 98 vi List of Tables 3.1 ExperimentalresultsoftwoUAVsequences ................. 23 5.1 Average local depth variance comparison of the reconstructed surfaces . . 54 5.2 CPUv.s.GPUhardwarespec......................... 55 6.1 Computationtimeforeachmodule ...................... 74 7.1 Numberofinliersv.s.numberoftotal3-viewcorrespondences....... 92 vii List of Figures 1.1 Imageregistrationthrough3Dreconstruction ................ 2 1.2 Flowchartofourimageregistrationframework ............... 3 3.1 FlowchartofourUAVimageregistrationtoareferencemap . ...... 16 3.2 ThecorrespondencesbetweenanUAVimageandamap . ......... 20 3.3 Registeringapartiallocalmosaic ....................... 21 3.4 RegistrationresultsoftwoUAVsequences.................. 24 4.1 A retinal fundus camera . . . . . ....................... 26 4.2 Retinalimagesindifferentmodalities..................... 26 4.3 Flowchartofretinalimageregistration ................... 28 4.4 Retinalimagesandtheircorrespondingedgeresponses........... 29 4.5 Imagepreprocessingateachstage ...................... 30 4.6 Numberoftruematchesintheoriginalandedgeresponseimages..... 31 4.7 AtypicalmatchingresultsineachiterationofIter-NNmatch . ...... 34 4.8 IterativeNearestNeighborMatchingalgorithm ............... 35 4.9 Chainedregistration .............................. 35 4.10 Registrationrateofthreematchingmethods................. 37 4.11 Registrationresultof2imageswithlargemodalitydifference ....... 37 4.12 Poordynamicrange .............................. 38 5.1 Flowchartof3Dretinalimageregistration ................. 40 5.2 Fundamental matrix estimation using plane+parallax method . ...... 42 viii 5.3 Flow chart of 4 pass bundle adjustment . . . . . . ............. 43 5.4 Piecewise planar retinal surface . ....................... 45 5.5 Mutualinformationvaluealgorithm ..................... 49 5.6 Mutualinformationderivativealgorithm................... 51 5.7 Thecameraconfigurationsofthe4P-BAexperiment ............ 52 5.8 Errorswithrespecttodifferentcamerabaselines .............. 53 5.9 Comparisonofthe3Dretinasurfacereconstruction............. 54 5.10 Computationaltimeofmutualinformationanditsderivatives. ...... 55 5.11 Computationtimeforimagealignmentusingmutualinformation ..... 56 6.1 Samples of the input UAV images ....................... 59 6.2 Flowchartofourdensereconstructionofurbanscene ........... 60 6.3 AssumptiononUAVimages.......................... 63 6.4 Reconstruction results using winner-take-all v.s. dynamic programming . 65 6.5 BiastermcomputedinthefirstpassDP................... 67 6.6 Densematchingresultsofastereoimagepair ................ 68 6.7 Linearfeaturesextractedfromurbanimagery ................ 68 6.8 Typcial2-passDPv.s.our2-passDP . ................... 69 6.9 Reconstruction results using 1-pass DP and 2-pass DP respectively . . . . 72 6.10 Reconstructed3Ddensepointcloudinmultipleviewingangles ...... 75 6.11 3Dbuildingmeshmodels . .......................... 76 6.12 Areconstructed3DpointcloudfromCLIF’06dataset ........... 77 7.1 Thesetupforcapturingthefivefacialimagesofasubject ......... 79 7.2 Thefivefacialimagesandtwoprofileimages ................ 79 7.3 Flowchartofthesparsereconstructionmodule ............... 83 7.4 Iterative bundle adjustment algorithm . . . . . . . ............. 84 7.5 Cylindricalvoxelsamplingfordense3Dreconstruction........... 86 ix 7.6 Reconstructionresultswithoutusingv.s.usingprofilecontours ...... 89 7.7 Reconstructionerrorcomparison ....................... 89 7.8 Meanerrordistributionsofthesyntheticfacesandcameraposes ..... 91 7.9 Reconstructionofrealimages ......................... 93 x Abstract Image registration is a fundamental problem in image analysis, and is used in a variety of applications such as tracking, moving object detection, remote sensing, etc.Inthis thesis, we study image registration problems where the images are taken at different times, from different sensors, and from different viewpoints. Moreover, the scene may contain arbitrary 3D structure. With all these variations, image registration become a challenging task, where standard techniques may fail to produce accurate results. According to the geometry in the image, we categorize image registration problems into 2D registration and 3D registration. In 2D registration, image features are extracted and matched to establish correspondences, from which the epipolar geometry can be esti- mated. Images are then registered using a derived 2D model. In 3D registration, on top of epipolar geometry estimation, a sparse reconstruction step is required which recovers camera parameters and the sparse structure. A dense reconstruction step follows, which recovers structure in the entire scene, images are then registered through 3D inference. The main contribution in this thesis is the in depth study of a number of issues in image registration applications which a general framework does not address, particularly in the 3D reconstruction pipeline. We start from 2D image registration, and look at two applications, UAV image reg- istration and retinal image registration. In UAV image registration, we are given a UAV image sequence, and the goal is to produce a mosaic in a progressive manner. As inter-frame registration error accumulates along the process, and results in deviation, xi we introduce an additional map as a global reference, and perform UAV to map regis- tration to compensate for the error. In retinal image registration, the input is a set of retinal images in multiple modalities. We propose an iterative nearest neighbor match- ing method to account for issues raised in multi-modal imagery, and achieve both high registration rate and high efficiency. We then extend our study in both imageries to 3D, to account for the underlying 3D geometry. In addition, we research some other 3D reconstruction problems using human facial images. In 3D retinal image registration, we address the issues which arise from the near planar property of a retinal surface, and propose a 4-pass bundle adjustment method to account for it. Our approach is shown to be very robust and efficient, and is state-of-the-art in 3D retinal image registration. For UAV image registration, we focus on the dense 3D reconstruction of urban environments. Images of urban environments are characterized by significant occlusions, sharp edges, and textureless regions, leading to poor 3D reconstruction using standard multi-view stereo algorithms. Our approach makes a general assumption that urban scenes consist of planar facets that are either horizontal or vertical. These two assumptions provide very strong constraints for the underlying geometry. The contribution of this work is the way we translate these con- straints respectively into intra-image-column and inter-image-column constraints, and formulate the dense reconstruction problem as a 2-pass dynamic programming problem, which can be solved efficiently. Moreover, our algorithm is fully parallelizable, which is appropriate for GPU computing. Our results show that we can preserve a high level of detail, and have high visual quality. In 3D human face reconstruction, we are given a set of 5 wide-baseline images that are only weakly calibrated. The focus in this work is on both sparse reconstruction and dense reconstruction. First, to calibrate cameras, we propose an iterative bundle adjustment approach to solve the challenging wide-baseline feature matching problem. Then, for dense reconstruction, we propose to use a face- specific cylindrical representation which allows us to solve a global optimization problem for N-view dense aggregation. We explicitly use profile contours extracted from the xii image in both sparse reconstruction and dense reconstruction steps to provide strong constraints for the underlying geometry. Experimental results show that our method provides accurate and stable reconstruction results on wide-baseline images. We com- pare our method with state-of-the-art methods, and show that it provides significantly better results in terms of both accuracy and efficiency. xiii Chapter 1 Introduction Image registration is the process of aligning two or more images of the same scene. Through this alignment, images can share the same coordinate system, upon which more advanced image analysis tasks can be performed. It is therefore a very critical step in various image analysis domains. It can be used in remote sensing, including applications such as scene change detection, image mosaicing, and environmental monitoring. It can be used in medical imaging, including applications such as retinal image registration and monitoring of tumor evolution, or cancer detection. It can also be used in computer vision for moving object detection and tracking. Typically, image registration deals with images that are taken from different sensors, at different times, and from different viewpoints, which exhibit multiple modalities and illumination changes. In this thesis, in addition to these factors, we deal with scenes that have arbitrary 3D structure. In such circumstances, the appearance of a scene varies greatly. Objects may disappear due to occlusion. The 3D structure in the scene may introduce strong parallax. Moreover, there are no globally applied transformation models. As shown in Figure 1.1(a), one image is registered to the other using a 2D perspective model, where only the ground plane is aligned. To obtain individual point transformation between images, two elements must be determined: the camera poses and the 3D structure, as shown in Figure 1.1(b). Hence, the recovery of such information is necessaryinordertoachieve accurateregistrationresults. Althoughsomeimagestitching applications [SS00, BL] avoid reconstructing the 3D structure of the scene, they assume a stationary camera optic center, which is not general enough. We thus classify image registration into 2D registration and 3D registration, and approach them differently. A high level flowchart of such an image registration process 1 image plane arbitrary 3D structure (a) (b) Figure 1.1: (a) 2D registration cannot account for structure induced parallax; (b) Point trasformation between images. The transformation is determined by the cameras and the 3D structure. is shownin Figure 1.2. With all these variations, image registration become achallenging task, and the issues can be summarized as follows: 1. Feature extraction. The extracted features have to be both invariant across dif- ferent views and descriptive in order to provide reliable matches. This is difficult in the presence of different modalities, lighting conditions, and viewing perspec- tives. In addition, we need to produce enough features so that a large number of correspondences can be produced. 2. Featurematching. The matching process establishes point correspondences, and the quality of matches is critical in the estimation of the 3D geometry. From the previous stage, we are given a large set of features that may not be descriptive enough due to perspective distortions or illumination changes. This increases both the search space and the ambiguity in the matching process. Thus, the issue here is to design an efficient and accurate feature matching process. 3. Epipolar geometry and camera pose estimation. To recover the correct 3D geometry, the epipolar geometry and the camera poses must be known. In our imagery, the epipolar geometry is estimated from a set of correspondences that are very noisy due to false matching, dynamic contents, or ill-conditioned camera 2 motions. The structure of the scene may also be near planar, which makes the estimation of epipolar geometry error prone, and in turn leads to an incorrect estimation of the camera poses. 4. Dense depth inference and 3D representation. The ultimate goal in 3D reconstruction is to produce a compact representation of the scene so that the scene can be stabilized in arbitrary view perspectives. First, accurate dense cor- respondences between images have to be established. This is a challenging task especially for regions with high matching ambiguities, such as occluded regions and homogeneous regions. In addition, we need to explicitly handle correspondences between moving objects, as they produce incorrect 3D structures. From the corre- spondences, a dense set of 3D point clouds can then be triangulated given camera calibrations. Next, we need to convert this dense 3D point cloud into a com- pact representation, such as 3D mesh models. This is also very difficult since the reconstructed 3D dense points may contain a large number outliers. The accurate production of 3D mesh models from such input may require some prior knowledge of the scene. Feature Extraction + Matching Epipolar Geometry Estimation 2D Registration Camera Pose Estimation Dense 3D Reconstruction Input Images Registeration through 3D Inference Figure 1.2: Flow chart of our image registration framework Although in the past decades, a lot of research efforts have been devoted to each building blocks in the pipeline, and a number of quality results have been produced, there are still unsolved issues, especially for 3D image registration. We thus emphasize image registration with 3D context, and study the problems from three specific image 3 domains, which are retinal images, UAV images, and human face images. In retinal image registration, the imagery is multi-modal, and the structure of the retinal fundus is near planar. We start by formulating it as a 2D registration problem, and address the feature extraction and matching problem in the presence of multi-modality. Our contri- bution in this work is the proposed Iterative Nearest Neighbor Matching algorithm that can gradually reduce the ambiguity in a noisy matching process, and produce accurate correspondences and homography estimates. Our results are shown to outperform [CC]’s in all aspects, including success registration rate, computational time, and registration error, which is state-of-the art in its class. However, the 2D registration cannot achieve perfect registration since the retinal surface is only near planar. We address this problem by explicitly reconstructing the 3D surfaceof theretina, and performtheregistration throughit. Thisis avery difficulttask, as recovering the correct epipolar geometry from a retinal fundus, which is a near planar geometry, isadegenerate caseforclassical epipolargeometry estimation approaches. The contribution here is our proposed 4-pass bundle adjustment that recovers the optimal 3D structure and camera poses. We have performed a qualitative comparison with the approach in [CM], and shown ours to produce better results in all cases. In UAV image registration, the scene contains moving vehicles, and the intrinsic structure is arbitrary, such as buildings or elevated terrains. We start from a 2D reg- istration approach, and address the drifting problem that happens in registering a long video sequence, in which we use an additional map image as a global reference. The issue here is the difficulty in establishing accurate correspondences between the map and the UAV images. We propose to use a two-step method to reduce matching ambiguities, and use mutual information as a similarity measure in producing matches. Results show out method to produce mosaics without drift in long video sequences. To account for parallax effects induced from 3D objects, especially in urban environments with a large number of buildings, we perform 3D reconstruction of the scene. It is an ill-posed prob- lem due to the presence of severe occlusions, sharp edges, and textureless regions, which 4 results in poor 3D reconstruction using standard multi-view stereo algorithms. Inspired by state-of-the-art urban reconstruction approaches, we also assume some shape priors of the scene to regularize the 3D reconstruction. Specifically, we assume that urban scenes consist of planes that are either parallel or perpendicular to the gravity vector. Our contribution here is the way we transform the assumption into intra-image-column and inter-image-column constraints, and formulate the dense reconstruction problem as a 2-pass dynamic programming problem, which can be solved efficiently. It is also a fully parallelizable process which can make use of GPU architecture to gain drastic speed-up. The reconstructed dense point clouds are shown to be very accurate and detail preserv- ing, and can be used directly as input to a 3D building modeling module to produce 3D building mesh models. In 3D face reconstruction, we are given 5 wide-baseline images in a weakly calibrated setup, and the goal is to perform a dense 3D reconstruction of the face. 3D dense recon- struction from wide baseline images is a very difficult problem, and many researchers start directly from a calibrated set of images. Our contribution in this work is to start from a set of images that are uncalibrated, and to produce dense reconstruction results accurately and efficiently. First, in the sparse reconstruction step, we propose an itera- tive bundle adjustment approach to remove outliers from the established corresponding points. Contours in the profile views are matched using iterative closest point algorithm to provide reliable correspondences that link two opposite sides of views together. Next, in the dense reconstruction step, we propose to use a cylindrical representation of the face volume which allows us to solve a global optimization problem for N-view dense aggregation. Profile contours are used once again to provide constraints in the optimiza- tion step. From the experimental results using synthetic and real images, we show that our method produces accurate and stable reconstruction results on challenging inputs, while state-of-the-art methods and other commercial software packages fail in the first step of sparse reconstruction. 5 This thesis is organized as follows: First, we review related work in Chapter 2. Next, we begin with 2D registration, and introduce our UAV image registration and 2D retinal image registration respectively in Chapter 3 and Chapter 4. For 3D registration, we first present 3D retinal image registration in Chapter 5, followed by 3D reconstruction of urban scenes in Chapter 6, and then the work on 3D face reconstruction in Chaper 7. Finally, we give conclusion and future work in Chapter 8. 6 Chapter 2 Related Work Research in image registration and 3D reconstruction covers decades, and the literature is large. Here we survey a few articles that are relevant to the general framework. They are the choice of matching features, 2D mosaicing, 3D reconstruction of cameras and structure, and dense structure inference. Those that are more relevant to individual applications are reported in the corresponding chapters respectively. 2.1 Invariant Feature Extraction and Matching Here, we are only interested in point features since they are common and more widely spread across all kinds of imagery, and they can be efficiently extracted. Note that it is important to first detect interest points that appear across different views before we can associate them with invariant descriptors. To detect interest points, the Harris corner detector [HS88] was the most notable work in the 90s. It has been shown to be very effective in tracking, or image recognition problems. Zhang [ZDFL95] then used a fixed size local window around each corner as its descriptor, and solved the matching problem by finding the highest cross correlation. There are alternatives other than cross correla- tion, including normalized cross correlation, sum of squared differences, sum of absolute differences, and mutual information etc. In information theory [Gra90], mutual informa- tion is a quantity that measures the information shared between two random variables. Applied in image processing problems, it can be used to measure the similarity between two images. It is very powerful since there are no limiting constraints imposed on the image content of the modalities involved, which is particularly useful in measuring multi- modal images. However, mutual information is computationally demanding, hence not 7 suitable for many applications. We address this problem later in Chapter 5.5. The main drawback of window based matching is that it is neither scale nor rotational invariant. A rotation invariant descriptor is later introduced in [SM97], which is used in general image recognition problems in a large database. Recently, a joint work of an interest point detector in full scales and a scale+rotation invariant descriptor was introduced by Lowe [Low04]. The interest points are first detectedbyfindingextremavaluesinthedifference-of-Gaussianimagesofdifferentscales. Then the major orientation of each interest point is assigned based on local image gradi- ents. Finally, a 128 dimensionaldescriptor that accounts for the scale and theorientation is computed and transformed into a representation that allows for significant levels of local shape distortion and change in illumination. SIFT features have been shown to be stable, thus providing robust matching, and they have been widely used in vision tasks including image recognition and image registration. Since then, a number of variations follow: PCA-SIFT [KS] applies Principle Compo- nents Analysis to the a local gradient patch, and derives a more compact representation of the descriptor, which can be more efficient in the matching process. GLOH [MS05] is another SIFT-like descriptor that considers more spatial regions to enhance robust- ness and distinctiveness. Concerning efficiency, DAISY [TLF09] inherits the same spa- tial regions from GLOH, but pre-computes the orientation maps and produces dense descriptors efficiently. It is shown to provide very accurate results in wide baseline dense matching problems. SURF [BETG08] is a Haar-wavelet based descriptor that can also be efficiently extracted using pre-computed integral images. To account for modality dif- ferences, Kelman [KSS] proposed another SIFT-like descriptors using gradient mirroring and edge precursors. With a descriptor associated with the SIFT-like feature, the correspondence can be determined by its nearest neighbor, i.e., the keypoint with minimum Euclidean distance in the descriptor space [Low04]. To produce reliable matches, there can be two matching strategies [MS05]. In the first strategy, two features are matched if the distance between 8 their descriptors is below a threshold. In the second strategy, features are matched if the distance ratio between the first and the second nearest neighbor is over a threshold. 2.2 2D Registration and Image Mosaicing From the point correspondences established between images, we are able to recover the underlying epipolar geometry. In 2D registration, where the scene is planar, the epipolar geometry becomes a homography. Image mosaicing is the process that takes an image sequence as input, estimates the homography between pairs of images, and align them altogether seamlessly. The main problem in image mosaicing is that the registration error accumulates over time, and results in large distortion in later frames. Toreduceaccumulated registration error, moreglobal alignment approachesare adopted. A common approach is to align a new frame with the mosaic, instead of its previous frame [SK99]. However, the accuracy still degrades over time, and the computation cost increases as the mosaic grows. Davis [Dav] used pairwise registration results from image pairs in addition to the consecutive pairs, and built a over constrained linear system. The optimal transformations can then be computed by solving the least square error of the system. Kang [KCM] performed global registration by minimizing the sum of color differences of the points in each image that correspond to the same point in the mosaic. The search for corresponding points can be performed efficiently by using a graph-based representation which considers temporal and spacial adjacency jointly. Marzotto [MFM] also considered the spatial adjacency of the frames. A graph is defined, where the nodes and the edges represent the frames and the approximated spatial distance of the image pairs respectively. Then the minimum spanning tree of the graph is computed, from which the best composition of homographies for each frame can be determined. However, these approaches can only be used offline, and some of them cannot account foralargenumberofimages. Therearecases whereimagemosaicing processcanproduce a panoramic view of the scene that contains arbitrary 3D structure [BL, SS00]. However, 9 the cameras need to be conditioned to have a fixed optic center, which is not general enough. 2.3 3D Reconstruction of Cameras and Structure In 3D registration, the underlying 3D geometry needs to be reconstructed. This includes the reconstruction of both cameras and scene structure. The reconstruction can be performed using 2 views, 3 views, and/or more than 3 views. In the two-view case, the reconstruction starts from the recovery of the epipolar geometry. In the presence of 3D, epipolar geometry is encoded in a 3×3 matrix of rank 2 matrix named the fundamental (F) matrix, which captures the geometric relations between the 3D points and their projections in the 2D images. From the F matrix, the camera matrices can be determined, and the 3D structure can then be triangulated. The F matrix can be computed from 7 point matches [Har94], since it has only 7 degrees or freedom. It may also be computed from 8 or more point matches, in which the singularity constraint on the fundamental matrix is enforced by using SVD [Har], or iteratively minimizing algebraic error [HZ04]. If the homography induced by a plane in the scene can be determined, the F matrix can be computed using a planar+parallax method [tL96]. The advantage of this approach is that only two additional off-the-plane correspondences are required in addition to the homography. In the three-view case, a trifocal tensor [HZ04] serves an analogous functionality as a fundamental matrix in the two-view case. From a trifocal tensor, one can derive the fundamental matrix of each pair of views, or retrieve the camera matrices. Most impor- tantly, given the point/line correspondences in two views, the trifocal tensor determines corresponding the point/line in the third view [FR]. This transfer property is very useful for establishing correspondences over multiple views. In the N-view scenario, bundle adjustment is often used. It is a non-linear optimiza- tion that iteratively minimizes the reprojection error (the geometric distance between 10 the reprojected point and the measured point). As pointed out in [HZ04], bundle adjust- ment should generally be used as a final step in reconstruction algorithms. However, it is critical for bundle adjustment algorithm to start from a good initialization. So far, the reconstruction described is performed in the projective space. The projec- tive reconstruction can be upgraded to a metric reconstruction to provide better visual- ization, given the camera parameters. This is known as the camera calibration problem, which has also drawn much research interest. Many of the works [FLM, Tri, PVG] assume constant intrinsic camera parameters, and make use of the absolute conic to derive constraints for the unknowns. A more advanced technique is later presented by Pollefeys [PKVG], in which the constraints on the intrinsic camera parameters are trans- lated to constraints on the absolute quadric. The method is very versatile, as it allows the intrinsic camera parameters to vary across images. 2.4 Dense Structure Inference The full body of dense structure reconstruction algorithms can first be classified into two classes: two-view and multi-view reconstruction. In two-view, images are first rectified into stereo pairs, where the correspondence search is limited to corresponding scan- lines. A thorough survey of early works in two-view dense stereo matching algorithms is presented in [SS02]. In brief, strictly local methods do not work well. Homogeneous and occluded areas require global approaches, such as graph cuts [KZa, BVZ01, RC] or dynamic programming [KLCL, BT, OK85]. Note that the dense correspondences are computed in the rectified image pair. They have to be transformed back to their original image space, where the depth map can be computed through triangulation. Matching ambiguities using two views can be greatly reduced by incorporating more views. Agreatdealofresearch, unknownasmulti-view stereo(MVS)[SCD + ], startsfrom a set of images each associated with a calibrated camera, and aims at producing highly accurate dense 3D reconstruction results. Merrell et al. [MAW + ] selects depth estimates 11 for each pixel that minimizes violations of visibility constraints and thus remove errors and inconsistencies from the depth maps to produce a consistent surface. Goesele [GCS] computes individual depth maps using a window-based voting approach, then merges the depth maps into a single mesh using a volumetric approach. Other than fusion of depth maps, there are space carving methods [KS00, Bro] that generally start from a large initial volume and gradually remove voxels by enforcing photometric consistency, or volumetric graph cuts methods [KZb, LPK] that minimize certain energy functions defined over surfaces. However, the performance of these methods is not well determined due to the lack of benchmark data sets. Moreover, the assumptions made and the initialization require- mentsvaryfromonetoanother. Itisdifficultforasinglemethodtoworkforallscenarios. We propose different approaches respectively, as detailed in the following chapters. 12 Chapter 3 Aerial Image Registration 3.1 Introduction Aerial imagery reveals massive information of our environment, such as weather, traffic, or population distribution. As the acquisition of such images is easier than ever, it is crucial to interpret the context and extract meaningful information automatically, in which image registration plays a very important role. By registering aerial images, it is possible to perform scene change detection, vehicle tracking, urban building modeling, and many other geographical tasks. In this work, we apply an image registration technique to unmanned aerial vehicle (UAV) image sequences. The scene may contain dynamic contents such as moving vehi- cles, and arbitrary 3D shapes from the man made structures or non flat terrains. Our goal is to produce stabilized image sequences in a progressive manner, as the functional- ity can be further integrated into real-time applications such as moving vehicle detecting and tracking. 3.1.1 Problem Formulation and Issues We start with a 2D registration process that assumes the camera motions are in smooth transition with small translation, and use a homography to represent the transformation between two UAV images, i.e., x j = H i,j x i (3.1) where x i and x j represent the same point in the UAV frame I i and I j respectively (we address the cases where the structure induced parallax is too large to use such 13 an assumption later in Chapter 6). Equivalently, we use I j i to represent the image by warping I i to the coordinates of I j I j i = H i,j I i (3.2) This is then an image mosaicing problem. In the simplest approach, each frame is registered to its previous frame in the sequence. The inter-frame homographies are then concatenated to derive the frame-to-mosaic homography. Theissuehereistoregister theseimages withoutdrifting. Thisisequivalenttoasking the question:“Can the same point in the scene be registered to the same location once it is revisited”? Consider the first frame as the reference image in the entire sequence, the registration error accumulates along the process, and results in a large distortion in the last frame. Another issue is to produce a seamless mosaic when multiple frames overlap. 3.1.2 Related Work A great deal of research has considered 2D image registration as previously mentioned in 2.2. Here we address a particular class of image registration problems, known as image geo-registration, as it is closely related to our proposed approach. In geo-registration, aerial images are registered to a reference map that is geodetically calibrated so that the aerial images can inherit the reference coordinates. In most works, the reference map is associated with a digital elevation map (DEM), where each pixel has longitude, latitude, and height information. Also the telemetry data of each video frame is available. This includes the position of the vehicle in terms of latitude, longitude, height, roll, pitch, and heading, as well as the camera extrinsic and intrinsic parameters. In [SKSC03], Sheikh uses the DEM of the reference map and the telemetry data to produce a pro- jective projection of the reference image that has the same viewing angle as the UAV frame, so that the two images have roughly the same orientation and scale. To perform further alignment, the projection matrix of the reference map is iteratively updated 14 to maximize the overall normalized cross correlation in the entire image. In his later work [SKS04], instead of warping the reference map, UAV frames are ortho-rectified. Then, a feature-based registration is performed by maximizing a similarity measure for salient features on the basis of a transformation model, where the similarity measure is defined using normalized cross correlation, and the features are detected using Gabor fil- ters. In [WHH + ], Wildes uses a similar approach to [SKSC03] and projects the reference map to UAV frames to provide a coarse alignment. It then starts from a local align- ment that establishes correspondences using normalized cross correlation. Subsequently, a global alignment is performed via a procedure that simultaneously considers all local correspondences in all frames to estimate a set of alignment parameters that are globally optimized. This two stage alignment scheme iterates in a progressive refinement manner where a pure translation model is estimated at the beginning, and a projective model is estimated in the end. Note that most of these works use normalized cross correlation (NCC) as the similar- ity measure in establishing correspondences between the UAV frames and the reference map. As a large period of time may elapse between the capture of the two types of images, the same scene may appear very different due to illumination changes, seasonal changes or human constructions. NCC in this case may not be robust enough. We therefore propose another similarity measure to serve for the purpose, as described next. 3.1.3 Overview of our approach To perform image registration in a progressive manner, while still preserving accuracy, we propose to use an associated map image as a global reference, i.e., register the UAV frames to a map. There is another benefit in such a registration. By registering images to a map, objects in the image are associated with geo-coordinates, which are more physically meaningful. Moreover, it is then possible to integrate information from geo- graphic information system (GIS), such as road networks, to perform more advanced image analysis tasks, such as video annotation from geospatial databases. However, the 15 UAV images are captured at different times and in different views from the map image. The illumination, and the dynamic content (such as vehicles, trees, shadow and so on) could be very different, and direct registration may fail. Therefore we use a two-step procedure to solve the problem. UAV image sequence Reference image (map) 1) Register to and derive 3) Register to directly and derive I i−1 I i−1 I i−2 I i−1 I i M M M H 2) Produce a local mosaic by warping previous images to M H i,i−1 M 4) Manually labeled correspondences in and M I i I i M Mosaic aligned with the map INPUT OUTPUT H i,M =H ×H i−1,M ×H i,i−1 ˜ I M i−2 ˜ I M i ˜ I M i−1 I 0 I i−1 i Figure 3.1: Flow chart of our UAV image registration to a reference map Figure 3.1 illustrates the flow chart of our approach. Let M denote the map image. We assume the scene depth is small with respect to the distance from the UAV camera, 16 and use a homography to represent the transformation between UAV frame I i and M similarly as in Equation 3.1 and 3.2: x M = H i,M x i (3.3) I M i = H i,M I i (3.4) The first frame is manually aligned to the map, i.e., H 0,M is given. In the first step, we estimate H i,i−1 from the point correspondences established from image I i and I i−1 . With H i,i−1 , we can roughly align I i with M: H i,M ≈ H i−1,M ×H i,i−1 (3.5) The first step provides an initialization which roughly compensates for the scale and rotation between the UAV image and the map, and reduces matching ambiguities. This allows us to directly register I i to the map, and derive H that compensates for the registration error in H i,i−1 : H i,M = H ×H i−1,M ×H i,i−1 . (3.6) Registering a single frame to the map may not produce a smooth transition [LYM07]. Instead, we combine multiple temporally adjacent frames to form a local mosaic, and registerittothemap. Resultsshowthatweproduceseamlessmosaicswiththisapproach. In the following sections, we first describe the method we use to perform inter frame registration. Then we introduce our method to perform a direct image to map registra- tion. In the experimental results, we demonstrate the registration with and without a global reference image respectively in two UAV sequences. 17 3.2 UAV Inter Frame Registration To compute the transformation H i,i−1 , we extract feature points from both I i and I i−1 , and match them to form correspondences. After trying many kinds of features, we select SIFT [Low04] features. SIFT features are invariant to image scale and rotation, and provide robust descriptions across changes in 3D viewpoint. In the feature matching step, we use nearest neighbor matching [BL]. Then we use RANSAC [FB81] to filter outliers among the set of correspondences and derive H i,i−1 .Having H i,i−1 and given H 0,M , we can roughly register the UAV image to the map by estimating H i,M as H i,M ≈ H i−1,M ×H i,i−1 = H 0,M i k=1 H k,k−1 (3.7) This shows that if there exists a subtle transformation error in each H k,k−1 , these errors accumulate and result in a significant overall error. Thus, a direct registration by estab- lishing correspondences between the UAV image and the map to refine the homography is necessary. 3.3 UAV-to-Map Registration Registering an aerial image to a map is a challenging problem [Med, SKSC03, WHH + ]. Due to significant differences in lighting conditions, resolution, and 3D viewpoints between the UAV image and the map, the same point may not be detected in both images, nor yield similar SIFT descriptors. Therefore, poor feature matching and poor registration can be expected. Since it is difficult to register a UAV image to the map directly,wemakeuseof H i,i−1 derived from the UAV inter-frame registration described in the previous section, and approximate H i,M (Equation 3.5). Our goal is to derive a tuninghomography H that compensates forthe errorintroducedinthe UAV inter-frame registration, so that the image is accurately aligned to the map. 18 The advantage of this approach is that with the approximated H i,M , I i is roughly aligned to M. We can then perform a local search for correspondences under the same scale and orientation. Therefore, the ambiguity of matching and the computation time is far less than directly registering I i to the map. 3.3.1 UAV-to-Map Correspondence Establishment To compute H , we have to find correspondences between ˜ I M i and the map M,where ˜ I M i is the image by warping I i using the approximated homography, i.e., ˜ I M i =(H i−1,M ×H i,i−1 )I i (3.8) We use a window based method to establish correspondences. A point p in M is matched to the point q in ˜ I M i that has the highest similarity measure, defined using mutual information [VWMW97], namely, q=argmax x mi(w(x, ˜ I M i ),w(p,M)), (3.9) where mi(·) is the function that computes the mutual information between two image patches, andw(x,I) is an image patch in I centered at point x. The search can perform efficiently by looking at patches in the local area since ˜ I M i and M are roughly aligned. In information theory, mutual information measures the mutual dependence of ran- dom variables. Taking two images as the random variables, it measures how much information two images share, or how much an image depends on the other. The use of mutual information as the similarity measure can be justified since the UAV images and the reference map are captured at different times and in different views. The illumina- tion, and the dynamic content could be very different. As mutual information requires no a priori model of the relationship between scene intensities in different views, such measurement is more robust in the conditions we have compared to measures such as 19 cross-correlation or sum of square differences. We will give a formal definition of mutual information later in Section 5.5. However, mutual information is computationally demanding, as it needs to com- pute the entropy and the joint entropy of two variables, which involves estimating the marginal probabilities and joint probabilities of each sample pixel. Our goal is to pro- vide a stabilized image sequence for real time moving object detection. Therefore, the speed-up of mutual information computation becomes a critical issue. We have ported the computation to a GPU platform and achieved a factor of 400 speed-up, as described later in Section 5.6.2. 3.3.2 Extracting Reasonable Correspondences Figure 3.2: The correspondences between a map (left) and a UAV image (right). The red points are correspondences whose spatial distances are over a threshold, which are excludedfromtheinputforRANSAC.ThegreenpointsandbluepointsaretheRANSAC inliers and outliers respectively. We can generate as many correspondences as we want by matching different pixels in themap. Inourexperiments, thecorrespondencesgeneratedcouldbeclassifiedintothree categories, which are outliers, correspondences not on the reference planar surface, and correspondenceson the referenceplane. For image patches inhomogeneous areas, such as those from only road or field, the similarity measure is meaningless and tends to result in 20 outliers. Even if the correspondences are correct, they may come from structures that do not belong to the reference plane, such as roofs, and should be discarded when deriving the homography. Therefore, we need to filter the first two types of correspondences, and use only the correct ones on the ground plane in deriving H . To achieve this, we rely on the approximated homography. As mentioned in Equa- tion 3.9, with a good approximation of H i,M ,apoint p in M should have its correspon- dence q in ˜ I M i closely. Hence, we only take the correspondences that have their spacial distance under a threshold in deriving H . As shown in Figure 3.2, the red points illus- trate the correspondences that are not close enough to each other. They are most likely to appear on structures that cause parallax, or homogeneous regions. 3.3.3 Local Mosaic Figure 3.3: Results of registering a single frame v.s. a local mosaic. The left figure is the registration result of registering only a single frame. The dashed lines highlight places of discontinuities. The right figure is the result of registering a local mosaic in each iteration, which produces a smooth transition. Inaddition to usingthefiltered correspondencesas describedintheprevious sections, we can include the inlier correspondences computed in the previous frames to estimate H . Namely, we are registering a local mosaic, defined as {I M i } (we use 10 for the set size), to the map. 21 There are several advantages in registering a local mosaic instead of registering only a single frame to the map. First, it provides a large number of correspondences as input to RANSAC, so the derived H is more robust. Second, since we use correspondences in multiple frames to derived H , the correspondences in I i only have marginal effect, which implies H i,M will not change abruptly. Most importantly, we can consider the correspondences in the previous frames as a prior knowledge to deriving H . Therefore, even if the correct correspondences in I i are not dominant after extracting good corre- spondences as described in the previous section, they will still stand out after performing RANSAC. As shown in Figure 3.3, registering only a single frame results in significant discontinuities, while registering a local mosaic results in a smooth transition. 3.4 Experimental Results We show results on two data sets. The UAV image sequences are provided with latitude and longitude information. The satellite images are acquired from Google Earth. The size of each UAV image is 720×480. We manually register the first frame of the UAV sequence to their corresponding satellite image, namely H 0,M is given. In each UAV to Map registration step, we select 200 key points in the UAV image as samples. We use Harris Corners instead of SIFT features since we are only interested in obtaining the key point positions, but not their descriptors. Extracting SIFT features here is a computation overhead. We require the distance between any two samples to be no lower than 10 pixels. For each sample, an image patch of size 100 ×100 is used to compute the mutual information, and the neighborhood region where we search for a best match is a window of size 40×40. We found the window size of 100×100 to be a proper size for a discriminative local feature in our UAV image registration. Since mutual information computation is very costly, we only perform UAV-to-Map registration every 50 frames. The results of sequence 1 with and without UAV-to-map registration are shown in Figure 3.4 (a) and (b) respectively, and the results of sequence 22 2 with and without UAV-to-map registration are shown in Figure 3.4 (c) and (d) respec- tively. We can see by using a global map as a reference and perform UAV-to-map registration, there is no drift at the end of the sequences. We have also implemented a GPU version for mutual information computation, and achieved a significant speed-up. This effort is described in detail in Section 5.5. Table 3.1 shows the comparison between registration with and without UAV-to-map registration in the two examples. Sequence #1 Sequence #2 Number of frames 1000 900 with map without map with map without map Total registration time in minutes 349 83 322 75 Average error per pixel in the last frame compared with ground truth pixels 6.24 12.04 3.16 109.98 Table 3.1: Experimental results of two UAV sequences 3.5 Summary and Concluding Remarks We have proposed a new method to improve the accuracy of mosaicing. An additional map image is provided as a global reference to prevent accumulation of error in the mosaic. We use mutual information as a similarity measure between two images to generate correspondences between an image and the map. The main limitation of our approach is the assumption that the scene structure is shallow compared with the height of the camera. When the UAV camera is low, structure induced parallax is strong, and the similarity measured by mutual information may produce ambiguous affine registration. For such a situation, the registration has to be performed by explicitly taking the 3D structure of the scene into account, as is described later in Chapter 6. 23 (a) (b) (c) (d) start end start end start start end end Figure 3.4: Registration results of two UAV sequences. (a) and (b) show the registration results of the first case without and with a further registration to the map respectively; (c) and (d) show the registration results of the second case without and with a further registration to the map respectively. Note the great distortion at the end of the image sequence in (a) and (c). 24 Chapter 4 Retinal Image Registration: A 2D Approach 4.1 Introduction Multi-modality image registration is a fundamental task in medical image applications. Images captured from different sensors or across time frames often have different modal- ities and the alignment of these images is critical. Retinal image registration is one of these applications. In retinal diagnosis, patients are injected with a sodium fluorescein dye. With a fundus camera that is designed to photograph the interior surface of the eye (including the retina, optic disc, macula, and posterior pole), photographic and flu- orescein angiographic images are then captured at multiple stages of the circulation, which reveal symptoms of possible eye diseases, such as leaky blood vessels. Figure 4.1 and Figure 4.2 show a fundus camera and a few images from a typical retinal image sequence respectively. Retinal image registration here serves as an important tool that help analysts integrate information from multiple scans, or localize abnormality. 4.1.1 Issues The main issue in retinal image registration is to find features that are invariant across different modalities for robust matching. As shown in Figure 4.2, the intensity of the images varies substantially. Similarity measure based on intensity values cannot work in such circumstances. Moreover, the extracted features may not be descriptive enough, and results in a noisy matching process where a large number of false correspondences is produced. 25 Figure 4.1: A retinal fundus camera Figure 4.2: Retinal images in different modalities 4.1.2 Related Work Several attempts have been made in finding invariant features in retinal images, where the use of vascular structure [CFF06, CF, HKG00, SAN + 04, CC] is the most popular due to its nice properties: 1)it is widely spread in the whole retina, and 2) it is very stable in position. Choe [CC] presents a similar but more local approach, which extracts Y features at vessel junctions. These features work well in general, but extracting such high level features can be time-consuming. Cattin [CBGS06] proposed to use SURF 26 features, which is more computational efficient. However, it didn’t address the issue when images have very different intensity profiles. To align images, in addition to the generally adopted affine model, Can [CSRT02] showed that the curved nature of retina can best be taken into account by a quadratic model. Stewart [STR03] proposed the dual bootstrap iterative closest point algorithm that has almost perfect registration rate. It starts from a local region where two land- marks (branching and cross-over points) are matched, and a lower order transformation model is estimated. The region is then gradually expanded by aligning the detected blood vessel centerlines using Iterative Closest Point (ICP) algorithm, from which a higher order quadratic model is estimated. Similarly, Chanwimaluang et al. [CFF06] estimates a lower order translation model by maximizing mutual information, and grad- ually upgrade to higher order affine or quadratic transformations if necessary. However, none of the models can account for the anatomical features of the retina. In fact, the correct approach must be performed in a 3D manner, which we describe in the next chapter. 4.1.3 Overview of Our Approach In this work, we use a homography model for image registration. The use of a homog- raphy model can be justified as follows: First, we only register a local portion of the retinal surface, which can be considered as near planar. Second, a homography model has fewer parameters and is easier to estimate, especially when the matches are noisy. Figure4.3showstheflowchartofourapproach. First, each imageispreprocessedand transformed into an edge response image, from which a set of SIFT [Low04] features are extracted. Since the edge response is relatively stable across modalities, SIFT features extracted in such a way are more stable. However, they are still difficult to match. Our main contribution in this work is the proposed Iterative Nearest Neightbor Matching algorithm which can gradually reduce the ambiguity in a noisy matching process, and produce accurate homography estimates. This method registers most image pairs in 27 the image set. The result is further improved by chained registration, in which the homography of a poorly registered image pair is estimated by chaining a sequence of related accurate homographies. Our method has several advantages: 1. High registration rate. Our method produces over 97% registration rate over 40 image sets with 574 image pairs. 2. It makes no assumptions about the content of the images.Compared with [STR03, CC] which extracts features from vessel junctions, our method extracts SIFT features that can be used in other multi-modality image registration problems that do not present a similar vascular structure. 3. Itisfast. SIFTfeatureextraction ismuchfasterthanhighlevel featureextraction. The most time-consuming process is the iterative feature matching that can be limited by using fewer iterations. Compared with [CC], our approach is 8 times faster. 4. It can be extended to 3D registration. The correspondences produced from SIFT feature matching are sub-pixel level accurate, and their numbers are large enough, which is suitable for 3D registration. Iterative nearest neighbor matching Chained registration Image preprocessing SIFT feature extraction Image sequence Registered image sequence edge response images SIFT features Initial homography estimates Globally optimized homographies INPUT OUTPUT Figure 4.3: Flow chart of retinal image registration In the following sections, we first show our feature extraction approach. The Itera- tive Nearest Neighbor Matching algorithm is introduced next, followed by the chained registration. Finally, we give the summary of our work. 28 4.2 SIFT on Edge Response Images Figure 4.4: Retinal images and their corresponding edge responses SIFT feature extraction can produce a large number of descriptive features that increases the number of reliable matches. Edge information can be retrieved easily, and is widely spread across the image. More importantly, edge response preserves the gradient magnitude of the surface and ignores its gradient direction, which is relatively invariant in multi-modality imagery. [KSS] uses a similar idea, gradient mirroring, which associates opposite gradient directions in the SIFT descriptor. Although this approach is invariant to contrast reversals, the descriptor is less descriptive and would degrade the performance of matching. Figure 4.4 shows two retinal images and their corresponding edge response images. Though the two images are quite different in intensity structure, their edge responses are similar. It is expected that features extracted from edge response images are more likely to be matched. 29 (a) (b) (c) (d) Figure 4.5: Image preprocessing at each stage. In the top row, the left and right image are the original input and the computed binary filter respectively. Following each stage of the process, (a) is the contrast enhanced image, (b) is the denoised image, (c) is the corresponding edge map, and (d) is the final contrast enhanced edge map. 4.2.1 Image Pre-processing To extract a sufficient large number of SIFT features from the edge response images, we preprocess the images by enhancing their contrast and removing noise. First, a single binary filter is automatically created to remove the black border which appears at the four corners of the image (the location of the black border is fixed over the entire image sequence). Note that the intensity value in the black border is not strictly 0, which means a threshold must be determined. In addition, we cannot compute the threshold from an arbitrary image since it may be too dark to determine a threshold. Hence, we select one of images with average intensity value larger than 80, and compute its intensity histogram of bin size equals 16. Then the threshold is determined by finding the bin that corresponds to the global minimum in the lower band (intensity< 64) of the histogram. This approach generates accurate binary filters over all tested data sets. Figure 4.5 shows an example of the binary filter. Next, the intensity histogram inside the binary filter is equalized to a Gaussian distri- bution with μ = 128,σ = 48. In images that have very low contrast, such an operation 30 generates much noise. We denoise such images with a Non-local Mean Filter [BCM]. Then we compute the edge response on the contrast enhanced images using a Sobel fil- ter. Finally, we use contrast limited adaptive histogram equalization (CLAHE) [Rez04] to enhance the contrast of the edge. In our experiments, such a preprocessing yields the greatest number of true matches. Figure 4.5 shows a low contrast image at each stage of the process. 4.2.2 Experimental Results We compare the quality of matching using SIFT features extracted from edge response images with those from the original images. We select 10 pairs of images that have very different intensity profiles. In each image pair, the SIFT features on the original images and edge response images are extracted and matched respectively. We also label some control points so that we can compute the ground truth homographies of the image pairs and the number of true matches available. The true matches are the ones that comply with the ground truth homography. Figure 4.6 shows that matching features extracted from edge response images produce more true matches than original images. (a)(b) Figure 4.6: Number of true matches in the original images and edge response images 31 Figure 4.6 also plots the number of true matches and total matches in the edge response images. It shows that the number of true matches are too small for RANSAC to estimate a good homography model. The issue now becomes how to estimate an accurate homography from noisy matches. 4.3 Feature Matching in Two Images Since our approach uses some of the techniques in [BL], we start by reviewing their method. Let Γ i and Γ j denote the SIFT features extracted from I i and I j respectively. To match features, a nearest neighbor matching (NN-Match) algorithm is used. Each SIFT feature in Γ i is matched to its nearest neighbor in Γ j by computing the Euclidean distance in the feature space. Moreover, to prevent false matching, the distance of the nearest neighbor has to be less than the second-nearest neighbor by a ratio (we use 0.8). Note that with a larger search area, there will be more matching candidates and less probabilityforafeaturetobematched. LetM i,j denotethesetofmatchesproducedfrom NN-Match, H i,j denotes the unknown homography. To estimate H i,j , RANSAC [FB81] is employed to perform a robust estimation. It performs several iterations on M i,j . At each iteration, a homography model is built and its corresponding inlier matches are determined. The best homography estimate is the one with the largest number of inliers. In our data sets, the displacement between images could be large. To account for large displacement, the size of the search neighborhood should be large. However, as described earlier, this results in fewer matches, which results in a less accurate homog- raphy estimate. Hence, there is a trade off between the displacement we can account for and the number of matches we can produce. We propose Iterative Nearest Neighbor Matching algorithm to solve the problem. 32 4.3.1 Iterative Nearest Neighbor Matching The idea is, given a rough estimate of H i,j , the correspondence of a feature x ∈ Γ i can be found in a neighborhood around H i,j x in image I j . We can thus perform NN-Match in a smaller region, and increase its chance to be matched. With more matches, a more accurate homography can be estimated. Our approach is shown in algorithm 4.8. It starts with a large search area with radius r = r max and assume the homography H i,j as identityI. At each iteration, every feature x ∈ Γ i is matched to its nearest neighbor in Γ j wherethematchingcandidatesareinthesearchareacentersatH i,j xwithradiusr.Then RANSAC is employed on the matches to estimate a more accurate homography for the next iteration. It iterates until the smallest search area r = r min is reached. Figure 4.7 shows a typical progress of matching. It starts from a search radius of 200 (image size 640 × 512), and ends with a search radius of 25. Each green point corresponds to a pair of match which is an inlier of the estimated homography. As the process iterates over time, more features are matched correctly, and a more accurate homography is estimated. Although this approach appears to be similar to ICP algorithm [Zha94], it is different as the estimated homography in the previous iteration can be used in the next to narrow down the search region and produce more reliable matches. The presented method maximizes the number of reliable matches in each image pair, and produces an initial set of pairwise homographies. In the next section, we show how these pairwise relationships can be linked to achieve global optimization. 4.4 Chained Registration Chained registration is a global optimization approach after all the images are pairwise registered. With the pairwise registration result, images can be registered to one another indirectly. This is illustrated in Figure 4.9 (a). Let a circle and a cross in row i column j denotes a successful and a failed registration from image I i to I j respectively. In the example, I 2 cannot be registered to I 3 directly. However, since I 2 can be registered to I 1 33 Search radius = 200 #Inliers = 12 Search radius = 100 #Inliers = 33 Search radius = 50 #Inliers = 89 Search radius = 25 #Inliers = 132 Figure 4.7: A typical matching results in each iteration of Iter-NN match 34 Figure 4.8: Iterative Nearest Neighbor Matching algorithm directly, and I 1 can be registered to I 3 directly, I 1 becomes the bridge to register I 2 to I 3 . We can perform another run of registration, where at algorithm 4.8 line 1, we initialize H 2,3 = H 1,3 H 2,1 instead of I,and r = r min instead of r max . In other words, we perform a NN-Match directly in a small area that roughly centered at the matching point. In such a way, we can overcome failure pairs that do not have enough true matches for Iter-NNM to converge to a correct homography. 123 4 1 2 3 4 to from 123 4 1 0.03 0.01 0.02 2 0.03 0.5 0.05 3 0.01 0.5 0.01 4 0.02 0.05 0.01 to (b) Pair-wise registration cost (c) Shortest path 123 4 1 0.03 0.01 0.02 2 0.03 0.04 0.05 3 0.01 0.04 0.01 4 0.02 0.05 0.01 to from from (a) Direct registration Figure 4.9: Chained registration 4.4.1 Cost Function for Chained Registration To select the best images for chained registration, we use the shortest path algorithm, in which each node represents an image. The internal nodes along the shortest path from 35 i to j are the images used for chained registration. The edge cost C i,j is defined as the inverse of the number of inliers of H i,j C i,j =1/#inlier(H i,j ) (4.1) Figures 4.9 (b) and (c) give a simple example. (b) is the cost table that results from direct pair-wise registration, in which the registration from 2 to 3 has fewer homography inliers (higher cost). To improve it, we run an additional direct registration in which we initialize the homography as H 1,3 H 2,1 since I 2 → I 1 → I 3 is the shortest path from I 2 to I 3 . We run such a registration over each image pair (I i ,I j ). Every time #inlier(H i,j ) increases, all image pairs that have their shortest paths going through edge I i I j are re-registered again since the initialization may improve. The process continues until no image pairs get more homography inliers. Note that the final number of inliers could still be low so that the corresponding shortest path goes through other nodes. In such a case, we use the chained homographies along the shortest path as the best homography estimate. Finally, we use all-pair shortest path algorithm [GGST86] on the set of images. The image with the least total cost to all the other images is selected as the reference frame I ref , ref=argmin i j C i,j (4.2) where C i,j is the minimum cost from I i to I j . 4.4.2 Experimental Results We compare three methods, the NN-Match, the iterative NN-Match and the iterative NN-Match + chained registration over 10 data sets. Figure 4.10 plots the successful registration rate. It shows that by using chained registration, most of the data sets’ registration rate can be improved to 100%. 36 Figure 4.10: Registration rate comparison of NN-Match, Iterative NN-Match, and Iter- ative NN-Match + Chained registration We perform another thorough test over 40 data sets. Among all 574 image pairs which contain large image variance in terms of intensity, scale, and displacement, only 16 images are not registered (97.2% registration rate). The average time required for an image set with 20 images is 40 minutes. On the machine with the same computational power, the Y feature based registration method takes 300 minutes. Figure 4.11 shows one of the difficult cases for which our method can still produce accurate registration. (a) and (b) are two images significantly different from each other. (c) is the registration resultshowninacheckerboardview. (b) (c) (a) Figure 4.11: Checkerboarddisplay ofthe registration resultof2images withlarge modal- ity difference. Figure 4.12showsoneofthefailurecases. (a) istheintensity histogram oftheoriginal image. We can see the entire image information is squeezed within 10 discrete levels. (b) and (c) are the results of the original image thresholded at 9 and 11 respectively. 37 (a)histogram (b)I>9 (c)I>11 Figure 4.12: Poor dynamic range 4.5 Summary We have presented a 2D retinal image registration approach. SIFT features are extracted from edge maps and iteratively matched to produce a larger set of reliable matches, from which accurate homographies can be estimated. The chained registration further optimized the pairwise registration results globally by registration in an indirect way. Compared with state-of-the-art 2D retinal image registration methods, ours achieves a near perfect registration rate over a massive real data set, while the time required is much less. This module is currently deployed in Doheny Eye Institute, where tera bytes of retinal images from patients are being processed. 38 Chapter 5 Retinal Image Registration: 3D Reconstruction 5.1 Introduction In the previous chapter, we presented a 2D registration approach that achieves high registration rate in less time than the approach in [CC]. However, the 2D approach assumes planarity of the retinal surface, which results in residual error where the surface do not agree with modeled plane. We address the problem in this chapter, and present an advanced retinal image registration approach through 3D metric reconstruction to account for the anatomical features of the retinal fundus. The reconstruction of the retinal surface not only results in better accuracy, but also provide 3D visualization, which is very useful in diagnosing eye disease, such as blisters that result from lesions. 5.1.1 Issues The issues are the same as those in 2D retinal image registration, since we also need point correspondences to recover the epipolar geometry. In addition, the new issue here is the 3D metric reconstruction of the retinal surface. In the classical approach, a metric reconstruction starts from projective reconstruction, then upgrades to affine reconstruc- tion, and finally upgrades to metric reconstruction. In the projective space, the cam- era matrices corresponding to a fundamental matrix Fcanbechosenas P 1 =[I|0], P 2 =[[e ] × F|e ], where e is the epipole. With the camera matrices, the 3D structure X in the projective space can be identified through triangulation. To upgrade the recon- struction from projective space to affine space, we need to identify the transformation 39 H = I|0 π∞ such that P A 1 = P 1 H −1 , P A 2 = P 2 H −1 , and X A = HX,where π ∞ is the plane at infinity, which requires 3 vanishing points to identify. In our imagery, only a local portion of the retinal surface is captured, which is considered as near planar. As pointed out in [CM], the shallowness of the surface makes it difficult to determine the vanishing point in the depth direction. Hence, the plane at infinity cannot be found, and the stratified reconstruction cannot proceed, i.e., the projective reconstruction cannot be upgraded to a metric reconstruction. Another issue is the dense depth inference. The difficulty here is the generation of dense correspondences from multimodal imagery. 5.1.2 Related Work To our knowledge, Choe’s work [CM] is the only one that explicitly registers retinal images through the reconstructed 3D retinal surface. He addressed the reconstruction problem of a near planar surface, and proposed a 3-pass bundle adjustment that recon- structs the sparse 3D structure of the retina and the camera poses jointly. The dense reconstruction of the retina is then performed on one of the image pairs using a dense stereo matching approach. However, the image pair is not carefully selected, and the results are not optimal. In our work, we explicitly select the image pair with the widest base line, and reconstruct the optimal structure and camera poses. Results show ours to be more accurate in all cases. 5.1.3 Overview of Our Approach Dense reconstruction Registration by back projection Estimate epipolar geometry 4 pass bundle adjustment Images + point correspondences + homographies Registered image sequence F matrix + F matrix inliers Camera poses + sparse 3D points 3D retinal surface INPUT OUTPUT Figure 5.1: Flow chart of 3D retinal image registration 40 The 3D surface reconstruction and registration processes are illustrated in Figure 5.1. Thepointcorrespondencesandhomographiescomputedfromour2Dregistration method are provided as input to the process. First, the fundamental matrix for each image with respect to the reference frame is estimated. The corresponding fundamental matrix inliers are then input to the proposed 4-pass bundle adjustment process to estimate all camera poses and the sparse 3D structure. To reconstruct the 3D surface, we use the image pair that correspond to the cameras with the widest baseline for dense stereo matching, from which the dense surface depth can be inferred through triangulation. Finally, images are registered to the reference frame by back projection. The novelty of our approach is the 4-pass bundle adjustment whose objective is to estimate the optimal poses of all cameras. In [CM], the camera selection strategy does not take the baseline into account, and produces poor results when two cameras are close. In the following sections, our approach to epipolar geometry estimation is first described. We then present how the camera poses and the dense structures are recon- structed in Section 5.3 and Section 5.4 respectively. In Section 5.5, we present a GPU implementation of mutual information computation. The evaluation of its performance in terms of accuracy is presented next, followed by a summary of our work. 5.2 Epipolar Geometry Estimation 3D reconstruction starts from estimating epipolar geometry. To estimate the fundamen- tal matrix, classical approaches (7-point or 8-point algorithms) do not work well in the presence of near planar structure. Due to the shallowness, the sampled correspondences usually to not provide sufficient constraints to compute the correct fundamental matrix. This is known as a quasi-degenerate case, which has been carefully studied in [FPa]. Hence, it is crucial to verify the validity of these correspondences. We choose to use the plane+parallax method [tL96] for estimating the F matrix. In [tL96], the fundamental matrix is computed as F=[e ] × H,where e is the epipole, 41 (a) (b) e ˜ x 2 ˜ x 1 x 1 x 2 Figure 5.2: Fundamental matrix estimation using plane+parallax method and H is the homography induced by the plane π as shown in Figure 5.2 a. Since we have already estimated the homography H for each image pair previously in 2D regis- tration, we can explicitly select two off-the-plane correspondences (homography outliers) to determine e (Figure 5.2 b), and therefore F can be derived accurately. Using this approach, we compute the fundamental matrix F i with respect to I ref for each image I i in the sequence. The corresponding set of fundamental matrix inliers, denoted as τ i , are then input to a 4-pass bundle adjustment (4P-BA) process described next. 5.3 Camera Pose Estimation Using 4-Pass Bundle Adjust- ment As mentioned earlier in Section 5.1.1, the stratified reconstruction usually fails when the structure is flat. Fortunately, we know that the retinal fundus is roughly fronto- parallel to the camera, which allows us to initialize a set of reasonable 3D points, and in turn initialize a reasonable camera pose for each image. This motivates us to use bundle adjustment, since it is a nonlinear optimization that needs to start from a good 42 initialization. However, such initialization may still not be good enough. If bundle adjustment optimizes the camera poses and the structure jointly with such initialization, it is very likely to get stuck at a local optima since there are too many free parameters. Hence, a better initialization is required, and we present 4-pass bundle adjustment (4P- BA) method. ? ? First bundle adjustment: Assume 3D points on planar surface, and estimate camera pose Second bundle adjustment: Fix the camera pose, and estimate the 3D structure For each image Third bundle adjustment: Refine camera pose and structure Fourth bundle adjustment: Fix the structure reconstructed from and refine other camera poses From the correspondences of Iter-NNM, estimate fundamental matrix and retrieve inliers I ref I ref I ref I i I ref I i I i I i I i I best cameras with the largest baseline Figure 5.3: Flow chart of 4 pass bundle adjustment 4P-BA aims at optimal recovery of all camera poses. Figure 5.3 illustrates the flow chart of the process. In the first-pass bundle adjustment, we assume the structure X i 43 on the plane is fronto-parallel to the reference camera, and estimate the camera pose by minimizing the re-projection error: Err(P i )= m k ∈τ i x k i −P i X k i 2 (5.1) where m k =(x k i ,x k ref )isthe k th pair in τ i . P i = K[R i |T i ], where K represents the set of intrinsic camera parameters. In the second-pass bundle adjustment, we fix the camera pose and estimate the 3D structure X i using bundle adjustment again. With the first and second pass bundle adjustment, the camera pose and the 3D structure are optimized separately with fewer free parameters, and provide a better initialization for a joint optimization. In the third-pass bundle adjustment, we refine both the camera pose and the 3D structure again by minimizing the re-projection error: Err(P i ,X i )= m k ∈τ i x k i −P i X k i 2 (5.2) where P i and X i are initialized using the output from the first-pass and second-pass bundle adjustment respectively. Note that the first three passes of bundle adjustment are employed on each image to compute individual camera pose. These camera poses are further globally optimized in the fourth bundle adjustment. Let I best denote the best image with respect to I ref for estimating the geometry. After we have estimated the poses of all cameras, we select the camera that has the widest baseline to the reference camera (illustrated as the green camera in Figure 5.3) and use its corresponding image as I best to estimate the final sparse structure. We then fix this structure and refine all other camera poses. This ensure that the sparse structure is global optimal, thus the camera poses derived are also optimal. With the camera poses, we still need to establish dense correspondences in order to infer the dense structure, as described next. 44 5.4 Dense Surface Depth Inference Affine Transformation Figure 5.4: Piecewise planar retinal surface To infer the dense surface depth, we start by establishing dense correspondences. However, instead of rectifying the images and formulate the problem as a classical dense stereo matching one as in [CM], we use a piecewise image alignment approach. First, we superimpose uniform mesh points in the reference image, and then use an affine transformation model between each pair of mesh face, as shown in Figure 5.4. In such a setting, finding the dense correspondences between two images becomes a piecewise image alignment problem, and here the alignment is performed by maximizing overall mutual information. In the mutual information maximization process, we compute the derivatives of mutual information over the displacements of the three mesh points for each mesh face. In general, a mesh point is shared by six mesh faces. Therefore the derivatives are the average of its six associated derivatives. Once the derivatives for each mesh point are computed, we find the step length which maximizes the mutual information along the gradient direction, and then recompute the derivatives from there. The process iterates until themutual information converges. Afterthe images are piecewise aligned, thedense 45 correspondence at each pixel can be computed using bilinear interpolation. Along with the camera poses, the dense surface depth can be triangulated. Compared with classical dense stereo matching formulation, ours has two major advantages. First, dense stereo matching finds the correspondence for each pixel, while we only find correspondences for the mesh points, which is more computational efficient. Second, dense stereo matching usually performs global optimization by minimizing a certain energy function that encodes the smoothness constraint on the disparity field. The objective of our optimization is to maximize overall pairwise mutual information, which is more meaningful. Note that the process is extremely computational demanding due to an enormous number of mutual information computations. Fortunately, we have been able to port it onto a GPU platform, and compute it in a more satisfactory time. This is described in more detail in the next section. Finally, with all camera poses and the 3D surface, images are registered to the refer- ence frame by back projection. Let x ref be the projection of X in image I ref ,andB(·) denotes the back projection function: X =B(x ref ,P ref ), (5.3) where P ref is the projection matrix of the reference camera. Image I i is then registered to I ref by x i = P i X = P i B(x ref ,P ref ) (5.4) 5.5 Mutual Information Computation on the GPU Plat- form In this and previous chapters, mutual information is used as the similarity measure for image alignment. As mentioned, computing mutual information is time consuming, 46 and a speed-up is imperative. We present here a CUDA implementation of Viola’s approach [VWMW97] to computing both mutual information and its derivatives. In his approach, the entropy and the joint entropy of two variables is approximated using a Gaussian mixture, which requires enormous number of exponential computations. Thesimplestway toreducethenumberofexponentialcomputations isbyprecomput- ing the Gaussian densities of all possible intensity values into a lookup table. However, if the intensity range is large, the precomputation becomes an overhead. Meihe [MSN99] presents an approach that maps all the intensities into a smaller dynamic range, and computes/stores the Gaussian densities for them. Nevertheless, such efforts only have limited speed-up, which can still be impractical for many applications. In addition, extra interpolation is required if the intensity values are in floating point precision. We have found a way to parallelize the computation of the Viola’s algorithm, and achieve a significant speed-up by using the GPU. We program the GPU using CUDA [cud], a new hardware and software architecture for computing on the GPU. Although Shams [SB07] has presented a CUDA implementation of mutual information computation, he uses histograms to compute the probability and joint prob- ability, which is hard to extend to a GPU version of computing mutual information derivatives. We start by reviewingformulas for computingmutualinformation andits derivatives. First, from information theory, the entropy of a random variable is defined as: h(v)= − p(v)lnp(v)dv, (5.5) and the joint entropy of two random variable is defined as: h(u,v)= − p(u,v)lnp(u,v)dudv. (5.6) 47 The mutual information of two random variables u, v is then defined as mi(u,v)= h(u)+h(v)−h(u,v). (5.7) 5.5.1 Approximate Value In [VWMW97], the probability density p(v) is estimated using the Parzen Window method: p(v) ≈ 1 N A v j ∈A G ψ (v −v j ), (5.8) where N A is the number of samples in A,and G ψ denotes a Gaussian function with variance ψ. Parzen Window is a widely adopted technique to estimate the probability density since it directly uses the samples drawn from an unknown distribution, and uses a Gaussian Mixture model to estimate its density, which is robust to noise. Equation 5.5 can be expressed as the negative expectation of lnp(v), i.e., h(v)= −E v (lnp(v)). Together with Equation 5.8, the entropy of a random variable v is approx- imated as: h(v) ≈ −1 N B v i ∈B ln 1 N A v j ∈A G ψ (v i −v j ), (5.9) where B is another sample set. The joint entropy can approximated in the same manner by drawing pairs of corre- sponding samples from the two variables. Hence mutual information is approximated as: mi(u,v) ≈ −1 N B u i ∈B ln 1 N A u j ∈A G ψu (u i −u j )+ v i ∈B ln 1 N A v j ∈A G ψv (v i −v j )− w i ∈B ln 1 N A w j ∈A G ψuv (w i −w j ) (5.10) 48 in which w=[u,v] T . There exist other approximation formulas for computing mutual information [KKZ, SHH99], which we have not considered here. Algorithm 5.5 implements Equation 5.10 to compute the mutual information of two images. I u and I v denotes the two images, while I(x) is the intensity value at x in image I. B u ,B v ,B uv and A u ,A v ,A uv are the values of the outer summation and inner summation in Equation 5.9 respectively. Note that we assume the two images to be independent and approximate the joint probability as the product of the two marginal probabilities (Algorithm 5.5 line 9). We also assume the two images have the same variance ψ, i.e., ψ u = ψ v = ψ. Figure 5.5: Mutual information value algorithm 49 5.5.2 Approximate Derivative In image registration problems, we usually want to estimate the transformation T that best aligns two images I u and I v . This is formulated as a mutual information maximiza- tion in [VWMW97] which looks for the optimal T: ˆ T =argmax T mi I u (x),I v (T(x)) . We can think of mi as a function of T. To maximize the mutual informa- tion, [VWMW97] approximates its derivative with respect to T as follows: d dT mi(I u (x),I v (T(x))) = 1 N B x i ∈B x j ∈A (v i − v j ) T W(x i ,x j ) d dT (v i − v j ). (5.11) The weighting factor W(x i ,x j ) is defined as: W(x i ,x j )= W v (v i ,v j )ψ −1 v −W uv (w i ,w j )ψ −1 uv , (5.12) W v (v i ,v j )= G ψv (v i −v j ) x k ∈A G ψv (v i −v k ) , (5.13) W uv (w i ,w j )= G ψuv (w i −w j ) x k ∈A G ψuv (w i −w k ) , (5.14) where u i = I u (x i ), v i = I v (T(x i )) and w i =[u i ,v i ] T . Algorithm 5.6 shows the implementation of Equation 5.11. The first inner loop computes the denominators in Equation 5.13 and Equation 5.14, and then the second inner loop computes the weighting factor in Equation 5.12. In our implementation, the goal is to maximize the mutual information between two triangular patches by moving the triangle vertices in I v , i.e., dv i /d T in line 16 is the derivative of a intensity value over the movements of the vertices. This can be done by dv dT = dv dA dA dT , 50 whereA is an affine transformation. Figure 5.6: Mutual information derivative algorithm It is obvious that in both algorithms, the statistics computed for each element i ∈ B are independent from each other, which can be computed efficiently on a GPU. 5.6 Experimental Results 5.6.1 3D Reconstruction The first experiment evaluates the accuracy of the recovered camera poses. As depicted in Figure 5.7, we synthesize a near-spherical surface that mimics the shape of the retina, and create a reference camera with 0 translation and 0 rotation that faces the z-direction 51 where the surface has a depth of 1000. We then create four sets of images that are translated 10, 20, 40 and 80 away from the reference camera respectively that cover the range in the real image data. Each set has 4 images that are translated in different directions on the X-Y plane. The cameras are rotated respectively so that a large area in the two images can overlap. b=80 b=40 b=20 d=1000 b=10 X Y Figure 5.7: The camera configurations of the experiment. b and d are the length of the baseline and depth respectively To understandthecamera poseerror with respecttothebaseline, weuseeach camera as the best camera and employ 4P-BA to compute the corresponding average error of all cameras. Given the true camera translation and rotation, the relative error of the estimated translation and rotation are determined by E trans = t true − t/t and E rot = r true −r/r respectively. Figure 5.8 (a) and (b) plot the average relative error of the 4 sets of images. It shows that the camera pose recovered by 3P-BA has error rate less than 7.5% when the translation of the camera is 80. It also shows that cameras with the largest baseline (b = 80) have the lowest error. The second experiment evaluates the reconstruction error with respect to different camera poses using the same 16 images. For each image, we recover the corresponding camera pose and estimate the corresponding 3D surface. We then compute the average error of the 3D surface with respect to the baseline. Given the true position of the 3D structure, the average error of the estimated structure is determined by E struct = 52 translation error 0 0.15 0.30 0.45 0.60 b=10 b=20 b=40 b=80 2D Reg registration error 0 0.0000075 0.0000150 0.0000225 0.0000300 b=10 b=20 b=40 b=80 rotation error 0 2.5 5.0 7.5 10.0 b=10 b=20 b=40 b=80 reconstruction error 0 0.075 0.150 0.225 0.300 b=10 b=20 b=40 b=80 (a) (b) (c) (d) Figure 5.8: Errors with respect to different camera baselines 1 n n j=1 (X j true −X j )wherenisthenumberofpoints. TheresultisshowninFigure5.8 (c). The error decreases significantly when the baseline increases. The third experiment compares the registration errors of 3D registration and 2D registration. We use the same data set. For each image, we use it as I best to estimate all camera poses P i and the corresponding 3D surface. Then we project the ground truth 3D points and the estimated 3D points using P i true and P i respectively and compute the projection error in pixels. We also compute the average 2D registration error. The result is shown in Figure 5.8 (d). The back projection error drops significantly as the baseline increases. Finally, we compare our method with Choe’s on the reconstructed surfaces of real image sets. Without ground truth surface data, we measure the average local depth vari- ance of the surface since a smooth shallow surface should have a relatively low variance in depth. Thelocaldepthvariance is computedat every densepointwitha neighborhoodof size 100. The result is shown in Table 5.1. In all 9 image sets, we produce more accurate surfaces. Particularly in case 1 and 9, the surfaces reconstructed using Choe’s method are totally wrong and have extremely high variance in depth. There is only one case where Choe’s method selects the same image for 3D reconstruction. Figure 5.9 shows some of the cases. In each row, the left and right surfaces are the reconstructed results from the same view point using our approach and Choe’s respectively. The surfaces on 53 the right side are poorly reconstructed. Our surfaces, on the other hand, are smooth and slightly curved just the way retinal surfaces should be. Our ability to recover the true 3D shape of an object is very important in retinal diagnosis and other operations where 3D volumes reveal information that images do not. 3.93 4.68 4.89 5.01 6.68 4.03 2.45 6.59 3.05 4P-BA 3P-BA 4.13 10.78 5.56 15.43 3.48 6.59 636.18 28.32 8.14 Table 5.1: Average local depth variance comparison of the reconstructed surfaces Choe's Ours Figure 5.9: Comparison of the 3D retina surface reconstruction 5.6.2 GPU Acceleration Our experiment on GPU speed-up is performed on the following hardware platforms: 54 128 streaming processors 575 MHz 768MB DDR3 1.8GHz 16KB shared memory per block Xeon Quad Core 2.33 GHz 4GB DDR2 667MHz GeForce 8800GTX Dell Precision 690 Table 5.2: CPU v.s. GPU hardware spec In the first experiment, we run a thousand iterations of both implementations with respect to different numbers of samples. The result is shown in Figure 5.10. Using 1000 samples, the GPU is a factor of 170 and 400 faster in computing mutual information and its derivatives respectively, compared with a single CPU. As the number of samples increases, the computation time for the CPU grows quadratically. Although a multi- core CPU can spread the task, the computation time can still be unacceptable when the number of samples is large. On the other hand, the computation time for the GPU is less than 3ms, and is relatively constant compared with that of the CPU. The result explains that the GPU implementation indeed computes mutual information more efficiently. Mutual information Mutual information derivatives (ms) (ms) (#samples) (#samples) Figure 5.10: Computational time of mutual information and its derivatives In the second experiment, we evaluate the performance difference of a full image alignment process as described previously in Chapter 5. For a single triangular patch in 55 side length: 40 # of faces: 373 side length: 60 # of faces: 160 side length: 80 # of faces: 86 side length: 100 # of faces: 47 (s) Figure 5.11: Computation time for image alignment using mutual information I u , we arrange all inner pixels in scan-line order, and group pixels with odd and even indices into sample sets A and B respectively. The corresponding samples in I v are usually at fractional positions, where we use bilinear interpolation to approximate their intensities and intensity gradients. To find the step length, we start at 0.1, and double it until the mutualinformation cannot increase. Note that this requiresmutual information to be computed at every step length, and the samples in I v to be re-interpolated. We run 50 iterations for an image pair of size 640×480 and compare the computation time with respect to the numbers of triangle sizes. We ignore the measurement of a single CPU since we know it is roughly 4 times longer than a quad-core CPU. The result is showninFigure5.11. We can see that the time saved by performing a full registration is not as much as computing mutual information or its derivatives alone. This is because the interpolation of the samples in I v is performed on the CPU, and these samples have to be copied from the host memory to the GPU memory at each iteration, which is a substantial overhead. Nevertheless, registration on the GPU takes only a few minutes, whereas CPU takes up to hours. For applications that require to run such computation over hundreds of iterations, CPU implementation is certainly impractical. 56 5.7 Summary We have presented a 3D metric reconstruction method that recovers accurate camera poses and infers a 3D surface for 3D registration. The proposed 4-Pass bundle adjust- ment method accurately recovers camera poses, which gives us the clue to select the image corresponds to the camera with the largest baseline, thus reconstruct the opti- mal structure and camera poses. The images are then registered by back projecting points through the structure, and produce a smaller registration error than our previous 2D approach. Compared with the state-of-the-art near planar surface reconstruction method, we produce more accurate surfaces in all data sets. Nevertheless, the epipolar geometry estimated using plane+parallax method may still not be robust enough. Given the relation, F=[e ] × H, from Section 5.2, if the depth is shallow, the parallax is small , and the parallax vector becomes highly sensitive to the measured point (˜ x in Figure 5.2). This may result in an unstable estimation of the epipole, and in turn leads to an unstable estimation of the fundamental matrix. This problem should be addressed in future work. In addition, a quantitative evaluation and a more extensive validation of our approach to near planar surface reconstruction is required. 57 Chapter 6 3D Urban Scene Reconstruction 6.1 Introduction In Chapter 3, we presented a 2D method that registers a UAV image sequence, where the camera motions are restrickted and may not be general enough for general scenarios. To account for general camera motions, the underlying geometry needs to be explic- itly reconstructed in order to achieve parallax free image registration. There are more benefits in knowing the 3D. To perform scene change detection, image sequences can be stabilized to a reference view through 3D inference, thereby accounting for parallax induced by buildings. To perform vehicle tracking, knowing the 3D not only allows us to distinguish between motions of moving objects and those caused by parallax, but also tells us where to stop tracking due to occlusions. The TAILWIND program (DARPA- BAA-09-4 [tai]) defines a realistic operational scenario where a UAV observes a scene in order to produce a 3D model “on the fly” from the collected imagery. This model can then be used for detection and tracking. In this work, we focus on 3D reconstruc- tion in urban scenes. This is a very challenging task. First, architectural surfaces are often textureless, which results in ambiguities in matching. Second, severe occlusion is ubiquitous in the image, which also produces uncertainties in 3D inference. Third, the reconstructed 3D volume is very large, which raises efficiency issues in terms of both time and memory consumption. We are given a set of images that are oblique views of a persistent urban area cap- tured by a UAV hovering, as shown in Figure 6.1. In addition, the camera calibration parameters corresponding to each image are known. Our goal is to produce 3D mod- els for the area in an efficient manner for further image analysis. We present a novel 58 Figure 6.1: Samples of the input UAV images approach that performs dense reconstruction of urban scenes from these aerial images both efficiently and accurately. Figure 6.2 shows the flow chart of our approach. For a reference view, we first compute at each pixel the matching scores (with respect to nearby views) at all discrete height levels using a simple yet robust algorithm similar to [GCS]. Next, we estimate the most likely height for each pixel, and derive a 3D point cloud for each reference view. Finally, we merge all the point clouds together and use the method presented by Zhou [ZN] to generate the 3D mesh models for the urban scene. The novelty of our approach resides in the dense height estimation module. Inspired by state-of-the-art urban scene reconstruction approaches [PNF + 08, FCSS], we also incorporate structural prior in our work. In addition, we make use of edges that are prevalent in urban scene imagery to provide additional constraints. We then formulate the dense reconstruction problem as a constrained optimization problem, which is solved efficiently using a 2-pass dynamic programming algorithm [KLCL]. The reconstructed 3D point clouds are highly accurate, and can be used directly as raw input to the 3D building modeling module to produce urban 3D models. Our proposed approach has the following desirable characteristics: 1. Speed. The computation complexity of both Dense matching score computation and Dense height estimation modules are O(nml), where n and m are the image width and height respectively, and l is the number of discrete heights. Moreover, the algorithm is fully parallizable, and can be implemented on a GPU platform, which allows the entire reconstruction to be completed in a few hundred seconds. 59 Dense matching score computation UAV images + Calibrated cameras Dense height estimation For each pixel the matching score at each discrete height level Dense 3D point cloud 3D building models INPUT OUTPUT Building Modeling Operates on each reference view Operates on the merged point cloud Figure 6.2: Flow chart of our dense reconstruction of urban scene 2. Accuracy. By imposing the structural prior, the matching ambiguities that typi- cally result from homogeneous areas can be eliminated. 3. Compactness. For most volumetric approaches, the bounding volume for a scene in aerial images is large, which give rise to memory and resampling issues. In contrast, ours is a view-based approach that reconstructs one 3D point at each pixel, thus avoiding such problems. 4. Generality. State-of-the-art urban reconstruction algorithms assume two domi- nant orientations for vertical planes, which is commonly violated in aerial imagery. In contrast, we allow an arbitrary number of orientations, which is a more general, and realistic, assumption. Inthefollowingsections, wefirstreviewrelated work. Next, wedescribeourapproach in Section 6.3, starting from a simplified version, then evolve to an advanced one. In Section 6.4, we present out line extraction method which produces line segments used in the 2-pass approach. We show experimental results in Section 6.5, and offer concluding remarks in Section 6.6. 60 6.2 Related Work In the last few years a significant amount of work has been devoted to 3D reconstruc- tion from aerial imagery. Here, we discuss the research on urban environments, as the methodology used for suburban or rural areas can be quite different. It is common to use geometry primitives to assist the reconstruction of urban scenes. Fradkin [FRML] segments the image based on color, and approximate a plane for each segment using the estimated depth at each containing pixel. Baillard [BSZF] uses matched line segments over multiple images to compute piecewise planar patches, and achieves near complete roof reconstructions. Recently, advances in urban scene reconstruction algorithms have produced com- pelling results in terms of both efficiency and accuracy. Mundy et al. present a vol- umetric approach in [CMT, PM]. Each voxel is associated with both an appearance model and an occupancy probability. Any arbitrary view of the underlying scene can then be synthesized from this model, which is useful for scene change detection since structure induced parallax is compensated. Pollefeys et al. [PNF + 08] develop a sys- tem that generates textured 3D models of a large scale urban scene from video. Their proposed multi-view plane sweeping stereo algorithm incorporates plane priors to dis- ambiguate the matching in textureless regions, and produces more accurate depth maps. Cornelis [CLCG08] assumes facades are parallel to the gravity direction. By finding the best match for each image column in the rectified stereo image pair, the 3D facade can be reconstructed. Note that both of the above methods utilize parallel computing on the GPU platform to achieve realtime performance. Furukawa et al. [FCSS, MK] assumes Manhattan world [CY], i.e., scenes consist of axis aligned planes, which greatly reduces the plane hypothesis space. The plane assigning problem for each pixel (or superpixel in [MK]) is then formulated as a MRF which can be solved using graph cuts. Although the input imagery for the previously mentioned approaches is mostly ground-based, we can summarize the ingredients in a successful urban reconstruction 61 algorithm. First, they all have certain shape priors, such as piecewise planar model used in [PNF + 08, MK], and ruled surface model used in [CLCG08]. Second, they make use of the gravity direction. The ground plane is generally used as the reference hori- zontal plane, and building facades are planes perpendicular to it. Third, most of them are depth map based approaches, as a depth map representation of the scene is very compact, which is very suitable for large scale reconstruction applications. 6.3 Approach As mentioned in Section 6.2, imposing certain shape prior can help to resolve matching ambiguities, which is a widely adopted technique. In our work, the geometry is modeled as piecewise planar surfaces that have their surface normals either parallel or perpen- dicular to the gravity vector g. This is a more general assumption than the Manhattan world [CY], in that our vertical planes in the scene may have an arbitrary number of ori- entations, as opposed to just two. Although multi-plane sweeping algorithm [PNF + 08] claims the ability to reconstruct planes with arbitrary number of orientations, the com- putation cost also increases accordingly. Figure 6.3 a illustrates such a geometry (the building in the right) that does not follow the Manhattan world but conforms to our model. We also assume that the projection of a vertical line is locally aligned with an image column. As shown in Figure 6.3 a, the vertical boundary of a building, depicted in a red solid line, is roughly aligned with an image column, as depicted in the white dashed line. This assumption is in general true when the camera’s up vector is close to g. In circumstances where the assumption is violated, it can be strictly enforced by an additional rectification step, as discussed in [CY]. With these two assumptions, a set of spacial constraints are derived. Let P i denotes the 3D point which correspond to an image pixel p i (we use bold face and italic face for imagespaceandworldspace, respectively, throughoutthepaper),thespatialrelationship 62 (a) (b) (c) (d) Image column 3D vertical line Figure 6.3: Our assumption on UAV images. See text for details. between two 3D points, (P m ,P n ) which correspond to a pair of adjacent pixels on the same image column, is one of the following: 1. P m and P n belong to the same horizontal plane, 2. P m and P n belong to the same vertical line, and 3. P m and P n belong to different planes Theabove relationships areshowninFigure6.3, wherethe color barsalong theyellow column in (b), (c), and (d) highlight the groups of points with relationships 1, 2, and 3 respectively. Next we present our detailed implementation of the dense reconstruction. We first describe how the dense matching scores are computed. Then we present a simplified version of reconstruction in Section 6.3.2, which leads to the advanced version of recon- struction in Section 6.3.4. 6.3.1 Matching Score Different from typical view-based reconstruction methods which compute matching scores at a number of depth levels, we perform matching score computation at each 63 pixel p in the reference image I ref at a number different height levels z, denoted as M(p,z). We define the matching score as M(p,z)= i ncc w i P(z) ,w ref P(z) (6.1) where ncc(·) is the normalized cross correlation of two image patches, w i (P(z)) is the image patch centers at the projection of P(z)in I i ,and P(z) is the 3D point which projects to p in I ref with height z. In practice, we choose view I i to be a view close to I ref , so that the projections of the same 3D surface in the two views do not show severe perspective distortions. Therefore w i and w ref are simply two equal sized n×n windows. We use n=7and i = 2 in our implementation. 6.3.2 Height Estimation along an Image Column In our simple version of reconstruction, the height estimation is performed along each image column independently. Let Z(p i ) denote the estimated height (the z component) of P i , our objective is to find the optimal Z that maximize the following function for each image column v E v (Z(p)) = k M(p k ,Z p k ) − k ρ Z(p k ),Z(p k+1 ) (6.2) where M p k ,Z(p k ) is the data term previously described, and ρ Z(p k ),Z(p k+1 ) is the smoothness term defined as ρ Z(p k ),Z(p k+1 ) =0,if P k .z = P k+1 .z =0,if D h (P k ,P k+1 )<L = K,otherwise (6.3) where K is a positive value (we use K =5), and D h is the horizontal distance between two points. Obviously, ρ encodes the spatial relationship between two 3D points 64 P k ,P k+1 as described in the previous section. Note that we use a slightly relaxed form D h (P k ,P k+1 )<L to encode the notion “P k and P k+1 belong to the same vertical line”. This is because the horizontal coordinates (x,y)of P are determined by the dis- crete z value, and therefore P k and P k+1 do not have identical horizontal coordinates. Intuitively, the smoothness term penalizes (P k ,P k+1 ) to be on different 3D planes, and encourages them to be on the same horizontal plane or on the same vertical line. Equation 6.2 can be expressed as a dynamic programming (DP) problem which has time complexity O(ml), wherem is thenumberof pixels in a column, andl is the number of discrete height levels. Thus, the time complexity for reconstructing all points in the image is O(nml), where n is the image width. Most importantly, the algorithm can be ported onto a GPU platform, which dramatically decreases the computation time. We show statistics in Section 6.5. (a) (b) Figure 6.4: Reconstruction results using (a), winner-take-all v.s. (b), dynamic program- ming To demonstrate the superiority of our formulation, we compare our method with a winner-take-all (WTA) strategy, which simply assigns to each pixel the height that yields the best matching score, i.e., Z(p) = argmax z M(p,z). Figure 6.4 shows the reconstruction results (point clouds rendered in MeshLab [mes]) of Figure 6.1 left using WTA and our method respectively in (a) and (b). We can see that the result using our approach, compared with the naive approach, is much cleaner and preserves structure better. However, this approach only imposes constraints on intra-column pixels. Since no constraints are imposed between adjacent columns, results may be inconsistent. This is a very typical problem for dynamic programming approaches, and there are some related 65 works studying this problem [OK85, KLCL]. Since our approach follows a similar idea as described in x, we start by reviewing this method in the next section. 6.3.3 Two-Pass Dynamic Programing In[KLCL], Kimpresentsa2-pass dynamic programmingalgorithm tosolve foradense2- view stereo matching problem. Given a stereo image pair (I l ,I r ), it estimates a disparity value at every pixel p in I l , denoted by D(p), so that a pixel at location (x,y)in I l is matched to the pixel at location (x+D,y)in I r . In the 1 st pass, dynamic programming is performedalong image rows that maximizes a function in a very similar form as Equation 6.2: E h (D(p)) = k N(p k ,D p k ) − k δ D(p k ),D(p k+1 ) (6.4) where N is the matching score as the data term, and δ is the smoothness term. In this first pass, the objective is not to compute the best D for each pixel, but to compute a bias term for each disparity level at each pixel, denoted as B, that is used in the 2 nd pass DP. In the 2 nd pass, dynamic programming is performed along image columns that maximizes the following function: E v (D(p)) = k N(p k ,D p k ) − k δ D(p k ),D(p k+1 ) + k B(p k ,D(p k )) (6.5) and the disparity along the optimal path is then the final disparity of the image. As can be seen from Equation 6.5, if B(p k ,j) has a larger value, the optimization will favor disparity level j at p k more. Next we describe how this bias term is computed. We start from looking at the last pixel in an image row. In a typical DP process, a best score for each disparity level is accumulated along the row, denoted as S(p k ,j), where p k is the k th pixel in the row, and j is the disparity level index. Say there are 66 n pixels in the row, as DP reaches the last pixel, each disparity level j is associated with a best accumulated score S(p n−1 ,j), and it is the amount level j biases DP in the second pass. Figure 6.5 (b) illustrates the description with 3 different disparity levels marked in red, green and blue respectively. In this example, the 2 nd pass DP favors the red level the most as it has the largest accumulated score at the end of the row. Intuitively, the bias term aggregates information along the row direction. By applying the bias term in the 2nd pass DP, the optimization is carried out by considering row and column information jointly. We can compute the bias term for the first pixels in the number of levels + − → S − → S ← − S ← − S (a) (b) (c) (d) Figure 6.5: Bias term computed in the first pass DP. (a) the image row to be computed; (b) the bias term for the last pixel in the row; (c) the bias term for the first pixel in the row; (d) the bias term for all pixels in the row. See text for detail. image rows in a similar fashion by performing DP in the reverse direction, as shown in Figure 6.5 (c). In this example, the 2 nd pass DP favors the green level the most as it has the largest accumulated score at the first pixel. Finally, the idea is generalized to all the pixels in the image by adding the forward pass and the backward pass together to form the bias term for each pixel at each level, i.e., B = − → S + ← − S (6.6) 67 where − → S and ← − S denote the best accumulated score computed in the forward and the backward directions respectively, as shown in Figure 6.5 (d). Note that the entire 2- pass DP requires performing DP 2 times in the row direction, and 1 time in the column direction, which still has O(nml) time complexity. [KLCL] compares the results of 1-pass DP and 2-pass DP. Figure 6.6 (a) is one of the input images, and Figure (b)-(c) are the results using 1-pass DP and 2-pass DP respectively. We can see that the streak effects produced by 1-pass DP is significantly reduced in 2-pass DP. (a) (b) (c) Figure 6.6: Dense matching results of a stereo image pair. (a) one of the stereo images; (b) estimated disparity map using 1-pass dynamic programming along image rows; (c) estimated disparity map using 2-pass dynamic programming. 6.3.4 Height Estimation for the Entire Image Figure 6.7: Linear features extracted from urban imagery 68 Our advanced version of reconstruction is motivated by the observation that there exists a number of linear features in urban UAV imagery produced by road marks and building boundaries. Under our assumption that all planes in the scene are either hori- zontal (parallel to ground plane) or vertical (perpendicular to ground plane), it follows that edges on the boundary of all such planes should exhibit the exact same behavior as well, i.e., these edges in the 2D image should also correspond to either horizontal or vertical edges in the world. In fact, a large number of edges within such planes also exhibit this behavior, such as the horizontal and vertical window patterns commonly found on building facades. Figure 6.7 shows a typical urban scene with the detected linear features overlaid. We leave the description of our line extraction method to the next section, and assume line segments are given from now on. As most horizontal edges span across multiple columns in the image, we use them to establish inter-column constraints. Let ¯ e denote an edge which corresponds to a horizontal line segment in the world space, the new constraint is that all 3D points which project to pixels along ¯ e share the same height. To collect the set of edges {¯ e} from the set of line segments extracted from an image, we ignore those that span across thenumberof imagecolumns less thanathreshold(weuse3). Although thismay remove horizontal edges as well, the consequence is minor, since the constraint on such edges only affects a small number of columns. number of discrete heights E e j E v i E v i E h j number of discrete heights Figure 6.8: Left: A typical 2-pass DP performs DP first along image rows then along image columns. Right: Our 2-pass DP performs DP first along world space horizontal edges then along image columns. 69 Ournovelty inthisworkistoadaptKim’sapproach[KLCL]originallyusedforsolving a2-view densestereomatchingproblem,asdescribedinSection 6.3.3, toourdenseheight estimation problem, and enforce both intra-column and inter-column constraints jointly. To employ our inter-column constraint, instead of a typical 2-pass DP approach that performs the 1 st pass along each image row, we perform our 1 st pass along each edge,as shown in Figure 6.8. The function to be maximized in the first pass DP is defined as E e (Z(p)) = k M(p k ,Z p k ) − k Z(p k ),Z(p k+1 ) , (6.7) where (·) is the smoothness term defined as Z(p k ),Z(p k+1 ) =0,if P k .z = P k+1 .z = K 1 ,otherwise (6.8) Clearly, it favors points along an edge to have the same height. The bias term E ¯ e is then defined in the same way as in Equation 6.6, E ¯ e = − → E e + ← − E e (6.9) where − → E e and ← − E e are E e computed in the forward and backward direction along the edges respectively. Note in this way, the non-edge pixels will have value 0 for all discrete height levels, ie, they do not bias the second pass DP. Finally, the second pass DP is performed on each image column that maximizes the following function E v (Z(p)) = k M(p k ,Z p k ) − k ρ Z(p k ),Z(p k+1 ) + k E ¯ e (p k ,Z(p k )) (6.10) Figure 6.9 shows the improvement using this additional constraint. We highlight two areas in the top row that best illustrate the improvements. In the middle row, (a) 70 exhibits typical inter-image column inconsistency (zigzag line patterns on the facade) using only 1-pass DP. This inconsistency is significantly reduced by our proposed 2-pass DP, as it takes into account the spatial cues from prominent structural edges, as shown in (b). In the bottom row, (c) shows the over-smoothing effect in a typical DP algorithm. The vertical planes of the building in this case are erroneously stretched out into a large horizontal plane, a result of the smoothness term in DP incorrectly dominating over the data term. In (d), however, by aggregating measures across image columns in the first pass DP, the distinction between the measures at the true height and the rest are magnified. By taking into account such distinction as the bias term in the second pass DP, the optimal path is less likely over-smoothed, and more likely to pass through the right height. 6.4 Line Extraction Our line extraction method first computes two quantities at each pixel, an orientation θ and a line saliency measure ϕ, and then groups pixels of the same orientation into line segments, based on these two quantities. Let l(p,θ)denotealinepassingthrough p with orientation θ. For a typical line, the local image patches along the line should look similar. Therefore, we first define a local self-similarity measure U(θ,p)= S k=−S sad w(q θ k ),w(p) , (6.11) where sad is the sum-of-absolute-pixel-difference of two image patches, w(p)isasquare window (we use 5×5) centered at pixel p,and q θ k is the pixel on line l(p,θ)thatis |k| distant away from p. S defines the size of the local neighborhood (here we use S=8). The orientation at p is then defined as θ=argmin θ i U(θ i ,p) (6.12) 71 1-pass-DP 2-pass-DP (a) (b) (c) (d) Figure 6.9: Reconstruction results of the two highlighted areas in the top row using 1-pass DP(left column) and 2-pass DP (right column) respectively. See text for detail. In our implementation, we find θ with a two-tier approach that first finds the best orien- tation on a coarse level (among 16 equally spaced bins with distance π/16 between each consecutive two), and then on a fine level (another 16 bins with distance π/256) around the coarse-level minimum to speed up the process. With the orientation θ determined, a line saliency measure with respect to θ is defined as ϕ = r ×(1− U(θ,p) T ) (6.13) 72 The first term here, r, is defined as the maximum running length of the sequence of pixels {q θ k },k = −S...S that satisfy (p)·(q θ k ) ≥ g t (6.14) where is the unit gradient vector and g t is a constant threshold (0.95 in our case). Clearly a line feature will result in a large r value since the gradient directions of pixels along l(p,θ)arethesameasthatof p, thus producing a long run. The second term, where T is a pre-defined upper bound of U, is a normalized, linear decreasing function of U. These two terms then jointly define the saliency of a line segment. Once θ and ϕ have been computed for each pixel, we use a simple grouping process to extract salient line segments across the entire image. For each pixel p whose ϕ value is a local maximum in (p) direction and is above a given threshold φ t , a line segment is initiated along its θ direction. This line segment is incrementally extended along the θ direction, for as long as the following two conditions hold: |θ(p)−θ(p i )|≤ θ t ϕ(p i ) ≥ φ t (6.15) wherep i is, along the θ dirction, the next adjacent point to beadded to our line segment, and θ t is a constant threshold (we use 0.05 here). The first condition ensures that the entire line has highly consistent orientation around θ(p). The second condition ensures high line saliency, and eliminates homogeneous, textureless image regions from ourlinesegment extraction. Incaseswheretheextracted linesegmentscrossoneanother, we favor longer line segments over shorter ones in that a longer line segment spans more image columns, and therefore provides a stronger spatial constraint for our 2-pass dynamic programming. The result is a set of line segments extracted from the 2D image that correspond to salient structural lines in the 3D scene. The computation of θ and ϕ 73 at each pixel can be carried out in parallel, which we perform it on a GPU. We report the computation time in Table 6.1. 6.5 Experiments Our experiments are carried out on a Intel Xeon 3GHz (quad-core), 3G ram machine equipped with a NVidia GeForce GTX 295 graphics card using CUDA 2.3 [cud] archi- tecture. All images are of size 1000×1000, i.e., 1M points are reconstructed from each view, and the number of discrete height levels is 160. The average computation time in each main module is listed in table 6.1 below. Time (sec) 4.3 71 3.2 317 6.6 Building Modeling (cpu) Dense Height Estimation Dense Matching Edge Extraction first pass DP second pass DP Table 6.1: Computation time for each module Next we show in Figure 6.10 the reconstructed 3D point cloud from a set of 3 images, with the image in Figure 6.1 right as the reference view. We can see that our proposed algorithm produces a clean and accurate reconstruction, especially for the homogeneous regions such as the rooftops and the roads, where naive approaches producenoisy results. In Figure 6.10 (a)-(c), it is clear that our algorithm cuts precisely at the occluding boundaries, as there are no zigzag patterns, and only very little clutter is found within the occluded regions. Figure 6.10 (d) shows the high level of detail our algorithm is capable of achieving. Notice that the step-like structure of the buildings are correctly reconstructed. We then use [ZN] to produce 3D mesh models directly from our reconstructed point clouds. Originally designed for LiDAR data, [ZN] is a purely data driven algorithm that estimates the principal directions of roof boundaries from dense points, and uses them in footprint production. Figure 6.11 shows two views of the modeling result. From 74 (a) (b) (c) (d) Figure 6.10: Reconstructed 3D dense point cloud taking Figure 6.1 right as the reference view plotted in multiple viewing angles. See text for details. 480K number of points, it generates a 3D model with 28484 vertices and 41593 faces. We can see the stair structure of the building is well preserved. Although there are a few fractures due to insufficient number of sample points, they could be recovered by incorporating more views. It also shows that our approach is capable of reconstructing a building with facades in several different orientations, which is more powerful than the approaches in [CLCG08, PNF + 08, FCSS]. We have applied our method on other datasets with different characteristics. In particular, we have used the CLIF ’06 dataset [cli]. Figure 6.12 plots the point clouds in multiple view points. The scene also contains a number of poorly textured regions and severe occlusions. However, we are still able to produce dense reconstruction with high visual quality. 75 Figure 6.11: 3D building mesh models generated from our reconstructed dense 3D point clouds 6.6 Summary Inthis work, wehave proposedanovel approach to thereconstruction of 3Durbanscenes from aerial images. Our algorithm works under the assumption that all planes in the scene are either horizontal or vertical, which is a more relaxed and general assumption than the one used in state-of-the-art ground-based urban reconstruction methods. We proposed a 2-pass dynamic programming algorithm that takes into account both the intra-column and the inter-column constraints, and produce highly accurate results. Especially in key areas such as homogeneous or occluding regions, our algorithm is able to faithfully reproduce the underlying structure. Moreover, our algorithm is fully parallelizable, capable of reconstructing 1M points over 160 discrete heights within 1 minute. Quality results produced from several challenging data sets demonstrate the generality and robustness of our approach. The main limitation of our approach is the assumption that the setup is fully cali- brated. We discuss this extensions, as well as future work in Section 8.2. 76 Figure 6.12: A reconstructed 3D point cloud from CLIF’06 dataset 77 Chapter 7 3D Face Reconstruction from Weakly Calibrated Wide Baseline Images 7.1 Introduction In this chapter, we address a very difficult problem, the inference of an accurate 3D face model from a set of images separated by a wide-baseline in a weakly calibrated setup. Unlike otherstatistical model-basedmethods, suchasVetter’s [ABF + 07], ourapproach is purely data-driven, thus produces faithful models. This gives two important advantages. First, our method provides stable reconstruction results while other methods require a critical initialization step. Second, ourmethod generates a more accurate shapeof a face. The use of a generic face model allows the reduction of the computational complexity or the inference of uncertain face shape, but restricts the use of the obtained 3D surface for some critical applications such as face biometrics and medical surgery. Our method should handle asymmetry of face, and local geometry (e.g. scar), which are important factors for realistic 3D face modeling. We process 5 images acquired from a fixed camera and a head successively facing 5 different targets which are roughly 45 ◦ apart, as shown in Figure 7.1. The head motion is mostly rotational, but may include translation. This results in 5 images that include a left and a right profile view, as shown in Figure 7.2. The main contribution of this paper is the inference of an accurate 3D face model from this weakly calibrated setup. 78 4 3 2 1 0 fixed camera moving head Figure 7.1: The setup for capturing the five facial images of a subject. The five boxes are the targets to face at when taking pictures. The circle represents the head of the subject. I 3 I 4 I 2 I 1 I 0 J 0 J 4 Figure 7.2: The five facial images and two profile images. Image I i is the image when the subject faces target i in Figure 7.1. We start by estimating camera poses in all views, building a sparse reconstruction, then inferring a dense 3D face model. We propose a set of key algorithmic components in order to make high quality 3D face models. The quality of 3D face models depends first on the accuracy of estimated head- camera motion. It is very difficult to establish correspondences between views in wide- baseline stereo images due to perspective distortions and lighting changes. We propose a new method to estimate the relative head-camera motion. First, we use an iterative bundle adjustment method to detect outliers in corresponding points. Even one single outlier might yieldcatastrophic errorsinthemotion estimation result. Second, silhouette matching links two profile views that have no point-wise correspondences, and provides additional constraints for calibration. For dense reconstruction, we employ a face-specific cylindrical representation, which allows us to solve a global optimization problem for N-view dense aggregation. In [FM], individual pairwise reconstructions are integrated into a single 3D point cloud, and then 79 an additional post-processing (e.g. tensor voting [MM06]) is applied to enforce surface self-consistency. We aggregate all information to get a single 3D point cloud usinga well- known optimization technique. With this representation, dynamic programming can be applied directly for all views. Profile contours are then used once again to produce accurate shape along the center (vertical) face region. Experimental results using synthetic and real images show that our method provides accurate and stable reconstructions results with wide-baseline images in which conven- tional data driven techniques, such as Boujou [bou], Arc3D [arc], and Photosynth [pho], fail to reconstruct correct camera poses in the first place. Even after providing accurate calibration from our setup, the state of the art PMVS [FPb] software package failed to build the model. The remainder of the chapter is organized as follows. Section 7.2 summarizes the related work. Our approach is presented in Section 7.3, and Section 7.4 shows the experimental results and Section 7.5 concludes the paper. 7.2 Related Work Most proposedapproaches tosolve thisproblemstart fromageneric model, whichisthen deformed to fit the data. In [ABF + 07], a deformable 3-D face model was introduced, which is fit to an image using a nonlinear optimization technique and a suitable distance measure. In [ASW93], the authors create a 3D head model from only a frontal view, and a profile view by adjusting a generic model. Tang [TH] creates 3D face model by the location of 32 feature points on the face in multiple images, and relates them to the nodes in a generic face model. This method builds a reasonable 3D model, but is not dense enough to represent the face shape accurately. Fua [Fua] uses a model-driven, bundle adjustment method to get the head model from a raw video sequence. It takes advantage of stereo data, silhouette edges and 2D feature points. It has shown good results, but it is not clear whether the silhouette edges and 2D feature points help. Combining 80 the three into a uniform representation is not a trivial problem. In [SLZ], the authors propose a model-based bundle adjustment algorithm that takes a set of image tracks as input, without a prior 2D to 3D association. The use of a generic face model allows the reduction of the complexity of the bundle adjustment algorithm, but restricts the use of the obtained 3D surface for identification purposes. The 3D features of the face are mapped onto the generic model, and consequently ignore the very specific features that allow the identification of a person from his/her 3D characteristics. On the other hand, the modeling problem can fit in multi-view stereo (MVS) frame- work that has no assumption on the reconstructed shape. Both internal and external camera parameters, however, must be provided to locate the point projections in images. A thorough survey of early MVS works has been presented in [SCD + ]. Several methods have achieved sub-millimeter accuracy on the benchmark objects. In recent develop- ments, Furukawa [FP09] presents a patch-based approach that ranks top in high resolu- tion imagery benchmarking [SvHVG + ]. Starting from a sparse set of matched keypoints, their corresponding surface(patch) normals are estimated, which are then used in iter- ative expansion and filtering phases that gradually completes the entire scene. Despite the outstanding performance in terms of accuracy, experimental results show it fails to perform dense reconstruction on our image sets. In addition, the processing time for high resolution images takes hours, which is too computational demanding for many applications. Note that most MVS approaches do not scale well with images that have wide base- lines. Fua [TLF09] presents an algorithm that can efficiently produce SIFT like local descriptors, DAISY, for dense matching purposes. The sophistication of its descriptors allows wide-baseline matching to be performed on low resolution images. Strecha [SFG] addresses the self-occlusion problem in wide baseline images. He presents a probabilistic approach in which the depth with respect to each image is optimized using EM algo- rithms. 81 We reiterate thattheaforementionedmethodsallassumeknownexternal andinternal camera calibration, which is difficult in practice. Although structure from motion (SfM) techniques [HZ04, PVGV + 04] can perform camera calibration directly from images, the accuracy relies on the quality of correspondences, which may be poor in wide baseline images. We presentourworkthatstartsfromaset of5weaklycalibrated widebaselineimages that still produces qualitatively competitive results, as described in the next section. 7.3 Approach There are two main modules in our approach, sparse reconstruction and dense recon- struction. Sparse reconstruction aims at recovering the unknown camera poses, which is a challenging problem given wide baseline images. For dense reconstruction, the goal is to reconstruct the dense 3D surface of the face, where we use contour lines in the profile images as a strong constraint. 7.3.1 Sparse Reconstruction To define a set of cameras in the same space, we need to establish point correspondences in at least 3 views. Finding such correspondences in our imagery is a very challenging task since any 2 adjacent views are 45 ◦ apart. Lighting changes and perspective dis- tortions make the matching process error prone. In classical SfM method with known camera intrinsic parameters, 3 view correspondence inliers can be naturally extracted by employing RANSAC on top of PnP [LMNF09] algorithms. However, such approach fails most of the time due to both the scarcity of true inliers and the small ratio of inliers to outliers (we typically get fewer than 10 inliers out of a total of 30 in the frontal 3 views, as shown in the experiment section). We thus present a new algorithm to extract 3 view correspondence inliers. 82 SURF feature extraction Contour line alignment Feature matching 5 facial images and their corresponding masks Epipolar geometry estimation First pass BA (fixed camera position) Second pass BA INPUT OUTPUT Camera poses + cylinder axis and radius Extract 3-view correspondences SURF Features Correspondences Ω1, Ω2,Ω3 F Matrix + F Matrix inliers , , , and Matched contour points ¯ Ω π0,1 π1,2 π2,3 π3,4 Figure 7.3: Flow chart of the sparse reconstruction module. Our approach is illustrated in Figure 7.3. We choose to use state of the art SURF features [BETG08] as interest points, and match them between image pairs usingnearest neighbor matching with a certain threshold of nearest neighbor distance ratio [MS05]. The next step is to estimate epipolar geometry from the matches. Ideally, given camera intrinsic matrices, we are able to estimate the essential matrix, from which the relative camera poses can be derived. However, in practice, we find that the focal length infor- mation recorded is not accurate enough, and the estimated essential matrix does not yield reasonable camera poses. We thus instead estimate the fundamental matrix using a plane+parallax method [BM]. Note that with the estimated F matrix, we perform another pass of guided feature matching to improve the quality of matches, from which amore accurate epipolargeometry can beestimated, andamore accurate set ofF matrix inliers can be produced. We denote the F matrix inliers between image I m and I n by π m,n . We then use a bundle adjustment based approach to estimate the 5 camera poses. Here weuseBtodenotethebundleadjustmentprocess, andusesubscripts R , T ,and S to represent three types of free parameters, which are camera rotation, camera translation, and structure respectively (e.g.,B R|S represents the camera rotational component and the structure are free parameters, while the camera positions are fixed). To initialize the camera poses, we exploit the knowledge that any two adjacent cameras are roughly 83 45 ◦ apart, and have roughly the same distance to the subject. Along with the estimated F matrices, we are able to initialize the cameras at the locations that are very close to their true ones. For the correspondences, we extract sets of 3-view correspondences in 3 groupsofimages(I 0 ,I 1 ,I 2 ), (I 1 ,I 2 ,I 3 ), and(I 2 ,I 3 ,I 4 )respectively. Letp j i denotesthej th feature in I i , a 3-view correspondence is a triple (p a i−1 ,p b i ,p c i+1 )where(p a i−1 ,p b i ) ∈ π i−1,i and (p b i ,p c i+1 ) ∈ π i,i+1 .WeuseΩ i ,i=[1,2,3] to denote the set of 3-view correspondences and the 3D points they triangulate to interchangeably. To relate I 0 and I 4 ,contour lines in J 0 and J 4 are matched using iterative closest point (ICP) algorithm, and a set of correspondences, denoted as ¯ Ω, are extracted. We then use two passes of bundle adjustment, B R|S followed by B R|T|S , on the correspondences (Ω 1 ,Ω 2 ,Ω 3 and ¯ Ω) to produce the final camera calibration result. The objective in B R|S is to produce a suboptimal set of 3D points that lead to a faster convergence inB R|T|S .Wealsofita vertical cylinder to the reconstructed sparse structure, and estimate its center axis and radius, which are used in dense reconstruction as later discussed in Section 7.3.2. Figure 7.4: Iterative bundle adjustment algorithm As mentioned earlier, the challenge here is to obtain accurate sets of Ω i .Wepropose to use an iterative bundle adjustment approach as shown in Algorithm 7.4 that produces 84 Ω 1 ,Ω 2 ,andΩ 3 individually. The idea is to iteratively remove a small portion of points in Ω i that have the largest re-projection error, and then perform bundle adjustment again on the remaining points to recalibrate the cameras. We use π i−1,i and π i,i+1 to construct an initial set of Ω i . Since outliers in Ω i greatly influence the optimization result, we fix the camera positions and adjust their rotation components only, i.e.,use B R|S . We alsoincludeπ i−1,i andπ i,i+1 inthebundleadjustmentsothattheoptimization will bias the points that conform to the estimated epipolar geometry. We then use the adjusted 3D points and cameras to compute the reprojection error for each point in Ω i . If the largest reprojection error is over a threshold, we remove 5% in Ω i that have the largest reprojection error, and perform another iteration again. Compared with RANSAC based approaches, ours can still work when the number of inliers is small. Note that the correspondences in ¯ Ω extracted by aligning the two profile contours are only accurate when the subject’s facing direction is exactly orthogonal to the camera. As the face pans away, the correspondences are less accurate, which in turn degrades the accuracy of the camera poses. We give a full quantitative analysis of this matter in our experiments. 7.3.2 Dense Reconstruction Once the cameras are calibrated, we can perform dense 3D reconstruction of the face from the five images, which is a multi-view stereo problem. It has been studied for years and a number of approaches have been developed. Unlike reconstruction of a scattered scene, reconstruction of faces is relatively easy due to the smoothness of the surface. However, the demand for accuracy is also higher since the reconstructed shape is critical for applications. We use a voxel-based approach. However, instead of sampling voxels in a Cartesian coordinatesystem,wesamplevoxels inacylindricalcoordinatesystem. Theadvantage of usingacylindricalcoordinatesystem istwofold. First, ityields acompact representation of a 3D surface. The geometry of a facial surface can be represented using an image D, 85 Cylinder center axis A sampled voxel 0 ◦ 180 ◦ θ d (C x ,C y ,C z ) φ φ θ Figure 7.5: Cylindrical voxel sampling for dense 3D reconstruction. where the value at D(θ,φ) is the horizontal distance to the cylinder center axis. The 3D pointcorrespondtopixel D(θ,φ)= d is therefore X θ,φ,d = dcos(θ)+C x ,kφ+C y ,dsin(θ)+C z , (7.1) as shown in Figure 7.5 (left). (C x ,C y ,C z ) is the bottom end point of the cylinder center axis, and k is the distance between adjacent vertical samples. Second, D is a smooth field. This is a desirable characteristic for global optimization methods that enforce smoothness constraint. Figure 7.5 (right) shows an example of D that is generated from a3Dfacemodel. To locate the surface voxel of pixel (θ,φ), we compute a matching score M at each discretelevel d alongthe correspondingray. Ourmatching score at pointX θ,φ,d isdefined as M(θ,φ,d)=max ncc(p i ,p i−1 ),ncc(p i ,p i+1 ) , (7.2) wherei isthe indexofthe image that ismost frontoparallel toX θ,φ,d . p i is theprojection of X θ,φ,d in image I i ,and ncc(a,b) is the normalized cross correlation of two image patches centering at pixel a and b respectively. Ideally, the matching score is maximum when the voxel is closest to the surface. However, in textureless or shadow regions, such a local method produces poor results. 86 Our reconstruction results are further optimized by exploiting the smoothness prop- erty of a facial surface. We use the two-pass dynamic programming (DP) method pre- sented in [KLCL] that optimizes D in both horizontal and vertical directions. A detailed description of the algorithm has been given in Section 6.3.3. In our first pass, the bias term, E¯ h , is computed as E¯ h = E h 1 +E h 2 (7.3) whereE h 1 andE h 2 areE h computedintheforward(increasingθ)andbackward(decreas- ing θ) directions respectively. E h is defined as E h (D(θ,φ i )) = θ M(θ,φ i ,D(θ,φ i ))− θ ρ(D(θ,φ i )−D(θ ,φ i )) (7.4) where ρ is an increasing function that penalizes discontinuity between two adjacent points, and θ is the previous point of θ along the scanline. Most papers use the Potts model [BVZ01] for the function ρ that accounts for discontinuities. Since D is smooth, we use the following equation instead ρ(t)= α×t 4 (7.5) Then the second pass dynamic programming is performed along a vertical scanline θ j to maximize the following function E v (D(θ j ,φ)) = φ M(θ j ,φ,D(θ j ,φ))+ φ E¯ h (θ j ,φ,D(θ j ,φ))− φ ρ(D(θ j ,φ)−D(θ j ,φ )) (7.6) Figure 7.6 (a) showsan example after optimization. Except forthe collapsed nose tip, other regions are well reconstructed. This is due to strong highlights in this region that 87 cause the optimization to over smooth it. We therefore introduce the contour lines in the left and right profile images to provide a boundary constraint for the reconstruction. Let Q denotes the family of planes that pass through the cylinder center axis. We make a reasonable assumption that all the points along the center contour belong to one of the planes. The best fitting plane Q ˆ θ can be determined by ˆ θ=argmin θ i E proj (X i θ , ¯ Ω i ), (7.7) where X θ is the set of 3D points on plane Q θ whichcorrespondtothematchesin ¯ Ω. E proj (·) is the total reprojection error of a 3D point X i θ with respect to ¯ Ω i . Once ˆ θ is determined, we apply the silhouette constraint by setting the matching score of the voxels along the contour to a very large number, and run the dynamic programming algorithm again. Since those voxels are set to a large number, the optimal path definitely goes through them, thus reconstructing the correct shape. Figure 7.6 (b) shows the result on the same person after applying the silhouette constraint. We can see that the subject’s profile is then very well reconstructed. Note that the number of discrete depth levels in cylindrical coordinates is relatively small, which leads to low computation complexity. Most importantly, the consumed memory is low enough for our dense reconstruction (both ncc computation and 2 pass dynamic programming) to run on a GPU platform in a few seconds. The computation speed is shown in our experiments. 7.4 Experimental Results We evaluate our method using two datasets. The first dataset contains synthetic, but realistic, face models and known cameras, which allows us to quantitatively define the performance with respect to the ground truth. The second dataset includes a set of real facial images. First, to show our problem is challenging, we tried both our datasets on 88 (a) (b) Figure 7.6: Reconstruction results: (a) without using v.s. (b) using profile contours. Figure 7.7: Reconstruction error comparison: with profile vs. without profile. Top row: ground truth depth maps. Middle row: error maps of the reconstruction without using profile information. Bottom row: error maps of the reconstruction using profile information. several commercial software packages, including Boujou [bou], Arc3D [arc], and Photo- synth [pho]. All of them fail in to reconstruct reasonable camera poses. We also test our synthetic data sets on state-of-the-art PMVS [FPb] software package providing ground truth camera poses. However, it fails to establish an initial sparse set of matches for further reconstruction in its feature matching stage. 7.4.1 Synthetic Face Models The dataset consists of 10 synthetic 3D faces, including different genders and ethnic groups, which are generated by the FaceGen [fac] software. Each face model has a mesh and a texture map. To obtain photo realistic skin texture maps, all texture maps in the 89 synthetic faces are replaced with real facial images from the real dataset. We then use a 3D rendering software package [3ds] to simulate lighting condition similar to the one we have when taking real pictures, and render the views from different angles. The following 3 experiments are based on the synthesized images. For each face, we produce a 180 × 150 ground truth depth map from the model that covers the central area of the face, as shown in Figure 7.7 top row. 7.4.1.1 Accuracy of reconstruction using silhouette We compare the enhancement by using additional silhouette information in the dense reconstruction step. Figure 7.7 middle row and bottom row are the error maps of each individual that use and do not use silhouettes respectively. We can see that by using silhouette information, results are more accurate, particularly for the nose region. To visualize the accuracy of the reconstruction, Figure 7.8 (a) shows the mean error distribution of our 10 data sets. First, using profile contours provides better results, which is consistent with the results shown in Figure 7.7. Second, by using profile con- tours, 82.5% of the points show less than 2mm error. Compared with modern MVS approaches, our method can still produces comparable results even without prior knowl- edge of the camera poses. 7.4.1.2 Accuracy in imperfect profile views As mentioned earlier in Section 7.3.1, the correspondences obtained by matching points onthetwoprofilecontoursareonlyaccurate whenthesubject’sfacingdirection isexactly orthogonal to the camera. In this experiment, we analyze the accuracy degradation with respect to the angle of deviation. We perform the experiments on 3 sets of images that have 5 ◦ ,10 ◦ ,and15 ◦ of deviation respectively, and measure the error in terms of reconstructed structure and the reconstructed camera poses. Given the true camera translation and rotation, the relative error of the estimated translation and rotation are 90 (a) (b) (c) Figure 7.8: (a) mean error distributions of the 10 synthetic data sets: with profile vs. without profile; (b) mean error distributions of the 10 synthetic datasets with respect to different amount of deviation angle in the profile views; (c) mean camera pose errors in terms of rotation and translation with respect to different amount of deviation angle in the profile views. computed as E T = t true −t/t and E R = r true − r/r respectively. The results byaveragingthe10datasetsareshowninFigure7.8(b)and(c). Results show that the accuracy in terms of reconstructed structureand reconstructed camera poses both degrade slowly within 10 ◦ of deviation. This verifies the robustness of our algorithm since the profile views are hardly perfect. 7.4.2 Real Face Models In real face experiments, we collect 10 image sets with a Canon Rebel T1i camera and a EF-S 18-55mm f/3.5-5.6 IS lens. The resolution of each image is 2400×1600, and the distance between the two eyes is roughly 200 pixels. 7.4.2.1 Computation complexity Sparse reconstruction, including feature extraction, feature matching, 3-view correspon- dence extraction, and the final 2 pass bundle adjustments takes less than 5 minutes in average. For dense reconstruction, we use 192 depth levels, while the size of θ and φ are both 540. This amounts to 56 millions voxels, for which we run both per voxel NCC computation and dynamic programming optimization on a NVIDIA GTX 295 graphic 91 17/68 28/50 48/151 28/41 2 72/100 76/106 6 0/13 29/39 3 44/59 3/17 9 49/78 40/58 2/22 11/38 59/105 8/23 3/21 3/26 6/29 8 41/84 45/69 63/84 5/29 2/16 34/62 30/66 5 2/19 41/75 1 12/33 7 30/53 4 10 Ω 1 Ω 2 Ω 3 Table 7.1: Number of 3-view correspondence inliers (left) vs. total number of 3-view correspondences (right) in each group of images for each individual. card (240 stream processors, processors clock 1242 MHz), and complete the task in an average of 25 seconds. This is much faster than PMVS, which usually takes hours on similar inputs. 7.4.2.2 Quality of reconstruction results To show the difficulty in getting 3-view correspondences, Table 7.1 shows the number of correspondence inliers vs. total number of correspondences in our data sets. The numbers from 1 to 10 correspond to the 10 subjects, while the three pair of numbers in each column are the number of 3-view correspondence inliers (left) and the number of 3-view correspondences (right) in Ω 1 ,Ω 2 , and Ω 3 respectively. We can see the number of correspondenceinliers inΩ 2 ismuch smaller thanthat in Ω 1 and Ω 3 . Inthe most extreme case, no correspondence inliers can be extracted from Ω 2 for subject #4. However, we are still able to reconstruct visually reasonable facial shape of subject #4 as shown in Table 7.9 top row. Also note that RANSAC based camera pose estimation approach can never work given the limited number of correspondence inliers in Ω 2 .Thisdemonstrates the robustness of our proposed camera calibration approach. Although there is no ground truth data to compare with, our results are visually accurate. Figure 7.9 shows a few results from real image sets. The left and right columns correspond to the reconstruction results without and with using contours respectively. 92 Figure 7.9: Reconstruction of real images (291600 points rendered in MeshLab [mes]): left, without profile contours vs. right, with profile contours. 7.5 Summary We presented a method to generate a highly accurate 3-D model from wide-baseline images acquired in a weakly calibrated setup. To perform accurate calibration, we pro- pose an iterative bundle adjustment approach that produce very stable results given noisy correspondences with very few number of inliers. Contours in the profile views are matched to provide additional calibration constraints. In dense reconstruction, we use a face-specific cylindrical representation to solve a global optimization problem for 93 N-view dense aggregation, and again use profile contours to provide constraints for the optimization. Experimental results, using carefully designed datasets and real images, confirm the accuracy and stability of our method on the challenging inputs. Our method is also shown to outperform other state-of-the-art methods in terms of robustness and efficiency. The main limitation is our reliance on very high resolution and high quality images as input. This is discussed further in Section 8.2. 94 Chapter 8 Conclusion and Future Directions In this work, we focused on 3D image registration problems, where the 3D geometry has to be explicitly reconstructed to achieve accurate registration. In particular, we address issues which cannot solved by standard methods. We have proposed novel approaches to solve for these problems, and achieved superior results than other existing methods. 8.1 Summary of Contributions The contributions in this work are summarized as follows: 1. Multi-modal image registration (Chapter 3 and Chapter 4). InUAVimage registration, we usemutual informationtoestablish correspondences between a UAV frame and the map, which have very different intensity profiles. This approach is shown to be stable across long image sequences with hundreds of frames. In retinal image registration, we extract SIFT features from corresponding edge response images to account for modality changes, and propose an iterative nearestneighbormatching algorithm toproducereliablematches. Extensiveexper- iments on clinical data sets (from Doheny Eye Institute) have been conducted to demonstrate the robustness and effectiveness of our approach. 2. Near planar surface 3D reconstruction (Chapter 5). We have presented a 4-pass bundle adjustment algorithm to robustly estimate the camera poses from a near planar structure, which is a degenerate case for 3D reconstruction. We are then able to perform dense reconstruction of the 3D volume, and provide 3D visualization. Experimental results on synthetic and real data sets have shown superior performance to state-of-the-art methods. 95 3. Dense 3D reconstruction in ill-conditioned cases (Chapter 5, 6, and 7). Wehavealsopresentedthreedensereconstructionmethodswithrespecttodifferent ill-conditioned scenarios listed as follows: • Multi-modality. In retinal surface reconstruction, we assume the a retinal surface is piecewise planar, and formulate dense matching problem as a piece- wise alignment one. By maximizing the overall mutual information between two images, dense correspondences are derived through interpolation. • Scenes with extruding structures. In urban scene reconstruction, we added a constraint that all planes in the scene are either vertical or horizontal, and presented a 2-pass dynamic programming algorithm that enforce such a constraint efficiently. • Poorly textured surfaces. In face reconstruction, we perform the recon- struction in cylindrical coordinates, which allows us to regularize the recon- struction using smoothness constraints. Profile contours are used as a strong constraint to account for over-smoothness. 4. Camera pose estimation from wide baseline images (Chapter 7). From a weakly calibrated setup, we proposed an iterative bundle adjustment algo- rithm to remove correspondence outliers, and estimated camera poses from a small number of inliers. Experimental results on synthetic and real data sets have shown both high robustness and high accuracy of our approach, whereas all commercial software packages and state-of-the-art methods fail to estimate them correctly. 5. GPU acceleration(Chapter 5, 6 and 7). Many key modules in our work were designed to utilize parallel processing and implemented on a GPU platform to maximize the computation throughput. They include mutual information computation and maximization (Section 5.5), dense matching score computation (Section 6.3.1 and Section 7.3.2) and 2-pass dynamic programming (Section 6.3.4 and Section 7.3.2), and edge extraction (Section 6.4). 96 8.2 Future Directions From the current limitations of our work, we have summarized a number of future directions as follows: 1. 3D building mesh model generation from dense point clouds. We plan to extend our work and improve the building-modeling module, as the current module only outputs dense points. Based on the 3D points we are able to reconstruct, we plan to build a complete and textured 3D mesh of the scene. This mesh can then serve as an accurate representation of the underlying 3D structure of the urban scene, and facilitate applications such as aerial surveillance, urban planning, and navigation. 2. Online camera calibration of UAV image sequences. The quality of urban scene 3D reconstruction depends on the accuracy of camera calibration, which is currently performed offline, and with human intervention. In the future, an online and automatic process should be developed. In addition, we plan to use telemetry data provided by the UAV system as an absolute reference, so that our calibrated cameras are aligned with geo-coordinates. This allows the reconstructed 3D models to be easily integrated with other tasks that also work in geo-coordinates, such as vehicle tracking, or urban planning. 3. 3D face reconstruction with low resolution images. Our current method requires high resolution images to provide a sufficient number of correspondences in the sparse reconstruction step, which may not always be practical. In the future, to adapt our approach to lower resolution images, we plan to use facial landmarks to establish an initial set of correspondences, and use them for camera pose estimation. Preliminary results on a small data set indicate that this approach is likely to overcome the current limitation. 97 Bibliography [3ds] 3ds max. http://usa.autodesk.com. [ABF + 07] Brian Amberg, Andrew Blake, Andrew Fitzgibbon, Sami Romdhani, and Thomas Vetter. Reconstructing high quality face-surfaces using model based stereo. In ICCV 2007, 2007. [arc] Arc 3d. http://www.arc3d.be/. [ASW93] Takaaki Akimoto, Yasuhito Suenaga, and Richard S. Wallace. Automatic creation of 3d facial models. IEEE Comput. Graph. Appl., 13(5):16–22, 1993. [BCM] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A non-local algo- rithm for image denoising. In CVPR 2005, volume 2, pages 60–65. [BETG08] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. Computer Vision and Image Understanding, 110(3):346–359, 2008. [BL] M. Brown and D. G. Lowe. Recognising panoramas. In ICCV 2003,vol- ume 2, pages 1218–1225. [BM] B. Boufama and R. Mohr. Epipole and fundamental matrix estimation using virtual parallax. In ICCV 1995, page 1030. [bou] boujou. http://www.2d3.com. [Bro] Adrian Broadhurst. A probabilistic framework for space carving. In ICCV 2001, pages 388–393. [BSZF] C. Baillard, C. Schmid, A. Zisserman, and A.W. Fitzgibbon. Automatic line matching and 3d reconstruction of buildings from multiple views. In ISPRSGIS 99, pages 69–80. [BT] S. Birchfield and C. Tomasi. Depth discontinuities by pixel-to-pixel stereo. ICCV 1998, pages 1073–1080. 98 [BVZ01] Y.Boykov, O.Veksler, andR.Zabih. Fastapproximateenergyminimization via graphcuts. IEEE Trans. Pattern Anal. Mach. Intell., 23(11):1222–1239, Nov 2001. [CBGS06] P Cattin, H Bay, L Van Gool, and G Szekely. Retina mosaicing using local features. MICCAI, 4191:185–192, 2006. [CC] Tae Eun Choe and I. Cohen. Registration of multimodal fluorescein images sequence of the retina. In ICCV 2005, pages 106–113. [CF] T. Chanwimaluang and G. Fan. An efficient algorithm for extraction of anatomical structures in retinal images. In ICIP 2003, volume 1, pages 1093–1096. [CFF06] T. Chanwimaluang, Guoliang Fan, and S.R. Fransen. Hybrid retinal image registration. IEEE Transactions on Information Technology in Biomedicine, 10:129–142, 2006. [CLCG08] Nico Cornelis, Bastian Leibe, Kurt Cornelis, and Luc Gool. 3d urban scene modeling integrating recognition and reconstruction. Int. J. Com- put. Vision, 78(2-3):121–141, 2008. [cli] Columbuslargeimageformat. https://www.sdms.afrl.af.mil/datasets/clif2006. [CM] Tae Eun Choe and G. Medioni. 3-d metric reconstruction and registration of images of near-planar surfaces. In ICCV 2007. [CMT] Daniel Crispell, Joseph Mundy, and Gabriel Taubin. Parallax-free registra- tion of aerial video. In BMVC 2008. [CSRT02] Ali Can, Charles V. Stewart, Badrinath Roysam, and Howard L. Tanen- baum. A feature-based, robust, hierarchical algorithm for registering pairs of images of the curved human retina. IEEE Trans. Pattern Anal. Mach. Intell., 24(3):347–364, 2002. [cud] Nvidia cuda programming guide 2.3. http://www.nvidia.com. [CY] J.M. Coughlan and A.L. Yuille. Manhattan world: Compass direction from a single image by bayesian inference. In ICCV 1999, pages 941–947. [Dav] J. Davis. Mosaics of scenes with moving objects. In CVPR 1998, pages 354–360. [fac] Facegen. http://www.facegen.com. [FB81] Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and auto- mated cartography. Commun. ACM, 24(6):381–395, 1981. 99 [FCSS] Y. Furukawa, B. Curless, S.M. Seitz, and R.S. Szeliski. Manhattan-world stereo. In CVPR 2009, pages 1422–1429. [FLM] Olivier D. Faugeras, Quang-Tuan Luong, and Stephen J. Maybank. Camera self-calibration: Theory and experiments. In ECCV 1992, pages 321–334. [FM] D. Fidaleo and G. Medioni. Model-assisted 3d face reconstruction from video. In AMFG 2007, pages 124–138. [FPa] J.-M. Frahm and M. Pollefeys. Ransac for (quasi-)degenerate data (qdegsac). In CVPR 2006, pages 453–460. [FPb] Yasutaka Furukawa and Jean Ponce. Patch-based multi-view stereo soft- ware. http://grail.cs.washington.edu/software/pmvs. [FP09] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multi- view stereopsis. IEEE Trans. Pattern Anal. Mach. Intell., 2009. [FR] Olivier Faugeras and Luc Robert. What can two images tell us about a third one? In ECCV 1994, volume 1, pages 485–492. [FRML] M. Fradkin, M. Roux, H. Maitre, and U.M. Leloglu. Surface reconstruction from multiple aerial images in dense urban areas. In CVPR 1999, pages 262–267. [Fua] P. Fua. Using model-driven bundle-adjustment to model heads from raw video sequences. In ICCV 1999, pages 46–53. [GCS] Michael Goesele, Brian Curless, and Steven M. Seitz. Multi-view stereo revisited. In CVPR 2006, pages 2402–2409. [GGST86] H N Gabow, Z Galil, T Spencer, and R E Tarjan. Efficient algorithms for finding minimum spanning trees in undirected and directed graphs. Com- binatorica, 6(2):109–122, 1986. [Gra90] Robert M. Gray. Entropy and information theory. 1990. [Har] R. I. Hartley. In defense of the 8-point algorithm. In ICCV 1995, page 1064. [Har94] Richard I. Hartley. Projective reconstruction and invariants from multiple images. IEEE Trans. Pattern Anal. Mach. Intell., 16(10):1036–1041, 1994. [HKG00] A.D. Hoover, V. Kouznetsova, and M. Goldbaum. Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Trans. Medical Imaging, 19(3):203–210, March 2000. [HS88] C. Harris and M. Stephens. A combined corner and edge detection. In Proceedings of The Fourth Alvey Vision Conference, pages 147–151, 1988. 100 [HZ04] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Second edition, 2004. [KCM] E. Kang, I. Cohen, and G. Medioni. A graph-based global registration for 2d mosaics. In ICPR 2000. [KKZ] Junhwan Kim, Vladimir Kolmogorov, and Ramin Zabih. Visual correspon- dence using energy minimization and mutual information. In ICCV 2003, page 1033. [KLCL] Jae Chul Kim, Kyoung Mu Lee, Byoung Tae Choi, and Sang Uk Lee. A dense stereo matching using two-pass dynamic programming with general- ized ground control points. In CVPR 2005, pages 1075–1082. [KS] Yan Ke and R. Sukthankar. Pca-sift: a more distinctive representation for local image descriptors. In CVPR 2004, volume 2, pages 506–513. [KS00] Kiriakos N. Kutulakos and Steven M. Seitz. A theory of shape by space carving. Int. J. Comput. Vision, 38:307–314, 2000. [KSS] Avi Kelman, Michal Sofka, and Charles V. Stewart. Keypoint descriptors for matching across multiple image modalities and non-linear intensity vari- ations. In CVPR 2007. [KZa] V. Kolmogorov and R. Zabih. Computingvisual correspondencewith occlu- sions using graph cuts. In ICCV 2001, volume 2, pages 508–515. [KZb] Vladimir Kolmogorov andRamin Zabih. Multi-camera scene reconstruction via graph cuts. In ECCV 2002, pages 82–96. [LMNF09] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accu- rate o(n) solution to the pnp problem. Int. J. Comput. Vision, 81(2):155– 166, 2009. [Low04] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, 2004. [LPK] P. Labatut, J.-P. Pons, and R. Keriven. Efficient multi-view reconstruction of large-scale scenes using interest points, delaunay triangulation and graph cuts. In ICCV 2007. [LYM07] Yuping Lin, Qian Yu, and Gerard Medioni. Map-enhanced uav image sequence registration. In IEEE Workshop on Applications of Computer Vision, 2007. [MAW + ] P. Merrell, A. Akbarzadeh, Liang Wang, P. Mordohai, J.-M. Frahm, RuigangYang, D.Nister, andM. Pollefeys. Real-time visibility-based fusion of depth maps. In ICCV 2007. 101 [Med] G. Medioni. Matching of a map with an aerial image. In ICPR 1982, pages 517–519. [mes] Meshlab. http://meshlab.sourceforge.net. [MFM] R. Marzotto, A. Fusiello, and V. Murino. High resolution video mosaicing with global alignment. In CVPR 2004, volume 1, pages 692–698. [MK] Branislav Micusik and Jana Kosecka. Piecewise planar city 3d modeling from street view panoramic sequences. In CVPR 2009, pages 2906–2912. [MM06] P. Mordohai and G. Medioni. Tensor Voting: A Perceptual Organization Approach to Computer Vision and Machine Learning. 2006. [MS05] K. Mikolajczyk and C. Schmid. A performance evaluation of local descrip- tors. IEEE Trans. Pattern Anal. Mach. Intell., 27(10):1615–1630, Oct. 2005. [MSN99] XuMeihe, Rajagopalan Srinivasan,andWieslawL.Nowinski. Afastmutual informationmethodformulti-modal registration. InInformation Processing in Medical Imaging, pages 466–471, 1999. [OK85] Yuichi Ohta and Takeo Kanade. Stereo by intra- and inter-scanline search using dynamic programming. IEEE Trans. Pattern Anal. Mach. Intell., 7(2):139–154, March 1985. [pho] Photosynth. http://photosynth.net/. [PKVG] Marc Pollefeys, Reinhard Koch, and Luc Van Gool. Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. In ICCV 1998, page 90. [PM] Thomas Pollard and Joseph L. Mundy. Change detection in a 3-d world. In CVPR 2007. [PNF + 08] M. Pollefeys, D. Nist´ er, J. M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels, D. Gallup, S. J. Kim, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stew´ enius, R. Yang, G. Welch, and H. Towles. Detailed real-time urban 3d reconstruction from video. Int. J. Comput. Vision, 78(2-3):143–167, 2008. [PVG] Marc Pollefeys and Luc Van Gool. A stratified approach to metric self- calibration. In CVPR 1997, page 407. [PVGV + 04] Marc Pollefeys, Luc Van Gool, Maarten Vergauwen, Frank Verbiest, Kurt Cornelis, Jan Tops, and Reinhard Koch. Visual modeling with a hand-held camera. Int. J. Comput. Vision, 59(3):207–232, 2004. [RC] S´ ebastien Roy and Ingemar J. Cox. A maximum-flow formulation of the n-camera stereo correspondence problem. In ICCV 1998, page 492. 102 [Rez04] Ali M. Reza. Realization of the contrast limited adaptive histogram equal- ization (clahe) for real-time image enhancement. J. VLSI Signal Process. Syst., 38(1):35–44, 2004. [SAN + 04] J. Staal, M.D. Abramoff, M. Niemeijer, M.A. Viergever, and B. van Gin- neken. Ridge-based vessel segmentation in color images of the retina. IEEE Trans. Medical Imaging, 23(4):501–509, April 2004. [SB07] Ramtin Shams and Nick Barnes. Speeding up mutual information compu- tation using nvidia cuda hardware. Digital Image Computing Techniques and Applications, pages 555–560, 2007. [SCD + ] Steven M. Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. A comparison and evaluation of multi-view stereo recon- struction algorithms. In CVPR 2006, pages 519–528. [SFG] Christoph Strecha, Rik Fransens, and Luc Van Gool. Wide-baseline stereo from multiple views: A probabilistic account. In CVPR 2004, pages 552– 559. [SHH99] C.Studholme, D.L.G.Hill, andD. J.Hawkes. Anoverlap invariant entropy measure of 3d medical image alignment. Pattern Recognition, 32(1):71–86, 1999. [SK99] Harpreet S. Sawhney and Rakesh Kumar. True multi-image alignment and its application to mosaicing and lens distortion correction. IEEE Trans. Pattern Anal. Mach. Intell., 21(3):235–243, 1999. [SKS04] Y. Sheikh, S. Khan, and M. Shah. Feature-based georegistration of aerial images. In Geosensor Networks, 2004. [SKSC03] Y. Sheikh, S. Khan, M. Shah, and R.W. Cannata. Geodetic alignment of aerial video frames. In Video Registration, 2003. [SLZ] Y. Shan, Z.C. Liu, and Z.Y. Zhang. Model-based bundle adjustment with application to face modeling. In ICCV 2001, pages 644–651. [SM97] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 19(5):530–535, May 1997. [SS00] Heung-Yeung Shum and Richard Szeliski. Systems and experiment paper: Construction of panoramic image mosaics with global and local alignment. Int. J. Comput. Vision, 36(2):101–130, 2000. [SS02] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vision, 47(1-3):7–42, 2002. 103 [STR03] Charles V. Stewart, Chia-Ling Tsai, and B. Roysam. The dual-bootstrap iterative closest point algorithm with application to retinal image registra- tion. IEEE Trans. Medical Imaging, 22(11):1379–1394, 2003. [SvHVG + ] C.Strecha, W.von Hansen, L.J.Van Gool, P.Fua, andU.Thoennessen. On benchmarking camera calibration and multi-view stereo for high resolution imagery. In CVPR 2008. [tai] Tailwind. http://www.darpa.mil/ipto/solicit/solicit.asp. [TH] L.A. Tang and T.S. Huang. Automatic construction of 3d human face models based on 2d images. In ICIP 1996, pages 467–470. [tL96] Q. t. Luong. The fundamental matrix: Theory, algorithms, and stability analysis. Int. J. Comput. Vision, 17:43–75, 1996. [TLF09] Engin Tola, Vincent Lepetit, and Pascal Fua. Daisy: An efficient dense descriptor applied to wide baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell., 2009. [Tri] B. Triggs. Autocalibration and the absolute quadric. In CVPR 1997, pages 609–614. [VWMW97] Paul Viola andIIIWilliam M. Wells. Alignment bymaximization of mutual information. Int. J. Comput. Vision, 24(2):137–154, 1997. [WHH + ] R.P. Wildes, D.J. Hirvonen, S.C. Hsu, R. Kumar, W.B. Lehman, B. Matei, and W.Y. Zhao. Video georegistration: Algorithm and quantitative evalu- ation. In ICCV 2001, pages II: 343–350. [ZDFL95] Zhengyou Zhang, Rachid Deriche, Olivier Faugeras, and Quang-Tuan Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artif. Intell., 78(1-2):87– 119, 1995. [Zha94] Zhengyou Zhang. Iterative point matching for registration of free-form curves and surfaces. Int. J. Comput. Vision, 13(2):119–152, 1994. [ZN] Qian-Yi Zhou and Ulrich Neumann. Fast and extensible building model- ing from airborne lidar data. In GIS 2008: SIGSPATIAL international conference on Advances in geographic information systems. 104
Abstract (if available)
Abstract
Image registration is a fundamental problem in image analysis, and is used in a variety of applications such as tracking, moving object detection, remote sensing, etc. In this thesis, we study image registration problems where the images are taken at different times, from different sensors, and from different viewpoints. Moreover, the scene may contain arbitrary 3D structure. With all these variations, image registration become a challenging task, where standard techniques may fail to produce accurate results.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Registration and 3-D reconstruction of a retinal fundus from a fluorescein images sequence
PDF
Line segment matching and its applications in 3D urban modeling
PDF
Accurate 3D model acquisition from imagery data
PDF
Digitizing human performance with robust range image registration
PDF
Object detection and recognition from 3D point clouds
PDF
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
PDF
A Bayesian framework of 2D image correspondence and 3D
PDF
Fast iterative image reconstruction for 3D PET and its extension to time-of-flight PET
PDF
3D deep learning for perception and modeling
PDF
Unsupervised learning of holistic 3D scene understanding
PDF
Image registration with applications to multimodal small animal imaging
PDF
Human pose estimation from a single view point
PDF
Automatic image matching for mobile multimedia applications
PDF
Depth inference and visual saliency detection from 2D images
PDF
Articulated human body deformation from in-vivo 3d image scans
PDF
Facial gesture analysis in an interactive environment
PDF
Point-based representations for 3D perception and reconstruction
PDF
Correspondence-based analysis of 3D deformable shapes: methods and applications
PDF
Model based view-invariant human action recognition and segmentation
Asset Metadata
Creator
Lin, Yuping
(author)
Core Title
Accurate image registration through 3D reconstruction
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/01/2010
Defense Date
03/08/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D reconstruction,image registration,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gérard (
committee chair
), Itti, Laurent (
committee member
), Kuo, C.-C. Jay (
committee member
)
Creator Email
yuping@gmail.com,yupingli@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3168
Unique identifier
UC1179195
Identifier
etd-Lin-3747 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-364479 (legacy record id),usctheses-m3168 (legacy record id)
Legacy Identifier
etd-Lin-3747.pdf
Dmrecord
364479
Document Type
Dissertation
Rights
Lin, Yuping
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
3D reconstruction
image registration