Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Line segment matching and its applications in 3D urban modeling
(USC Thesis Other)
Line segment matching and its applications in 3D urban modeling
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LINE SEGMENT MATCHING AND ITS APPLICATIONS IN 3D URBAN MODELING by Lu Wang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2010 Copyright 2010 Lu Wang ii Dedication This thesis is dedicated to my mother and father, who gave me the most support during my PhD study. iii Acknowledgements I would like to thank my advisor Dr. Ulrich Neuman for his direction, assistance, and guidance. I also wish to thank Dr. Suya You, Dr. Burcin Becerik Gerber, Dr. Ram Nevatia, and Dr. Gerard Medioni for their comments and suggestions that gave me great help in the projects and paper writing. Special thanks should be given to my lab colleagus who also helped me in many ways. iv Table of Contents Dedication ii Acknowledgements iii List of Tables vi List of Figures viii Abstract ix Chapter 1: Introduction 1 1.1 Wide-Baseline Image Matching . . . . . . . . . . . . . . . . . . 2 1.2 Automatic Registration of 2D Aerial Images with Untextured 3D Aerial LiDAR Data . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Semi-Automatic Registration of Ground-Level Panoramas with Orthorectied Aerial Images . . . . . . . . . . . . . . . . . . . . 8 1.4 Line Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: Edge Detection 13 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Saliency Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Segment Based Hysteresis Thresholding . . . . . . . . . . . . . . 17 2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 3: Line Signature and Its Application in Wide-baseline Image Matching 22 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 The Detection of Line Signatures . . . . . . . . . . . . . . . . . 25 3.3 The Similarity Measure of Line Signatures . . . . . . . . . . . . 27 3.3.1 The Description of a Pair of Line Segments . . . . . . . . 29 3.3.2 The Similarity Measure of Segment Pairs . . . . . . . . . 31 3.3.3 The Similarity of Line Signatures . . . . . . . . . . . . . 34 3.4 Fast Matching with the Codebook . . . . . . . . . . . . . . . . . 35 v 3.5 Wide-baseline Image Matching . . . . . . . . . . . . . . . . . . . 37 3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6.1 Experimental Settings . . . . . . . . . . . . . . . . . . . 39 3.6.2 Scenes without Rich Texture . . . . . . . . . . . . . . . . 40 3.6.3 Non-Planar Scenes with Large Viewpoint Change . . . . 41 3.6.4 Images Related with Homography . . . . . . . . . . . . . 43 3.6.5 Comparison with Existing Line Matching Approaches . . 48 3.6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 4: Automatic Registration of Aerial Images with Untextured Aerial LiDAR Data 52 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Line Segment Detection . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Detection and Description of 3CS Features . . . . . . . . . . . . 62 4.3.1 The Detection of 3CS Features . . . . . . . . . . . . . . 62 4.3.2 The Description of 3CS Features . . . . . . . . . . . . . 63 4.4 Two-level RANSAC Algorithm . . . . . . . . . . . . . . . . . . 65 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 5: Automatic Registration between Ground-Level Panoramas and Orthorectied Aerial Images for Building Modeling 76 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3 Correspondence Detection and Camera Pose Estimation . . . . 80 5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.2 Voting for the Camera Pose . . . . . . . . . . . . . . . . 83 5.3.3 Additional Constraints . . . . . . . . . . . . . . . . . . . 89 5.3.4 User Correction . . . . . . . . . . . . . . . . . . . . . . . 90 5.4 Model Construction and Optimization . . . . . . . . . . . . . . 91 5.5 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . 93 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Chapter 6: Conclusion 98 Bibliography 100 vi List of Tables 3.1 Comparision of dierent local features on matching low-texture images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Comparison of dierent local featurs on matching wide-baseline non-planar scenes . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Comparison of dierent local features on the rst image se- quence from ZuBud image database . . . . . . . . . . . . . . . . 43 3.4 Comparison of dierent local features on the second image se- quence from ZuBud image database . . . . . . . . . . . . . . . . 46 3.5 Matching results of Line Signature for images related by Ho- mographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 vii List of Figures 1.1 Low-texture images . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Example results of line signature matching . . . . . . . . . . . . 6 1.3 Screen shots of photo-realistic 3D city models . . . . . . . . . . 8 1.4 The registraion of 3D models and street views in Google Earth . 9 1.5 3D reconstruction from ground-level panoramas and orthorec- tied aerial images . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Example of supporting range map . . . . . . . . . . . . . . . . . 16 2.2 Adaptive parameter experiment . . . . . . . . . . . . . . . . . . 20 2.3 Fixed parameter experiment . . . . . . . . . . . . . . . . . . . . 21 3.1 Line segment clustering . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 An example of line signatures . . . . . . . . . . . . . . . . . . . 28 3.3 The description of two line segments . . . . . . . . . . . . . . . 31 3.4 Matching results of line signature on textureless scenes . . . . . 42 3.5 Matching results of line signature for non-planar scenes . . . . . 44 3.6 Two image sequences from the ZuBud image database . . . . . 45 3.7 The charts of number of correct matches and matching preci- sion for dierent local featurs . . . . . . . . . . . . . . . . . . . 45 3.8 Image matching for handwritings . . . . . . . . . . . . . . . . . 46 3.9 Image sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 viii 3.10 The comparison of repeatability, number of correspondences, matching score and number of correct matches between dier- ent local features . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.11 Comparison with existing line matching approaches . . . . . . . 50 4.1 The projection according to the initial camera pose . . . . . . . 54 4.2 Line segment detection . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 An example of 3CS features . . . . . . . . . . . . . . . . . . . . 61 4.4 Spatial distribution of feature matches . . . . . . . . . . . . . . 67 4.5 Registration before and after the renement . . . . . . . . . . . 74 4.6 Screen shots of textured 3D models . . . . . . . . . . . . . . . . 75 5.1 An aerial image and the user interaction on it . . . . . . . . . . 79 5.2 The estimation of camera rotation and location . . . . . . . . . 84 5.3 Example user interactions and screen shots of a reconstructed 3D model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4 Screen shots of a 3D model reconstructed from 2 panoramas . . 97 ix Abstract Man-made environments are full of line segments, and a complex curve can be ap- proximated with multiple straight-line segments. Therefore, line segment matching is an important computer vision problem, and it is a powerful tool for solving regis- tration problems in 3D urban modeling . Image-based 3D modeling is one of the core goals in computer vision, in which a great challenge is image matching. The approaches based on local features are cur- rently the most eective matching techniques for wide-baseline images. The detection and description of existing local features are directly based on pixels. In the rst part of the work, a novel local feature called Line Signature is introduced. Its construction is based on curves. Curves are extracted from images and are approximated with line segments. A line signature is a local cluster of these line segments. Extensive experiments have shown that wide-baseline image matching using line signatures has signicant advantages over existing approaches in handling low-texture images, large viewpoint changes of non-planar scenes, and illumination variations. Next, a robust approach is presented for automatic registration of aerial images with untextured aerial LiDAR data. Airborne LiDAR has become an important technology in large-scale 3D city modeling. To generate photo-realistic models, aerial images are needed for texture mapping in which a key step is to obtain accurate x registration between the two data sources. Existing registration approaches based on matching corners are not robust, especially for areas with heavy vegetations. Our approach based on line segments has 98 percent success rate. Airborne remote sensing technologies provide information of building rooftops but the detailed structures of building facades can only be captured from the ground level. Therefore, in order to create 3D models with high-resolution geometry and texture for both roofs and facades, it is necessary to integrate the two data sources. In the nal part of the work, an interactive system is proposed that can rapidly cre- ate georeferenced 3D models of groups of buildings with high-resolution texture for both roofs and facades by integrating orthorectied aerial images and ground-level panoramas. To greatly reduce the user interaction, a semi-automatic approach is proposed for matching line segments detected in ground-level panoramas with those in orthorectied aerial images. 1 Chapter 1 Introduction Photo-realistic 3D urban models have numerous applications in map service, city planning, real estate, lm industry, computer game, and home security. According to dierent data sources, current 3D modeling approaches can be divided into two kinds: image-based and LiDAR-based. Image-based approaches reconstruct 3D models from 2D images or videos, while the inputs to LiDAR-based approaches are 3D point clouds from laser scanners. This work is focused on three registration problems encountered in 3D urban mod- eling under dierent scenarios. They are image matching in 3D reconstruction from wide-baseline images, 2D aerial image to untextured 3D aerial LiDAR data registra- tion in generating photo-realistic large-scale city models, and ground-level image to orthorectied aerial image registration for reconstruction of 3D building models with high-resolution texture for both roofs and facades. The foundation of our approaches to all the problems is the matching based on line segments. 2 1.1 Wide-Baseline Image Matching Image-based 3D modeling is one of the ultimate goals of computer vision. Its main advantage over the approaches based on LiDAR is that only a common digital cam- era is required whereas laser scanners are still expensive and most of them are not portable. The two essential problems to be tackled in image-based approaches are image registration and 3D inference based on camera geometry. While the progress in the study of multi-view camera geometry can be said one of the major achievements in computer vision during the past two decades, image registration or image matching is still one of the fundamental problems that have not been solved very well. The image registration problem is much easier in the cases where the 3D reconstruc- tion is from calibrated stereo images with small-baseline (the viewpoint change is small), or from video sequences in which consecutive images are very similar and tracking techniques can be applied. Image registration in 3D reconstruction from un- calibrated wide-baseline images (with large viewpoint change) is still a very dicult problem. However, it is the methods based on wide-baseline images that can provide the greatest exibility in usage, in terms of the ease in image capturing and the re- quired storage space of the data. A few images taken with a regular photo camera from several dierent viewpoints are usually enough to reconstruct a high-quality 3D model [62, 64]. Local feature technology is becoming increasing popular in wide-baseline image match- ing due to its robustness to occlusion and clutter. Most existing local features are pixel-based. Each feature is a group of pixels in a connected local region (or so-called local patch). The region is typically of a regular shape, e.g., the rectangular window in the traditional template matching and SIFT [44], the circular area in Shape con- 3 text [13], and the elliptical region in Harris-Ane features [11]. Under large image deformation, however, it is very common that similar regions in two images cannot be enclosed by templates with a xed shape without including a considerable part of the background that may be totally dierent in the two images. This is one of the major reasons that the repeatability of these local features decreases rapidly with viewpoint change. Many feature descriptors use some kind of histograms, such as those used in SIFT [44], Shape Context [13] and GLOH [48]. Histograms with xed bin size are often not distinctive when image deformation is large. Some descriptors are based on moments [30] which can handle large deformation of planar patterns but have limited power to deal with non-planar distortion such as parallax. In addition, most existing local features fail in matching textureless images. However, scenes without rich texture are common in man-made environments, such as those shown in Figure 1.1(a) and (b). (a) (b) Figure 1.1: Images without rich texture. Our approach [69] clusters detected line segments into local groups according to spa- tial proximity. Each group is treated as a feature called a Line Signature. Similar to local features, line signatures are robust to occlusion and clutter. They are also robust to events between the segments, which is an advantage over the features based on connected regions[49]. Moreover, their description depends mainly on the geometric 4 conguration of segments, so they are invariant to illumination. However, dierent to existing ane invariant features, we cannot assume ane distortion inside each feature area since neighboring line segments are often not coplanar. Line signature is more biologically sound than pixel-based approaches. A human cannot detect SIFT features but can perceive line segments, and we suspect that part of the function of human's selective attention clusters the segments into distinctive features. In some sense, line signature shares the similarity with the bag-of-features technique in that line segments can be regarded as the primitive features and each line signature is a bag of line segments. However, the description of a line signature depends largely on the spatial relationships between the line segments whereas, in the original bag-of-features approaches, spatial relationships between features are usually ignored [63, 19, 52]. There are two challenges in constructing robust features based on line segment clus- tering. The rst is to ensure feature repeatability under unstable line segment detec- tion. In our approach, this is handled by multi-scale polygonization and grouping in line segment extraction, the clustering criterion considering relative saliency between segments, and the matching strategy allowing unmatched segments. The second chal- lenge is to design a distinctive feature descriptor robust to large viewpoint changes taking into account that the segments may not be coplanar and their endpoints are inaccurate. Our approach describes line signatures based on pairwise relationships between line segments whose similarity is measured with a two-case algorithm ro- bust not only against large ane transformation but also a considerable range of 3D viewpoint changes for non-planar surfaces. 5 Extensive experiments validate that line signature has much better performance than existing local features in matching textureless structured images, non-planar scenes under large viewpoint change, and illumination variation. It is also robust to scale change and image blur. These advantages make line signature ideal for image match- ing in 3D modeling of man-made environments from wide-baseline images. Figure 1.2 demonstrates some example outputs given by line signature matching. The two images on each row are an image pair to be registered. The red line segments in the images are the matched line segments automatically detected with our approach. To show the matching results, the line segments are labeled at their middle points and those with the same labels are pairs of corresponding line segments. 1.2 Automatic Registration of 2D Aerial Images with Untextured 3D Aerial LiDAR Data LiDAR-based 3D modeling draws a lot of attention in recent years due to its robust- ness and high accuracy. LiDAR (Light Detection and Ranging) technology determines the distance to an object or surface by measuring properties of scattered light (mostly laser pulses). The outputs of LiDAR devices are directly 3D point clouds. The re- maining problems in 3D modeling include how to merge dierent laser scans into a unied 3D model, and how to smooth and simplify the raw point clouds so that the data size can be reduced and suitable for real-time rendering without sacricing the visual quality of the 3D models. Another important issue is that LiDAR does not pro- vide texture information, therefore, to generate photo-realistic 3D models, 2D images are acquired and need to be registered with 3D LiDAR data for texture mapping. 6 (a) (b) (c) (d) (e) (f) Figure 1.2: The two images on each row are an image pair to be registered. The red line segments are the matched line segments automatically detected with line signature matching. To show the matching results, the line segments are labeled at their middle points and those with the same labels are pairs of corresponding line segments. 7 The second part of the work is focused on the problem of registering aerial images with untextured aerial LiDAR data in large-scale city modeling. Airborne LiDAR has emerged as a mainstream technology in rapid modeling of large-scale urban environ- ments. While the problem of simplifying the raw 3D point clouds with building and terrain segmentation has been addressed in several recent papers [45, 55, 75], auto- matic registration of aerial images with LiDAR data for texture mapping remains an open problem. Usually, the cameras capturing the aerial images are calibrated and an approximate camera pose for each aerial image is provided by GPS/INS systems. However, these camera parameters are often not accurate enough for precise texture mapping even with the most advanced and expensive hardware devices. In addition, in many cases, GPS/INS data is not available continuously for every frame and is distorted by signicant biases and drift [74]. In a latest work by Ding and Zakhor [22], an automatic registration method was proposed but it can only achieve 61% overall success rate. We present a more robust approach [68]. For the 2120 aerial images tested in our experiments, it can correctly register 97% of them with the 3D LiDAR models to obtain accurate texture mapping. The robustness is due to our three contributions: 1)a robust line detector which will be introduced in Section 1.4; 2)matching based on a novel feature called 3CS (3- connected-segments) that is more distinctive than those used in most existing work; 3)a two-level RANSAC approach that not only considers the number of features but also their spatial distribution in selecting inliers. Some screen shots of the 3D city models with photo-realistic texture generated with our approach are given in Figure 1.3. 8 (a) (b) (c) (d) (e) (f) Figure 1.3: Screen shots of the 3D city models with photo-realistic texture generated with our approach. 1.3 Semi-Automatic Registration of Ground-Level Panoramas with Orthorectied Aerial Images Airborne remote sensing technologies, including aerial images and airborne LiDAR, provide information of building rooftops, but the 3D modeling of detailed facade 9 structures with high-resolution texture can only be based on the information from ground-level sensors. To generate 3D models with high-resolution geometry and tex- ture for both building rooftops and facades, it is necessary to integrate the information from both data sources. As an example, Google Earth [1]has 3D building models created from airborne LiDAR and aerial images. Figure 1.4(a) shows a close-up look of the facade of one of the 3D buildings. It has very low resolution. Google Earth also provides street views at every 10 meters on most streets of major US cities, which are images taken from ground level and stitched into panoramas. Figure 1.4(b) is one of the ground-level images, which obviously has much higher resolution than Figure 1.4(a). The ground- level images are calibrated and their poses are given by GPS/INS systems. However, these camera parameters are not accurate enough for texture mapping. This can be seen from Figure 1.4(c) which is a transparent overlay of the previous two images provided by Google. To integrate the two data sources, image-based registration is required. Due to the large dierence in their view angles, the automatic registration is challenging and there are very few related papers in literature. (a) (b) (c) Figure 1.4: (a): A screen shot of the facade of a 3D building model in Google Earth. (b): A ground-level street view image in Google Earth. (c): The transparent overlay of (a) and (b) according to the camera pose provided by GPS/INS. In the last part of the work, we present an interactive modeling system that integrates the information from ground-level images and orthorectied aerial images. Orthorec- 10 tied aerial images are available from websites such as Google Earth and Microsoft Virtual Earth [1, 3]. An example is shown in Figure 1.5(a). Therefore, our system has a low cost and can be used by most people who have access to the Internet and a regular digital camera. Ground-level images are taken in the way so that they can be stitched into panoramas to increase the camera eld of view. Figure 1.5(b) is an example. The users are required to draw building outlines in the aerial images, such as the red line segments shown in Figure 1.5(a). We focus on the automatic detection of correspondences between the line segments detected in the ground-level panoramas and those on the building outlines given by the users in the aerial im- ages. Based on these corresponding line segments and multi-view camera geometry, 3D building models can be automatically reconstructed which have high-resolution texture for both facades and roofs. Figure 1.5(c)-(f) are several screen shots of the 3D model created from Figure 1.5(a) and (b). With minor modication, the automatic registration algorithm can also be used in registering ground-level panoramas with 3D models created from airborne LiDAR or aerial images, such as the case shown in Figure 1.4. 1.4 Line Detection Line segments are abundant in man-made environments. Our approaches to the three dierent registration problems are all based on the matching of line segments. There- fore, the robustness of line segment detector is very important. The line detection algorithm in most existing work on line-based matching [22, 41, 26] can be described as follows: Edge pixels are detected with Canny detector and then linked into curves according to 8-neighbor connectivity. The curves are divided into straight line seg- 11 (a) (b) (c) (d) (e) (f) Figure 1.5: (a): An orthorectied aerial image downloaded from Google Earth. The red line segments are building outlines draw by the user. (b): A panorama from ground-level images. (c)-(f): Screen shorts of the 3D model created from (a) and (b). ments based on thresholding on line tting error. Although this approach is simple and ecient, it is not very robust. Our line detector has improvements in two aspects. First, a novel thresholding tech- nique based on supporting range and segment-based hysteresis thresholding is used in edge detection. It has better performance in keeping object boundaries complete while removing the edge pixels caused by random noise or texture. It also makes the edge detection less sensitive to the selection of thresholds so that the parameters can be xed without the need of manual adjustment for each image during the ex- periments. Second, to improve the robustness of registration, the basic strategy in our line detection is to maximize the detection of correct line segments even though this may also increase the number of spurious line segments which can be removed according to the constraints in the registration process. This strategy is implemented with multi-scale polygonization and line segment grouping. For dierent registration problems, the line detection algorithms are slightly dierent accordingly. 12 1.5 Summary The thesis is focused on three registration problems in 3D urban modeling, including wide-baseline image matching, 2D aerial image to 3D aerial LiDAR data registra- tion, and ground-level panorama to orthorectied aerial image registration. The ap- proaches are all based on line segment matching, and their performance is validated with extensive experiments. The rest of the thesis is organized as follows: In Chapter 2, the edge detection al- gorithm used in all our approaches is described. Chapter 3 introduces line signature and its application in wide-baseline image matching. In Chapter 4, a robust approach for automatic registration of 2D aerial images with untextured 3D aerial LiDAR data is proposed. Chapter 5 presents an interactive 3D modeling system and a semi- automatic approach for registering ground-level panoramas with orthorectied aerial images. Chapter 6 gives a conclusion. 13 Chapter 2 Edge Detection 2.1 Related Work Edge detection is a fundamental problem in computer vision. Currently the most popular edge detectors are based on gradient [47]. These approaches typically have three steps: gradient computation, non-maximum suppression and thresholding. Gra- dient can be computed with templates such as Sobel. Non-maximum suppression is used to select the pixels whose gradient magnitude is a local maximum along the gradient direction. These pixels are called edge pixels. Out of them, the salient ones are selected with thresholding. There are many variations of the thresholding approaches. The simplest one is to select the edge pixels with their gradient magnitude above a threshold. However, it is very dicult to choose a good threshold that can lter out most of the noise while still keeps most of the useful edge pixels. Hysteresis thresholding in Canny detector [16] is a better approach in which two thresholds are used. The edge pixels with 14 the gradient magnitude below the lower threshold or above the upper threshold are denitely removed or kept respectively. The other edge pixels will be kept if they are next to those that have already been selected. Dynamic thresholding [57] is another advanced approach in which the threshold varies across an image and the threshold for each pixel is decided by its neighboring edge pixels. Some researchers found that the gradient operators may give a large spurious response in an apparently unstructured neighborhood because they usually have a large null space [47]. To solve this problem, in some approaches [47, 50] a condence value is computed for each pixel by tting an ideal edge model to the local image structure. Thresholding is based on both the gradient magnitude and the condence value. We present two ideas to improve the robustness of thresholding. First, a novel saliency value called supporting range is introduced. It measures the range along the gradient direction of an edge pixel in which its gradient magnitude is a local maximum. Under the assumption of uniform random noise, the possibility of an edge pixel being caused by noise is small if this range is large. Second, a hysteresis thresholding based on curve segments instead of pixels is pro- posed. Before the thresholding, edge pixels are linked into connected curve segments. The thresholding is to decide whether to remove or to keep the whole segments based on their saliency values. The reason behind this is that the possibility that a number of edge pixels aligned into a salient connected chain are caused by noise is small, even if each individual pixel has low saliency. In [40], the thresholding is also based on segments but it does not use hysteresis techniques. 15 In Section 2.2, the saliency measure based on supporting range is introduced. Section 2.3 describes the segment-based hysteresis thresholding. Some experimental results are given in Section 2.4. 2.2 Saliency Measure The absolute value of gradient magnitude is not the only criterion to evaluate the saliency of an edge pixel. Its spatial distribution is also an important cue to judge if an edge pixel is on an object boundary or is due to noise or the texture of the object material. Figure 2.1 shows an example. Figure 2.1(b)is the gradient map of Figure 2.1(a). The gradient magnitudes of the edge pixels are normalized within [0; 255]. We can see that the edge pixels on the middle part of the banana boundary have very low gradient magnitudes but people can still quickly notice them because compared to the other edge pixels in a large neighborhood (the table area), their gradient magnitudes are higher. To describe this property, we introduce the concept of supporting range. For an edge pixel p with the gradient magnitude of g, we search along its gradient direction to nd the closest edge pixel p 1 with a larger gradient magnitude. Denote the distance between the two edge pixels as d 1 . We then search in the opposite direction to nd the closest edge pixel p 2 with a gradient magnitude larger than g. The distance between p 2 and p is denoted as d 2 . The supporting range of the edge pixel p is dened as: d = max(d 1 ;d 2 ) (2.1) In addition, if d 1 >d 2 (d 1 d 2 ) the average gradient magnitude g of the edge pixels on the line segment pp 1 (pp 2 ) is computed. The reason why the larger one of d 1 16 and d 2 is selected is that the image area on one side of an object boundary is often very cluttered. To improve speed, the search stops if the distance is over 100 pixels. Therefore, the supporting range is clamped to be within 100. The number 100 is chosen according to our experimental results. Figure 2.1(c) is the the supporting range map of Figure 2.1(a) in which the supporting range is normalized to be within [0; 255] in order to compare it with the gradient map Figure 2.1(b). The outline of the banana becomes clear in Figure 2.1(c) while it is very weak in Figure 2.1(b). An important advantage of supporting range is that it is invariant to illumination variations. Therefore, it is much easier to nd the optimal threshold that is suitable for a large set of images, which is validated in our experiments. The saliency value of an edge pixel is computed with: s = (gg) d (2.2) The parameters and control the relative importance between the gradient mag- nitude and the supporting range. In our experiments we set = 1 and = 1. In practice, they can be selected according to the applications. For example, to nd the boundary of the major objects in the scene, it is better to use a larger . (a) (b) (c) Figure 2.1: (a)Gray image. (b)Gradient map. (c)Supporting range map. 17 2.3 Segment Based Hysteresis Thresholding The hysteresis thresholding in the Canny detector is pixel-based. In our approach, the edge pixels are linked into connected curve segments and the thresholding is based on segments. The reason is that some individual edge pixels may not be salient enough to be distinguished from noise but if they are aligned into a connected long chain, the possibility that this alignment is caused by random noise is small. Thresholding should incorporate this global information. The approach in [40] also uses segments in thresholding but it does not exploit the hysteresis technique. Some curve segments may be too short to have a saliency value higher than the threshold. However, if they are very likely to be the extension of some salient segments, it is better to keep them in order to keep the completeness of object boundaries. In the rst step of the segment-based hysteresis thresholding, edge pixels are linked into curve segments with the following approach. Started from the pixel with the highest gradient magnitude, a curve segment grows at each step by including the most salient edge pixel among the 16 neighbors of the two endpoints (each endpoint has 8 neighbors) of the curve segment. The growing is ended when no more edge pixels can be added. Then, a new segment growing is started from the edge pixel with the highest gradient magnitude among the remaining edge pixels that have not been included into a curve segment. This process is repeated until no more new curve segment can be generated. The saliency value of each curve segment is the sum of the saliency values of its edge pixels computed with Eq.2. To do hysteresis thresholding, a lower threshold L and a upper threshold H are needed. The curve segments with their saliency values above H are selected. The segments with more than 50% of their edge pixels having saliency values belowL are 18 removed. Among the rest of the segments, a segment will be selected if it is connected with a segment that has already been selected (the gap between the closest endpoints of the two segments is within 3 pixels). This process is iterated until no more segment can be selected. Not all of the edge pixels on the selected segments will be kept. Assume the edge pixel sequence on a segment isfp 1 ;:::;p i ;:::;p j ;:::;p n g. If all of the edge pixelsfp 1 ;:::;p i g at one end of the segment or all of the pixelsfp j ;:::;p n g at the other end have saliency values smaller thanL, these pixels will be removed and only the pixelsfp i+1 ;:::;p j1 g will be kept. In other words, the saliency values of the two endpoints of the trimmed segment are larger than L and only the edge pixels in the middle can have saliency values below L. 2.4 Experimental Results We did extensive experiments to compare the proposed approach with the ve edge detectors studied in [33]. They are Canny, Nalwa, Iverson, Bergholm and Rothwell edge detectors. Following the same method in [33], two groups of tests are conducted. In the rst group, for each edge detector the best overall parameter set for a set of images is identied and xed during the edge detection for all the images. This is called xed parameter experiment. In the second group, the optimal parameter set of each edge detector is found for each image. This is called adaptive parameter experiment. Figure 2.3 shows the results on 4 of the 28 real images used in the xed parameter experiment. Due to the space limitation, only the edge maps generated by Bergholm 19 and Iverson edge detectors are displayed as the comparison to our approach. Accord- ing to [33], they are the best two among the ve edge detectors in xed parameter experiments. From left to right, the rst column is the gray images and the next 3 columns are the edge maps generated by Bergholm, Iverson and our algorithms respectively. All the gray images, the edge maps and the parameters of the 5 edge detectors can be obtained from the website 1 of the Image Analysis Research Labo- ratory at USF. The parameters used in our algorithm for this image set are ( = 3, H = 10000, L = 75). is the size of the Gaussian template in image smoothing before the gradient computation. H andL are the upper and lower thresholds in the hysteresis thresholding. The gradient magnitudes of the edge pixels are normalized to be between 0 and 255. The edge maps output by our algorithm are obviously less noisy and most of the edge pixels on the object boundaries are detected. Some of the edge pixels that are important in object recognition are missing in the edge maps generated by the other edge detectors due to low gradient magnitudes. For example, the left part of the tire is missing in Figure 2.3(b) and Figure 2.3(c). The table outline in Figure 2.3(f) is incomplete. The banana cannot be recognized from Figure 2.3(j) and Figure 2.3(k). Our approach successfully detects these edge pixels. Figure 2.2 shows the comparison of the proposed algorithm with Canny and Bergholm edge detectors in the adaptive parameter experiment. According to [33], they are the best two among the ve detectors in this kind of experiments. The parameters used in our algorithm for the two images are ( = 3, H = 6250, L = 25) and ( = 3, H = 7500,L = 25) respectively. We can see the edge map of the rst image given by the proposed algorithm contains the most complete outlines of the cone and the car. 1 http://marathon.csee.usf.edu/edge/edgecompare main.html 20 For the second image, the boundary of the orange in more complete and less noisy in our output than in those given by the other detectors. The time of our approach for processing a 640x480 image is around 250ms on a 3GHz PC. (a) (b) (c) (d) (e) (f) (g) (h) Figure 2.2: Adaptive parameter experiment. The rst column is gray images. The second, third and fourth columns are the edge maps generated by the Canny detector, the Bergholm detector and our detector. 2.5 Summary A robust thresholding method for edge detection is proposed. In this approach, the saliency measure of edge pixels depends not only on the gradient magnitude but also on its spatial distribution described by supporting range. In addition, the hysteresis thresholding is not based on pixels but curve segments. The approach outperforms ve existing edge detectors in the xed-parameter and adaptive-parameter experiments. 21 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) Figure 2.3: Fixed parameter experiment. The rst column is gray images. The second, third and fourth columns are the edge maps generated by the Bergholm detector, the Iverson detector and our detector. 22 Chapter 3 Line Signature and Its Application in Wide-baseline Image Matching 3.1 Related Work Local features are widely used in image matching and object recognition. Excellent reviews are provided in [49, 48]. Each local feature technique has two main compo- nents: a detector and a descriptor. The detector extracts local areas from images. The descriptor represents them mathematically so that their similarity can be measured. There are two criteria to evaluate a local feature: repeatability and distinctiveness. Repeatability is the ability that the counterpart of a feature in one image can be detected in the other image even under signicant image deformation. Distinctive- ness means the description of a feature should be similar to that of its corresponding feature and be very dierent from the description of any other features. Usually there is a trade-o between them. 23 Most existing local features are pixel-based. Each feature is a group of pixels in a connected local region. The region is typically of a regular shape, such as a rectangle or an ellipse. Under large image deformation, however, it is very common that similar regions in two images cannot be enclosed by templates with a xed shape without including a considerable part of the background that may be totally dierent in the two images. This is one of the major reasons that the repeatability of these local features decreases rapidly with viewpoint change. Many feature descriptors use some kind of histograms, such as those used in SIFT [44], Shape Context [13] and GLOH [48]. Histograms with xed bin size are often not distinctive when image distortion is large. Some descriptors are based on moments [30] which can handle large deformation of planar patterns but have limited power to deal with non-planar distortion such as parallax. In addition, existing local features usually fail with low-texture scenes that are common in man-made environments. In recent literature, some region-based features have been proposed [34, 72, 7]. Most of them are based on Region Adjacency Graphs(RAGs) or segmentation trees to describe the regions extracted with image segmentation approaches. However, current image segmentation techniques have diculties in generating stable regions under large illumination change, occlusion and perspective distortion. In addition, region- based approaches usually cannot provide matching results with an accuracy down to pixels. We present a novel local feature called line signature. It is based on curves. Curves are extracted from images and divided into straight line segments. A line signature is a cluster of nearby segments with an arbitrary spatial distribution. The number of segments directly controls its distinctiveness. Compared to pixel-based approaches, 24 it is also much easier to design a descriptor of a line signature that can handle large image distortion since the number of its segments is small. Many line matching approaches match individual segments based on their position, orientation and length, and take a nearest line strategy [21]. They are better suited to image tracking or small-baseline stereo. Some methods start with matching indi- vidual segments and resolve ambiguities by enforcing a weak constraint that adjacent line matches have similar disparities [46], or by checking the consistency of segment relationships, such as left of, right of, connectedness, etc [35, 67]. These methods require known epipolar geometry and still cannot handle large image deformation. Many of them are also computationally expensive for solving global graph matching problems [35]. The approach in [58] is limited to the scenes with dominant homogra- phies. Some methods match individual segments based on intensity [59] or color [12] distribution of pixels on both sides of line segments. They are not robust to large illumination changes. Moreover, [59] requires known epipolar geometry, and [12] is not suited to match B/W images. Perceptual grouping of segments is widely used in object recognition and detection[38, 43, 24]. It is based on perceptual properties such as connectedness, convexity, and parallelism so that the segments are more likely on the same object. Although this strategy is useful to reduce searching space in detecting the same objects in totally dierent backgrounds, it is not suited to image matching since it is quite often that the curve fragments detected on the boundary of an object are indistinctive but they can form a distinctive feature with several fragments on neighboring objects whose spatial conguration is stable under a considerable range of viewpoint changes (Figure 3.2 shows an example). Therefore, in our approach, the grouping is based only on the proximity and relative saliency of segments with feature repeatability 25 and distinctiveness the only concerns. A line signature usually contains segments on dierent objects. This property is similar to many region features where the regions refer to any subset of the image unlike those in image segmentation [49]. With indistinctive features, the approach in [24] resorts to the bag-of-features paradigm which is also not suited to image matching. Moreover, the description and similarity measure of segment groups in most object recognition systems are not designed to handle large image deformation. The rest of this chapter is organized as follows: The detection of line signatures is presented in Section 3.2. Section 3.3 describes their similarity measure. Section 3.4 privides a codebook approach to improve matching speed. An ecient algorithm for removing outliers in image matching is presented in Section 3.5. Experimental results are given in Section 3.6, and the section is summarized in Section 3.7. 3.2 The Detection of Line Signatures The segment clustering is based on the spatial proximity and relative saliency between segments. For a segment i of saliency value s, we search the neighborhood of one of its endpoints for the top k segments that are closest to this endpoint (based on the closest point not the vertical distance) and whose saliency valuess 0 rs, wherer is a ratio. Thek segments andi form a line signature. Segment i and the endpoint are called its central segment and its center respectively. Similarly, another line signature can be constructed centered at the other endpoint of i. For example, in Figure 3.1(a) the line signature centered at the endpointp 1 of segment i consists of segmentsfi;c;a;bg, when k = 3. Note although segment e is closer to 26 p 1 than a and b, it is not selected because it is not salient compared to i. The line signature centered at p 2 includesfi;d;a;cg. (a) (b) (c) Figure 3.1: (a)Line segment clustering. (d)Overlapping segments. (f)Parallel seg- ments. Spatial proximity improves repeatability since the geometric conguration of nearby segments usually undergoes moderate variations over a large range of viewpoint changes. Our clustering approach is scale invariant. Compared to the method in [51] where a rectangular window with a xed size relative to the length of the central segment is used to group contours, it is more robust to large image distortion and can guarantee feature distinctiveness. Relative saliency is neglected in many perceptual grouping systems [38, 43, 24]. Nev- ertheless, it is important in handling the instability of segment extraction. Weak segments in one image often disappear in another image (e.g. e in Figure 3.1(a)). However, if the central segment of a line signature in one image exists in another image, its other segments are less likely to disappear since they have comparable or higher saliency (s 0 rs). Based on this, line signatures centered at corresponding segments in two images are more likely to be similar. In addition, this strategy gives us features of dierent scales since with a strong central segment the other segments in a line signature are also strong, while those associated with weak central segments are usually also weak. In our experiments, the ratior = 0:5. However, the results are not sensitive to it in a reasonable range. 27 The number k is called the rank of a line signature. Increasing k can improve its distinctiveness, but will decrease its repeatability and increase the computation in matching. In experiments, we found k = 5 is a balanced choice. A practical issue during line signature construction is illustrated in Figure 3.1(b). Due to multi-scale polygonization and grouping, segments ab, bc and de can be regarded as three separate segments, or one segment ae. Therefore, for these two possibilities, we will construct two dierent rank-3 line signatures centered at the endpoint f of segmentfg:ffg;ab;bc;deg andffg;ae;jk;hlg. Another practical issue is depicted in Figure 3.1(c) where several parallel segments are very close, a common case in man- made scenes. This conguration is usually unstable and line signatures composed mostly of such segments are indistinctive. Therefore, from a set of nearby parallel segments, our approach only selects the most salient one into a line signature. Figure 3.2 shows two rank-5 line signatures whose central segments (in blue) are corresponding segments in two images. Their centers are indicated with yellow dots and their other segments are highlighted with red color. The other detected line segments are in green. Although most segments in the two line signatures can be matched, two of them(ab in (a) andcd in (b)) cannot. To be more robust to unstable segment extraction and clustering, the similarity measure between line signatures should allow unmatched segments. 3.3 The Similarity Measure of Line Signatures The similarity between two line signatures is measured based on the geometric cong- uration of their segments. The approach by checking if the two segment groups satisfy 28 (a) (b) Figure 3.2: Two corresponding rank-5 line signatures in two real images. The blue segments are the central segments and the red ones are their other segments. an epipolar geometry is impractical since the endpoints of the segments are often in- accurate while innite lines provide no epipolar constraint. Moreover, the segment matches between two line signatures are few and some segments may share endpoints, so the number of matched endpoints is usually insucient to decide an epipolar ge- ometry. It is also infeasible to compute the similarity based on if the segments satisfy an ane matrix or a homography because the segments in a line signature are of- ten not coplanar. Figure 3.4(c)-(d) provide a good example where the neighboring lines forming the ceiling corner are on dierent planes but their conguration gives important information to match the images. Our approach measures the similarity based on the pairwise relationships between segments. Similar strategies are also used in [73] and [37] where pairwise congu- rations between edgels or line segments are used to describe the shape of symbols. 29 Unlike [24] in which only the relationships of segments with the central segment in a feature are described, our approach is more distinctive by describing the relationship between every two segments. 3.3.1 The Description of a Pair of Line Segments Many methods [35, 67, 12] describe the relationship of two segments with terms such as left of, right of, connected, etc. It is described with an angle and length ratio between the segments in [37], and a vector connecting their middle points in [24]. All these methods are not very distinctive. We describe the conguration of two line segments by distinguishing two cases. In the rst case, they are coplanar and in a local area so that their transformation between images is ane. As shown in Figure 3.3(a), the lines of two segments ! p 1 p 2 and ! q 1 q 2 (the arrows represent their orientations (Section 3.1)) intersect at c. The signed length ratios r 1 = ( ! p 1 c ! p 1 p 2 )=j ! p 1 p 2 j 2 and r 2 = ( ! q 1 c ! q 1 q 2 )=j ! q 1 q 2 j 2 are ane invariant, so they are good choices to describe the two-segment conguration. Moreover, they neatly encode the information of connectedness (r 1 ;r 2 = f0; 1g) and intersection (r 1 ;r 2 2 (0; 1)) which are important structural constraints. Since the invariance of r 1 and r 2 is equivalent to an anity, we can judge if the transformation is ane with a threshold on the changes of r 1 and r 2 . If the two segments are not coplanar or the perspective eect is signicant, any conguration is possible if the underlying transformation can be arbitrarily large. However, since the two segments are proximate, in most cases the variations of the relative positions between their endpoints are moderate in a large range of viewpoint changes. The limit on the extent of transformation provides important constraints 30 in measuring similarity, which is also the theory behind the SIFT descriptor [44] and used in the psychological model of [23]. In the SIFT descriptor, the limit on the changes of pixel positions relative to its center is set with the bins of its histogram. It was reported in [44] that although SIFT features are only scale invariant they are more robust against changes in 3D viewpoint for non-planar surfaces than many ane invariant features. For two line segments, there are 6 pairs of relationships between 4 endpoints. We select one of them, the vector ! p 1 p 2 in Figure 3.3(a), as the reference to achieve scale and rotation invariance. Each of the other 5 endpoint pairs is described with the angle and the length ratio (relative to ! p 1 p 2 ) of the vector connecting its two points. Specically, the attributes are l 1 =jq 1 q 2 j=jp 1 p 2 j, l 2 =jq 1 p 1 j=jp 1 p 2 j, l 3 =jq 1 p 2 j=jp 1 p 2 j, l 4 =jq 2 p 1 j=jp 1 p 2 j, l 5 =jq 2 p 2 j=jp 1 p 2 j, and 15 in Figure 3.3(a). Note although the absolute locations of points q 1 and q 2 can be described with only 4 coordinates, it is necessary to use 10 attributes to describe the locations of each point relative to all the other points since the judgment on shape similarity may be dierent with dierent reference points. An example is shown in Figure 3.3(b) where the position change from q 1 to q 0 1 relative to p 2 is not large ( 5 and l 3 are small) whereas the change relative to p 1 is very large ( 2 > =2). Similar strategy of using the relative relationship between every two points is also exploited in [73] to describe the conguration of N points, which provides O(N 3 ) constraints. In addition, we use an extra attribute g = g 2 =g 1 to describe the appearance infor- mation, where g 1 and g 2 are the average gradient magnitude of the two segments. It is robust to illumination changes and helps to further improve the distinctiveness. Therefore, the feature vector to describe a two-segment conguration contains to- tally 13 attributes: v =fr 12 ;l 15 ; 15 ;gg. Note there is no con ict with the degree 31 of freedom. The 12 attributes describing the geometric conguration are computed from the 8 coordinates of 4 endpoints. They are dependent, and are essentially in- termediate computation results stored in the feature vector to make the following similarity computation more ecient. Similar strategy of increasing the dimension- ality of the original feature space to facilitate the following classication is also used in the kernel-based SVM approach[14]. (a) (b) (c) (d) (e) (f) Figure 3.3: (a)Description of a segment pair. (b)Dierent reference points generate dierent judgment. (c)Impractical ane transformation. (d)Impractical deformation. (e)Quantization of the feature space of the general similarity. (f)The general similarity is1 but the ane similarity is not. 3.3.2 The Similarity Measure of Segment Pairs Assume the feature vectors of two segment pairs are v = fr 12 ;l 15 ; 15 ;gg and v 0 =fr 0 12 ;l 0 15 ; 0 15 ;g 0 g respectively. As mentioned before, ifjr 1 r 0 1 j andjr 2 r 0 2 j are smaller than a threshold T r , the underlying transformation can be regarded as ane. In this case, we call the similarity of the two segment pairs as ane similarity; otherwise, it is called general similarity. 32 Ane Similarity: To be completely ane invariant, ane similarity should be based only on r i and r 0 i . However, from experiments we found matching results can be improved by limiting the range of the changes of length ratio l 1 and angle 1 , which increases feature distinctiveness. In our approach, the ane similarity S a is computed with: S a = 8 > < > : d r 1 +d r 2 +d 1 +d l 1 +d g ; if = true; 1; else; (3.1) where 8 > > > > > > > > > > < > > > > > > > > > > : d r i = 1 jr i r 0 i j Tr ; i2f1; 2g; d 1 = 1 j 1 0 1 j T ; d l 1 = 1 max(l 1 ;l 0 1 )=min(l 1 ;l 0 1 )1 T l ; d g = 1 max(g;g 0 )=min(g;g 0 )1 Tg ; fd r i ;d 1 ;d l 1 ;d g g 0 & ( 1 )( 0 1 ) 0: (3.2) T r ,T ,T l andT g are thresholds. If the change of an attribute is larger than its thresh- old (fd r i ;d 1 ;d l 1 ;d g g< 0), the deformation between the segment pairs is regarded as impractical and their similarity is1. Note these thresholds play the similar roles as the bin dimensions in the histograms of local features in which the location change of a pixel between two images is regarded as impractical if it falls into dierent bins. These thresholds are also used to normalize the contribution of dierent attributes so thatd r i ;d 1 ;d l 1 ;d g 2 [0; 1]. In addition, they greatly reduces the computation since if the change of any attribute is larger than a threshold the similarity is1 without the need of the rest computation, which leads to the codebook approach discussed in Section 5. Through extensive experiments, we foundT r = 0:3,T ==2,T l =T g = 3 are good choices. The approach is not sensitive to them in a reasonable range. The 33 condition ( 1 )( 0 1 ) 0 ( 1 ; 0 1 2 [0; 2]) is used to avoid the situation in Figure 3.3(c) where the deformation from ! q 1 q 2 to ! q 0 1 q 0 2 is ane but rarely happens in practice. Note an alternative way to measure ane similarity is to t an anity to the 4 endpoints of the segments and estimate the tting error. However, computing ane matrix for every combination of two segment pairs between two images is inecient compared to estimatingr i just once for each segment pair in each image. More impor- tantly, unlike r i , ane matrix does not directly re ect the constraints of connection and intersection. General Similarity: It is measured based on the relative positions between the 4 endpoints: S g = 8 > < > : 5 P i=1 d l i + 5 P i=1 d i +d g ; iffd l i ;d i ;d g g 0 & !C; 1; else; (3.3) whered l i ,d i andd g are computed in the same way as in Eq.2. Since we are handling line segments not disconnected points, the situationC depicted in Figure 3.3(d) where q 2 jumps across segment p 1 p 2 to q 0 2 (in these cases segments p 1 p 2 and q 2 q 0 2 intersect) is impractical. Overall Similarity: Combining the above two cases, the overall similarity S of two segment pairs is computed with: S = 8 > < > : S a ; ifjr i r 0 i jT r ; 1 4 S g ; else: (3.4) 34 The coecient 1=4 gives S g a lower weight so that its maximal contribution to S is 2.75 smaller than that of S a 5. This is to re ect that anity is a stronger constraint so segment pairs satisfying an anity are more likely to be matched. 3.3.3 The Similarity of Line Signatures Given two line signatures, their similarity is the sum of the similarity between their corresponding segment pairs. However, the mapping between their segments is un- known except the central segments. The approach in [24] sorts the segments in each feature according to the coordinates of their middle points, and the segment map- ping is determined directly by the ordering. However, this ordering is not robust under large image deformation and inaccurate endpoint detection. In addition, as mentioned before, some segments in a line signature may not have their counterparts in the corresponding line signature due to unstable segment detection and clustering. Instead of ordering segments, our approach nds the optimal segment mapping that maximizes the similarity measure between the two line signatures. Denote a one-to-one mapping between segments as M =f(l 1 ;l 0 1 );:::; (l k ;l 0 k )g where fl 1 ;:::l k g andfl 0 1 ;:::l 0 k g are subsets of the segments in the st and the second line signatures. Note that the central segments must be a pair inM. Assume the similarity of two segment pairs (l i ;l j ) and (l 0 i ;l 0 j ) is S ij , where (l i ;l 0 i )2M and (l j ;l 0 j )2M. The similarity of the two line signatures is computed with: S LS =max M X i<j S ij ! : (3.5) 35 In the following section, we will show that the combinatorial optimization in Eq.5 can be largely avoided with a codebook-based approach. 3.4 Fast Matching with the Codebook For most segment pairs, their similarities are1 because their dierences on some attributes are larger than the thresholds. Therefore, by quantizing the feature space into many types (subspaces) with the dierence between dierent types on at least one attribute larger than its threshold, segment pairs with1 similarity can be found directly based on their types without explicit computation. This is similar to the codebook approaches used in [24] and many other object detection systems in which the feature space quantization is based on clustering the features detected in training images, whereas it is uniform in our approach. Since the similarity measure has two cases, the quantization is conducted in two feature spaces: the space spanned byfr i ; 1 ;l 1 g for ane similarity and the space spanned byf i ;l i g for general similarity. From experiments, we found for a segment pair whose segments are not nearly intersected, if its general similarity with another segment pair is1, their ane similarity is almost always1. If its two segments are intersected or close to be intersected, this may be wrong as shown in Figure 3.3(f) where the general similarity is1 becauseq 1 jumps acrossp1p2 toq 0 1 but the ane similarity is not, which usually happens due to inaccurate endpoint detection. Therefore, the quantization of the feature space spanned byfr i ; 1 ;l 1 g is conducted only in its subspace where r i 2 [0:3; 1:3]. The segment pairs within this subspace have types in two feature spaces, while the others have only one type in the space of f i ;l i g which is enough to judge if the similarities are1 . 36 The space offr i ; 1 ;l 1 g is quantized as follows: The range [0:3; 1:3] of r 1 and r 2 , and the range [0; 2] of 1 are uniformly divided into 5 and 12 bins respectively. The range of l 1 is not divided. Therefore, there are 300 types. The quantization of the space spanned byf i ;l i g is not straightforward since it is large and high dimensional. Based on experiments, we found the following approach is eective: The image space is divided into 6 regions according to ! p 1 p 2 as shown in Figure 3.3(e). Thus there are 36 dierent distributions of the two endpoints of ! q 1 q 2 in these regions. When they are on dierent sides of ! p 1 p 2 , the intersection of the two lines can be above, on or below the segment ! p 1 p 2 . The reason to distinguish the 3 cases is that they cannot change to each other without being the condition C in Eq.3. For example, if ! q 1 q 2 changes to q 0 1 q 0 2 in Figure 3.3(e), point p 1 will jump across ! q 1 q 2 . The range of 1 is uniformly divided into 12 bins. Thus, there are 264 types (Note some congurations are meaningless, e.g., when q 1 and q 2 are in region 6 and 2 respectively, 1 can only be in (0;=2)). Therefore, there are totally 564 types (more types can be obtained with ner quan- tization but it requires more memory). Each segment pair has one or two types depending on if r i 2 [0:3; 1:3]. The types of a segment pair are called its primary keys. In addition, according to the deformation tolerance decided by the thresholds T r ,T , the condition ( 1 )( 0 1 ) 0 in Eq.2, and the conditionC in Eq.3, we can predict the possible types of a segment pair in another image after image deformation (e.g. an endpoint in region 2 can only change to region 1 or 3). These predicted types are called its keys. The similarity of a segment pair with another segment pair is not 1 only if one of its primary keys is one of the keys of the other segment pair. 37 For each line signature, the keys of the segment pairs consisting of its central segment and each of its other segments are counted in a histogram of 564 bins with each bin representing a type. In addition, each bin is associated with a list storing all the segments that fall into it. To measure the similarity of two line signatures, assume the segment pair consisting of segment i and the central segment in the rst line signature has a primary key ofp. The segments in the second line signature that may be matched with i can be directly read from the list associated with the p-th bin of its histogram. From experiments, we found the average number of such candidate segments is only 1:6. Only these segments are checked to see if one of them can really be matched withi (by checking if the dierences on all feature attributes are smaller than the corresponding thresholds). If there is one, we set a variablec i = 1; otherwise c i = 0. An approximate similarity of two line signatures is computed withS 0 LS = P i c i . Only if S 0 LS 3 (meaning at least 3 segments besides the central segment in the rst line signature may have corresponding segments in the second signature), Eq.5 is used to compute the accurate similarity S LS ; otherwise S LS is 0. In average, for each line signature in the rst image, only 2% of the line signatures in the second image have S 0 LS 3. Eq.5 is solved with an exhaustive search but the searching space is greatly reduced since the average number of candidate segment matches is small. 3.5 Wide-baseline Image Matching To match two images, for each line signature in an image, its top two most similar line signatures in the other image are found whose similarity values are S 1 and S 2 respectively. If S 1 > T 1 and S 1 S 2 > T 2 , this line signature and its most similar 38 line signature produce a putative correspondence, and the segment matches in their optimal mapping M in Eq.5 are putative line segment matches. T 1 = 25 and T 2 = 5 in our experiments. To remove outliers, we cannot directly use RANSAC based on epipolar geometry since the endpoints of line segments are inaccurate. In [12], putative segment matches are rst ltered with a topological lter based on the sidedness constraints between line segments. Among the remaining matches, coplanar segments are grouped using homographies. The intersection points of all pairs of segments within a group are computed and used as point correspondences based on which the epipolar geometry is estimated with RANSAC. Instead of using the topological lter in [12] which is computationally expensive, we provide a more ecient approach to remove most of the outliers. All the putative line signature correspondences are put into a list L and sorted by descending the value of S 1 S 2 . Due to the high distinctiveness of line signatures, the several candidates on the top ofL almost always contain correct matches if there is one inL. LetR as the set of line segment matches of the two images, and initialize R =. Two segment matches are regarded as consistent if the similarity of the two segment pairs based on them is not1. A segment match in the optimal mapping M of a line signature correspondence is called a reliable match if the similarity of the two segment pairs formed by its segments and the central segments is larger than 3:5. Starting from the top ofL, for a line signature correspondence, if all the segment matches in its optimal mapping M are consistent with all the existing matches in R, the reliable ones will be added intoR. To reduce the risk that the top one candidate on L is actually wrong, each of the top 5 candidates will be used as a tentative seed 39 to grow a set R. The one with the most number of matches will be output as the nal result. After the above consistency checking, most of the outliers will be removed. To further reduce the outliers and estimate the epipolar geometry, the approach in [12] based on grouping coplanar segments and RANSAC can be used. 3.6 Experimental Results Extensive experiments demonstrate the performance of line signatures under large viewpoint change, image scaling, illumination variation, and image blur. In this section, we present the comparison of Line Signature with existing local features, including the original SIFT, Harris-Ane, MSER, and several others studied in [49], under various conditions. 3.6.1 Experimental Settings The comparison is based on three dierent situations: (1)Images without rich texture. (2)Non-planar scenes with large viewpoint change, in which the underlying transfor- mation is not a homography. (3)Images related by a homography, as those discussed in [49]. Unlike [49] in which only situation (3) is considered, we think it is important to distinguish situation (2) and (3), because only in the former case the capability of a local feature in handling occlusion, parallax and non-uniform illumination vari- ation, such as shadows, can be tested. Situation (2) is more common in practice. 40 In addition, we will also give some results on the comparison of line signature with existing line matching approaches. The parameters of Line Signature used in all the experiments are the same as those given in prevous sections. These parameters are xed and there is no need to adjust them throughout the experiments. The number of Line Signatures in an image is limited to be 2000 by selecting those associated with the most salient line segments. The binary code of SIFT from [2] is used, and the executables of Harris-Ane and MSER are downloaded from [4]. As in [49], the descriptor of the original SIFT detector is used for the Harris-Ane and MSER detectors. In the original paper of Lowe [44], the threshold on the ratio of the smallest distance to the second smallest distance in selecting correct feature matches is 0.8, whereas it is 0.6 in the source code provided by the author [2]. Therefore, we set it 0.7 in the experiments in order to reach a balanced performance. 3.6.2 Scenes without Rich Texture Figure 3.4 shows three pairs of images without rich texture. In each row, the rst two images are the original images, and the last two show the line segment matches detected with Line Signature. The matched line segments are labeled at their middle points. Two segments with the same labels in the two images are a pair of corre- sponding segments. The correctness of the matches is examined visually by a human. If a corresponding segment has two thirds of overlapping with the manually labeled ground truth, it is deemed correct. Table 3.1 lists the matching results of SIFT, Harris-Ane, MSER, and Line Signature for each image pair. In each cell of the table, the rst number is the number of correct 41 matches, and the second one is the total number of detected matches. From this table, it is quite obvious that Line Signature has much better performance than the other local features on textureless scenes. a-b e-f i-j SIFT 5/11 0/2 1/7 Harris-Ane 0/0 0/2 0/3 MSER 0/1 1/2 0/1 Line Signature 19/20 27/29 32/35 Table 3.1: Matching results for the low-texture images in Figure 3.4. In each cell, the rst number is the number of correct matches and the second number is the total number of detected matches. 3.6.3 Non-Planar Scenes with Large Viewpoint Change Most scenes in reality are not completely at so that their transformation under large viewpoint change is not a homography. To match these images, a local feature should be able to handle occlusion, parallax, and non-uniform illumination variation, such as shadows. The following experiments validate that Line Signature is much better than the other local features in matching this kind of scenes. The three image pairs shown in Figure 3.5 are downloaded from Internet or taken by ourselves. They are all non-planar scenes under large view point change. Figure 3.5(i) and Figure 3.5(i) also have signicant non-uniform illumination variation, e.g. the shadow and specularity change in the window areas. Table 3.2 compares the performance of dierent local features on these image pairs. As before, the numbers in each cell are number of correct matches and total number of detected matches. The correctness is judged manually by a human. 42 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure 3.4: In each row, the rst two images are the original images. The last two display the segment matches detected with Line Signature. Any two segments with the same labels in the two images are a pair of corresponding segments. We also did extensive tests on the ZuBuD image database [6]. Figure 3.6 shows two examples. The images in each row are the same building taken from dierent viewpoints. The rst image is selected as the reference image and it is paired with all the other images in the same row to do matching. For the images in the top row, the matching results of dierent local features are listed in Table 3.3. Table 3.4 shows the results for the images in the second row. (The segment matches detected with Line Signature can be found in the corresponding images in the image folder of our supplementary material.) For dierent local features, the relationship of number of correct matches and match- ing precision with viewpoint change is plotted in Figure 3.7. Matching precision is 43 a-b e-f i-j SIFT 13/34 0/7 2/27 Harris-Ane 2/6 6/10 3/15 MSER 5/6 8/10 0/7 Line Signature 76/76 19/20 50/50 Table 3.2: Matching results for the wide-baseline non-planar scenes in Figure 3.5. In each cell, the rst number is the number of correct matches, and the second number is the total number of detected matches. the ratio of number of correct matches to total number of detected matches. Figure 3.7(a) and (b) are the charts of number of correct matches and matching precision, respectively, for the images in the top row of Figure 3.6. Figure 3.7(c) and (d) are the charts for the bottom row in Figure 3.6. We can see line signature performs better than the other local features under large viewpoint change. Figure 3.8 shows two handwritings with clutters. Our approach detects 18 matches (all correct) while all the other methods mentioned above can hardly nd any matches. This also demonstrates the advantage of making no hard assumption such as epipolar geometry or homographies. a-b a-c a-d a-e SIFT 352/372 31/85 16/64 2/30 Harris-Ane 125/156 21/45 18/42 5/18 MSER 72/79 26/37 23/39 13/24 Line Signature 225/230 79/106 84/96 20/34 Table 3.3: Matching results for the images in the rst row of Figure 3.6. In each cell, the rst number is the number of correct matches, and the second number is the total number of detected matches. 3.6.4 Images Related with Homography In [49], the authors compared dierent local features in matching images with the underlying transformation as a homography. These situations include planar scenes 44 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure 3.5: In each row, the rst two images are the original images, and the last two display the segment matches detected with Line Signature. The matched segments are labeled at their middle points. Two segments with the same labels are a pair of corresponding segments. with viewpoint change, scale change together with in-plane rotation, image blur, JPEG compression, and illumination variation. Figure 3.9 shows the rst and the last images in some of the image sequences they tested. Each image sequence has 6 images, in which the rst one is selected as the reference image and matched with all the other 5 images. Table 3.5 gives the matching results of Line Signature for the 5 image pairs in each image sequence. As before, the numbers in each cell are the number of correct matches and the total number of detected matches. As we can see, Line Signature is good at matching structured scenes under viewpoint change(Grati), scale change(boat), image blur(bike), and illumination change(Leuven). 45 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 3.6: Two image sequences from the ZuBud image database. (a) (b) (c) (d) Figure 3.7: (a) and (b) are the charts of number of correct matches and matching precision, respectively, for the images in the rst row of Figure 3.6. (c) and (d) are the charts of number of correct matches and matching precision, respectively, for the images in the second row of Figure 3.6. 46 f-g f-h f-i f-j SIFT 260/316 110/221 39/99 16/47 Harris-Ane 63/119 33/68 17/39 7/29 MSER 92/99 48/78 23/37 3/13 Line Signature 120/123 110/115 68/81 37/49 Table 3.4: Matching results for the images in the rst row of Figure 3.7. In each cell, the rst number is the number of correct matches, and the second number is the total number of detected matches. (a) (b) Figure 3.8: The matching result of our method on two handwritings. It can also deal with moderate viewpoint change for textured scenes(wall), but it per- forms poorly for scale change of textured scened(bark). To compare Line Signature with other local features, we plotted the curves of re- peatability, number of correspondences, matching score, and number of matches in the same way as in [49]. Since the description of Line Signature is not based on pixels, we cannot use the area of overlapping between regions to judge if a correspondence is correct. Instead, if a Line Signature and its corresponding signature transformed with the homography have four pairs of segments with enough overlapping (the overlap- ping part takes up two thirds of the smallest length of the two segments), the match is deemed correct. Figure 3.10 shows the charts of the image sequences in Figure 3.9. From the top row to the bottom one, they are of the Grati, wall, boat, bark, bike and leuven sequences, respectively. In each row, from left to right the charts are 47 repeatability, number of correspondences, matching score, and number of matches, respectively. In our experiments, the number of Line Signatures in each image is limited to be 2000 (and hence xed for images with large number of edges) which is larger than the number of detected features of most of the other local features. Therefore, for Line Signature, the curves of ratios including those of repeatability (number of corre- spondences/number of features) and matching score (number of matches/number of features) are lower than those of the most other local features for many of the image sequences. However, its corresponding curves of absolute values including number of correspondences and number of matches are higher. In some charts, such as Figure 3.10(v), the values for Line Signature are over the range depicted in the original charts in [49]. In this case, we just put the curves of Line Signature on the top. From Figure 3.10, we can see Line Signature is better than most of the other local features in matching structured scenes. It also has the advantage in handling illu- mination variation. However, it is worse than most of the other local features in matching textured scenes, especially for textured scenes with scale change. 1-2 1-3 1-4 1-5 1-6 Grati 437/448 257/266 143/146 63/63 14/14 wall 85/88 79/80 64/67 24/24 0/0 boat 192/198 103/106 62/63 39/40 19/20 bark 0/0 0/0 0/0 0/0 0/0 bike 277/281 236/240 102/106 71/73 24/25 Leuven 364/374 335/339 296/300 259/263 243/251 Table 3.5: Matching results of Line Signature for dierent image sequences in Figure 3.9. In each cell, the rst number is the number of correct matches, and the second number is the total number of detected matches. 48 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure 3.9: Image sequences tested in [49]. Each image sequence has 6 images, but only the rst and the last ones are shown here. From top to bottom, and left to right, the sequences are Grati, wall, boat, bark, bike and Leuven. 3.6.5 Comparison with Existing Line Matching Approaches Figure 3.11 shows 2 image pairs from the line matching papers [12] and [59]. The red segments with the same labels at their middle points are corresponding line segments detected with our approach (please zoom in to check the labels). The correctness of the matches is judged visually. For (c)-(d), the approach in [12] detects 21 matches (16 correct), and the number increases to 41 (35 correct) after matching propagation, while our method detects 39 matches (all correct) without matching propagation. For (e)-(f), the method in [59] detects 53 matches (77% are correct) with known epipolar geometry, while our method detects 153 matches (96% are correct) with unknown epipolar geometry. 49 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) (r) (s) (t) (u) (v) (w) (x) Figure 3.10: From left to right, the four columns are the charts of repeatability, number of correspondences, matching score and number of correct matches. From top to bottom, the rows are for the Gratti, wall, boat, bark, bike and Leuven sequences in Figure 3.9 50 (a) (b) (c) (d) Figure 3.11: Two image pairs in the papers of [12] and [59]. The red segments with the same labels at their middle points in each image pair are corresponding line segments detected with our approach. 3.6.6 Conclusion From the above experiments, we can see Line Signature is much better than all the other local features in matching textureless images, and non-planar scenes with large viewpoint change. It also has the advantage in handling large illumination variation. In matching the images discussed in [49] where the underlying image distortion can be described by a homography, Line Signature performs better than most of the other local features on structured scenes. However, it is worse than most of the other local features in matching textured scenes. 51 3.7 Summary We have introduced a novel local feature called line signature. Unlike most current local features, it is constructed based on line segments so that it has high repeatability and distinctiveness. As validated in the experiments, it can handle large view point change, scale change and illumination variation. In addtion, It also has the advantage over all existing local featuers in matching textureless man-made scenes. As far as we can see, line signature is the rst curve-based local feature approach whose performance is comparable with state-of-the-art pixel-based local features such as SIFT. However, pixel-based approaches also have their advantages, such as handling natural scenes in which curve extraction is very dicult. Therefore, in practice, we can usually combine these two complementary feature types to achieve better performance. 52 Chapter 4 Automatic Registration of Aerial Images with Untextured Aerial LiDAR Data 4.1 Related Work 3D modeling of large-scale urban environments is an active research area in recent years. It has wide applications in map service, city planning, entertainment, and surveillance. The existing approaches can be divided into two kinds: imagery-based (using videos or images) [54] and LiDAR-based [29, 55]. Due to its robustness, accu- racy and eciency, airborne LiDAR technology draws increasing interest in creating large-scale 3D city models [45, 75]. Aerial LiDAR data typically does not provide texture information. To generate photo- realistic 3D models, oblique aerial images are needed for texture mapping purpose. 53 This requires accurate registration between aerial images and 3D LiDAR data. Usu- ally, the cameras are calibrated and an approximate camera pose for each aerial image is provided by GPS/INS systems. However, these camera parameters are often not accurate enough for precise texture mapping even with the most advanced and expen- sive hardware devices. In many cases, GPS/INS data is not available continuously for every frame and is distorted by signicant biases and drift. Similar to [22], in this work we consider the case where there may be a large transformation between the building boundaries in aerial images and the projection of 3D building outlines according to the GPS/INS data. Figure 4.1 shows an example, where the green con- tours represent the projected 3D outlines of building rooftops detected in the LiDAR data. PointA on the projected 3D outline and pointB on the 2D building boundary in the aerial image are an example of corresponding points. To rene the initial cam- era pose, feature correspondences between 2D aerial images and 3D LiDAR data are needed. For modeling large-scale environments, manual selection of such correspon- dences is expensive. Therefore, automatic aerial images to LiDAR data registration is a key component in generating photo-realistic 3D city models. There is a considerable amount of prior work on 2D image to 3D model registration. Stomas and Liu studied the registration of ground-level images with ground-level 3D LiDAR model [41, 66]. Their approach decouples camera rotation and translation in pose estimation. Camera rotation relative to the 3D model is computed based on at least two vanishing points. In [41], camera translation is then estimated with a hypothesis-and-test scheme in matching 2D rectangles in images and parallelepipeds in 3D models. Since rectangular parallelepipeds are not always available, the authors improved their approach in [66] where the estimation of camera translation is based on line segment matching. Although this approach works well for ground-level data, 54 (a) Figure 4.1: The green contours are the projected outlines of 3D building rooftops in the LiDAR data according to the GPS/INS data. Point A on the projected outlines and point B in the aerial image are an example of corresponding points. it has diculties in handling aerial images. In many cases, it is hard to detect vertical vanishing points from aerial images in which building facades are barely visible. It is also quite often that there are no dominant clusters of horizontal parallel lines in aerial images, especially for maintain areas where the distribution of buildings is irregular. These lead to inaccurate estimation of camera rotation, and further cause the failure of translation computation. The same strategy of decoupling the estimation of camera rotation and translation is also exploited by some researchers in matching ground-level images with 3D models generated from aerial images [39, 70]. Their approaches also rely on vanishing points, and hence are not suitable for complicated aerial images. Zhao et al. worked on aligning continuous videos onto 3D point clouds [74]. Although the approach avoids the problem of plane detection from 3D point clouds, it requires structure from motion techniques for video-based 3D reconstruction which are computationally expensive 55 and have drawbacks in accuracy and robustness. Frueh et al. proposed an approach [26] for texture mapping 3D models with oblique aerial images. Their registration method is to exhaustively search a 7-dimensional space of camera parameters so that the projected 3D model lines can be as close as possible to the 2D lines in the aerial images, which is computationally expensive. The existing work most similar to ours is presented in [22] in which oblique aerial images are registered with untextured aerial LiDAR models. The approach utilizes the vertical vanishing point in an aerial image to estimate the pitch and roll angles of its camera rotation. The camera position and heading angle are read from GPS and compass. To rene these initial camera parameters, the features called 2D orthogonal corners (2DOCs) are extracted from both the aerial image and the digital surface model (DSM) of the LiDAR data. Each 2DOC feature corresponds to an orthogonal corner on building outlines. The 2DOCs on the DSM are projected into the aerial image according to the initial camera pose. Putative matches between the projected 2DOCs and those in the aerial image are generated by thresholding on their spatial proximity and similarity measure. The outliers are then removed with a method combining Hough transform and RANSAC. As reported in [22], the correct pose recovery rate of this approach is 91% for downtown areas but only about 50% for campus or residential areas. The main factors causing incorrect registration are: (1)failure in the extraction of 2DOC features; (2)too many outliers that cannot be handled by Hough transform or RANSAC. We present an ecient and more robust approach for aerial image to aerial LiDAR data registration. It does not require vanishing point detection. The pitch, roll and yaw angles given by GPS/INS devices are directly used as initial camera rotation. Our main contributions are: (1)A line detection algorithm with special strategies to ensure 56 that the line segments detected in the aerial images and those in the LiDAR data have as many matches as possible. (2)A novel feature called 3CS (3 connected segments) is introduced. Each 3CS has 3 segments connected into a chain. Compared to the 2DOC features in [22], 3CS features are more distinctive, and hence the number of outliers in putative feature matches is greatly reduced. (3)According to the characteristics of our problem, a two-level RANSAC algorithm is proposed in which putative feature matches are divided into multiple groups. In the rst level processing, a separate RANSAC routine is applied for each group. The output is then input to a global RANSAC to remove the remaining outliers. Compared to the traditional RANSAC, the two-level scheme is more ecient, and more robust in the situations where outliers are much more than inliers. Please note we cannot directly use the line signatures described in Chapter 2 for solving this registration problem. This is because aerial images and 3D LiDAR data are two very dierent modalities of data sources. Their dierence is not only due to the viewpoint change or illumination change. For example, there are many salient line segments in the aerial images that are on roads, vegetations, and building facades, but they are not on the rooftop outlines detected from the LiDAR data. Therefore, using the line clustering approach presented in Chapter 2, the detection of line signatures has very low repeatability in this case. The line signatures detected in the aerial images will inevitably include many line segments that are not of the rooftops, but the line signatures extracted from the projection of the LiDAR data only contains line segments on the rooftops. Therefore, to adjust the approach in Chapter 2, we need to change the way of line clustering. Instead of depending on spatial distance, the line clustering in 3CS features is based on connectivity in order to ensure that many 3CS features only contain line 57 segments on the building rooftops. In addition, the number of line segments in a 3CS feature is limited to 3 instead of 6 as in a line signature. This is also for the purpose of increasing the repeatability of feature detection. The rest of the paper is organized as follows: The line segment detection algorithm is presented in Section 2. Section 3 describes the detection and description of 3CS features. In Section 4, the two-level RANSAC algorithm is introduced. Some experi- mental results are given in Section 5, and the paper is concluded in Section 6. 4.2 Line Segment Detection Our approach requires line segment detection in aerial images and LiDAR data. In most existing 2D image to 3D model registration approaches [22, 41, 26], the following line detection algorithm is used: Edge pixels are detected with Canny detector and then linked into curves. The curves are divided into straight line segments based on thresholding on line tting error. Although this method is ecient, it often misses the detection of many useful line segments on building outlines. To ensure the robustness of automatic registration, the strategy in our line detector is to detect as many as possible the line segments on building outlines, even though this may increase the number of spurious segments (such as those on trees and roads). With enough useful segments, the following feature matching and RANSAC process are able to distinguish them from the spurious ones. Based on this, our line detection algorithm has the following 5 steps. Step 1: Edge pixels are detected with the approach in [71]. Compared to Canny detector, this approach is less sensitive to the selection of thresholds. The default 58 (a) (b) (c) (d) (e) Figure 4.2: (a):Multi-scale polygonization. (b): Link the segments into AD and EJ. (c): Merge the parallel segments intoAB. (d): SegmentAB is divided intoAG,GH, HB, AH and GB. (e): The description of a 3CS feature. parameters given in [71] are used for all the aerial images in our experiments. The edge pixels are then linked into curves based on 8-neighbor connection. Step 2: The curves are divided into line segments with an approach similar to [9] in which a split-and-merge algorithm is used to select break points from a set of points with local extreme curvature. There are two parameters: the scale ! of the Gaussian lter in curvature computation and the threshold T on line tting errors. Since there is no single scale or threshold suitable for all curves in all images, multiple scales and thresholds are used. The line segments obtained under dierent settings are all kept for the following processing. In our experiments ! =f5; 11; 21; 31g, and the thresholds on line tting errors are decided based on the length of the segments. Assume the length of a curve segment is L, then T =f5; 10;:::;max(0:1L; 20)g. As shown in Figure 4.2(a), the curve can be divided into 4 line segments AB, BC, CD and DE, while the curve will be approximated with only one segment AE with a larger threshold T . The reason to keep all these segments is that there are two possible factors causing the zigzag patterns on building outlines: building structures and noise (such as the occlusion by trees). In the former case, the zigzag details 59 are useful for the registration purpose, whereas in the latter one, only the longest segments (such as AE) can appear in both aerial images and LiDAR data. Step 3: It is often that a continuous line segment is broken into several fragments in an aerial image due to occlusion or failure in edge detection. Therefore, a linking process is needed to recover the original segment. For each segment, we search the neighborhood around each of its endpoints to nd the segments that can be linked with it. As demonstrated in Figure 4.2(b), two segments AB and CD can be linked if the following conditions are satised: 1)The dierenced a of their orientation is less than a threshold (10 degrees in our experiments). 2)The distance d h between B and C is smaller than min(jABj;jCDj). 3)The vertical distance d v from point C to the underlying line of AB is smaller than a threshold min(d v ; 10). In the case that a line segment can be linked with multiple segments, its most linkable segment is the one with the smallest value of w 1 d a +w 2 d v +w 3 d h (w 1 = 1;w 2 = 0:1;w3 = 2 in our experiments). Two segments are actually linked into a new segment only when they are the most linkable segments of each other. New segments can be further linked with other segments. As an example, for the segments in Figure 4.2(b), AB andCD will be linked into AD; EF , GH, and IJ will be linked into AJ. Note that the new segments and the original segments before the linking are all kept. Step 4: As depicted in Figure 4.2(c), it is a common situation in urban environments that multiple parallel line segments are closely located in a narrow area in aerial images. Usually, these segments correspond to a single line segment in LiDAR data, so it is necessary to merge them. The merging process is similar to the linking process in Step 3. Two segments can be merged if they satisfy the following conditions: 1)They are almost parallel. 2)They have overlapping along their line directions. 3)Their 60 vertical distance is smaller than a threshold. As an example, a new segment AB is generated after merging in Figure 4.2(c). Step 5: To construct 3CS features as described in Section 4, line segments are split based on their intersection relationships. As shown in Figure 4.2(d), segments AB and CD intersect at point G. If the gapjGDj is smaller than a threshold T g = 0:3 min(jABj;jCDj), AB will be split by CD into two segments AG and GB. The newly generated segments may further be split by other segments. Therefore, in Figure 4.2(d), AB will be divided into 5 segments: AG, GH, HB, AH and GB. The original segment AB is still kept for the following processing. In addition, the segments with their length shorter than a thresholdT L (15 pixels in our experiments) will be removed. Figure 4.3(a) shows an example of the line detection result, where the detected line segments are displayed in green color. To extract line segments in 3D LiDAR data, the approach in [75] is used to detect planar facets on building rooftops. Unlike the approach based on DSM in [22] where only height dierences are examined for plane detection, in [75] the orientation of normals is also considered so the roof structures composed of slopes can be extracted. These detailed roof structures are very useful in registration, especially for residential areas where the exterior contours of rooftops are often corrupted by trees. The contours of the planar facets are then projected into aerial images according to the initial camera parameters given by GPS/INS systems (visibility will be handled with the Z-buer technique). As an example, Figure 4.3(b) is the projection of the rooftop outlines extracted from the LiDAR data of the area shown in Figure 4.3(a). These contours will then be divided into line segments with the same approach as described above in Step 2-5. 61 In our experiments, the resolution of the aerial images is 4992 3328 pixels. The average number of line segments detected in an aerial image is 29,000. Although many of them are spurious, most of the meaningful segments on building outlines are correctly extracted (judged visually by a human), which is important for the following registration. (a) (b) Figure 4.3: (a): The result of line detection. Several 3CS features are shown in red color (see Section 4). (b): The projection of the 3D rooftop outlines extracted from the LiDAR data. 62 4.3 Detection and Description of 3CS Features The 2DOC features used in [22] are not very distinctive. Each 2DOC is described by only two angles: the orientation of its two lines. This leads to a large number of outliers in the putative 2DOC matches. We introduce a more distinctive and still repeatable feature called 3CS (3 connected segments) which is similar to the kAS features applied in object detection in [24]. 4.3.1 The Detection of 3CS Features Each 3CS feature consists of 3 segments that are connected one after another into a chain. As an example, four 3CS features are shown in Figure 4.3(a) whose line segments are displayed in red color. Since the endpoints of the detected segments are often inaccurate, two segmentsAB andCD are regarded as connected if the following conditions are satised. As demonstrated in Figure 4.2(e),P 1 is the intersection ofAB andCD. ! jAP 1 j is the signed distance fromA toP 1 in the direction of ! BA ( ! jAP 1 j< 0 ifP 1 is on the same side ofB relative toA). ! jCP 1 j is the signed distance fromC toP 1 in the direction of ! DC. Then, if 0 ! jAP 1 j 0:3jABj and 0 ! jCP 1 j 0:3jCDj, AB andCD are a pair of connected segments. In the case ! jAP 1 j< 0,AB will be divided into two sub-segments in Step 5 discussed in Section 2, and these sub-segments will be checked if they are connected with CD. To detect 3CS features from an aerial image, for each line segment denoted as AB, we search the neighborhood of its two endpoints to nd the line segments that are connected with it. Assume EF is one of the segments that is connected with AB at point A, and GH is connected with AB at point B. The three segments AB, EF 63 and GH form a 3CS feature. AB is called the central segment and the middle point ofAB is called the center of the 3CS feature. With the same approach, 3CS features are also extracted from 3D LiDAR data based on the line segments detected on the projected outlines of building rooftops. In practice, to reduce the number of 3CS features detected in an aerial image, a 3CS will be removed if: 1)The central segment is almost parallel to the other two segments; or 2)The length ratio between any two segments is larger than a threshold (7 in our experiments). Such a 3CS feature is less likely to be composed of segments on building outlines. In our experiments, the average number of 3CS features detected in an aerial image (4992 3328 pixels) is about 150,000. Note that unlike some object recognition systems such as [?] in which the number of line segments in a segment group can be arbitrarily large, the number of segments in a 3CS feature is limited to be 3. Increasing this number will decrease the repeatability of the features, and greatly increase their number and hence the computation in the following registration process. 4.3.2 The Description of 3CS Features For a 3CS feature composed of segments AB, CD and EF , with AB as the central segment, it can be described with 6 attributes (l;l 1 ;l 2 ;; 1 ; 2 ) as demonstrated in Figure 4.2(e). P 1 and P 2 are the two intersection points of AB with CD, and AB withEF respectively. Then,l =jP 1 P 2 j,l 1 =jDP 1 j=jP 1 P 2 j, andl 2 =jFP 2 j=jP 1 P 2 j. is the angle from the vector ! AB to the X axis. 1 is the angle from ! CD to ! AB, and 2 is the angle from ! EF to ! AB. 64 To measure the dissimilarity of two 3CS features with the description of (l;l 1 ;l 2 ;; 1 ; 2 ) and (l 0 ;l 0 1 ;l 0 2 ; 0 ; 0 1 ; 0 2 ) respectively, the following equation is applied: D = 8 > > < > > : 6 X k=1 d k ; if d 16 < 1; 1; else; (4.1) where, 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : d 1 = max(l;l 0 )=min(l;l 0 )1 T 1 ; d 2 = max(l 1 ;l 0 1 )=min(l 1 ;l 0 1 )1 T 2 ; d 3 = max(l 2 ;l 0 2 )=min(l 2 ;l 0 2 )1 T 2 ; d 4 = j 0 j T 3 ; d 5 = j 1 0 1 j T 4 ; d 6 = j 2 0 2 j T 4 : (4.2) D is the dissimilarity value of the two 3CS features. A smallerD means the two 3CSs are more similar. In Eq.2,T 14 are thresholds. d 16 are the normalized dissimilarities of the two 3CS features with respect to each of the 6 attributes. If the dierence of one of the attributes is larger than the corresponding threshold (represented by d 16 < 1 in Eq.1), the two 3CS features are not likely to be matched, and so their dissimilarity D is set1. The thresholdsT 14 decide the size of the searching space in looking for the putative 3CS matches. In our experiments, T 14 are 1, 0.5, 45 and 30 respectively. Note thatT 2 andT 4 are smaller thanT 1 andT 3 respectively. This is because the variations of the relative length ratio and angle between neighboring line segments are usually smaller than those of the absolution length and orientation. 65 For each projected 3CS featuref of the LiDAR data, its putative corresponding 3CS features in an aerial image are found in a circular neighborhood of its center, whose dissimilarity values with f are smaller than1. If multiple such 3CS features are found, only the best two with the smallest dissimilarity values are kept. Since the number of 3CS features in an aerial image is usually very large, to improve the speed in searching for putative 3CS matches, an index structure is created by dividing the aerial image into a grid and splitting the range of each attribute of the 3CS descriptor into bins. For each projected 3CS feature of the LiDAR data, the buckets that possibly contain its putative corresponding 3CSs in the aerial image can be computed. 4.4 Two-level RANSAC Algorithm In most cases, we can make the same assumption as in [22] that the error in the camera location given by a GPS device is very small relative to the distance from the camera to the buildings. Therefore the transformation between the projected 3CSs of the LiDAR data and the 3CSs in the aerial image can be considered to be purely caused by camera rotation, and hence is a homography. Therefore, RANSAC can be used to remove outliers from putative feature matches [22, 25]. However, due to the large size of the aerial images and the huge number of putative matches, direct applying the traditional RANSAC approach has the following problems. Problem (a): When outliers are much more than inliers, some outliers can be t with a homography accidentally. If the number of these outliers is larger than that of the real inliers, it will cause completely wrong registration. Problem 66 (b): Due to image radial distortion or because the error of the initial camera location cannot be completely ignored, sometimes the underlying transformation is not a strict homography. This makes it hard to choose a good threshold on homography tting errors for selecting inliers. If the threshold is small, many of real inliers will be removed. If it is large, many of the outliers will be included as inliers and harm the accuracy of the camera pose estimation. Problem (c): Although two 3CS matches provide enough constraints to decide a homography (as discussed later), the homography estimated with two 3CS matches is usually not accurate enough to describe the transformation of the whole image since the constraints are limited in a small part of the image. Therefore, more than two 3CS matches should be sampled at each RANSAC iteration, which will greatly increase the required number of iterations and make the computation expensive. We present a two-level RANSAC algorithm. Instead of processing all the putative 3CS matches as a whole, they are divided into groups so that the 3CS features in each group are contained in a local area. During the rst-level processing, a separate RANSAC routine is run for each group, and the selected inliers are called qualied 3CS matches which are input to the second-level processing where a global RANSAC is applied. Rather than selecting individual 3CS matches, the global RANSAC operates based on groups and decides which groups are inliers. The qualied 3CS matches of these groups are output as the nal matches. The reasons that this two-level scheme can handle the aforesaid problems of the traditional RANSAC are analyzed below. For Problem (a), our approach not only considers their number but also their spatial distribution in evaluating if a group of feature matches are inliers. This is illustrated in Figure 4.4 where the feature matches depicted in Figure 4.4(a) (only the projected 3CS features of LiDAR data are represented) are more likely to be inliers than those 67 in Figure 4.4(b) since they are close to each other, even though the numbers of matches are the same in the two cases. This is because the features in a small area are less than those in a large area so the possibility that the same number of features satisfying a homography are caused by random noise is smaller if they are clustered in a small area than if they are distributed sparsely over the image. Another reason is that buildings usually form blocks or clusters, which is important knowledge for distinguishing inliers and outliers for areas with heavy vegetations. (a) (b) Figure 4.4: (a):Closely located feature matches. (b): Sparsely distributed feature matches. In our approach, the above strategy is implemented in the second-level RANSAC by giving dierent groups dierent weights. At each of its iterations, assume there are p groups selected as inliers, and the number of qualied 3CS matches in thek-th group is n k . Then, the score of each iteration is calculated with: G = p X k=1 (n k p n k ); (4.3) where p n k serves as the weight for group k. Because of this weight, the situation where the qualied 3CS matches concentrate in a few groups will get higher score than the one where they are uniformly spread among all of the groups, even if the total numbers of these 3CS matches in both situations are the same. The two-level scheme also has advantage with regard to Problem (b). Although the transformation in the whole image may not be a strict homography so a larger thresh- 68 old T 2 on the homography tting error should be used in the global RANSAC, the transformation in a local area can be more accurately described with a homogra- phy. Thus, the threshold T 1 used in the rst-level RANSAC for each group can be set smaller than T 2 . This double-threshold strategy can relieve the assumption of strict homography in the whole image while remove the 3CS matches that are not compatible with their neighbors. Finally, the two-level RANSAC algorithm is computationally ecient. Only two 3CS matches are sampled at each iteration of the rst-level RANSAC. This is because the local area of each 3CS group is much smaller than the whole image, and the homography computed from two 3CS matches is usually accurate enough to describe the transformation inside the local area. The second-level RANSAC operates based on groups. At each iteration, three groups are sampled and all of their qualied 3CS matches are used to estimate a homography. Since the number of groups is much smaller than the total number of putative 3CS matches, the required number of interactions in the second-level RANSAC is not large. The details of the two-level RANSAC algorithm are described as follows: Step 1: The putative 3CS matches are divided into groups according to spatial proximity. In doing this, an aerial image is divided into windows. The window size is set ss pixels (s = 1000 in our experiments). Start from the left top corner of the image, a window is shifted from left to right and top to bottom. The step size is s=4 so that neighboring windows have overlapping. For each window, if the number of projected line segments of the LiDAR data inside it is larger than m, it will be split into four sub-windows. A sub-window will be further split until the the number of projected segments inside it is smaller than m. The projected 3CS features inside 69 each window or sub-window form a group. Each projected 3CS of the LiDAR data may have one or two putative corresponding 3CSs in the aerial image (see Section 4). Therefore, each window or sub-window denes a group of putative 3CS matches. The number m controls the size of a group. In our experiments, the average number of projected line segments of the LiDAR data in an aerial image is 3500. We nd m = 50 is a balanced choice, so that each image has about 300 groups. Note that dierent groups may have overlapping. Step 2: Assume that a 3CS feature has segments AB, CD and EF , with AB its central segment, and the segments in its corresponding 3CS areA 0 B 0 ,C 0 D 0 andE 0 F 0 . Since the endpoints of the detected line segments are often inaccurate, they cannot be used to estimate the homography. However, the two intersection pointsP 1 andP 2 as shown in Figure 4.2(e) are more accurate, and the correspondence of each of them provides two linear constraints on the homography. In addition, the orientation of the line segments is also reliable. The fact that pointD (orF ) after the homography transformation is on line C 0 D 0 (or E 0 F 0 ) gives a linear constraint. Therefore, each 3CS feature match provides 6 linear constraints, and two matches are enough to completely decide the homography [32]. For each group of putative 3CS matches, a separate RANSAC routine is applied to remove its outliers. At each iteration, two 3CS matches are uniformly sampled, from which a homography is then estimated. The other putative matches are regarded as inliers if their homography tting errors are smaller than a thresholdT 1 (The homog- raphy tting error of the line constraint is the vertical distance from the transformed point to the line). In our experiments, T 1 = 10 pixels. The inliers selected from each group are called qualied 3CS matches. 70 Step 3: A global RANSAC is applied based on groups. At each of its iterations, three groups of qualied 3CS matches are uniformly sampled, from which a homography is estimated. If any of the homography tting errors of these 3CS matches is larger than a thresholdT 2 , a new sample of three groups will be generated. Otherwise, each of the other groups will be examined to see if the homography tting errors of its qualied 3CS matches are all smaller than T 2 . If so, it will be selected as an inlier. The score at each iteration is computed with Eq.3. When the RANSAC terminates, the groups selected as inliers at the iteration with the highest score are returned together with the homography computed based on the qualied 3CS matches in these groups. In our experiments, the threshold T 2 = 25 pixels, which is larger than T 1 = 10 used in Step 2. The reason has been given previously in this section. Step 4: For groups identied as inliers in Step 3, their qualied 3CS matches are kept in the nal list of correct 3CS matches. For the other groups, all of their putative 3CS matches will be checked again against the homography returned at Step 3, and those with the homography tting error smaller thanT 1 will also be kept. The reason that the lower threshold T 1 is used instead of T 2 is that these matches are isolated and have less condence to be correct unless they can satisfy the homography more strictly. Based on the corresponding intersection points in the obtained 3CS matches, camera parameters can be estimated and rened with the approach in [32]. The approach in [26] is then used for texture mapping with multiple aerial images. 71 4.5 Experimental Results We did extensive experiments to examine the proposed system. Two datasets are tested. The rst one consists of 306 oblique aerial images covering a 1:5 1:4 km 2 area in the city of Oakland. The second one has 1984 oblique aerial images covering a 2:7 2:6 km 2 area in the city of Atlanta. As in [22], the urban environments in these two datasets can be classied into three types: downtown, campus and residential areas. In downtown areas, buildings are tall and dense while trees are sparse. In regions such as campus, large buildings are sparsely distributed in dense trees. In residential areas, houses are usually short and small, and located among dense vegetation. The correctness of the automatic registration is evaluated by checking the average distance between the corresponding points manually labeled on the aerial images and the building outlines in the LiDAR data projected according to the rened camera parameters. It is also validated by visually examining the quality of the texture mapping. For all of the 306 aerial images in the rst dataset, our system accurately recovered their camera parameters. For the second dataset, 1951 aerial images are correctly registered with the 3D LiDAR data while the registration for the other 33 images is wrong. Therefore, the overall correct pose recovery rate of our approach is 98.5%. The images that cannot be correclty registered are mainly from residential areas where most of the buildings are seriously occluded by trees. In these situations, the detection of planar facets in the 3D LiDAR data and the detection of line segments on building outlines in aerial images become very dicult. 72 Figure 4.5 shows an example of the alignment between the aerial image and the projection of 3D building outlines (the green contours) with the initial camera pose (Figure 4.5(a)) and with the rened camera pose after the automatic registration (Figure 4.5(b)). Figure 4.6(a)-(d) are several screen shots of the textured 3D models. To make the 3D models look cleaner, most of the trees are removed in the LiDAR data with the approach in [75]. To prove that 3CS features are more distinctive than the 2DOC features in [22], we computed the average percentage of inliers in the putative feature matches. It is 19% for 3CS features, higher than that of 2DOC features which is 4% according to [22]. In average, the approach takes about 1 minute on a PC with a 3 GHz CPU to register a 4992 3328 aerial image. Most computation is spent on the line detection. The two-level RANSAC only takes several seconds. 4.6 Summary We have presented an approach for automatic registration of aerial images with un- textured aerial LiDAR data. Several strategies are taken to improve the robustness of line segment detection. A novel feature called 3CS that is more distinctive than a single line segment or the 2DOC feature in [22] is used, which greatly increases the percentage of inliers in the putative feature matches. Finally, a two-level RANSAC algorithm is proposed that is more robust and ecient than the traditional RANSAC approach in our situations where the number of putative feature matches is very large while the percentage of inliers is low, and the underlying transformation is not a strict homography. 73 Compared to existing approaches, our system is more robust. Its overall correct pose recovery rate is above 98%. To further improve the approach, more ecient and robust algorithms of line detection, and planar facet detection in the LiDAR data should be developed. Building detection and segmentation with high-level knowledge are also helpful. 74 (a) (b) Figure 4.5: (a):The alignment between the aerial image and 3D outline of building rooftops projected with the initial camera pose. (b): The alignment with the rened camera pose. 75 (a) (b) (c) (d) Figure 4.6: Screen shots of textured 3D models. 76 Chapter 5 Automatic Registration between Ground-Level Panoramas and Orthorectied Aerial Images for Building Modeling 5.1 Related Work There are many existing methods and systems of architectural modeling. Some are based on remote sensing techniques, such as stereoscopic aerial images or airborne LIDAR [36]. Some depend on ground-level imagery including images [20, 17], videos [8, 31, 53] and laser scans [27, 65]. Aerial imagery can model building roof structures that are invisible from ground while ground-level imagery can provide detailed facade texture and 3D structures. Furthermore, remote sensing aerial imagery can model 77 a large area without signicant error accumulation whereas in the techniques based on ground-level imagery GPS/INS units are usually necessary [8, 18] to overcome drifting. Therefore, aerial imagery and ground-level imagery are two complementary data sources. In order to create large-scale, complete urban models with both roof and facade structure and photo-realistic texture, it is necessary to integrate them and their automatic registration becomes an important topic. In literature, there are researchers trying to register ground-level images with existing 3D models created from stereo aerial images [39]. In [27], the authors rened their results by matching aerial photos with ground-level laser scans. In [28], the 3D models constructed from ground-level laser scans and the DSM from airborne LIDAR are merged. In our work, the 3D model is created by combining ground-level images and orthorectied aerial images that can be downloaded from popular websites 1 . These aerial images can have resolutions as high as one-foot per pixel. Figure 5.1(a) is an example. Compared to the approaches using stereo aerial images or LIDAR, our system has a lower cost and can be used by people with only a digital camera and the access to Internet. However, since there is no 3D model beforehand, the automatic registration between ground-level views and aerial views is more dicult in our case than in the previous cases. The most similar work to ours is [56] in which the 3D model is created by combining ground-level images and a detailed map. However, in [56] the correspondences be- tween the ground-level images and the map are selected manually. To model a large area, this makes the approach very time-consuming. In our work, we focus on au- tomating the correspondence detection. The users only need to correct the automatic output when some errors occur. This greatly reduces the user interaction. 1 Such as http://terraserver.microsoft.com and http://earth.google.com 78 In our system, the users draw building footprints in the orthorectied aerial image. Multiple ground-level images are taken at the same viewpoint and are stitched into a 360-degree panorama to obtain a wide eld of view because in many cases the number of constraints contained in a single image is not enough to decide the camera pose. More importantly, panorama stitching gives an accurate camera calibration [61] and helps to combine the constraints from all directions to obtain a more accurate camera pose and hence a more accurate 3D model. The automatic registration is started from line segment extraction in the ground-level images. Their corresponding line segments on the building footprints in the aerial image are detected through a voting process. The rest of the chapter is organized as follows: Section 5.2 describes the user inter- action in aerial images and brie y introduces panorama stitching and line detection. The automatic correspondence detection is presented in section 5.3 in which sub- section 5.3.1 is an overview, subsection 5.3.2 gives the details of the voting process, subsection 5.3.3 explains some additional constraints and subsection 5.3.4 describes additional user interaction. 3D model generation and the nal optimization are pre- sented in section 5.4. Some experimental results are shown in section 5.5 and the chapter is concluded in section 5.6. 5.2 Preprocessing The user interaction in an aerial image can be illustrated with an example in Figure 5.1. The green line segments in Figure 5.1(b) are selected by the user. The polygons formed by these line segments will be the facets composing the roofs in the 3D model. At the beginning, the corners of these polygons are given the same default height from 79 the ground. The user changes this by selecting the corners that are really at the same height (judged from the user's experience) and dragging them up or down to indicate that they have a relatively higher or lower height. This process is directly displayed in 3D with an interface similar to [5]. The 3D model resulted from this interaction is shown in Figure 5.1(c). What is true in this model is the relative spatial relationship between the corners. Their actual height values need to be estimated. (a) (b) (c) Figure 5.1: (a)An aerial image. (b)The green line segments are selected by the user. (c)The 3D model resulted in this step. To obtain a wide eld of view, multiple images (about 16) are taken at each viewpoint by rotating the camera on a tripod (or even hold in hand [60]), and stitched into a 360-degree panorama. The approach in [15] is taken to stitch the images. In this process, an accurate estimation of the camera calibration and the rotation between the images is obtained [61]. Line segments are detected from each ground-level image with the approach presented in Chapter 4. The nal representation of a line segment in a panorama is a pair of endpoints, each of which is associated with an ground-level image. 80 5.3 Correspondence Detection and Camera Pose Estimation 5.3.1 Overview In [56], the correspondences between the corners in the ground-level images and those on the building footprints are given by the user. The camera pose of each ground- level image is then computed by solving a linear equation. In our approach, both the correspondences and the camera pose are obtained through a voting process. For a panorama, one of its component images can be selected as the reference image. Once the camera pose of this image is known the camera pose of all the other images in the panorama can be determined based on their rotation to this image. In the rest of the paper, the camera pose(or camera frame) of a panorama refers to the camera pose(or camera frame) of its reference image. We set the world frame in the following way. ItsX andY axes are parallel to theX andY axes of the aerial image respectively. Its origin is right above the origin of the aerial image with its height from the ground plane equal to the height of the camera center. This is to simplify the following computation. The approach in [10] is taken to detect the vanishing point of the parallel 3D lines that are vertical to the ground plane. This process is robust since in general there are abundant such 3D lines visible in a panorama. Once this vanishing point is obtained, the direction of the worldZ axis in the camera frame can be computed since it equals to the direction of the ray back projected from the vanishing point [31]. Suppose this direction is r = (r 1 ;r 2 ;r 3 ). The rotation R from the camera frame to the world frame 81 can be decomposed into two parts. The rst one is to rectify r back into its canonical form (0; 0; 1). This rotation can be: R 0 = 1 2 6 6 6 6 4 r 1 r 3 r 2 r 3 r 2 1 r 2 2 r 2 r 1 0 r 1 r 2 r 3 3 7 7 7 7 5 ; (5.1) where = p r 2 3 +r 2 3 (if = 0, the identity matrix can be chosen as R 0 ). The second one is a rotation around the vector (0; 0; 1) by a certain angle . R can be expressed as: R() = 2 6 6 6 6 4 cos sin 0 sin cos 0 0 0 1 3 7 7 7 7 5 R 0 : (5.2) Hence, the rotation from the camera frame to the world frame has only one DoF left, the angle . Denote the location of the camera center in the aerial image as (o x ;o y ). The camera pose of the panorama relative to the world frame is fully determined by = (;o x ;o y ). In the next section, we will prove that given a correspondence between the line seg- ments in the panorama and those on the building footprints, the possible can be computed analytically. Therefore, the most likely camera pose can be obtained through a voting process in which all possible correspondences vote in the camera pose space. To reduce the searching range, the line segments on the building footprints that may be visible in the panorama need to be determined. For this purpose, the user selects one point in the aerial image as the approximate camera location. The possible camera location is limited within a circular region centered at this point (its radius is 100 pixels in our experiments). By tracing the 2D rays started from the 82 selected point in the aerial image, the buildings that are visible in the panorama can be determined. Those that are too far from the center of will not be considered since their projections in the panorama will be too small (the threshold is determined according to the camera focal length). Next, the visibility of each line segment on the footprints of the visible buildings will be determined for each location (pixel) in . These line segments are divided into two kinds. The rst, hereinafter called exterior segments, are those form the building outlines (likee 1 ,e 2 ande 3 in Figure 5.1(b)). The second are those inside the building outlines (like i 1 , i 2 and i 3 in Figure 5.1(b)) and will be called interior segments. Generally, the interior segments are higher than the exterior segments in 3D space. An endpoint of an exterior segment is visible from a location if the line segment connecting them is not intersected by any line segments on the building footprints. An exterior segment is fully visible from this location if both of its endpoints are visible. If only one is visible, it is called partially visible. The visibility of an interior segment is computed in the same way but the exterior segments on the same building of it will not be used to test intersecting. The segments that can be visible from a location inside will be stored in a list together with their visibility types: fully or partially. In general, a visible exterior segment given by the above approach is really visible in the panorama unless they are occluded by trees or something else while a visible interior segment may not be really visible since they are occluded by the the exterior segments, which is the general case for tall buildings. Therefore, only the visible exterior segments are used in the voting process. 83 5.3.2 Voting for the Camera Pose Suppose a line segment c 1 c 2 (c 1 and c 2 are its endpoints) on the building footprints corresponds to a line segmentu 1 u 2 (u 1 andu 2 are its endpoint and they are in image i and image j respectively) in a panorama. Based on this, the possible camera pose of the panorama can be computed. There are ve cases as discussed below. In the rst case, c 1 c 2 is fully visible in the panorama, and c 1 and c 2 have the same height from the ground plane in the 3D space (this information comes from the user interaction described in Section 2.1). That means the 3D line connecting c 1 and c 2 is parallel to the ground plane. Its direction in the world frame is d = (c 2 c 1 ; 0). We order the endpoints so that c 1 and c 2 corresponds to u 1 and u 2 respectively (this can be done by assuming that all the buildings in the panorama are right-side-up). The camera frame of the panorama is rectied by rotating it withR 0 (see Eq.1). The new camera frame diers from the world frame only by an -angle rotation around the world Z axis. Figure 5.2(a) shows this rectied camera frame with a unit sphere centered at its origin O. v 1 and v 2 are the two rays back projected from u 1 and u 2 . Their directions in the rectied camera frame are denoted as v c 1 and v c 2 . Suppose the camera calibration matrix is K and the camera rotations from image i and image j to the reference image of the panorama areR i andR j respectively. v c 1 and v c 2 can be computed: v c 1 =R 0 R 1 K 1 u 1 ; v c 2 =R 0 R 2 K 1 u 2 : (5.3) is the plane spanned by v 1 and v 2 . The intersection of with the unit sphere is a great circle that meets the the great circle on the XY plane at two points. The one on the side of v 2 is denoted asP . Since the vanishing point of d is onu 1 u 2 , d should be on in the world frame. In addition, d is also on theXY plane and is fromu 1 to 84 u 2 , so the direction ! OP should coincide with d in the world frame. Denote the angle from the world X axis to d as . The normal of is n = v c 1 v c 2 . Its projection onto the XY plane is m whose angle from the X axis of the rectied camera frame is denoted as . Therefore, the angle from the X axis of the rectied camera frame to ! OP is + 2 and the angle of the rotation from the rectied camera frame to the world frame is: = 2 : (5.4) (a) (b) Figure 5.2: (a)Compute the rotation from the rectied camera frame to the world frame. (b)Compute the camera location. After obtaining , the rotation R from the camera frame to the world frame is de- termined by Eq.2. The directions of the two rays v 1 and v 2 in the world frame are denoted as v w 1 and v w 2 and they can be computed with: v w 1 =RR 1 K 1 u 1 ; v w 2 =RR 2 K 1 u 2 : (5.5) 85 The orthographic projections of v w 1 and v w 2 onto the XY plane are denoted as v 0 1 and v 0 2 . They should pass c 1 and c 2 respectively. This is shown in Figure 5.2(b). o c is the camera location. ! 1 and ! 2 are the angles from the X axis to v 0 1 and v 0 2 respectively. is the angle from the X axis to ! c 1 c 2 . Therefore, \c 1 o c c 2 = ! 2 ! 1 and\c 1 c 2 o c =! 2 . Let l =jc 1 c 2 j. The location (o x ;o y ) of the camera center can be computed as: 8 > < > : o x = sin(! 2 ) sin(! 2 ! 1 ) l cos(! 1 ) +x 1 ; o y = sin(! 2 ) sin(! 2 ! 1 ) l sin(! 1 ) +y 1 ; (5.6) where (x 1 ;y 1 ) is the coordinates of the corner c 1 in the aerial image. Therefore, the camera pose can be estimated by Eq.4 and Eq.6. However, the coordinates ofc 1 andc 2 given by the user may have small drift from their real values. The endpoints of the line segments detected in the ground-level images may not be very accurate. In addition, the camera calibration and the rotation between ground- level images estimated during the panorama stitching may also have errors. All these factors can cause the parameters in Eq.4 and Eq.6 to vary in a small range around their estimated values. Hence, from Eq.4 and Eq.6, we get a range of the possible camera pose in stead of an exact value. In our experiments, the coordinates of c 1 and c 2 may have up to 2-pixel drift. According to this, the range of and l can be computed. We also assume that ! 1 and ! 2 may have 3-degree variation from their estimated values and 1-degree variation for . To simplify the computation, Eq.4 and Eq.6 are assumed to be monotonic with the parameters within their local range. Based on these, the range of the possible camera pose can be easily computed. In the second case, c 1 c 2 is also fully visible but c 1 and c 2 have dierent height from the ground plane. Therefore, Eq.4 cannot be used to compute. Eq.6 still exists but the anglesw 1 andw 2 are unknowns. However, the directions v c 1 and v c 2 in the rectied 86 camera frame can still be computed with Eq.1. Denote the angles from theX axis to their orthographic projections in the XY plane of the rectied camera frame as ! 0 1 and ! 0 2 . It is easy to see that ! 2 ! 1 = ! 0 2 ! 0 1 and ! 1 = +! 0 1 . Therefore, Eq.2 can be rewritten as: (o x a) 2 + (o y b) 2 = l 2 4 sin 2 (! 0 2 ! 0 1 ) ; (5.7) and 8 > < > : sin(2 +! 0 1 +! 0 2 ) = (o x a) 2sin(! 0 2 ! 0 1 ) l ; cos(2 +! 0 1 +! 0 2 ) =(o y b) 2sin(! 0 2 ! 0 1 ) l ; (5.8) wherea = sin(! 0 2 ! 0 1 ) 2sin(! 0 2 ! 0 1 ) l+x 1 andb = cos(! 0 2 ! 0 1 ) 2sin(! 0 2 ! 0 1 ) +y 1 . Obviously, the possible camera location is on a circle centered at (a;b). By intersecting this circle with the region , the range of the possible o x can be computed. Given a possible value of o x , o y can be computed from Eq.7 (there may be two values). Further, can be computed from Eq.8. Due to the same reason as in the rst case, o y and are in a local range around their estimated values. By scanning all the possible values of o x (1 pixel as the step), the range of the possible camera pose can be obtained. In the third case, c 1 c 2 is partially visible, and c 1 and c 2 are known to have the same height. Suppose c 1 is visible while c 2 is invisible. Eq.4 can still be used to compute . Eq.6 does not hold since v 0 2 does not pass c 2 now. However, v 0 1 still passes c 1 and hence the camera location is on the rayc 1 o c whose angle from theX axis of the world frame is w 1 . By intersecting this ray with the region , the range of the possible o x can be computed. In addition, since u 1 u 2 is the projection of part of c 1 c 2 , the length from c 1 to the intersection of v 0 2 with c 1 c 2 should be shorter than the length 87 of c 1 c 2 . This gives the following constraint: o x <l sin(! 2 ) sin(! 2 ! 1 ) cos(! 1 ) +x 1 (5.9) Given a possible value of o x , o y can be computed with: o y = (o x x 1 ) tan(w 1 ) +y 1 (5.10) Again,o y can vary within a local range around the estimated value. The range of the possible camera pose is obtained by scanning all possible o x values. In the fourth case, c 1 c 2 is partially visible, and c 1 and c 2 have dierent height from the ground. cannot be computed by Eq.4 and the camera location can be anywhere inside the region as long as c 1 c 2 is partially visible from it. Therefore, we scan all the locations from which c 1 c 2 is partially visible. At each location, denote the angle from the X axis of the world frame to ! o c c 1 (see Figure 5.2(b)) as . The rotation angle can be computed with = ! 0 1 where ! 0 1 is dened in the discussion for the second case. In addition, since c 1 c 2 is partially visible, this provides a similar constraint on the possible camera location based on the length of c 1 c 2 as in the third case. Finally, we consider the situation when the height values of c 1 and c 2 have already be computed from the previous panoramas in the case where multiple panoramas are used to model a large environment. Suppose c 1 c 2 is fully visible. Denote the angles from v 1 and v 2 (the rays back projected fromu 1 andu 2 ) to the ground plane as 1 and 2 . The length of the segments c 1 o c and c 2 o c can be computed:jc 1 o c j = arctan 1 h 1 andjc 1 o c j = arctan 2 h 2 , where h 1 and h 2 are the height value of c 1 and c 2 . Given jc 1 o c j, the camera location is limited to be on the circle centered at c 1 withjc 1 o c j as 88 the radius. Due to errors, the camera location is actually inside a circular band. We assumeh 1 andh 2 may have up to 5foot error in their estimated values and 1 may have a 1-degree variation. The width of the circular band can be computed. Similarly, according toc 2 o c the camera location is within another circular band. In addition, as in the second case the angle\c 1 o c c 2 =! 0 2 ! 0 1 constrains the camera location to the neighboring area of another circle. Therefore, the possible camera location is inside the intersecting region of the three areas. At each pixel in this region, the camera rotation angle can be computed in the same way as in the fourth case: = ! 0 1 . In the above discussion, bothc 1 andc 2 are assumed to be visible in the panorama. If c 1 c 2 is only partially visible, a similar analysis can be taken. Each exterior segment on the visible segment list can form a possible correspondence with each line segment in the panorama (except those vertical segments found in the vertical vanishing point detection). According to the visibility type of the exterior segment, the appropriate approach will be selected to estimate the range of the pos- sible camera pose. Each camera location in this range will be checked to see if they are consistent with the supposed visibility type. Some exterior segments may have two possible visibility types since they are fully visible from some locations in and partially visible from some other locations. Both possibilities will be considered for them. The camera pose space is divided into buckets. In our experiments, each bucket represents a 3-pixel by 3-pixel area for camera location (o x ;o y ) and an one-degree range for the rotation angle. Each possible correspondence will vote for the buckets within the estimated range of the compatible camera pose. It is possible that an exterior segment can form multiple possible correspondences with the segments in the panorama to vote multiple times for the same bucket. However, these votes will 89 be counted only once since only one of them is correct. In other words, the most likely camera pose is the one that can give most of the exterior segments a corresponding segment in the panorama. 5.3.3 Additional Constraints As we mentioned above, multiple segments in the panorama can be matched with the same exterior segment to vote for the same bucket. This is either due to the discretization of the camera pose space or because the orthographic projections of their corresponding 3D line segments onto the ground plane almost coincide with each other. An example is shown in Figure 5.3(f), in which all the yellow (red) segments matched with the segment L 2 (L 3 ) in Figure 5.3(b) will vote for the same bucket. If the bucket represents the correct camera pose, all of them may be the corresponding segment of that exterior segment. We call them candidates since only one of them can be selected. In the voting process, the connection relationship between the exterior segments is not considered. If two segments on the building footprints are connected, their cor- responding line segments in the panorama should also be connected. This constraint can be used to select their correct corresponding segments out of all the candidates. The exterior segments that vote for a bucket are divided into groups. In each group, they form a connected sequence. We need to nd an optimal solution to assign their corresponding segments so that the connection constraints can be satised as well as possible. If the connection constraint is broke between two consecutive segments in a solution, we give a cost to it. The optimal solution is the one that has the minimum total cost. It can be found eciently with Viterbi algorithm. If the total cost is 90 not zero for the optimal solution, the exterior segments in the longest subsequence satisfying the connection constraints will be kept and all the other segments in the se- quence will be discarded. The number of votes received by the bucket will be reduced accordingly. It is possible that there are multiple optimal solutions with the same total cost. An ex- ample is shown in Figure 5.3(g), in which selecting (E 1 ;E 2 ;E 3 ;E 4 ) or (E 5 ;E 6 ;E 7 ;E 8 ) as the corresponding segments of the exterior segments (L 1 ;L 2 ;L 3 ;L 4 ) in Figure 5.3(b) has no dierence with regard to the connection constraints. In this case, we make the choice according to the positions of the line segments in the panorama. The upper ones will be selected since they are more likely to be on the building roofs. To improve speed, not every bucket in the pose space will be checked for connection constraints. We divide the pose space into bigger buckets each of which represents a 10-pixel by 10-pixel area for the camera location and a range of 90 degrees for the rotation angle. After the voting process, only the small buckets that have the maximum number of votes inside each big bucket will be checked for connection constraints. Among them, the one with the maximum number of votes after the checking represents the most likely camera pose and the correspondences selected according to this bucket are kept. 5.3.4 User Correction Since the line detection is not perfect, some correct line segments may not be de- tected or may be inaccurate. This will cause errors in the detected correspondences. Therefore, the line segments selected by the automatic approach in the panorama as the corresponding segments of those on the building footprints are highlighted in 91 the panorama. If some correct segments are missed or inaccurate, the user can select their accurate endpoints directly in the panorama. In addition, if some of the interior segments on the building footprints are visible, the user can also select them in order to obtain the accurate model of the rooftops. Once a new segment is selected, it trig- gers a new voting process based on the previous one in which it is matched with every line segment in the visible segment list. This time the visible interior segments will join the voting. Only the buckets that receive new votes are possible to contain the correct camera pose. Hence, the connection constraint checking will only be applied to these buckets and an additional constraint should be satised that the segments input by the user must be selected. Among these buckets, the one with the most num- ber of votes represents the most likely camera pose. The correspondences associated with this bucket are kept and the highlighted line segments in the panorama will be updated accordingly. Therefore, the automatic approach can intelligently adjust its prediction according to the user's input. The further user interaction will be based on this new prediction. This process may be repeated several times until all the correct line segments on the building roofs are highlighted in the panorama. 5.4 Model Construction and Optimization Up to now, the line segment correspondences and an estimation of the camera pose are obtained. Each line segment correspondence gives one or two corner correspon- dences depending on its visibility type. Assume corner c on the building footprints corresponds to image corner u in the panorama. Their corresponding 3D point is at the intersection of the ray back projected from u with the 3D line that passes c and 92 is vertical to the ground plane (in practice, the point on the vertical 3D line with the minimal distance to the ray is computed). Bundle adjustment is then taken to optimize all the estimated parameters, includ- ing camera calibration, camera rotation between the ground-level images in the panorama, camera pose of the reference image and the 3D coordinates of the roof corners. There are three kinds of errors to minimize. First is the image matching error between ground-level images in the panorama [15]. Second is the projection error of the 3D roof corners. Third is the dierence between the height values of two 3D roof corners that are supposed to have the same height from the ground. The x and y coordinates of the roof corners should be close to the values provided by the user. To ensure this, a soft constraint is added[31]. The sparse bundle adjustment package of Lourakis and Argyros are used to implement the optimization [42]. Note that all the height values of the corners computed so far are relative to the camera center. We can x the height of a corner that is known to be on the ground-plane as 0. Then the height of other corners relative to the ground-plane can be easily computed. The 3D model is composed of the polygons of the building roofs and the vertical wall planes associated with the exterior segments on the building footprints. The heights of the roof corners that are invisible in the panorama are obtained if they are known to have the same height with some visible corners. The ground-level images are then projected onto the 3D model as the facade texture. To model a large area, multiple panoramas are taken and are processed one by one. In the nal step, a bundle adjustment can be taken to optimize the parameters obtained from all the panoramas simultaneously. 93 5.5 Experimental Result We did extensive experiments to test our approach. An example is depicted in Figure 5.3. Figure 5.3(a)is the aerial image in which the green dot indicates the approximate camera location. Figure 5.3(b)is to shown the detailed structure of a building in Figure 5.3(a). Figure 5.3(c) shows the 360-degree panorama stitched from 16 ground- level images. Line segments are detected from each ground-level image. Figure 5.3(d) and Figure 5.3(e) are two samples. After the voting process, in a bucket an exterior segment may have multiple candidate corresponding segments in the panorama. An example is illustrated in Figure 5.3(f), in which all the yellow (red) segments are the candidate corresponding segments of L 2 (L 3 )in Figure 5.3(b). After the checking for connection constraints, the optimal correspondences can be found. However, ambigu- ities may still exist as demonstrated in Figure 5.3(g) where selecting (E 1 ;E 2 ;E 3 ;E 4 ) or (E 5 ;E 6 ;E 7 ;E 8 ) has no dierence with regard to the connection constraints. In this case, (E 1 ;E 2 ;E 3 ;E 4 ) will be selected since they have upper positions in the panorama and hence are more likely to be on the building roof. The line segments selected by the automatic approach in the panorama are shown in Figure 5.3(h) and Figure 5.3(i). All of them are correct but two are inaccurate due to poor image quality (circled in Figure 5.3(h)) and occlusion (circled in Figure 5.3(i)). The user selected their accurate endpoints in the panorama and added three line segments for the interior segments shown as the green segments in Figure 5.3(j) and Figure 5.3(k). After this, all the correct correspondences are detected. The constructed 3D model is shown in Figure 5.3(l)-Figure 5.3(o). Another example is shown in Figure 5.4. in which two panoramas are used to model a larger area. Figure 5.4(a) is the aerial image with the two blue dots as the approximate 94 camera locations. Figure 5.4(b)-Figure 5.4(e) are several views of the created 3D model. To test the accuracy of the constructed model, we measured the actual size of these buildings with an accurate laser range nder. After being multiplied by a common scale (1 foot per pixel), the average error in the x and y dimensions of these buildings in the 3D model is around half a meter. This largely depends on the resolution of the aerial image. The average error of the height values of the buildings is around 1 meter. In the rst example, the user only needs to click 10 points (for 5 line segments) in the ground-level panorama in order to register it with the aerial image. If the registration is completely manual, the user has to click 40 points (20 in the ground-level panorama and 20 in the aerial image) to obtain the same result. In the second example, the user only needs to click 12 points in the ground-level panoramas; otherwise he has to click 78 points in the ground-level panoramas and the aerial image. Therefore, the user interaction is greatly reduced with our registration approach. 5.6 Summary We have presented a system that can model a group of buildings from an orthorec- tied aerial image and multiple ground-level panoramas. The aerial image provides building footprints and the texture for the terrain and building roofs. The ground- level panoramas are used to compute the height of each roof corner and provide the facade texture. The user interaction is mainly in the aerial image to select the line segments on the building roofs. The approach automatically extracts line segments in the ground-level panoramas and detects their correspondences in the aerial image. The user monitors the result from the automatic approach and modies it by select- 95 ing the correct line segments in the panorama. The automatic approach adjusts its prediction based on the input from the user. In most cases, the user just needs to select those line segments that are very hard to be extracted automatically due to poor image quality, complex background or occlusion while the rest line segments are detected by the automatic approach. There is no need for the user to manually indicate the mapping between the line segments in the aerial image and those in the panoramas. The created 3D model is accurate and has photo-realistic texture for the building roofs and facades. Currently the building facades are simply composed of vertical wall planes. One direction of our future work is to recover more detailed 3D structures, such as windows or balconies, from multiple overlapping panoramas based on the obtained coarse model and the accurately estimated camera poses of the ground-level images. 96 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) Figure 5.3: (a)Aerial image. The red line segments are provided by the user as the building footprints. The green dot is the approximate camera location. (b)Zoom in on part of the aerial image. (c)Panorama. (d)-(e)Line detection in two of the ground- level images . (f)All yellow (red) line segments are the candidate counterparts of line segment L 2 (L 3 ) in (b). (g)The segment sequence E 1 E 2 E 3 E 4 is selected instead of E 5 E 6 E 7 E 8 according to the additional constraints. (h)-(i)Line segments selected by the automatic approach. Some errors are indicated by the yellow circles. (j)- (k)The user corrected the output of the automatic approach by selecting the green line segments. (l)-(o) Several synthesized views of the created 3D model. 97 (a) (b) (c) (d) (e) Figure 5.4: (a)Aerial image. The blue dots indicate the approximate camera locations of the two panoramas. (b)-(e)Several views of the constructed model. 98 Chapter 6 Conclusion Line segments are crucial in visual analysis of man-made environments. Line-based matching approaches are powerful tools for solving dierent registration problems encountered in 3D urban modeling. For wide-baseline image matching, a novel local feature called a line signature is introduces. As far as I know, line signature is the rst curve-based local feature whose description is explicitly based on the shape of the curves. The image matching approach based on line signatures has better performance in matching low-texture images, and handling large viewpoint changes and illumination variations than existing local features that are directly based on pixels. For the registration between aerial images and untextured aerial LiDAR data, a robust automatic approach is proposed in which line segments are detected in both aerial images and the projection of 3D outlines of building rooftops. The line segment matching between the two data sources is based on the matching of 3CS features that are composed of three connected line segments. Due to the high distinctiveness and 99 repeatability of 3CS features, the percentage of inliers in the putative feature matches is much higher than those in existing methods, which leads to the high robustness of the approach. An interactive system is presented that can quickly reconstruct a 3D model of a group of buildings with high-resolution texture for both rooftops and building facades by combining the information from both ground-level panoramas and orthorectifed aerial images. The user interaction is greatly reduces by automating the registration between the two data sources from very dierent view angles. The approach exploits the constraints from multi-view camera geometry and high-level knowledge about the structure of buildings, such as vanishing points. 100 Bibliography [1] http://earth.google.com/. [2] http://www.cs.ubc.ca/spider/lowe/research.html. [3] http://www.microsoft.com/virtualearth/. [4] http://www.robots.ox.ac.uk/ vgg/research/ane/. [5] http://www.sketchup.com/. [6] http://www.vision.ee.ethz.ch/showroom/zubud/index.en.html. [7] N. Ahuja and S. Todorovic. Learning the taxonomy and models of categories present in arbitrary images. ICCV, 2007. [8] A. Akbarzadeh, J.-M.Frahm, P.Mordohai, and et al. Towards urban 3d recon- struction from video. Int. Symp. on 3D Data Processing, Visualization and Transmission, 2006. [9] N. Ansari and E. J. Delp. On detecting dominant points. Pattern Recognition, 24(5):441{451, 1991. [10] M. E. Antone and S. Teller. Automatic recovery of relative camera rotations for urban scenes. Computer Vision and Pattern Recognition Conference, 2:282{289, 2000. [11] A. Baumberg. Reliable feature matching across widely separated views. CVPR, pages 774{781, 2000. [12] H. Bay, V. Ferrari, and L. V. Gool. Wide-baseline stereo matching with line segments. CVPR, pages 329{336, 2005. [13] S. Belongie, J. Malik, and J. Puzicha. Shape context: A new descriptor for shape matching and object recognition. NIPS, pages 831{837, 2000. 101 [14] B. E. Boser, I.M. Guyon, and V.N. Vapnik. A training algorithm for optimal margin classiers. 5th Annual ACM Workshop on COLT. [15] M. Brown and D. G. Lowe. Recognising panoramas. International Conference on Computer Vision, 2:1218{1225, 2003. [16] J. Canny. A computational approach to edge detection. IEEE Transaction on Pattern Anaysis and Machine Intelligence, 8:679{698, Nov. 1986. [17] R. Cipolla, D. Robertson, and E. Boyer. Photobuilder { 3d models of architec- tural scenes from uncalibrated images. International Conference on Multimedia Computing and Systems, 1:25{31, 1999. [18] N. Cornelis, K.Cornelis, and L.Van Gool. Fast compact city modeling for navi- gation pre-visualization. Computer Vision and Pattern Recognition Conference, 2:1339{1344, 2006. [19] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Vusual categorization with bags of keypoints. ECCV'04 Workshop on Statistical Learning in Computer Vision, pages 59{74, 2004. [20] P. E. Debevec, C. J. Taylor, and J. Malik. Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach. SIGGRAPH, pages 11{20, 1996. [21] R. Deriche and O.Faugeras. Tracking line segments. European Conference on Computer Vision, 1990. [22] M. Ding, K. Lyngbaek, and A. Zakhor. Automatic registration of aerial imagery with untextured 3d lidar models. CVPR, pages 1{8, 2008. [23] S. Edelman, N. Intrator, and T. Poggio. Complex cells and object recognition. Unpublished manuscript: http://kybele.psych.cornell.edu/ edelman/archive.html, 1997. [24] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid. Groups of adjacent contour segments for object detection. PAMI, 30(1):36{51, 2008. [25] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model tting with applications to image analysis and automated cartography. Graphics and Image Processing, 24(6):381{395, 1981. [26] C. Frueh, R. Sammon, and A. Zakhor. Automated texture mapping of 3d city models with oblique aerial imagery. 3DPVT, pages 396{403, 2004. [27] C. Frueh and A. Zakhor. 3d model generation for cities using aerial photographs and ground level laser scans. Computer Vision and Pattern Recognition Confer- ence, 2:31{38, 2001. 102 [28] C. Frueh and A. Zakhor. 3d model generation for cities using aerial photographs and ground level laser scans. Computer Vision and Pattern Recognition Confer- ence, 2:31{38, 2003. [29] C. Fruh and Z. Zakhor. An automated method for large-scale, ground-based city model acquisition. IJCV, 60(1):5{24, 2004. [30] L. Van Gool, T. Moons, and D. Ungureanu. Ane/photometric invariants for planar intensity patterns. ECCV, pages 642{651, 1996. [31] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cam- bridge, 2000. [32] R. I. Hartley and A. Zisserman. Multiple view geometry in computer vision, 2004. [33] M. D. Heath and S. Sarkar. A robust visual method for assessing the relative per- formance of edge-detection algorithms. IEEE Transaction on Pattern Analysis and Machine Intelligence, 19:1338{1359, Dec. 1997. [34] V. Hedau, H. Arora, and N. Ahuja. Matching images under unstable segmenta- tions. CVPR, pages 1{8, 2008. [35] R. Horaud and T. Skordas. Stereo correspondence through feature grouping and maximal cliques. PAMI, pages 1168{1180, 1989. [36] J. Hu, S. You, and U. Neumann. Approaches to large-scale urban modeling. IEEE Computer Graphics and Applications, 23(6):62{69, 2003. [37] B. Huet and E.R. Hancock. Line pattern retrieval using relational histograms. PAMI, (12):1363{1370, 1999. [38] D. Jacobs. Groper: a grouping based object recognition system for two- dimensional objects. IEEE Workshop on Computer Vision, pages 164{169, 1987. [39] S. C. Lee and R. Nevatia. Automatic integration of facade textures into 3d building models with a projective geometry based line clustering. Computer Graphics Forum(Euro Graphics), 21(3):511{519, 2002. [40] T. Lindeberg. Edge detection and ridge detection with automatic scale selection. International Journal of Computer Vision, 30:117{156, Nov. 1998. [41] L. Liu and I. Stamos. Automatic 3d to 2d registration for the photorealistic rendering of urban scenes. CVPR, pages 137{143, 2005. 103 [42] M. Lourakis and A.Argyros. The design and implementation of a generic sparse bundle adjustment software package based on the levenberg-marquardt algo- rithm. Tech.Rep. 340, Inst. of Computer Science-FORTH, Heraklion, Crete, Greece, 2004. [43] D. Lowe. Three-dimensional object recognition from single two-dimensional im- ages. Articial Intelligence, 1987. [44] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91{110, 2004. [45] B. C. Matei, H. S. Sawhney, S. Samarasekera, J. Kim, and R. Kumar. Building segmentation for densely built urban regions using aerial lidar data. CVPR, pages 1{8, 2008. [46] G. Medioni and R.Nevatia. Segment-based stereo matching. CVGIP, 1985. [47] P. Meer and B. Georgescu. Edge detection with embedded condence. IEEE Transaction on Pattern Anaysis and Machine Intelligence, 23:1351{1365, Dec. 2001. [48] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. PAMI, 27(10):1615{1630, 2004. [49] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaal- itzky, T. Kadir, and L. Van Gool. A comparison of ane region detectors. IJCV, 65(1):43{72, 2005. [50] V. S. Nalwa and T.O. Binford. On detecting edges. IEEE Transaction on Pattern Anaysis and Machine Intelligence, 8:699{714, 1986. [51] R.C. Nelson and A. Selinger. Large-scale tests of a keyed, appearance-based 3-d object recognition system. Vision Research, (15):2469{88, 1998. [52] E. Nowak, F. Jurie, and B. Triggs. Sampling strategies for bag-of-features image classication. ECCV, 2006. [53] M. Pollefeys, L. V. Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch. Visual modeling with a hand-held camera. International JOURNAL of Computer Vision, 59(3):207{232, 2004. [54] M. Pollefeys, D. Nister, J.-M Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels, D. Gallup, S.-J. Kim, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewenius, R. Yang, G. Welch, and H. Towles. Detailed real-time urban 3d reconstruction from video. IJCV, 78(2-3):143{167, 2008. 104 [55] C. Poullis, S. You, and U. Neumann. Rapid creation of large-scale photorealistic virtual environments. IEEE Virtual Reality, pages 153{160, 2008. [56] D. P. Robertsonand and R. Cipolla. Building architectural models from many views using map constraints. European Conference on Computer Vision, pages 155{169, 2002. [57] C. A. Rothwell, J.L. Mundy, W. Homan, and V.D. Nguyen. Driving vision by topology. Proceedings of the International Symposium on Computer Vision, pages 395{400, 1995. [58] C. Sagues and J.J. Guerrero. Robust line matching in image pairs of scenes with dominant planes. Optical Engineering, (6):067204/1{12, 2006. [59] C. SCHMID and A. Zisserman. Automatic line matching cross views. CVPR, 1997. [60] H. Y. Shum, M. Han, and R. Szeliski. Interactive construction of 3d models from panoramic mosaics. Computer Vision and Pattern Recognition Conference, pages 427{433, 1998. [61] S. N. Sinha and M.Pollefeys. Pan-tilt-zoom camera calibration and high- resolution mosaic generation. Computer Vision and Image Understanding, 103(3):170{183, 2006. [62] S. N. Sinha, D. Steedly, R. Szeliski, M. Agrawala, and M. Pollefeys. Interactive 3d architectural modeling from unordered photo collections. ACM Trans. on Graphics (SIGGRAPH Asia'08), 2008. [63] J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. ICCV, pages 1470{1477, 2003. [64] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collec- tions in 3d. ACM Trans. on Graphics (SIGGRAPH'06), pages 835{846, 2006. [65] I. Stamos and P. K. Allen. Geometry and texture recovery of scenes of large scale. Computer Vision and Image Understanding, 88(2):94{118, 2002. [66] I. Stamos, L. Liu, C. Chen, G. Wolberg, G. Yu, and S. Zokai. Integrating auto- mated range registration with multiview geometry for the photorealistic modeling of large-scale scenes. IJCV, 78(2-3):237{260, 2008. [67] V. Venkateswar and R. Chellappa. Hierarchical stereo and motion correspon- dence using feature groupings. IJCV, (3):245{269, 1995. [68] L. Wang and U. Neumann. A robust approach for automatic registration of aerial images with untextured aerial lidar data. Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 105 [69] L. Wang, U. Neumann, and S. You. Wide-baseline image matching using line signatures. Proc. of IEEE International Conference on Computer Vision, 2009. [70] L. Wang, S. You, and U. Neumann. Semiautomatic registration between ground- level panoramas and an orthorectied aerial image for building modeling. ICCV Workshop on Virtual Representation and Modeling of Large-scale Environments, pages 8{15, 2007. [71] L. Wang, S. You, and U. Neumann. Supporting range and segment-based hys- teresis thresholding in edge detection. ICIP, 2008. [72] P. L. Worthington and E. R. Hancock. Region-based object recognition using shape-from-shading. ECCV, pages 445{471, 2000. [73] S. Yang. Symbol recognition via statistical integration of pixel-level constraint histograms: a new descriptor. PAMI, (2):278{281, 2005. [74] W. Zhao, D. Nister, and S. Hsu. Alignment of continuous video onto 3d point clouds. PAMI, 27(8):1305{1318, 2005. [75] Q. Y. Zhou and U. Neumann. Fast and extensible building modeling from air- borne lidar data. ACM GIS, pages 1{8, 2008.
Abstract (if available)
Abstract
Man-made environments are full of line segments, and a complex curve can be approximated with multiple straight-line segments. Therefore, line segment matching is an important computer vision problem, and it is a powerful tool for solving registration problems in 3D urban modeling.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D deep learning for perception and modeling
PDF
Automatic image matching for mobile multimedia applications
PDF
Hybrid methods for robust image matching and its application in augmented reality
PDF
Model based view-invariant human action recognition and segmentation
PDF
Object detection and recognition from 3D point clouds
PDF
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
PDF
3D urban modeling from city-scale aerial LiDAR data
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
Accurate image registration through 3D reconstruction
PDF
Digitizing human performance with robust range image registration
PDF
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
Accurate 3D model acquisition from imagery data
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Tracking multiple articulating humans from a single camera
PDF
3-D building detection and description from multiple intensity images using hierarchical grouping and matching of features
PDF
Facial gesture analysis in an interactive environment
PDF
3D object detection in industrial site point clouds
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
Asset Metadata
Creator
Wang, Lu
(author)
Core Title
Line segment matching and its applications in 3D urban modeling
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
01/31/2010
Defense Date
11/30/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D modeling,image matching,image registration,OAI-PMH Harvest,urban modeling,wide-baseline
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Neumann, Ulrich (
committee chair
), Becerik-Gerber, Burcin (
committee member
), Medioni, Gerard G. (
committee member
), Nevatia, Ramakant (
committee member
), You, Suya (
committee member
)
Creator Email
luwang@usc.edu,luwang2728@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2819
Unique identifier
UC1492982
Identifier
etd-Wang-3415 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-287762 (legacy record id),usctheses-m2819 (legacy record id)
Legacy Identifier
etd-Wang-3415.pdf
Dmrecord
287762
Document Type
Dissertation
Rights
Wang, Lu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
3D modeling
image matching
image registration
urban modeling
wide-baseline