Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Automatic image matching for mobile multimedia applications
(USC Thesis Other)
Automatic image matching for mobile multimedia applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
AUTOMATIC IMAGE MATCHING FOR MOBILE MULTIMEDIA APPLICATIONS by Quan Wang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2011 Copyright 2011 Quan Wang Table of Contents List of Tables .......................................................................................................................... iiv List of Figures........................................................................................................................... v Abstract................................................................................................................................... iix Chapter I: Introduction.............................................................................................................. 1 Chapter II: Related Works ........................................................................................................ 6 Chapter III: Real-Time Image Matching System.................................................................... 11 3.1 The Walsh-Hadamard Kernels Projection............................................................. 11 3.1.1 General projections in Euclidean space.................................................. 11 3.1.2 The Walsh-Hadamard kernel.................................................................. 13 3.2 MVKP for Real-time Feature Matching................................................................ 14 3.2.1 Offline training stage .............................................................................. 15 3.2.2 Feature set construction for query image................................................ 17 3.2.3 Establishing feature correspondences..................................................... 19 3.3 Distinctive Feature Selection................................................................................. 21 3.3.1 Pre-processing Stage............................................................................... 22 3.3.2 Feature Selection .................................................................................... 23 3.4 Experimental Results and Evaluations .................................................................. 29 3.4.1 Effect of projection kernels .................................................................... 29 3.4.2 Feature distinctiveness............................................................................ 33 3.4.3 Matching accuracy and robustness ......................................................... 33 3.4.4 Matching speed....................................................................................... 36 3.4.5 Feature selection evaluation.................................................................... 39 Chapter IV: Augmented Distinctive Features for Efficient Image Matching ......................... 46 4.1 Feature Augmentation Process.............................................................................. 50 4.1.1 Sub-images ............................................................................................. 50 4.1.2 Relative features ..................................................................................... 52 4.1.3 Normalization ......................................................................................... 55 4.2 Establishing Correspondences using Augmented Features................................... 59 4.3 Experimental Results............................................................................................. 60 4.3.1 Standard tests.......................................................................................... 62 4.3.2 Dense matching from distinctive features .............................................. 64 4.3.3 Matching visualization and analysis....................................................... 65 Chapter V: Fast Similarity Search for High-Dimensional Dataset ......................................... 71 5.1 FFVA for Similarity Search .................................................................................. 73 5.2 Experimental results.............................................................................................. 76 5.2.1 Searching accuracy ................................................................................. 77 5.2.2 Data access rate....................................................................................... 78 5.2.3 Query speed ............................................................................................ 79 ii 5.2.4 Memory requirement .............................................................................. 81 Chapter VI: Application Example, Augmented Museum Exhibitions.................................... 84 6.1 Augmented Reality and Related Works ................................................................ 85 6.2 System Overview .................................................................................................. 86 6.3 Adapted MVKP and Information Retrieval .......................................................... 88 6.3.1 MVKP adaptations.................................................................................. 88 6.3.2 Information retrieval............................................................................... 90 6.4 Real Museum Test................................................................................................. 91 Chapter VII: Extend to 3D Range Data .................................................................................. 93 7.1 Introduction and Related Works............................................................................ 93 7.2 Registration handling different sensors................................................................. 98 7.2.1 LiDAR segmentation .............................................................................. 99 7.2.2 Aerial images segmentaion................................................................... 107 7.2.3 Region matching................................................................................... 121 7.2.4 Principal directions ............................................................................... 127 7.3 Registration handing viewpoint changes............................................................. 132 7.3.1 MVKP revisted ..................................................................................... 132 7.3.2 Feature Fusion for Registering Wide-baseline Urban Images.............. 134 7.4 Experimental results............................................................................................ 139 7.4.1 LiDAR segmentation results................................................................. 140 7.4.2 ROI extraction from aerial images........................................................ 142 7.4.3 Different sensor registration ................................................................. 144 7.4.4 Matching wide-baseline urban images ................................................. 150 7.4.5 Applications.......................................................................................... 154 Chapter VIII: Conclusion...................................................................................................... 157 Bibliography ......................................................................................................................... 159 iii List of Tables Table 3.1, Feature Selection Sample Results for One View Set............................................. 28 Table 3.2, Feature Ranking Change Due to New Selection Criteria. .................................... 40 Table 4.1, Algorithm Structure for Tied Ranking Handler..................................................... 59 Table 5.1, FFVA algorithmic structure................................................................................... 76 Table 5.2, Matching Accuracy of FFVA ................................................................................ 78 iv List of Figures Figure 1.1, Image Matching Result: ......................................................................................... 4 Figure 1.2, 2D-3D Registration Result: .................................................................................... 5 Figure 3.1, Kernel Projection:................................................................................................. 13 Figure 3.2, Overview of the Whole MVKP System. .............................................................. 14 Figure 3.3, Major Components of the Training Stage. ........................................................... 16 Figure 3.4, Major Components of Online Query Stage. ......................................................... 18 Figure 3.5, Rotation Dominant Views. ................................................................................... 23 Figure 3.6, Components Overview of Feature Selection ........................................................ 25 Figure 3.7, Effect of Kernels: ................................................................................................. 30 Figure 3.8, Procedures for Generating CDIKP Descriptors.................................................... 31 Figure 3.9, Kernel Projection Evaluation with Standard Test Dataset. .................................. 32 Figure 3.10, Feature Distinctiveness Test:.............................................................................. 33 Figure 3.11, Matching Results:............................................................................................... 35 Figure 3.12, Accuracy Comparison ........................................................................................ 36 Figure 3.13, Query Speed Comparision:................................................................................. 38 Figure 3.14, Speed Comparison.............................................................................................. 38 Figure 3.15, Visualization of Feature Ranking Change.......................................................... 41 Figure 3.16, Feature Selection Improvement.......................................................................... 42 Figure 3.17, Results of Image Matching with Feature Selection:........................................... 43 Figure 3.18, Changing the Weight Parameter......................................................................... 45 Figure 4.1: Overview of Our Feature Augmentation Process................................................. 47 v Figure 4.2, Illustrations of Sub-image Concept: ..................................................................... 51 Figure 4.3, Relative Features: ................................................................................................. 54 Figure 4.4, Augmented Features:............................................................................................ 57 Figure 4.5, Standard Tests Results:......................................................................................... 64 Figure 4.6, Comparison of Feature Distinctiveness:............................................................... 65 Figure 4.7, Matching Visualization and Comparison: ............................................................ 68 Figure 5.1, Algorithmic Structure of FFVA Approach........................................................... 74 Figure 5.2, Query Time of VA-File ........................................................................................ 75 Figure 5.3, Data access rate test (results are the averages of 1000 queries) ........................... 79 Figure 5.4, Query Speed Test (Synthetic Data) ...................................................................... 80 Figure 5.5, Query Speed Test (Real Data).............................................................................. 81 Figure 5.6, Memory Requirement:.......................................................................................... 82 Figure 5.7, Test for Query Speed Considering Memory Paging: ........................................... 83 Figure 6.1, AR Example: ........................................................................................................ 84 Figure 6.2: System Overview of the Developed Augmented Exhibition System................... 86 Figure 6.3, Low and High Texture Paintings:......................................................................... 89 Figure 6.4, Real Museum Test:............................................................................................... 92 Figure 7.1, Overview of Our Proposed 2D-3D Registration System...................................... 95 Figure 7.2, System Illustration:............................................................................................... 96 Figure 7.3, Overview of the Proposed 2D-3D Registration System ..................................... 100 Figure 7.4, Comparison of Normalization Methods: ............................................................ 101 Figure 7.5, Illustration of Various Interest Points:................................................................ 102 Figure 7.6, Effectiveness of LCC: ........................................................................................ 103 Figure 7.7, Effectiveness of Region Refinement:................................................................ 104 Figure 7.8, Samples of Extracted Shapes:............................................................................. 106 vi Figure 7.9: Color-coded LiDAR ROI Extraction Results..................................................... 107 Figure 7.10, ROI Extraction from Aerial Images. ................................................................ 112 Figure 7.11, Vege-map. ........................................................................................................ 113 Figure 7.12, Shadow-map:.................................................................................................... 114 Figure 7.13, Seed Locations for One Sample Aerial Image. ................................................ 115 Figure 7.14, Color-coded Initial Segmentations of the Whole Aerial Image. ...................... 120 Figure 7.15, ROI Descriptors:............................................................................................... 123 Figure 7.16, ROI Partial Matching. ...................................................................................... 124 Figure 7.17, Cost Ratio:........................................................................................................ 125 Figure 7.18, Results of an Easy Scene:................................................................................. 126 Figure 7.19, Detection of Principle Directions: .................................................................... 128 Figure 7.20, Histogram-based Regulations:.......................................................................... 129 Figure 7.21, Feature Selection for Urban Scenes: ................................................................ 132 Figure 7.22, Feature Fusion Framework:.............................................................................. 134 Figure 7.23, Effectiveness of In-region Edges and Selectively Smoothing:......................... 136 Figure 7.24, Overview of the Edge Grouping Component. .................................................. 137 Figure 7.25, LiDAR Segmentation Comparison:.................................................................. 141 Figure 7.26, ROI Extraction Results (from Aerial Images).................................................. 142 Figure 7.27, Aerial Photos Segmation Comparison :............................................................ 143 Figure 7.28, Whole System Results:..................................................................................... 146 Figure 7.29, Registration Results of Our Proposed Approach:............................................. 147 Figure 7.30, Composition of Average CPU time for Major Components. ........................... 148 Figure 7.31, Results form Feature Fusion:............................................................................ 151 Figure 7.32, Refined and Propagated Matchings:................................................................. 152 Figure 7.33, Matchings from SIFT: ...................................................................................... 153 vii Figure 7.34, Matchings from Our Proposed Method:........................................................... 153 Figure 7.35, The Proposed Method Applied to Urban Modeling and Rendering Task. ....... 154 Figure 7.36, Apply Our Approach to UAV Localization (Moderate Scale Example).......... 155 Figure 7.37, Apply Our Approach to UAV Localization (Large Scale Example)................ 156 viii Abstract Image matching is a fundamental task in computer vision, used to correspond two or more images taken, for example, at different times, from different aspects, or different sensors. Image matching is also the core of many multimedia systems and applications. Today, the rapid convergence of multimedia, computation and communication technologies with techniques for device miniaturization is ushering us into a mobile, pervasively connected multimedia future, promising many exciting applications, such as content-based image retrieval (CBIR), mobile augmented reality (MAR), handheld 3D scene modeling and texturing, and vision-based personal navigation and localization, etc. Automatic image matching, although notable progresses have been achieved in recent years, is still a challenging problem especially for applications on mobile platforms. Major technical difficulties include the algorithms’ robustness to viewpoint and lighting changes, processing speed and storage efficiency for mobile devices, and the capability to handle inputs from different sensors and data sources. This research focuses on the advanced technologies and approaches of image matching, particularly targeting mobile multimedia applications. First, a real-time image matching approach is developed. The approach uniquely combines kernel projection technique with feature selection and multi-view training to produce efficient feature representations for real- time image matching. To address the computational and storage efficiency for mobile devices, our produced feature descriptors are highly compact (20-D) in comparisons to the state-of-the-art (e.g. SIFT: 128-D, SURF: 64-D, and PCA-SIFT: 36-D), suiting for applications on mobile platforms. Second, couple with the matching approaches, a fast data search technique has also been developed that can rapidly recover and screen possible ix x matches in a large high-dimensional database of features and images. Third, in order to enhance the distinctiveness and efficiency of image features, a feature augmentation process has been proposed integrating geometry information into local features and producing semi- global image descriptors. Next, combining those developed techniques, an application system called Augmented Museum Exhibitions has been built to demonstrate their effectiveness. Finally, the research is extended to matching images acquired from different sensor modalities, i.e. corresponding the 2D optical images to 3D range data from LIDAR sensors. The developed high-level feature based matching approach is efficient, being able to automatically register the heterogeneous data with significant differences. 1 Chapter I: Introduction Image spatial-temporal matching is a fundamental task in Computer Vision, used to comprehend two or more images typically taken at different times, from different aspects, or even from different sensors. Many intelligent image processing systems and applications require image matching, or closely related operations as intermediate steps. Examples of such application systems in which image matching is a significant component include object recognition, content-based image database retrieval, medical image processing, augmented reality, 3D scene reconstruction, and vision-based autonomous navigation. Recent years, the rapid convergence of multimedia, communication and vision-based computation technologies combined with techniques for device miniaturization is ushering us into a mobile, wireless and pervasively connected future. It is now quite common for compact mobile devices to integrate a mobile phone, digital camera and PDA functions, heralding many exciting new fields for applications and services, as exemplified by several new multimedia phone services recently launched, such as content-based image retrieval (CBIR), mobile augmented reality (MAR), handheld 3D scene modelling and texturing, and vision- based personal navigation and localization, etc. A specific example is the Nokia Point & Find, MyClick in telecommunication markets. The service employs image recognition and search techniques that allow users to capture designated images such as product advertisements and quickly match them to vendor’s databases in order to obtain detailed product information associated with the images. While the development trend is clear enough, the main challenge posted by such systems is the integrated image matching algorithms have to be reliable, fast, robust and capable of 2 handling realistic diverse viewing conditions in our real-world. Technically, image matching and recognition process generally consists of three components: (1) feature detection and selection - finds stable matching primitives over spatial-temporal space, (2) feature representation and description - represents the detected features into compact and robust descriptors for image matching, and (3) optimal matching and searching - uses the feature descriptions and additional constrains to locate, index, and recognize targets and scenes of interest. Focusing on mobile multimedia applications, this thesis presents several advanced methods for image matching, targeting on the limitations of the current state of the art image matching systems including robustness to viewing condition changes, processing speed, space efficiency and adaptation to different sensors. Specifically, our major contributions are: First chapter 3 introduces our novel image matching system, Multiple View Kernel Projection (MVKP), for robustly establishing correspondences among 2D images captured by conventional optical sensors in real-time. The proposed system utilizes Walsh-Hadamard Kernel Projection combined with a multiview training stage for real-time matching speed and robustness against viewing condition changes. Targeting applications on cell phones or other mobile devices, where computational power and storage raise important concerns, we explore the dimension reduction functionality of Kernel Projection. Experimental results indicate that in comparisons to the state-of-the-art our novel descriptors provide comparable robustness and invariance while occupying much less CPU time and disk storage due to their compactness. Additionally, we propose a feature selection approach based on feature distinctiveness and invariance, which also combines feature selection and multiple-view training into a unified framework. Our selected feature points are less likely to arise from 3 unstable region boundaries and small repeated items in the scene. Therefore, the number of features needs to be searched is efficiently reduced, more suitable for applications on mobile platforms. Third, we propose a feature augmentation process (Chapter 4), which treats the image matching problem as a recognition problem of spatially related image patch sets. We construct augmented semi-global descriptors (ordinal codes) based on subsets of scale and orientation invariant local keypoint descriptors. Tied ranking problem of ordinal codes is handled by increasingly keypoint sampling around image patch sets. Similarities of augmented features are measured using Spearman correlation coefficient. The produced augmented features are more distinctive and efficient compared with based features. Next, for the feature matching part, methods that search exhaustively in high-dimensional image feature space are highly computationally expensive, unsuitable for real-time applications, while hierarchy approaches frequently encounter the problem of “curse of dimensionality”. To tackle this problem, we develop an effective approach, Fast Filtering Vector Approximation (FFVA), a nearest neighbor search technique that facilitates rapidly indexing and locating the most similar matches, efficiently matching a very large high-dimensional database of image features in real-time (Chapter 5). In one of our experiments, the above system is used to build an augmented reality application for museum exhibits using natural features replacing calibrated fiducials (Chapter 6). The ability to augment and annotate exhibits, such as paintings, with virtual content represents a potentially powerful tool for museum curators. By presenting visitors with rich, interactive content, an augmented museum can add a dynamic element to otherwise static content. We demonstrate MVKP's real-time performance and robustness to lighting and viewpoint changes make it ideal for such kind of applications. After identifying an exhibit, related information is retrieved from a remote server and displays as virtual content. Experimental results on a real-world museum demonstrate the effectiveness of the proposed approach. (a) (b) Figure 1.1, Image Matching Result: (a) An example of how a museum visitor might use our augmented exhibit system on a cell phone; (b) One screenshot from our real museum test. Recent years, there has been an increasing awareness of the growing need for fusing range sensing data with traditional digital photography. For example, the photorealistic modeling of urban scenes and the UAV (unmanned aerial vehicles) localization problem using range data from airborne or ground laser scanner require the registration of those 3D range data onto aerial or ground 2D images for texture mapping or region recognition purpose. In the medical image processing domain, there has been the long standing concern about how to automatically alignment CT (Computed Tomography) or MR (Magnetic Resonance) images with optical camera images. Traditional image matching approaches based on local texture analysis can not be directly adapted to above tasks, basically because in contrast to the optical sensors, there is no texture information available from range sensors. In this thesis, we also explore the adaptation of feature matching approaches to different sensors problems (Chapter 7). Particularly, an automatic system is developed to robustly and efficiently register 2D aerial photography onto 3D LiDAR data. It is based on matching region of interest (ROI) from 2D camera images to the depth images of 3D range data and 4 able to handle orientation, scale and location differences between inputs without any assistance from positioning hardware (figure 1.2(a)). Experiments on urban scenes of several large cities and their corresponding LiDAR data have demonstrated the effectiveness of the proposed approach. Compared with previous methods, the whole registration process of our method takes merely seconds in today's PC. Besides invariant to rotation and scale changes, experimental results also indicate that the proposed approach can tolerate large perspective distortion and is able to handle noisy and extremely low resolution inputs. The proposed 2D-3D registration system was also successfully applied to emerging popular applications such as photorealistic 3D reconstruction of urban scenes (figure 1.2(b)). (a) (b) Figure 1.2, 2D-3D Registration Result: (a) initial correspondences from LiDAR data onto 2D aerial photography generated by our automatic 2D-3D registration system; (b) application result: photorealistic 3D urban scene reconstruction 5 6 Chapter II: Related Works Among the image matching approaches based on local features, early works mostly focused on the information provided by one single view of the object. The pioneer work of Schmid and Mohr [78] introduced rotationally invariant descriptors for local image patches based on local greylevel invariants, which are used for indexing 2D local grayscale patterns. The proposed method’s robust performance against challenging factors such as occlusion and rotation brought public attentions to local features. The ground-breaking work of D.G. Lowe [60] demonstrated that rotation as well as scale invariance can be achieved by first using difference-of-Gaussian function to detect stable interest points, then construct the local region descriptor using assigned orientation and several histograms. The proposed Scale Invariant Feature Transform (SIFT) descriptor constructs orientation histograms using local gradient directions. Orientation invariance is achieved by assign one or multiple dominant orientations to each selected interest point. Finally, each interest point is described using histograms of its local image gradients measured at the selected scale with the assigned orientations. Besides impressive performance considering scale, rotation, lighting and distortion invariance, one major strength of the proposed SIFT descriptor is its feature distinctiveness, which is an important property considering feature retrieval from a large database. Its feature distinctiveness, however, primarily comes from its high-dimensional (128 dimensions for the original SIFT descriptor) feature vectors. As a direct result, efficient searching technique handling database of high dimensional vectors is crucial. In chapter 3 and 4 of this thesis, we introduce our novel methods to reduce feature number and dimensionality respectively, as well as improving features distinctiveness without increasing their dimensionalities and an 7 efficient nearest neighbor search technique (chapter 5) specially designed for high dimensional features of image matching systems. Other limitations include: first, as a histogram based approach, the computational costs for both the orientation assignment and final feature description are very high. Even on today’s workstations, the detection plus description part and the final matching part each take a few seconds for typical images. Chapter 3 of this thesis proposes our novel real-time image matching system, which is able to robustly establish the correspondences for around 20 image pairs independently in one second and produce matching results with comparable accuracy as other state-of-the-art image matching systems. Second, although robust to various geometric distortions, when it comes 3D non-planar object with large viewing direction changes, SIFT descriptors have very limited power. Last, because SIFT and its many variations describe features based on local texture analysis, they almost have no power for registering data from different sensors when texture information has different meanings from one sensor to the other or when there is no consistent texture available at all. As a contrast, our 2D-3D registration (Chapter 7) is able to match such data robustly handling location, rotation and scale changes, and even geometric distortions. SIFT produced significant influence on later works. For example, Ke and Sukthankar [48] applied Principal Components Analysis (PCA) to an image gradient patch in order to reduce the descriptor’s dimensionality instead of using the smoothed weighted histograms like in the original SIFT. However, the PCA projection kernels need to be trained before the matching process which is time-consuming and may not be practical to certain applications. Moreover, the trained kernels reasonably only work well for certain category of images that is similar to the training images and thus generalization is a problem. Last, with the feature vector 8 dimensionality reduced from 128 to around 36, the features’ distinctiveness, which is one major advantage of original SIFT, is naturally compromised. Other variations of SIFT method include: Bay et al. proposed a fast implementation of SIFT using integral images [3]. The proposed Speeded Up Robust Feature (SURF) is built upon other experiments-verified and successful detectors and descriptors, especially SIFT, but simplify those steps to the essential. The new descriptors can be constructed and matched more efficiently. However, our own experiments using the published binary also indicates a considerable accuracy drop when compared with the original SIFT. Instead of using circular regions, Deng et al. [25] extended it to affine-invariant elliptical bins and provided an efficient algorithm to sample the gradient within these regions. The proposed descriptors are more robust to geometric (especially affine) distortions because implicitly the global context information is combined into the local feature matching process, although it is comparatively even more computational expensive than the original SIFT to extract pixel information from such irregular local regions. To achieve viewpoint invariant, another line of research is to combine the information of multiple views and train the system in an offline stage so that it will learn the main characters of the same object under different viewing conditions. Consequently, the online matching process can be much faster, even real-time. Concerning the data source of multiple views, some works use affine transformation to synthesize a number of views from one single input view ([55], [54], [14] and [32]), while others take real images captured from camera as input ([69], [102]). Our real-time image matching system uses a number of synthetic views generated from single image because of the ground truth of training images’ correspondences it can provide. We use Kernel Projection scheme to extract the significant components 9 containing in the synthesized images and to construct compact feature descriptors. Recent representative works about multiview-based image matching include: Lepetit, et al. [55] treat the multiple-view point matching problems as a classification problem. They synthesize small patches of each individual feature point served as training input. PCA and k- mean algorithms are applied to those patches to provide local descriptors. After the offline training stage, the same keypoint detector and PCA projection matrix are used on the query image patches. Eventually, the feature vectors of training and query images are matched by simple exhaustive linear scan. Later on in their continuous work [54], classification tree is used to replace the PCA and k- mean as well as the final nearest neighbour search. The branching of the trees is decided by simple comparison of nearby intensity values and the final classification is determined by statistic analysis at leaf nodes. Their online matching process is fast enough for real-time application. However, the forest construction is very slow (10-15 minutes) and it is also pointed out that their actual performance can vary a lot depending on the viewpoint and illumination conditions when compared with classical histogram-based approaches [14]. Further experimental results also indicate a huge number of training views is essential to the success of their real-time image matching system using randomized trees. The problem details and solution of this aspect will be discussed in the section 3. Ozuysal, et al. [69] propose an extension of the randomized tree (RT) focusing on non-planar object tracking without 3D model. With the help of RT structure, features can be updated and selected dynamically, called “harvest”. The training views are obtained by moving the object slowly in front of the camera. Tuzel, et al. [90] proposed the covariance of features as a new descriptor, computed from a set of integral images. A suitable distance metric for the new 10 descriptor is also given. Boffy, et al. [14] use additional information about the appearance of the object under the actual viewing condition to update the classification trees at run-time. They also use special designed spatially distributed trees to enhance the reliability and speed. Chapter III: Real-Time Image Matching System This chapter presents our real-time image matching system based on multiple view training and projection-rejection scheme. The projection and rejection scheme used in our system has long been proved to be efficient for pattern recognition and general classification problems. Various projection vectors have been studied. Many previous works emphasized on the discrimination abilities of the projection kernels ([28], [50]), while Y. Hel-Or, et al. [41] argued that besides the discrimination power, it is also important to choose projection kernels that are “very fast to apply”. For this purpose, they choose Walsh-hadamard (WH) kernels and achieved a speed enhancement by almost two orders of magnitude. Furthermore, experimental results indicate that their Projection Kernel method is robust to noise and lighting changes. However, as a fast “window matching” technique, the method can not handle geometric distortion brought by viewpoint changes. In order to achieve invariance and tolerance to geometric distortions, our real-time image matching system integrates a training stage based on generated synthetic views of the object. The two robust and efficient methods cooperate together, resulting the core part of our Multiple View Kernel Projection method (MVKP). 3.1 The Walsh-Hadamard Kernels Projection The projection scheme in our MVKP method is based on WH kernels, which is a special case of Gray-Code Kernels [9] and general projection kernels in Euclidean space. 3.1.1 General projections in Euclidean space Suppose there are two sets of image patches with size k × k. Each patch can be directly 11 expressed as a k 2 dimensional vector. Intuitively, the similarity between two patches can be measured as the Euclidean distance between the two corresponding vectors. It can be easily seen that such similarity is impractical to compute especially when the number of patches to be measured is large. The projection strategy is to project the original vectors onto a smaller set of projection kernels, which are fast to compute and still maintain the distance relationship. Assume ... , , 3 2 1 b b b are orthonormal projection basises in k 2 dimension Euclidean space (Figure 3.1(a)). P is a point in the k 2 dimension space with projected components ... , , 3 2 1 v v v respectively. Scalars c i = r v i T r v i . Let represents the squared Euclidean distance from P to the origin O, then we have: d(P) It is trivial from the above equation or followed from the Cauchy-Schwartz inequality since the Euclidean distance is a norm, that lower bounds of d(P) can be calculated using a number of projection scalars. The lower bounds layers can be expressed as the following: With the increasing number of projections kernels involved in the calculation, the lower bound becomes tighter. When all the k 2 kernels are involved, the lower bound becomes the actual squared Euclidean distance. For the projection scheme to be efficient, there are two factors need to be considered: on one hand, the projection basises should be ordered in a way such that the lower bound can become tight after only a small number of projections. On the other hand, equally important is the requirement that “the kernels should be efficient to apply 12 enabling real-time performance”[41]. 3.1.2 The Walsh-Hadamard kernel The WH kernel is one special case of the Gray-Code projection kernels satisfying the above two requirements. First, the WH projection kernels are very efficient to generate and apply. One-dimensional kernels can be generated using binary tree while consecutive kernels are α- related [9]. In the context of 2D image processing, two-dimensional kernels can be generated as the outer product of one-dimensional kernels. All the coordinates of WH kernel’s basis vectors are either +1 or –1. Consequently, projection onto WH kernels involves only dimensionality number of additions or subtractions, which can be performed very fast. (a) (b) Figure 3.1, Kernel Projection: (a) General projection. (b) 2-dimensional 4×4 WH kernels in increasing order of frequency. Blank represents value “1” and shadow represents “-1”. Second, when the kernels are ordered according to increasing frequency of sign changes, experimental results show that a tight lower bound can be achieved using only a small number of kernels. Thus, we can greatly reduce the complexity of similarity computation while still captures the major difference between feature vectors. Figure 3.1(b) shows a list of two-dimensional WH kernels in increasing order of frequency. [9] introduced an efficient algorithm to compute the ordering of the kernels, which captures the increase in spatial 13 frequency. 3.2 MVKP for Real-time Feature Matching Kernel projection using WH kernels is able to measure the similarity between two large sets of image patterns in real-time, however, it can not handle geometric variance caused by viewing condition changes. In order to achieve invariance and tolerance to geometric distortions, we combine the WH kernel projection method with a multiple view training stage. The training stage is aimed to providing the system with additional information concerning affine distortions, such that the same object can still be matched under different view angles, etc. Figure 3.2, Overview of the Whole MVKP System. 14 3.2.1 Offline training stage During the offline training stage, the MVKP method takes one object image as input, smoothes it using Gaussian filter, generates 50-100 synthetic training images from it and then describes the main characters for each selected object location. The output of the training stage is a set of feature vectors, subsets of which corresponds to each selected object location. The method first synthesizes a number of training views of the input object image using affine transformation. A general affine transformation can be expressed as: where R is a rotation matrix and t is a translation with components t 1 and t 2 . Matrix A corresponds to a rotation of θ first, followed by a rotation of −φ then scale changes of λ 1 and λ 2 in horizontal and vertical direction respectively. At last, the image is rotated back by φ . The six affine transformation parameters are generated randomly to cover the whole parameter space for rotation and shear angles. A typical set of parameters are: θ ∈[ −π, π], φ ∈[ −π/2, π/2], λ 1 , λ 2 ∈[0.4, 1.6], t 1 ,t 2 = 0, 1, 2, or 3. 15 Figure 3.3, Major Components of the Training Stage. Interesting points are identified by searching for local maximum Eigen-values within 3×3 local patches. The patches with a local minimum smaller than a threshold are discarded. The detector is designed to guarantee that one feature point will not be too close (for example, 3 pixels) to one another. Otherwise, two features might have a very similar description and consequently fail the distance ratio criteria. After all the keypoints in all the synthetic views are detected, we can tell how many of them belong to the same physical location in the object image, since all the affine transformations are synthesized. It is assumed that the object locations that appeared more often on the synthetic views have a higher probability to be detected in the query image containing the same object. Therefore, we select 100-200 “mostly common appeared” object locations for future feature matching use. Each object location is represented as a link list containing the coordinates in the corresponding views. Within each 16 17 synthetic view, we extract a 32 × 32 patch around each detected and selected feature point. Because the projection of the image patch onto the first WH kernel conveniently gives its DC value. Robustness to lighting changes can be enhanced by simply disregarding the first projection kernel. In addition to that, we normalize (translate and rescale) each patch’s intensity values to the same range in order to further enhance the performance against different lighting conditions. The lists of extracted image patches contain the information of various possible appearances for all feature locations. The last step of the training stage is to describe the extracted patches into feature vectors. Each patch’s intensity values, forming a very-high-dimension vector, are processed by the kernel projection method so that the final descriptors belongs to the same object location can be more effective, compact and contain the information under various viewing situations. WH kernels are used for the kernel projection. In our experiments, we found that typically the first 20 WH kernels are enough for reliable feature description. After kernel projection, k-mean method is used to further reduce the size of the feature set. For all the feature vectors representing the same object location, 10-20 clusters are formed, and the center vector of each cluster is used to represent that cluster. 3.2.2 Feature set construction for query image Given a query image containing the same object, our goal is to find the correspondences between the query image and the object image. After the offline training stage, we have lists of object feature vectors. Each of them corresponds to an interest selected object location. Now we need to construct a similar feature set for query image. Figure 3.4, Major Components of Online Query Stage. After the query image is read and smoothed, the same feature point detector is applied. Because this is an online stage desired to be as fast as possible, we only select a number of “strongest” feature points reported by the detector. Let x 1 be the number of selected stable object locations in the training stage and x 2 is the number of selected feature points in this stage, y is the number of final reported correspondence (NFRC), then we have: y ≤ min(x 1 , x 2 } Typically, x 2 is around 500, assume x 1 is 100, then the NFRC will be no larger than 100. After the keypoint detection, the intensity values of the image patch around each keypoint give us original vector description. Those original vectors are normalized to the same intensity range to enhance the robustness against lighting changes. The normalized vectors are projected onto a number of (the same number as in the training stage) WH kernels 18 19 resulting in compact final descriptors. As the first part of online query process, the feature set construction is comparatively fast to excute. The most time-consuming part is actually to find the correspondence between feature sets. 3.2.3 Establishing feature correspondences Give the feature descriptors covering various viewing conditions for each object location and the feature descriptors for the query image, the final task is to establish the correct correspondences between two features sets efficiently. The rejection scheme in [41] can’t be directly adapted to our problem because it requires all the query image patches be continuously distributed. Thus we use a different technique based on lower bound rejections to accomplish the task. We employ Euclidean distance as similarity metric due to its simplicity and low computational cost. Various Nearest Neighbor (NN) search techniques have been studied under this context. The authors of [55] use linear scan, while in [5], an approximate NN- search method over traditional Kd-tree structure is introduced in order to efficiently index their high-dimensional (128-D) feature vectors. To decide the proper NN-search technique for the MVKP method, first we investigated the properties of features generated by WH kernel projection. The following is an example of three feature vectors generated by kernel projection at the training stage: Feature vector #1: 22875, 2962, -1843, -935, 1037… Feature vector #2: 17886, -2797, 1175, 315, -1008… Feature vector #3: 19568, -3567, 1338, 347, 1572 … 20 The dimensionality of our feature vector typically ranges from 20 to 100 (depending on how many kernels are used) while the magnitude is comparatively large. It can also be seen from the experiments that, our features vectors are sparser distributed in the space compared with feature vectors in many other methods such as [55], where features are more likely to cluster together. Our feature vectors are more distinctive and further away one from another. Accordingly, we propose FFVA method to perform the NN-search, which is described in details in Section 6. FFVA was proposed for fast indexing and matching high-dimensional features for large databases. It has been proved to be efficient when dealing with large-magnitude, semi- uniform and sparse distributed, high-dimensional feature set, providing search accuracy close to exact linear scan. Experimental results showed that when the dimension is within 100 and the number of database vectors is within 4000, as an 'exact' search technique, FFVA demonstrates faster query speed even when compared with tree-based 'approximated' NN- search method. Like all the vector approximation based NN-search techniques with static partition length, FFVA works better when the feature vectors are semi-uniform or sparse distributed, in which case, the majority of the data never have the chance to enter the second (more expensive) level. The vector approximation strategy, which provides compact representations of original vectors, is especially efficient for large magnitude vectors. After NN-search using FFVA, there are two additional layers to further refine the matching results. The first layer is to remove the false alarms from complex background. We used the distance ratio as evaluation criteria, that is: “the second closest neighbor should be significant far away from the closest one” [60]. Fixed threshold value of α=0.5~0.9 were used in some 21 of our experiments. Another way to determine the thresholds for distance ratio and even for RANSAC process is to dynamically adjust the values at run time. Take the threshold of distance ratio criteria for example. First we set up a global goal about how many correspondences we’d like to keep after applying the distance ratio filter. At the run time, we periodically (10 times in our experiments) check the number of correspondences our method has found so far, compare it with the global goal, and adjust the threshold accordingly. Dynamic thresholds are particularly useful when the category of query images changes dramatically during run-time. Only those correspondences past the distance ratio layer will enter the last layer, which is a consistent check using RANSAC. Experimental results show that the last layer is only necessary for challenging real image tests with complex background. For the relatively easy synthetic image test, our method is reliable enough to skip the consistent check layer. 3.3 Distinctive Feature Selection This section presents our feature selection technique that can significantly reduce the number of features needed for certain level of robustness and accuracy, and improve the feature invariance at the same time. Proper feature selection techniques are particularly important for a real-time image matching system. Take the randomized tree method [54] for example, experimental results indicate a large number of training views are crucial to the system’s reliability. If only 100 features are kept for each object, then 1000 training views will generate 100,000 feature vectors per object. Consequently, proper feature selection is crucial for real-time, accurate performance. Feature selection has been widely used to reduce computation time or improve accuracy. The 22 recent work of Fan et al. [29] used multi-class SVM to select the most informative features for face recognition. According to the statistical relationship between the two tasks, feature selection and multi-class classification, they are integrated into a single consistent framework and effectively achieve the goal of discriminative feature selection. The proposed SVM-DFS can speed up classification without degrading the matching accuracy. Mahamud and Hebert [62] proposed discriminative object parts selection and used conditional risks as the distance measure in nearest neighbor search. The optimal distance measure was modeled directly by a linear logistic model that combined more elementary distance measures associated with simple feature spaces. Dorko and Schimid [27] introduced a method for selecting most discriminative object-part classifiers based on likelihood ratio and mutual information. The importance of feature selection was illustrated through car detection tasks with significant variations in viewing conditions. The paper also compared different techniques showing that likelihood is well suited for object recognition and mutual information for focus of attention mechanisms. None of these approaches focuses on invariance improvement or utilize additional information from specifically designed and labeled training views. The following sub-sections provide the details of our feature selection approach to reduce the number of necessary features and enhance the invariance performance of our real-time image matching system. The pre-processing step (section 3.3.1) is a necessary preparation for our feature selection framework. Details of the proposed feature selection technique are provided in section 3.3.2. Selected feature points are matched and tested using our MVKP (section 2) method. 3.3.1 Pre-processing Stage Both MVKP and the proposed feature selection method are based on multiple view frameworks. The first task is to obtain multiple training views of the interested object. Similar like the training stage of MVKP, we use training views synthesized by affine transformation, which can provide ground truth correspondences during the training. In order to select features that are distinctive and invariant, for example, under various rotations, we independently generate a small number of rotation-dominant synthesized views during the pre-processing stage, shown in figure 3.5. These views are for feature selection only and will not be directly used to produce the final descriptors. Figure 3.5, Rotation Dominant Views. An interest point detector will then detect potential feature points for all the generated training views. The detector searches for local maxima of eigenvalues within 3 by 3 patches and guarantees that two feature points should not be too close. The outputs of this stage are synthesized training views with potential feature points detected and stored. 3.3.2 Feature Selection Our method’s overview is given in figure 3.6. First, sets of local patches belonging to the 23 same physical location (view tracks) are constructed and their corresponding feature descriptors computed. Then we compute three ranking scores for each view track based on distinctiveness, invariance, and stability. Finally, feature points with high weighted ranking scores are selected to build the object feature database. The first step of our feature selection method is View Set Construction. Suppose N V and N F are the numbers of training views and detected feature points. Let V k (k=0~N V ), F i (i=1~N F ) represent training view and feature descriptor respectively. Because the affine transformations from V 0 (the input object image) to all the synthesized training views are known, we are able to identify subsets of F i belonging to the same physical locations. Such a subset is called a view track of the object, represented by T j (j=1~N T ). Formally we can define: All features are described by local patches around feature points and then projected onto space with much lower dimensionality (represented by N k ) using WH kernels. The outputs of view set construction are the compact feature representations and δ i,j values for all feature points and for all view tracks. After all the view tracks are constructed, each corresponding to a physical location on the object, our goal is to select those view tracks that are distinctive, invariant and stable. The stability is measured by the feature repeat rate across all the training views, equivalent to the size of view tracks. The stability score for view track j is defined as: 24 , 1 F N ji i SS j δ = = ∑ Figure 3.6, Components Overview of Feature Selection Because the interest point detector we use is essentially a corner detector, and thus robust to rotations, feature points with high stability scores are generally robust to rotations. Our feature selection method will first eliminate those feature points with stability scores lower than a certain threshold (L SS ). In addition, we propose other criteria by utilizing further information provided by view tracks. Features that are not distinctive from one view track to another are more likely to come from small repeated elements of the scene, which causes confusion to image matching systems. Features that have large variance within one view track have a high possibility to come from boundaries of a region, forming more diverse descriptors from different view-pointes and 25 potentially compromising the system’s robustness. Therefore, the goal is to select those feature points that are distinctive and invariant during multiple view training and to improve the geometric invariance of the whole system. The “distinctiveness” is measured by the expected value of pair-wise distances between view track mean vectors. We measure the “variance” of each view track by the expected value of single dimension variances. Thus, Mean and Variance Calculation is needed for evaluating feature ranking scores. To measure the distinctiveness of one view track T j , we first compute the mean vector of all the feature vectors belonging to T j : , 1 , 1 F F N ij i i j N ij i F M δ δ = = = ∑ ∑ Let Dist j,j’ represent the mean vector distance between view track T j and T j’ . The distinctiveness score for view track T j is defined as: ,' '1 1 T N jj j T j DSDi N = = ∑ st We perform variance computation by independently handling each dimension of one view track’s all feature vectors, which provides a N k -dimensional variance vector for each view track: 2 ,, , 1 , , 1 () F F N ij il jl i jl N ij i FM VV δ δ = = − = ∑ ∑ , 26 where l = 1~ N k is the dimension of vectors. The variance score for view track T j is defined as the expected value of all VV j ‘s components: , 1 1 K N jj l K VS VV N = = ∑l Good features should have high distinctiveness and low variance. Therefore, we combine distinctiveness score and variance score together. The raw rank score (RRS) for view track T j is defined as: j j j DS RRS VS = The unified formula for raw ranking score expressed by original feature descriptors is: ,,' 11 1 2 ,, , , 11 1 , 1 1 || () 1 TF F FF K F NN N ij i i j i ji i T j NN N ij il ij il ii N l K ij i FF N RRS FF N δδ δδ δ == = == = = − = − ∑∑ ∑ ∑ ∑ ∑ ∑ Because we initially eliminate those features with stability score SS j less than L SS , the SS j ranges from L SS to N V . In order to combine the above raw ranking score with stability score, we rescale RRS: , (min_) max_ jV rescale j SS RRS rrs N RRS L rrs − =+ , where min_rrs and max_rrs are the min and max RRS values of all view tracks. 27 Finally, we use a weight parameter α to combine the rescaled ranking score with stability score and output the final ranking score (FRS): , (1 ) j rescale j j FRS RRS SS α α =⋅ +− , where α = 0~1. α = 0 corresponds the traditional criterion when only repeatability is considered while α = 1 is the extreme case using only our raw ranking scores. Table 3.1 illustrates the step by step results of one sample view track with α = 0.5. Table 3.1, Feature Selection Sample Results for One View Set. Given all the ranking scores, the final Feature Selection step is to select object features (corresponding to view tracks) with scores higher than a threshold or select a certain percentage of high scoring features. Both cases are tested in our experiments. We use selected feature points to construct an object feature database and apply k-means to further reduce the view track size. After the offline training, given a query image the online query stage runs the same interest point detector and selects a number of “strongest” feature points reported by the detector. There is no feature selection using ranking scores during the online query. All those query image features will be matched against the object feature database. 28 29 3.4 Experimental Results and Evaluations The proposed image matching system has been tested using synthetic and real images as well as the combination of both. Real images are captured using a DSLR camera with high light- sensitivity settings (ISO=800~1600) and in-camera noise reduction off, which gives the input pictures hardware (CCD) generated randomly distributed noise points with random intensity. To obtain the synthetic images, we either performed synthetic viewpoint and lighting changes on the real images or download computer generated images from the Internet. We compared the MVKP method with SIFT method, the classification method using PCA [55] represented by CPCA and the randomized tree based method [54] represented by CRTR. In all the experiments the number of generated training views for MVKP is 100, for CRTR is 1000 (the method’s default), the number of selected objection feature location is 200, the maximum keypoint number returned by the detector is 500, image patch size is 32 by 32 and the number of k-mean kernels to represent each object location is 20. The test computer is a desktop PC with 1.4GHz CPU. 3.4.1 Effect of projection kernels Figure 3.7, Effect of Kernels: Lower bound distance obtained in terms of number of basis used. WH projection kernels versus standard basis. To evaluate the effect of projection kernels, we use two original 256-dimensional feature vectors (one from training stage and the other from the query image) and project them onto the first 5, 10, … , 125 WH kernels respectively. Each time, we calculate a lower bound of the squared Euclidean distance (around 3×10 8 for this test) between the projected vectors. Figure 3.7 demonstrates the kernels’ effectiveness. The maximum coordinate of the vertical axis roughly corresponds to the exact Euclidean distance we are trying to approximate. As can be observed from the figure, the converging speed is much faster using W.H. kernel project. Compared with the result of standard basis vectors, the projections onto the first 20~50 WH kernels already captures the majority difference between the two vectors. To further test the effectiveness of W.H. kernel projection in image matching domain, we also combine the scale-invariant feature detection of SIFT method with W.H. projection kernel technique to produce highly efficient feature representation [87]. The produced feature 30 descriptors are highly-compact, do not require any pre-training step, and show comparative performance when compared with other state-of-the-art methods, therefore, particularly suitable for mobile multimedia applications. Figure 3.8 shows the major steps in order to generate local image descriptors combining SIFT detector and W.H. kernel projection. The approach selects multi-scale salient features/regions with the scale-invariant detector [60] where 3D peaks are detected in a DoG scale-space. Furthermore the dominant orientation and scale are computed and assigned to each detected feature. Next, under the assumption of local planarity, a new canonical view of the local patch (with fixed size and scale) is produced by image warping with the feature’s dominant orientation and scale and centered at the location provided by scale-invariant detector. Finally, we apply W.H. kernel project to the canonical view of the local patch to produce the final highly-compact feature descriptor. The new feature descriptor is named Compact Descriptor through Invariant Kernel Projection (CDIKP) and we use only 20-dimension in our experiments (represented by CDIKP-20). Figure 3.8, Procedures for Generating CDIKP Descriptors. 31 The proposed CDIKP descriptors are tested using the standard INRIA dataset [66]. These are images of real scenes with known ground truth for image correspondences. Figure 3.9 shows the results for several cases: (a) rotation and scale (Boat), (b) viewpoint changes (Wall), (c) image blur (Bikes), and (d) lighting changes (Leuven). The competitors are original SIFT [60] with 128-D descriptors and the well-known PCA-SIFT [48] with 36-D descriptors. We can see from these results that the proposed CDIKP descriptor demonstrates a comparative performance, sometimes even outperforms original SIFT while being more compact and efficient to compute because the feature vectors are low in dimensionality and don’t need any pre-training stage to generate. 32 Figure 3.9, Kernel Projection Evaluation with Standard Test Dataset. 33 3.4.2 Feature di vectors generated from WH kernel projections are Figure 3.10 shows that for CPCA and MVKP method, the number of reported stinctiveness This experiment shows that the feature sparser distributed in the space compared with CPCA method. In other words, our feature vectors are more distinctive one from another, resulting more reliable feature matching and allowing vector approximation based NN-search technique like FFVA working more efficiently. correspondences under the same distance ratio α=0.5. For the same feature sets size, MVKP has a much higher reported correspondence number indicating MVKP’s feature vectors are less likely to cluster together than CPCA’s. Therefore, it is easier to find a distinctive matching in MVKP’s feature space. Figure 3.10, Feature Distinctiveness Test: Number of returned correspondences after distance ratio ( α=0.5) layer. 3.4.3 Matching accuracy and robustness 34 istent check step like RANSAC, the proposed ough only 100 views are used for training) shows better mance with feature selection, we compared the For synthetic image test, even without the cons MVKP method is able to return a large number of correct matching in real-time. The really challenging experiments are those using real images or the combination of real-world and computer-generated images. For pure real image tests, the pictures of the same object are captured from different view angle, under different lighting condition and maybe partial occluded. We also had live-demo comparison in which query images arrive in real-time captured from a live web camera. Overall, the MVKP method (alth accuracy, comparable number of reported correspondences and faster query speed compared with CRTR (1000 views training). In some very difficult testing cases, CRTR gives no output at all while our method still produces rather good results. The figure below shows some examples of evaluation results. For the experiments of whole system perfor speed and accuracy of MVKP combined with our feature selection method (now referred to as Rotation Invariant Feature Selection, or “RIFS” for short) to several other image matching methods. (a) (b) (c) Figure 3.11, Matching Results: Visualized matching results with real-time correspondences indicated by lines. (a): Synthetic images test result; (b) Real images test result; (c) Real and computer-generated images test result. Right side training images are computer-generated images downloaded from the Internet. In the “Burger King” example, red (dark dash) lines are outliers removed by consistent check. For the accuracy test, we synthesized challenging views of object images by affine transformation. For each feature match returned by a particular image matching method, we can determine the location of the ground truth correspondence since the transformation is known. The average error is defined as the average Euclidean distances from all the reported matches to the ground truth matches. Unless otherwise stated, all the average errors are computed based on the initial matchings before RANSAC. For histogram and single view based methods, we found standard SIFT still demonstrates the best accuracy in these challenging tests and SURF [3], as a fast implementation of SIFT, shows the best speed. Among existing multiple-view based real-time methods, the BazAR (CRTR) [54] system demonstrates the best tradeoff between accuracy and speed. We use the default 1000 training views for BazAR system and adjust the threshold so that BazAR and RIFS have similar number of reported matches. We observe a 15~25% accuracy drop when 35 changing BazAR’s training views from 1000 to 100. The same constant distance ratio (0.8) is used for both RIFS and SIFT. 02468 10 12 14 16 18 20 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Average Error Test Index RIFS MVKP BazAR SIFT RIFS(r) Figure 3.12, Accuracy Comparison Figure 3.12 shows the accuracy comparison among four methods. For the average error before RANSAC, SIFT gives the best results, followed by RIFS. However, the accuracy of RIFS after RANSAC (represented by RIFS(r)) easily exceeds that of SIFT and, as shown in figure 3.11, is still ten times faster. 3.4.4 Matching speed CPCA treats feature matching as a classification problem and achieves an online matching speed 5 times faster than SIFT. It is fast enough for many application areas but still not real- time. The introduction of randomized tree (CRTR method) brought the performance into the real-time range and also obtained more robust performance. Figure 3.13 shows the results of our query speed test using large-size synthetic images. The 36 37 number of feature vectors generated from training stage is 4,000 while the number of query image feature vectors ranges from several hundreds to 5,000. The original real images include aerial image, ad poster, small indoor item and complex scene. Synthetic scale, rotation, shear and lighting changes are applied to those real images to generate query images. Linear scan is used for both CPCA and our MVKP methods. Even when the simple and slow linear scan is used, MVKP’s query speed is much faster than CPCA and comparable with CRTR. The typical training time for CRTR is around 15 minutes, for CPCA is around 1 minute and for MVKP is only around 30 seconds. Analyzing the matching time composition for MVKP method we found that the time spent for NN-search using linear scan is more than 70% of the total online matching time. When the feature sets size is large or the application demands a large number of correspondences, a more efficient NN-search technique is the key to the whole system performance. We choose FFVA method due to the feature vector’s two inherent properties: large magnitude and sparse distribution. Our experimental result shows a significant speed enhancement (more than double the speed) over the linear scan method. Although an approximated NN-search technique, FFVA has accuracy close to linear scan. Details can be found in chapter 6. 02 46 8 10 12 100 200 300 400 500 600 700 800 900 1000 1100 1200 Query Time (1/1000 second) Test Index CRTR MVKP CPCA Figure 3.13, Query Speed Comparision: 11 tests are included. Index is ordered according to the query time of MVKP. Exact linear scan is used for both MVKP and CPCA as NN-search component. RIFS BazAR SURF SIFT 0 100 200 300 400 Average Query Time (1/1000s) Image Matching Method Figure 3.14, Speed Comparison. 38 39 Figure 3.14 is the average online query time comparison for the same test sets. We use a default number of trees and tree levels for BazAR system. We add SURF to the comparison to show the speed limitation of histogram and single view based methods. For all four methods, the online query time is the time needed for processing one frame, including: (1) feature detection and description time for the query image (SIFT and SURF need no training, so we assume the feature descriptors of the object image are ready before the test); (2) nearest-neighbor search in RIFS, SIFT and SURF method, or randomized tree classification in BazAR system; (3) RanSac time for consistency check. 3.4.5 Feature selection evaluation To evaluate the proposed feature selection component, we use both synthesized images with known ground-truth correspondences and challenging real images. The input object images cover a large variety from small indoor objects to large outdoor scenes or aerial images. During all the experiments, MVKP with feature selection use only 50 views for general training and 72 rotation-dominant views for feature selection part. The weight parameter α is 0.5 unless otherwise stated. The number of selected features for one object is 100. We use 20 W.H. projection kernels and 10 k-means clusters. The partition length in FFVA is 100. First, we perform high-ranking feature visualization. We apply the feature selection program with different weight parameters to the same set of training views. α = 0 is the traditional repeat rate criterion and α = 1 corresponds the extreme case using only our proposed ranking scores. Table 3.2, Feature Ranking Change Due to New Selection Criteria. Table 3.2 shows the changes in feature ranking based on two different experimental settings. The input object image is the classical book cover image [60]. The traditional repeatability score and our proposed ranking score demonstrate considerably different preferences on the same feature set. 40 α = 0 α = 1 Figure 3.15, Visualization of Feature Ranking Change. Figure 3.15 visualizes the difference in selected features. For both α values, only the top 100 ranking features are marked as white plus symbols. It can be clearly observed that: (1) high ranking features selected by our approach are more likely to distribute within high texture region instead of on the boundary of such regions. Features on the region boundary introduce large variances when rotated and have high probability of being mixed and affected by 41 background textures. (2) Our high ranking features have a smaller chance of appearing on the repeated elements of the scene (the top left part of figure (c)). Features describing repeated elements are less distinctive and can easily confuse the matching systems. Therefore, based on these visualization results, when fewer but better features are selected, the proposed feature selection method should improve the robustness and speed of the whole system. This intuitive expectation is verified by the experimental results in the next experiment. 02 468 10 12 0 10 20 30 40 50 60 70 Average Error Test Index Alpha = 0.5 Alpha = 0 Figure 3.16, Feature Selection Improvement. 42 (a) (b) (c) Figure 3.17, Results of Image Matching with Feature Selection: Top (a): challenging synthesized image test; Middle (b): multiple objects under cluttered background. The original image was adapted from [60]; Bottom (c): images from different sources. Second, for the improvement from feature selection, we carry out 11 independent pair-wise 43 44 image matching tests and only keep the top 30 percent ranking feature points for both α = 0 and α = 0.5. The test indexes are ordered by the increasing error of α = 0.5 case. Since feature selection only occurs during training, it will not prolong online processing time. If a smaller number of features are selected, processing time may even be reduced. When the same percentage of features is selected, figure 3.16 clearly indicates that our new feature selection method notably improves the matching accuracy. Figure 3.17 illustrates the image matching and pose recovery results for real and synthesized images. The new distribution of feature point locations can be easily observed. Overall, MVKP with feature selection clearly demonstrate more stable performance than without. The new feature selection method is able to select distinct features, discard features on the unstable boundaries and repeated elements. The selected features are robust even for extremely noisy and blurred scenes. Last, for influence of the weight parameter, RIFS combines the traditional repeatability criterion with our newly proposed ranking score through a weight parameter α. We change the weight parameter from 0 to 1 but always select the 30% top ranking features for each image pair. Optimal performance is observed around α = 0.5 (figure 3.18). We observe small changes of the best weight value for different kinds of images. If the image category is given, a good weight parameter can be found beforehand through machine learning techniques such as holdout or cross validation. 0.0 0.2 0.4 0.6 0.8 1.0 40 45 50 55 60 65 70 75 Average Error Alpha Value Accuracy when changing alpha Figure 3.18, Changing the Weight Parameter. 45 46 Chapter IV: Augmented Distinctive Features for Efficient Image Matching Although local feature based approaches gained much attention recently, when used alone, the locality property of such features frequently leads to noisy results. Geometric constraints and consistency check generally need to be applied as additional layer to refine the result. Furthermore, although experimentally proven to be remarkably robust and invariant to viewpoint changes and distortions (usually at the cost of heavily computational demands), local features are also typically not very distinctive basically because of the lack of built-in geometry information. Therefore, they tend to face great difficulty for scenes without high- textured surfaces or with repeated patterns, which easily confuse texture-based image matching methods. To tackle the problem, many recent works in matching and recognition domains try to directly extract and describe high-level matching primitives such as constellation of edges ([24] and [80]) and regions ([98] and [37]) to enhance the descriptors distinctiveness. However, the accurate and clean acquisition of those primitives remains a challenging open problem in computer vision. The acquired curves and shapes are often insufficient stable between images due to viewpoint and lighting changes. Input Images Augmented Features Database Sub- Images Relative Features Ordinal Codes keypoint detection & grouping keypoint description offset aggregation compute ranks handle tied Figure 4.1: Overview of Our Feature Augmentation Process This chapter presents our novel approach in a general framework to generate more distinctive features based on invariant local feature descriptors, without the need of high-level feature extraction. We propose to treat the classic image matching problem as a collection of image recognition problems, which naturally integrates geometry information into our augmented features. Figure 4.1 shows the major steps involved when generating augmented features. First, we detect stable interest points in input images and extract corresponding image patches based on positions, orientations and scales of those points. Sub-images, which are basically subsets of image patches that are close to each other in image space are constructed, consequently converting the original problem into an image recognition problem for those sub-images. Next, the semi-global descriptor for each sub-image is computed by first accumulating offsets between the sub-image center and member patches’ descriptors, which produces what we called relative features, containing geometry information of the neighborhood. The 47 48 augmented features (ordinal codes) are finally generated by considering the relative rankings of features’ components instead of their original values, which can be regarded as a feature normalization process. From certain point of view, our feature augmentation process is analogous to the coding process in the general framework of image recognition. One difference is what we are coding here are sub-images, not local patches (e.g. [55]) or the whole image. Additionally, we propose increasingly keypoint sampling around sub-images to handle tied ranking problem of ordinal codes. Finally, Spearman correlation coefficient is used to measure the similarity between augmented features and establish the point-to-point correspondences for image pairs. Extensive experimental results based on SURF features have demonstrated the effectiveness of our augmented features. We conduct experiments using benchmark datasets plus precision- recall analysis, together with supplemental real-word challenging image pairs, both indicating higher distinctiveness and lower outlier level of our augmented features compared with base features. The additional computational cost of the feature augmentation process is nominal, which makes building interactive or even real-time computer vision applications possible when combined with fast base features. Furthermore, our feature augmentation framework is general enough to be compatible with a wide range of existing invariant local features, providing important performance gain to image representation methods or upgrading existing image feature databases at a minimum cost. We also provide theoretical analysis about compatibility issues of our method working with different kinds of local features. Our proposed approach is related to the following recent works: In [15], Boureau, et al. provided a systematic evaluation about combinations of various coding and pooling techniques for recognition, and concluded that large performance increase can be obtained by 49 merely representing neighboring descriptors jointly. Similar idea was explored in [45] which also proposed a novel way to optimize dimension reduction and approximated nearest neighbor searching all at the same time. Concerning the ordinal description method we used, the early work of ordinal measurement tracks back to M. Kendall [49]. Recent works applying ordinal description on image correspondence and recognition problems include R. Zabih and J. Woodfill [106], J. Luo, et al. [61] and M. Toews and W. Wells [86], which built upon SIFT descriptors and reported superior results in terms of precision and recall when compared with many widely-used descriptors such as original SIFT, PCA-SIFT and GLOH. Our sub-image concept is also enlightened by the following classic techniques: first of all, the shape context [7] takes sampled contour points as inputs and constructs shape descriptors in log-polar histograms using relative positions (direction and distance) between contour points. Two fundamental differences are: our sub-images are described by counting offsets in invariant feature space, not in image space. And the construction of sub-images is directly based on stable interest points, therefore avoiding the need to acquire clean shapes and contour points from images, which is a challenging fundamental vision problem by itself. Second, our sub-image description is also related to Fisher kernel, which provides the offset directions in parameter space into which the learnt distribution should be modified in order to better fit the newly observed data. Recent work of Perronnin et al. [70] uses fisher kernel to obtain compact and efficient image representations for image classification. Last, the famous bag-of-features approach [82] uses K-mean to generate code-books based on invariant local features, assigns each query descriptor to one item in the code-book, producing a histogram of codes representing the whole image, while our approach proposes sub-image as matching 50 primitives and uses accumulated offsets instead of the cluster centers in the feature space. Each base descriptor is converted into exactly one augmented descriptor, facilitating dense point-to-point correspondences. 4.1 Feature Augmentation Process This section presents the reasoning and details of our feature augmentation process. Based on keypoint locations and descriptions, we construct sub-images representing interested neighborhood in image space, then compute relative features for each sub-image integrating geometry information in feature space, which are normalized using ordinal description and produce final augmented features. 4.1.1 Sub-images Many existing image matching methods use local image patches around interested keypoints as matching primitives. The image patches can be extracted either from scale and rotation- adaptive neighborhoods, where transformation parameters are determined through searching in scale space and orientation histograms, or from regular regions of fixed sizes, which achieve viewpoint invariance through separated multiple-view training process. The produced local features can be very robust against viewpoint and lighting changes and distortions, but usually insufficient distinctive due to the lack of global geometry information. On the other hand, image classification and recognition works focus more on distinctiveness among different images rather than robustness again viewing condition changes, generally producing one unified descriptor for each image and can by no means provide robust and dense point-to- point correspondences. We propose to use sub-images as matching primitives, aiming to fill the gap between the above two extremes. Sub-images are sets of local image patches that are close to each other in image space (figure 4.2). As relative local structures, they are robust to factors such distortions and occlusions, but they also integrate semi-global geometry information in order to improve feature distinctiveness. Figure 4.2, Illustrations of Sub-image Concept: (k = 5 cases). Red and blue dots are leaders and members of each sub-image respectively. Notice for figure clarity and due to the overlapping between different sub-images, only four sub-image structures are visualized in this figure, although the total number of sub-images is very large (equal to the number of original keypoints, see the formula below for details). Grey crosses are other interest points returned by SURF detector. To construct sub-images, we first detect stable interest points (represented by P i ) and extract fixed or invariantly adaptive image patches around each interest point depending on which detector is used. After the position of each P i is stored and properly indexed (hierarchy tree structure preferred especially when the total number of keypoints, represented as num, is large), we construct one sub-image structure for each P i (represented by S(P i )) with P i as its leader and its k nearest neighbors in image space as members. Naturally, the number of resulting sub-images is the same as the total number of original interest points and there are many overlaps between neighboring sub-images. In other words, each original interest point is converted into exact one sub-image, integrating its neighborhood information. Formally: num i P NN k P P S i i i ≤ ≤ − = 1 )}, ( { } { ) ( ∪ 51 52 Each S(P i ) is an abstract type of “image” containing information for the neighborhood of P i . Both leaders and members of sub-images are centrally organized by keypoints. As a result, the keypoint detector should be sufficient stable so that the constitution of the same sub- image will remain consistent to certain degree between different input images under various viewing conditions. In our current experiments, we use the efficient keypoint detector in [3] based on integral images, which is very fast to apply and provides us with sufficient repeatability. After the sub-image construction, finding matching points in the input image pair is equivalent to classifying a large set of sub-images generated from both images. 4.1.2 Relative features Given sub-images, our next task is to form invariant descriptors for them, which will be used later to measure the similarities between sub-images efficiently. This step is analogous to the coding process in image recognition. Suppose we choose an invariant local feature descriptor (with descriptor dimensionality represented by dim) and D(P i ) is the feature vector of P i and D j (P i ) represents its j-th vector component. Notice that there is no need to adopt the keypoint detector and descriptor all belonging to the same image matching system, as long as the detector selected is stable and the descriptor invariant, meeting general requirement of those components. This provides extra flexibility for our augmented feature to work well with various combinations of existing keypoint detection and image representation and description techniques. If we seek to combine the sub-image’s descriptors directly, for example, simply concatenating D j (P i ) for all the P i belonging to the same sub-image, at least two fundamental 53 drawbacks will immediately follow. One is the rather high dimensionality of the resulting descriptors (when k = 5 and standard SURF [3] descriptor is used, each sub-image will be associated with a 384-D vector), which raises practical obstacles to later indexing and searching steps due to dimensionality curse. Even hierarchy data structures will face great difficult indexing and search exact nearest neighbor for such high dimensional feature database. The other is such combination will highly likely compromise the invariant properties of the base descriptors because of the different and variant orders when performing the concatenation. Encouraged by successful classic techniques such as Fisher kernel [70] and shape context [7], we believe that generally speaking, features based on relative and aggregated values (either in image space like shape context or in feature space like “macro features” in [15]) usually demonstrate more robust performance than features directly based on raw values of base descriptors. Figure 4.3, Relative Features: Sub-images (red crosses as sub-image leader and blue ones as members) of similar physical locations in one image matching pair, and their corresponding relative features visualized as 2D (8 by 8) histogram patterns, treating the accumulated offsets in each dimension as intensities. Notice both the overall similarity and detailed magnitude difference between those corresponding relative features. Based on the above reasoning and analysis, we compute relative features for S(P i ) (denoted as R(P i )) by accumulating k offsets between leader and member descriptors of S(P i ) for each dimension. Formally, we can define: Then the j-th dimension of the relative feature is computed as: dim 1 , || ) ( ) ( || ) ( ] , 1 [ , ≤ ≤ − ⋅ = ∑ ∈ j P D P D P R num n n j i j n i i j δ The computed relative features have a constant low dimensionality regardless of k, the 54 55 number of members in each sub-image. Moreover, assuming the bases descriptors are scale and rotation-invariant, and the keypoint detector is stable, it can be easily proved that the relative features produced will also be invariant, since the sub-image members are considered in an order-independent way. Each relative feature captures the characteristics of invariant features distribution of one sub-image, which is a semi-global structure capturing keypoints’ spatial relationship in image space. This is a considerable advantage for image matching applications, achieved by computing relative features in invariant feature space and maintaining invariance during the computation process. As a comparison, features computed by aggregating offsets in image space, such as shape context, are fundamentally not invariant to orientations and typically obtain limited degree of scale invariance through normalization techniques. In our current experiments, we use SURF descriptor [3] with dim = 64. We visualize the generated relative features by first transforming their components into 0~255 intensity range and then rendering each descriptor as a 2D (8 by 8) histogram. Figure 4.3 shows some of our relative features generated using a modified version of the famous testing image pair from [60] (extra lighting changes and distortions are added). 4.1.3 Normalization After relative features in each image are computed, we can directly measure their similarities using standard distance metrics such as Euclidean distance and establish the initial correspondences between two feature sets. For simple image matching pairs, e.g. one image is the in-plane rotated or uniformly scaled version of the other, robust and accurate results can be obtained. However, when it comes to challenging image pairs involving complicated viewpoints and lighting changes (e.g. figure 56 4.3), we found such initial matching results are generally very noisy. The primary reason lies in the linear relationship assumption of today’s widely-used computational efficient metrics such as Euclidean or chi-squared distances. To handle the violation of linear assumption (e.g. from illumination changes), original feature vectors are usually normalized to remove non-linearity before entering the matching process. Many existing methods use a set of experiences- or experiments-determined parameters to threshold, translate and rescale, fitting the original data into a fixed range, which is naturally difficult to generalize to other viewing conditions or combinations of situations. As can be observed from figure 4.3, in terms of raw values’ magnitudes (the intensities of the 2D histogram cells), sub-images of similar physical locations sometimes produce rather different relative features due to viewing condition changes, which is exactly the reason of noisy results when directly measuring the similarities of relative features. However, it can also be observed that although those bins’ absolute values are not identical across different views, their relative orders and relationships are overall consistent. Based on the above analysis and observations, we propose to use ordinal description as a parameter-free and computational efficient approach to normalize our relative features. First, for each relative feature R(P i ), which has the same dimensionality as the base descriptors, we locally sort its vector components and produce a sorted array, which provides us with relative ranks (ranging from 1 to dim) of the relative feature vector’s components. The augmented feature (represented by A(P i )) is generated by replacing R(P i )’s original component values with their corresponding ranks in the sorted array for every dimension. For example, if R(P 1 )’s first vector component has a value of 50. After sorting all the vector components of R(P 1 ), we find there are nine components less than 50 and consequently the first vector component has a relative ranking of 10, which will become the first vector component of the corresponding augmented feature A(P 1 ). The overall computational complexity of normalizing each relative feature is only dim*log(dim) due to sorting. Formally we have: dim 1 | )} ( ) ( | ) ( { | ) ( ≤ ≤ ≤ = j P R P R P R P A i j i m i m i j Figure 4.4, Augmented Features: Corresponded augmented features visualization for similar physical locations in the two input images. Separated by the middle dash line: on the left side are four augmented features (locations indicated as red dots) extracted from the book cover image while the right side are four corresponding augmented features from the scene image. All augmented features are visualized as barcodes and corresponding ones from different images are aligned at the same row. Notice the improved similarity between augmented features of the same row and maintained distinctiveness among features on different rows. Our final augmented features consider the relative ranking of features’ each dimension instead of their original values. As a result, each augmented feature is represented as an integer vector of dim dimensions, more specifically, a permutation of the set {1, 2, …, dim}. The ordinal description we use generates normalized image representations invariant to 57 58 monotonic deformations and also robust against certain degree of challenging non-linear and non-uniform factors such as partial shadows and lightings. Since each ordinal codes is a permutation of integer set {1, 2, …, dim}, we treat those integers as intensity values and simply visualize each ordinal codes as a barcodes-like 1D pattern. Figure 4.4 shows the “barcodes” associated with the same sub-images as in figure 4.3 in the two input images. It indicates that after the ordinal description, our feature similarity is notably improved. Under different viewing conditions, features belongs to the same physical locations are more similar while others remain distinctive. One crucial issue for ordinal description is how to handle tied ranking. Given a relative feature vector, when its two components with different indices contain the same original value, the produced ordinal codes will contain tied identical rankings for the two indices, resulting in unexpected outputs. The simplest way to handle tied ranking is to use their different indices as fail safe reference to break the tied situation. When two different vector components produce the same ranking, the one with smaller index will simply be given priority. This method is equivalent to using a vector of original indices as a reference to break the tied case. As an improvement, in [86], such reference vector is obtained by averaging a large set of descriptors, e.g. the descriptors of the whole image, offline, which is appropriate for coding the whole image for recognition purpose. However, we argue that for image matching, the reference vector used by semi-local ordinal description should reflect the local ordering trend, which is usually not consistent with the global trend. Algorithm: Tied Ranking Breaker 1: Assume: R j = R j’ ; j ≠ j’; 2: Initialize: set S = {all keypoints}-S(P i ); 3: do iteration t = 1, 2, …, M; 4: P n = NN(P i , S); 5: S = S – P n ; 6: S(P i ) = S(P i ) + P n ; 7: compute new R(P i, S(P i )); 8: if (R j ≠ R j’ ) break; 9: end iteration; 10: if (R j ≠ R j’ ) use (R j , R j’ ) to handle tied ranking; 11: else use (j, j’) instead; Table 4.1, Algorithm Structure for Tied Ranking Handler Therefore, we propose increasing keypoint sampling around sub-images to break tied ranking. The idea is to iteratively include additional nearest neighbors of the current sub-image one by one and compute new relative features during each iteration until the tied ranking is broken, or, for efficiency consideration, until a maximum number of iteration (denoted by M) is reached, in which case, we use the original indices as reference vector instead. Pseudo codes for tied ranking handling are provided in table 4.1. NN(P i ,S) is a function returning the nearest neighbor of P i in S. 4.2 Establishing Correspondences using Augmented Features Although standard distance metrics e.g. Euclidean distance can be used to measure the similarities, better comparison results can be obtained more efficiently using special metrics based on characteristics of the ordinal codes, for example, given dim, there are a fixed total number of unique codes (dim!) and are all of the same length. Ordinal description and measurement have been studied for nearly a century and many specially designed distance metrics were proposed. In this thesis, we studied three measurements compatible well with our features and problem domain. First is the measurement of element-wise consistency counting. Assume A and A’ are two augmented 59 feature vectors for two sub-images, and A j is the j-th vector components. Their similarity can be efficiently computed as: dim 1 | } ' | { | ) ' , ( ≤ ≤ ≠ = j A A A A A Dist j j j The second metric is based on relative ordering of vector component pairs measured by ‘sign()’ function, which returns 1 if the input parameters have the same sign and -1 otherwise. Formally, the Kendall coefficient [49] is defined as the following: ) 1 dim(dim ) ' ' , ( 2 ) ' , ( dim] , 1 [ dim] , 1 [ − − − ⋅ = ∑ ∑ ∈+ ∈ jj m m j m j A A A A sign A A Dist Last, based on the ordinal codes’ property of equal length, our modified version Spearman correlation coefficient [85] returns the similarity of two ordinal codes in the range of 0~1. ) 1 dim(dim ) ' ( 3 1 ) ' , ( 2 dim] , 1 [ 2 − − − = ∑ ∈ j j j A A A A Dist Our own experiments indicate that the Spearman correlation coefficient generally provides better matching results than the other two. As some additional but rather practical considerations, its conventional unit range also works well with many existing image matching components and frameworks, such as distance ratio (used in [60] and [66]) and consistent check methods like RanSac. Therefore, we select the Spearman correlation coefficient as our distance metric in the next section’s evaluation. 4.3 Experimental Results Our proposed augmented feature has been intensively tested using standard benchmark datasets [66] with known ground truth. We primarily focus on outdoor scenes according to 60 61 our project’s interest. Besides the standard test data, we also conducted experiments on other real-world image pairs of outdoor nature scenes with man-made objects such as buildings and signs. Many of those additional testing image pairs contain interest objects with surfaces of low-textures or repeated patterns, particularly challenging for various existing local image features. We experiment different kinds of invariant image description methods and finally select SURF (Speeded Up Robust Feature) [3] as our base descriptor, because it provides the best tradeoff between matching speed and feature robustness. SURF is built upon other experiments-verified and successful detectors and descriptors, such as famous SIFT [60], but simplify those steps to the essential. It proposes the use of integral images to drastically reduce the number of operations for simple box convolutions, independent of the chosen scale. The feature description is based on sums of Haar wavelet components, which can be constructed and matched more efficiently compared with other state-of-the-art methods. In our evaluation, the widely-used 64 dimension of SURF descriptor is used. We set k, the only parameter of the whole feature augmentation process, to 5 which we experimentally find to provide best results. Overall, our experimental results demonstrate that compared with base descriptors, the proposed augmented features achieve remarkably higher level of distinctiveness without loss of robustness, at a nominal additional computational cost. Specifically, besides outperforming base features in the standard precision-recall curves, the new augmented feature integrating geometry information generally produces larger number of correspondences under the same distance ratio requirement and less number of noisy matchings especially for challenging real-world scenes. The proposed method is integrated into our image matching system 62 described in the previous chapters, which generates augmented features of one target image offline then matches the features of query images online and frame by frame. The total processing time each frame including keypoint detection, description, feature augmentation process and finally establishing correspondences takes only 100~150ms with a peak memory occupation of around 15M, making the system an attractive solution for image matching problems of interactive/real-time applications such as outdoor navigation and augmented reality on small mobile devices. 4.3.1 Standard tests The standard test sets we experimented consist of image sets of various scenes (K. Mikolajczyk and C. Schmid [66]), with each containing six images (five pairs when the first image is fixed as reference image) of successive increments of one type of image deformation including image blur and compression, lighting and viewpoint changes. One image transformation matrix is associated with each image pair, so that for any keypoint position in the reference image, we are able to compute the ground truth corresponding position in any of the other five images, which can be used to verify matchings returned by various methods. During the evaluation process, for each image pair, we construct base descriptors and two versions of augmented descriptors using local (L) and global (G) reference vector respectively, for the same set of interest points and then use Euclidean distance and Spearman correlation coefficient to measure their similarities and establish initial matchings. Within one test, the same distance ratio is applied to both base and augmented tracks to obtain final reported matchings. Distance ratios are varied in different tests to generate the figure data points. Next, for each keypoint position in the reference image, by comparing its reported matching 63 positions with ground truth matching positions, we are able to distinguish true and false matchings. This crucial information, combined with the total number of reported and possible correspondences, allow us to compute precision and recall for one image pair, under one distance ratio. According to our project interests, we focus our experiments on challenging outdoor scenes with non-planar objects and low-textured surfaces, handling complicated viewpoint changes. To generate figure 4.5, we choose 13 image pairs meeting the above guidelines from three image sets (the boat set, the graffiti set, and the bike set), which also represents challenging factors in outdoor mobile applications such as rotations of the mobile device and blurred input images. Each image pair is tested by using base and augmented features and ten different distance ratios. Figure 4.5 shows the precision and recall results summary of over 300 image matching tests, demonstrating the superior performance of our proposed method. 0.00.1 0.20.3 0.40.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Recall 1-Precision Aug. Feature (L) Aug. Feature (G) Base Feature Figure 4.5, Standard Tests Results: Recall-precision for our standard test results summary, evaluated on 64-D SURF as base descriptors. Results averaged over 13 images pairs of K. Mikolajczyk’s testing data set. 4.3.2 Dense matching from distinctive features In this subsection, we demonstrate that our augmented features are more distinctive than base features, resulting denser and cleaner correspondences. Towards this goal, we compute base and augmented features using the five image pairs of boat image set (with increasing level of zoom and rotation), apply the same distance ratio and record the number of reported matchings for the two feature tracks respectively. Since distance ratio is defined as the ratio of the best matching’s similarity over the second best, we believe that from some aspects of view, the number of remaining matchings after applying a fixed distance ratio can also be used as a simple indicator of feature distinctiveness. Assume one method generates similar, clustered thus less distinctive features while the other 64 produces features well scattered and distributed in the feature space. After applying the same distance ratio filter, the first method will have much less matchings remaining because the difference between its different matching’s similarities is much limited. 123 45 0 50 100 150 200 250 300 350 Number of Features and Matchings Test Index (Increasing Zoom and Rotation) Total Number of Features # of Reported Matchings (Aug.) # of Reported Matchings (Base) Figure 4.6, Comparison of Feature Distinctiveness: Number of matchings remaining after applying the same distance ratio. Evaluation based on the boat image sets (five pairs with the first fixed as reference image). Figure 4.6 shows that our augmented features lead to significant larger number of reported matchings than the base features, under the same distance ratio. This result can also be considered jointly with section 4.3.1, which indicates better precision and recall tradeoff of the denser matchings, together with section 4.3.3, which demonstrates consistent results for images outside the standard datasets as well. 4.3.3 Matching visualization and analysis 65 66 This subsection provides some line by line matching visualization results for both standard test images and other real-world scenes. We apply the same distance ratio and for each of remaining final matching, draw a blue line from its keypoint in reference image to the matched keypoint location in the current query image. As can be observed from figure 4.7 and consistent with the quantitative evaluation results from the above two subsections, through augmented features, we are able to obtain denser correspondences and less matching outliers than using base features. The primary reason lies in the fact that our augmented features enhance distinctiveness by integrating semi-global geometry information based on sub-images concepts and offset aggregation in feature space, while maintaining and even improving feature invariance and robustness through ordinal description. Last, regarding the compatibility issue of our feature augmentation, theoretically the proposed method should work well with a large range of existing invariant local features including but not limit to SIFT [60], SURF [3], and GLOH [66]. However, there are a few issues worth considering beforehand. First of all, it goes without saying that the base feature needs to have descriptors in vector form. Otherwise the computation of relative features and ordinal description can not be applied directly. So special approaches based on direct learning on local image patches without vector representations such as [54] and [16] couldn’t benefit from the feature augmentation process. Next, bag-of-features [82] and some multi-view based image matching techniques such as [55] and [95] use K-mean cluster centers instead of the original interest points to represent images 67 or features. In such case, special cautions are needed to make sure except stable keypoints, those cluster centers are also sufficient stable otherwise it is hard to guarantee that sub- images, as the very foundation of our feature augmentation process, will remain repeated and consistent across different viewing conditions. Finally, concerning the normalization component in our method, ordinal description generally works better when the feature vectors it applies to have a high dimensionality, providing a large pool of unique codes. Image matching methods advocating dimension reduction (e.g. by using PCA) of their feature vectors, such as [55] and [48] are ill-suited for our normalization component. It is reported in [86] that directly applying ordinal description onto PCA-SIFT [48] can even lead to inferior performance. (a) (b) Figure 4.7, Matching Visualization and Comparison: of base (top) and augmented (bottom) features. Standard dataset: (a) Graffiti; (b) Boat; Extra testing dataset: (c) building with similar or repeated patterns; (d) street scene with large viewpoint changes. 68 (c) (d) Figure 4.7 (Continued): Matching visualization and comparison of base (top) and augmented (bottom) features. Standard dataset: (a) Graffiti; (b) Boat; Extra testing dataset: (c) building with similar or repeated patterns; (d) street scene with large viewpoint changes. Finally, as a summary, this chapter presents our feature augmentation process, producing 69 70 more distinctive features for efficiently image matching. The proposed method is built upon the concept of sub-images, which connects close small image patches in image space, converting one image matching problem into a collection of image recognition problems. Our relative features aggregate descriptor offsets in invariant feature space and within sub-images, in order to integrate geometry information and produce semi-global features. Based on visualization and analysis of corresponding relative features, we propose to use ordinal description to normalize and generate augmented features, invariant to monotonic deformations and beyond. Last, similarities are measured by Spearman correlation coefficient and correspondences are established. Experimental results using standard and supplemental datasets verified the augmented features containing additional geometry information are more distinctive and efficient, leading to denser and cleaner correspondences, particularly suitable for multimedia applications on small mobile devices. 71 Chapter V: Fast Similarity Search for High-Dimensional Dataset Similarity search is crucial to multimedia database retrieval applications, which involves finding the most similar objects in a dataset to a query object based on a defined similarity metric. To achieve a robust and effective querying, many current multimedia systems use highly distinctive features as basic primitives to represent original data objects and perform data matching. While these feature representations have many advantages over original data including distinctiveness, robustness to noise, and invariance and tolerance to geometric distortions and illumination changes, they typically produce high-dimensional feature spaces that need to be searched and processed. Methods that search exhaustively over the high- dimensional spaces are time-consuming, resulting in painfully slow evolution of related applications. Efficient search strategies are needed to rapidly and robustly screen the vast amounts of data that contain features and objects of critical interest to users’ applications. Traditional tree-structure techniques hierarchically partition and cluster the entire data space into several subspaces and then use special tree structures to index objects. KD-tree is a widely used NN-search algorithm with sub-linear complexity [33]. It performs well for low- dimensional datasets if there are many tree branches that can be pruned and consequently only a small fraction of data needs to be processed. However, their performance will rapidly degrade when directly adapting to high dimensionality. This is referred to “curse of dimensionality” [6] and places a practical limit on the partitioning based techniques. Weber, et al [101] conducted a quantitative analysis of various NN-search methods in high- dimensional vector spaces. It shows when the dimensionality of data excesses around 10, the searching and indexing techniques based on hierarchical data-partitioning, including KD-tree, 72 R-tree [38], R*-tree [4], SR-tree [47], M-tree [19], and TV-tree [56] can be easily outperformed by a simple sequential-scan. X-tree [11] has demonstrated good performance for datasets with medium dimensionality (e.g. 30-dimension). X-tree is constructed in such a way that the minimum bounding hyper- rectangle of two splitting sets is minimal. The concept of supper node is introduced intending to tackle the problem of dimensionality curse. Experiments showed that the number and size of the supper nodes increase with dimensionality, which can be viewed as a proof of the limitation of hierarchical structured NN search. Best-Bin-First (BBF) [5] performs an approximated searching over KD-tree structure to accelerate the search speed for high-dimensional datasets. The tree-nodes are examined in an ascending order according to the minimum distance from the query point to the bin containing the leaf nodes. Similarity search terminates after certain number of leaf nodes are examined. Another line of research is to use linear and flat data structures to index objects [10, 13, 31, 99, 100 and 101]. VA-File proposed in [13] has shown to be an effective technique for high- dimensional databases. The method divides the original data space into rectangular cells quantized as bit-vector representations. Similarity search of a query object is performed by first screening the data using lower/upper-bound distances, and then refining the candidates. VA-File, however, suffer from the assumption of database uniform distribution, which is not always applicable to complex data, especially a heterogeneous multimedia dataset. There exist several variant approaches intending to overcome the problem. Ferhatosmanoglu [31] improved the VA-File search by using non-uniform bit allocation and optimal statistic quantization. Weber [99] discussed the issue of selecting appropriate bit number for each 73 dimension, and also suggested a formula for evaluating the bound error distribution. Furthermore, a parallel VA-File search technique with multiple workstations was investigated in [100]. 5.1 FFVA for Similarity Search This section presents an improved vector approximation NN search method, called Fast Filtering Vector Approximation (FFVA), for rapidly searching and matching high- dimensional features from large multimedia databases. Figure 5.1 illustrates the structure of FFVA and its major components for an efficient nearest neighbor search. The basic data structure of FFVA and standard VA File method is a space-partition-table (SPT). Each dimension of input feature vectors is quantized as a number of bits used to partition it into a number of intervals on that dimension. In the whole vector space, each rectangle cell with the bit-vector representation approximates the original vectors that fall into that cell, resulting in a list of vector approximations of the original vectors. It is noteworthy that our FFVA method clusters the original vectors to the corners of SPT cells, thereby enable us to use a list of corners, instead of cells, to approximate the original vectors. Such strategy is efficient for the following fast lower bounds filtering. There are two major levels involved in FFVA NN-search: 1) coarse search level to sequentially scan the approximations list and eliminate a large portion of data, and 2) real data search level to calculate accurate distances of resultant candidates and decide the final k- nearest neighbors. Input Database Vectors Input Query Vector Output k Nearest Neighbors Vector Approximation Database Vector Approximation Coarse Search Level Real Data Search Level Figure 5.1, Algorithmic Structure of FFVA Approach In previous works, lower-bound (the Euclidean distances from the SPT cell’s nearest corner to a query point) are used for the coarse search. Experiments show that the approximation quality and computation cost of the bound-distances determinate the performance of the entire searching system ([31], [88]). Figure 5.2 shows the total time (sum of 900 queries) of a typical VA-file method query using real data sets (128-D SIFT features). According to the graph, the calculation of lower-bounds takes around 90% of the total querying time. This result is consistent with the experimental results in [19]. Therefore, a more efficient method for lower-bounds computation is essential for improving the entire searching performance. In the coarse search level, only the vector approximations are accessed and block distance is used as similarity metric, which is calculated by employing the Manhattan distance of corresponding corner points of two SPT cells: 74 where and v are the coordinates of two corresponding corner points in a n-dimension space. v 1i 2i Let “max_bd” represents the longest exact distance (squared Euclidean distance) from a query point to current k-nearest neighbors. Since in our experiments, all the vector coordinates are integers so the exact distance is strictly lower-bounded by the block distance, whenever we encounter an approximation whose block distance from the query point is larger than max_bd, it can be guaranteed that at least k better candidates have already been found. Therefore, we can eliminate data with block distances larger than max_bd. Updating max_bd is also fast, as we dynamically sort and maintain the k-NN structure. Figure 5.2, Query Time of VA-File Only those candidates with block distance no larger than max_bd will enter the real data search level. In this level, their original vectors are accessed in order to calculate their exact 75 76 distances. If the exact distance turns out to be shorter than any of current k-NN distance, the k-NN as well as max_bd will be updated. Below is the algorithmic structure of the FFVA approach. Input: query and database vectors Output: kNNQ containing k-nearest neighbors /* Note: kNNQ is a list containing the k-nearest neighbors found so far, sorted according to ascending order of exact distance from a query point */ 1. Calculate or load (if pre-computed) the approximations of database vectors: aprov and query point: aproq; 2. Initialize kNNQ; 3. max_db = maximum possible value; 4. For each approximation aprov of database vectors { current_bd = block distance between aprov and aproq; if (current_bd < max_db) { Calculate the corresponding actual distance; Insert this vector into kNNQ, if it is closer to query point than the last element in kNNQ; Update max_db to be the distance from query point to the last element of k-NN found so far; } } Table 5.1, FFVA algorithmic structure 5.2 Experimental results In this section, we provide extensive performance evaluation and comparison of the proposed FFVA approach with four commonly used k-NN search techniques: exhaustive linear scan, 77 KD-tree, BBF, and VA-File. The tests cover: search accuracy, data access rate, query speed, and memory requirement. Both synthetic and real data are used in our experiments. The real data sets are large number of high-dimensional feature vectors (128-dimension for each feature) generated from a range of real images containing various object and scenes. SIFT (Scale Invariant Feature Transforms) approach [60] was used to extract the image features. In our test, the partition length for VA-File and FFVA methods is 15. For BBF, the size of buffer for “best bins” is 25 and the limit number of examined nodes is 100. During the similarity search, top two nearest neighbors (2-NNs) were returned. To evaluate the matching correctness, we used the distance ratio as evaluation criteria, that is: “the second closest neighbor should be significant far away from the closest one” [60]. A threshold value of 0.6 was used throughout the experiments. 5.2.1 Searching accuracy In the accuracy test, we randomly pick up 1000 vectors from a synthetic database served as query vectors (test set). Since the test set is a subset of the database, ideally a perfect match of a query vector should have zero distance. Table 5.2 summarizes the number of correct matches (out of 1000 queries) of the proposed FFVA method to the synthetic database. Gaussian noise (variance = 0.3) is added to the uniform distributed vectors in the test set. We also conducted tests on various real data sets. Overall, the matching performance is consistent with the results of above synthetic data. Dimension 60 70 80 90 100 110 115 Correct matches 1000 999 1000 1000 998 1000 1000 Dimension 120 125 130 135 140 145 150 Correct matches 1000 1000 1000 1000 1000 1000 1000 Table 5.2, Matching Accuracy of FFVA 5.2.2 Data access rate Data access rate is an important aspect to evaluate the effectiveness of an approach for very large highdimensional database retrieval problem. KD-tree and BBF approaches utilize hierarchy tree structure to skip the nodes that are not along the searching path. FFVA and VA-File methods use compact vector approximations to avoid accessing the majority of the original database vectors. Figure 5.3 illustrates the test results of data access rates for three methods. The FFVA NN- search method clearly demonstrates the best performance. Its lowerbound is much tighter than that of standard VA-File method. As a result the proposed lower-bound computation based on block distance is efficient in reducing the amount of necessary data access and distance calculation. In this experiment of synthetic data, less than 10 exact distance calculations are needed for FFVA to find the 2-nearest neighbors among 15,000 120- dimension feature vectors, while other two methods (BBF and standard VA-file) spend 10-20 times more for the exact distance calculations. 78 0 2000 4000 6000 8000 10000 12000 14000 16000 0 20 40 60 80 100 120 140 160 180 200 220 240 Number of exact dist cal Database size VA file FFVA BBF Figure 5.3, Data access rate test (results are the averages of 1000 queries) 5.2.3 Query speed In this test, we assume that all the feature vectors and their data structures are loaded into main memory (i.e. full memory access) to perform NN-search. The total query time also include the times spent on the tree construction (BBF) and vector approximation (VA-file and FFVA) processes. Figure 5.4 shows the testing results (total query time of 100 queries) of synthetic data with dimension fixed at 100. Under full memory access assumption, BBF demonstrates the best performance among the three approaches when the database excesses certain size. 79 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Query time (1/1000s) Database size BBF VA file FFVA Figure 5.4, Query Speed Test (Synthetic Data) Figure 5.5 shows the test results of real data. The test data are collections of 128-dimension image features extracted from image pairs using SIFT approach. The search time per query is total query time divided by the number of features in the query image. An exact NN-search using hierarchy structure becomes even slower than exhaustive sequential-scan when the dimensionality is high. Standard VA-file approach spends over much time for bounds- distances calculation when facing non-uniform real data, while FFVA and BBF demonstrate better and comparable performances. 80 0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6 7 CPU time per query (1/1000s) D a tabase size E xhaustive scan B B F V A file F F V A K D -tre e Figure 5.5, Query Speed Test (Real Data) 5.2.4 Memory requirement Memory usage has to be considered in developing an effective algorithm for search of large databases on mobile platforms. BBF or similar tree-structure approach typically needs to load and store entire datasets into system memory for online processing: constructing tree structure, iteratively tracing the tree branches, and searching for the optimal tree nodes. This strategy apparently is impractical for searching very large databases when the data size easily overwhelms the limit of available memory space. One of the major advantages of FFVA and VA-File is their low memory requirement provided by the flat and linear SPT data structure. Figure 5.6 shows the memory usages of three methods, BBF, VA-file and FFVA to query a real feature database containing 3,000 feature vectors. We also tested the relationship among query time, data dimensionality and memory blocks 81 (i.e. memory pages). These results clearly indicated that the memory usages of FFVA are far more effective than that of BBF (figure 5.7). Therefore, FFVA is suitable for the application of large database retrieval or the systems that have limited memory spaces such as mobile computing devices. 0 500 1000 1500 2000 2500 3000 0 50 100 150 200 250 300 350 Memory requirement for search process (unit) Database size BBF VA file FFVA Figure 5.6, Memory Requirement: Test for memory requirement (note: one unit approximately equals to 512 bytes) 82 0 1000 2000 3000 4000 5000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 Query time (1/1000s) Page size BBF FFVA Figure 5.7, Test for Query Speed Considering Memory Paging: Test for query speed considering memory paging (database size: 50,000; feature dimension: 120; page size ranges from 240KB to 2.4MB 83 Chapter VI: Application Example, Augmented Museum Exhibitions This Section presents an application system, called Augmented Museum Exhibitions that combines the mobile computation, Augmented Reality and image matching techniques to demonstrate the effectiveness and utility of our proposed real-time image matching system. Today, museums commonly offer a simple form of virtual annotation in the form of audio tours. Attractions ranging from New York’s Museum of Modern Art to France’s Palace of Versailles provide taped commentary, which visitors may listen to on headsets as they move from room to room. While a fairly popular feature, audio tours only offer a linear experience and cannot adjust to the particular interests of each visitor. An AR museum guide (illustrated in figure 6.1) is able to add a visual dimension. Pointing a handheld device at an exhibit, a user might see overlaid images and written explanations. This virtual content may include background information, schematic diagrams, or labels of individual parts, all spatially aligned with the exhibit itself. Figure 6.1, AR Example: An example of how a museum visitor might use an augmented exhibit implementation on a cell phone. 84 85 The virtual content could be interactive as well. The AR content for a piece of art, for example, might allow the user to switch between background information about the artist, a description of the historical context of the piece, and a list of related works on display in the museum. An application for a science museum could display the inner workings of a machine, with the user able to adjust the level of details. 6.1 Augmented Reality and Related Works Augmented Reality (AR) is a natural platform to build an interactive museum guide. Rather than relying solely on printed tags or pre-recorded audio content to aid visitors, an AR system can overlay text and graphics on top of an exhibit image and thus provide interactive, immersive annotations in real-time. Graffe, et al. [36], for example, designed an AR exhibit to demonstrate how a computer works. Their system relies on a movable camera that the user can aim at various parts of a real computer. A nearby screen then displays the camera image annotated with relevant part names and graphical diagrams. Schmalstieg and Wagner [77] presented a similar system using a handheld device. As the user walks from place to place, AR content provides information not only about the current exhibit, but also acts as a navigational tool for the entire museum. Both of the above systems and many others such as [75], [1] and [67], rely on printed markers for recognition and tracking. That means for every object to be incorporated in the AR application, a marker needs to printed and placed in the environment in such a way that it is always clearly visible. If at any time, no marker is visible inside the camera’s field of view, then no AR content can be rendered. This can lead to frustrations when a particular exhibit is partially or even fully visible but its associated marker is obscured, perhaps because another visitor is standing in the way. Our work seeks to avoid the need for artificial markers by recognizing the target objects themselves, in this case 2D drawings and paintings. Thus, as long as an exhibit is visible to the user the application can render the associated AR content. A number of approaches have been proposed for building natural feature based AR such as [68], [23] and [81]. In this presented work, we use our real-time image matching technique: MVKP, as is presented in chapter 3. Our information retrieval system is based on a simplified version of the multi-tier client/server architecture described in [67]. The user interacts with a client application that recognizes exhibits and sends their unique ID numbers to the server. The server then responds with relevant data for that exhibit. Thus, even with a large number of clients, all content can be controlled from a single server. 6.2 System Overview Figure 6.2: System Overview of the Developed Augmented Exhibition System 86 87 The vision-based augmented exhibition system we proposed is composed of four major components: • Acquiring query image: the system accepts query images captured from simple camera attached to a mobile device. It can also accept single JPG image or video clip. • Adapted MVKP: there are two main tasks for this component. First, given a painting, it builds the feature set for the painting. Low-resolution images (around 200x150) are enough and there is no need to extract the painting from the image in order to remove the background. Second, given a query image, it matches the painting to the database. If one of the trained paintings is matched, it establishes a feature correspondence between query image and database image. The output is the painting’s ID and 3D pose with respect to the camera. • Remote Server: After the server receives the painting ID through a local Internet, it retrieves the corresponding information from its database (XML file), which it sends back to the client. • Overlaid Display: The client application, upon receiving the associated annotations from the server, displays them as overlaid virtual content on top of the current camera image. The virtual contents include the name of the painting and the artist as well as a URL pointing to related information on the Internet. The visitor can click on the URL, which will open a web browser and bring up even more information. Figure 6.2 illustrates the overall structure of our augmented exhibition system. Two major components: adapted MVKP and remote server are described in the following section. 88 6.3 Adapted MVKP and Information Retrieval Based on the practical requirements of the application, we chose to use MVKP as the foundation for painting recognition and 3D pose recovery. As described in chapter 3, the major advantages of MVKP are: (1) robustness to lighting changes and image noises, (2) invariance to geometric distortion, (3) ability to handle complex conditions like occlusion and cluttered background, (4) sufficient accuracy for pose recovery, (5) particularly good for rigid planar objects like art paintings, (6) real-time, reliable performance, and (7) feature distinctiveness when considering a large feature database. All of these advantages make it ideal for the application of vision-based augmented exhibition system. We also introduce several important adaptations to MVKP to better accommodate the requirements of AR. In our augmented exhibition system, the outputs of MVKP method are a painting’s ID and its 3D pose. The ID is then sent to a remote server through a WiFi LAN connection to retrieve the related complementary information to be displayed as virtual content on top of the painting. 6.3.1 MVKP adaptations Originally, the MVKP method was used to find correspondences between two input images, which means: (1) there is no need to detect the existence of interest object and there is no search among multiple objects involved, and (2) thresholds like the one in the distance ratio criteria can be set manually since user knows the query image beforehand. However, we have to make several important adaptations to the original MVKP method to meet the application requirements of augmented exhibition system. First, for the augmented exhibition system, there can be hundreds of various painting displayed in the museum and some of them are high-textured paintings and some are not. Figure 6.3 shows two representative paintings. The right painting returns 50% more feature points than the left one after running the same detector. For those painting with low texture, the number of feature points returned by the detectors will also be low, which means the threshold in distance ratio criteria should also be low for it to work properly. Further more, there are other factors like feature distinctiveness of a specific painting that also affect the same threshold. And there are thresholds sharing the same dilemma other than distance ratio, for example, those thresholds in RanSac algorithm. To tackle this problem, we introduce Dynamic Threshold to MVKP method. Take the threshold of distance ratio criteria for example. First we set up a global goal about how many correspondences we’d like to keep after applying the distance ratio filter. At the run time, we periodically (10 times in our experiments) check the number of correspondences the method has found so far, compare it with the global goal, and adjust the threshold accordingly. Experimental results show that, with the help of the automatic adjusted thresholds, for high textured painting we can keep the number of correspondences low and accordingly the computational cost low. For low textured painting we will still have enough correspondences to recognize the painting and recover its pose. Figure 6.3, Low and High Texture Paintings: All artwork courtesy of Riko Conley, USC Roski School of Fine Arts 89 90 Second, the user of the augmented exhibition system can point the camera to anywhere inside the museum where the query image might contain no painting at all. If there is one, we need to search and decide which painting it is. Based on our experiments, we found the size of the largest consistent correspondences set after running the RanSac is the best criteria to determine which painting, if any, is contained inside the query image. Last but not the least, for image matching methods based on local features, especially when the query image has significant view point and lighting changes, consistent check methods like RanSac are necessary in order to combine the global information. One problem involving RanSac in AR system is stability. RanSac randomly chooses three correspondences to fit an affine transformation, and for performance consideration, terminates after a limited number of iterations. Therefore, there is no guarantee that the correct affine transform can always be found. Failure of RanSac typically means one or two frames “target lost”, which should be avoided for AR applications. To solve this problem, we assume that when a certain painting is detected in one frame by the system, it is more likely that the same painting will appear in the following frames. In practice, after one painting is detected, the system will focus only on that painting’s features in the following frames even after it encounters a RanSac failure. The system will revert to general search mode only when RanSac process fails a certain number of consecutive times. Through this implementation technique, we achieve stable and smooth displays for the augmented exhibition system. Besides this simple technique, every frame of the input is processed independently and there is no tracking technique involved in our current system. 6.3.2 Information retrieval 91 Our system is based on a client/server architecture, where the client performs all of the visual processing and recognition and the server maintains a database of all known exhibits and their associated information. When the client positively identifies an exhibit, it sends a unique ID to the server. The server looks up the ID in its database and retrieves the relevant data, which may include the name of the work, the name of the artist, and possibly links to related web pages. It sends this data back to the client to be displayed over the current camera image of the exhibit. The advantage of using a client/server model is that changes to the underlying information can be changed in one place. Whenever a client application recognizes an exhibit that it has not seen recently, it sends a new request to the server to retrieve the latest data. Due to the ready availability of wireless LAN technologies such as WiFi, it is easy to have a mobile client make periodic request to a server. Only one send-receive round trip is needed for each exhibit, so the client and server do not need to maintain a persistent open communications channel. 6.4 Real Museum Test In our real museum test, we capture a video from a gallery at the USC School of Fine Art and process the video in our augmented exhibition system with four painting trained. During the video capture, we intentionally includes many challenging cases like out-of-plane rotation of the camera, moving highlights on the painting, sudden change of illumination, intense shaking of the video camera, etc. Overall, our augmented exhibition system demonstrates fast and reliable performance. Figure 6.4 illustrates the results with the matched correspondences displayed or hidden. A video clip showing the real-time processing together with the overlaid virtual content displayed is also available online. Figure 6.4, Real Museum Test: (left) with image correspondences and recovered pose displayed under lighting and viewpoint changes, and (right) with retrieved information displayed. 92 93 Chapter VII: Extend to 3D Range Data The traditional image matching problem is to find matching primitives between digital images from optical sensors. Many well-known existing methods are based on texture analysis around interest points. Recently, there has been a growing need to register 3D range sensing data with digital photography in nadir, oblique or even ground views. For example, the newly emerged handheld 3D scene modelling and texturing, and the photorealistic modelling and real-time rendering of urban scenes all require the efficient registration of 3D range data or derived models onto aerial or ground 2D images. In the medical image processing domain, there is a long standing concern about how to automatically align CT or MR images with optical camera images. Frequently, those data are captured not only by different sensors, but also at significant time differences and possible large 3D viewpoint changes. 7.1 Introduction and Related Works Many existing methods try to reconstruct 3D points from 2D image sequences, and then match 3D primitives from both sides. The availability of appropriate multiple images associated with 3D range data, the well-known challenge of inferring 3D from 2D and the difficulty of establishing correspondences among 3D primitives when there is no pre- knowledge about initial pose estimation, lead us to a different approach based on region matching between optical images and depth images projected from range data. Chapter 7 presents our automatic 2D-3D registration system (overview in figure 7.1), which efficiently registers 3D LiDAR data with 2D images of diverse viewing directions and from 94 various sources. Figure 7.2 shows typically inputs and outputs of our system. The first stage focuses on different sensors problem and assume similar viewing directions (nadir views) of the inputs. Based on our basic assumption: the dominant contours of many interest regions (ROI), which typically are well-separated regions of individual buildings, are repeatable under both optical and range sensors, we extract, describe and match ROI from optical images and projected depth image of LiDAR data. Specifically, we propose an automatic registration method based on matching of local interest regions extracted from 2D images and depth images of 3D range data for urban environment. The regions we are interested in (ROI) are typically well separated regions of individual buildings. Global context information is implicitly used for outlier removal and matching. Our approach can register images from different sensors with large initial location and scale errors. Although today there exist systematic ways to obtain initially well aligned 3D and 2D data at the same time for large scale scenes, possible applications of our work include data fusion from different sources and sensors, and updating existing GIS (Geographic Information System) with new content, when data from different sources may have non-unified calibration or no georeference at all, e.g. historic photos or photos from common users. Furthermore, the ROI extraction component proposed in this thesis is an important prerequisite to a variety of recognition, understanding, and rendering tasks in urban environments. This next section (section 7.2) provides details of our LiDAR segmentation method (section 7.2.1), aerial image segmentation method (section 7.2.2) and the region matching component (7.2.3). We also introduce several techniques that significantly improve the whole stage’s efficiency (7.2.4). Figure 7.1, Overview of Our Proposed 2D-3D Registration System The inputs to the second stage (section 7.3) are all from optical sensors but may have large viewpoint changes. In order to match other images with those already registered nadir images, we propose two different methods: One (section 7.3.1) is based on MVKP method introduced in the previous sections. We first obtain distinctive and invariance features of the nadir images through a multiview-based training process. Next, other images with different viewing directions are rapidly matched with those nadir images and indirectly registered with 3D range data. Excluding the training process, which happens offline and only once for each location, our system is able to process about 10 frames per second at this particular stage, which makes it ideal for efficient data fusion and urban scene rendering, even on mobile devices with limited computational resources. 95 Figure 7.2, System Illustration: (a) inputs; (b) registration results illustrating our two-stage system; (c) applications to urban rendering; Focusing on the low success rate problem of the first approach, we also propose our novel urban images matching method (section 7.3.2) utilizing multiple image clues. Specifically, we explore robust image features generated by combining interest regions and edge groups extracted from urban images. Initial correspondences are established based on our high-level features’ similarity measurement. Cost ratio, combined with global context information, is used to remove outliers. We believe our hybrid approach is more suitable than traditional texture features for such scenes. This proposed urban image registration method has been tested using real aerial urban photos under diverse viewing conditions. Experiments report that our registration success rate is nearly doubled when compared with classical methods such as [60] and [95]. Concerning related works, traditional texture-based image matching approaches such as [60], 96 97 [95] and [89] can not be directly adapted to this newly emerged 2D-3D registration problem, basically because range sensors capture no texture information. To tackle the problem, many recently developed methods first reconstruct dense or sparse 3D point clouds from 2D images then use 3D features to establish correspondences with initial alignment provided by positioning hardware. Zhao, et al. [107] use motion stereo to recover dense 3D point clouds from video clips. ICP (Iterative Closest Point) algorithm is used to register those reconstructed 3D points to LiDAR data with initial alignment provided by positioning hardware such as GPS (Global Positioning System) and IMU (Inertial Measurement Unit). Later work of Ding, Lyngbaek and Zakhor [26] detects 2D orthogonal corners (2DOC) as primitives to match single oblique image onto LiDAR data using similar positioning hardware. It achieves overall 61% accuracy during a test of 358 images and the processing time of each image is only several minutes in contrast to 20 hours in [35]. We agree that accurate and uniform georeference can be assumed to associate with particular inputs for certain stage of the registration to significantly simplify the problem, so that there only exist small offset errors to correct. However it is not reasonable to assume such assistant information always available for general inputs from diverse sources, especially for data fusion and GIS updating using historic data and oblique or ground photos from common users. An automatic system for texture mapping 2D images onto ground-scanned 3D range data is proposed by Liu, et al. High level features such as rectangles are extracted and matched for 2D-3D registration [58]. Ground-based manual range scan is able to provide rich 3D details about a particular building’s structure. However, it is not feasible to obtain such data for large urban scene efficiently. Moreover, because both range and optical data are ground-captured 98 focusing on one building for each test, the range of possible initial misalignments is restricted. Their experiments report that only a portion of 2D images can be independently registered with 3D range data and the system will fail in parts of the scene without sufficient features for matching. The working range can be expanded [59] by applying structure from motion to a sequence of images to produce sparse point clouds and aligning dense and sparse point clouds. Both [107] and [59] try to use multiview geometry to recover 3D point clouds from image sequences. The availability of appropriate multiple 2D images associated with 3D range data, the well-known challenge of inferring 3D from 2D, and the difficulty of finding correspondences among 3D primitives without good initial pose estimations all raise practical restrictions to such approaches. Other related works include: BAE-systems [42] uses high-resolution stereo images to reconstruct 3D. Specially designed hardware is needed for inputs, the problem of fusing other views still remains, and unlike LiDAR, such approaches typically have difficult handling vegetation area. Viola and Wells [93] use mutual information to align un-textured models to a scene. The first problem is that clean models are difficult to obtain from noisy, low resolution 3D range data. Moreover, using mutual information as similarity measurement lacks spatial information since it processes each pixel independently, consequently tends to produce unstable results. Our registration system has been tested using datasets containing nearly 1,000 images. Experimental results in terms of robustness to different sensors, viewpoint changes, large geometric distortions and missing of partial data demonstrate the potential of our system. 7.2 Registration handling different sensors 99 This section presents the first stage of our two-stage registration system, handling different sensors problems. Inputs are nadir view photography of urban scenes without georeference and depth images projected from LiDAR data. Their relative scale, rotation and translation are unknown. Outputs are initial point-to-point registration and the recovered transformation. 7.2.1 LiDAR segmentation The first component of this stage is ROI extraction from 3D range data. Similar topic has been intensively studied ever since the range sensing technology emerges. Early work of Vestri and Devernay [92] extracts planar surface patches from Digital Elevation Model, which is obtained from laser altimetry. Tall trees which produce a region of high range responses similar to buildings have always been a major problem for LiDAR data segmentation. Kraus and Pfeifer [52] propose an iterative approach, in which 3D measurements with residuals to current surface larger than a threshold are removed and the surface is re-computed using the remaining data. In [76], Rottensteiner et al. use local plane estimation to group local plane patches into larger regions. Initial plane patches are obtained from seed locations and co-planarity constraints are used to merge over-segmented regions. Recent work of Verma et al. [91] proposes a region-growing based approach using Total Least Squares to extract local planar patches and merge consistent patches into larger regions. The largest region is simply labeled as ground while others are building, which is not true for heavily urbanized areas. Later work by Matei et al. [65] overcomes this limitation by applying ground classification prior to building segmentation with the help of minimum ground filtering and surfels grouping. Different from many previous approaches which use LiDAR segmentation results directly for modeling purpose and desire as many details as possible for individual building to produce accurate model, as the first component of our registration system, we are not interested in the slope angle of a particular roof or other inside details. Our goal is to extract the separated and most external boundaries of interest regions, which will be used for region matching later in our 2D-3D registration system. Based on this goal, we propose a novel ROI segmentation method for airborne LiDAR data based on local consistent check and region growing with watershed. Figure 7.3 shows the whole system overview including our building extraction component for LiDAR data. 3D Range Data 2D Image ROI Extraction from Range Data ROI Extraction from Image ROI range ROI image Region Matching 2D-3D Point-to-Point Correspondences Figure 7.3, Overview of the Proposed 2D-3D Registration System The first step of our LiDAR segmentation method is normalization. Because intensities of input depth images tend to be clustered, to enlarge the effective range and facilitate later processing, we normalize the depth images. Two different normalization methods are compared. The first one utilizes a lookup table mechanism based on integral of histograms. The other method simply translates and rescales all the intensities to the same range. Experiments show that the first method will reveal a lot of details inside building or ground 100 regions, ideal for direct modeling but will cause confusion to our ROI extraction (figure 7.4). So we apply the second method to normalize input depth images. The second step is region growing with local consistent check. Initial segmentations are generated using region growing with watershed and LCC to remove outliers like trees. The original watershed algorithm ([12], [2]) treats the strength of detected edges as elevation information, which is used to hold the flooding water within separated "lakes". Concerning depth images, the intensity value of each pixel is already the hardware obtained elevation. So we simply negative all the intensity values in depth images and then fill in water from marker locations in a natural bottom-up way. This corresponds to a recursive region-growing process starting from marker locations as seeds and depth information as boundary constraint. Figure 7.4, Comparison of Normalization Methods: Left: original input image; Middle: normalized by histogram- based method; Right: normalized to the same range Each current segmenting region is recursively expanded until one of three boundary conditions are met: First, the current pixel’s depth value in positive depth images is above a soft threshold (WL dyn ), which is dynamically fluctuated around the global water level (WL). A region with smaller current area (area cur ) will have a larger range so that the expansion won’t be stuck in 101 isolated small peak regions such as a high clock tower on top of a building. When the current area is relative large (area cur > area MIN ), we have: WL dyn = WL. Otherwise, MAX area are dyn range e WL WL MIN cur ⋅ + − − = − ) 1 ( ) ( where area MIN is the minimum acceptable area and range MAX is the maximum adjustable range. Figure 7.5, Illustration of Various Interest Points: illustration of edge points (blue), corner points (green) and noise points (red) Second, if neighboring pixels along both positive and negative directions of x/y axis fail the 1 st condition, we say the central pixel is fully isolated. If only one direction is isolated, we call it partial isolation. A partial isolation on either x or y axis indicates a possible edge point while a partial isolation both x and y axises indicates a possible corner point. However, a fully isolation on either x or y axis likely indicates a noise pixel outside the ideal boundary and further expansion will be stopped. Sample edge, corner and noise points are illustrated in 102 figure 7.5. Last, the current pixel (I c ) should pass the LCC. Here a two dimensional Gaussian is centered at the current pixel and the neighboring pixels with approximately identical distances from I c form neighboring circles (suppose I i,j is the j-th pixel on the i-th circle away from I c ). We apply a stricter consistency requirement for closer pixels. The current pixel will pass the LCC only if the number of consistent neighboring pixels is above certain percentage (P) of the total number of scanned pixels (Num). Figure 7.6, Effectiveness of LCC: from left to right, the original depth image, the segmentation with and without LCC. Let T i represents the consistency threshold for the i-th circle with radius r i , we have: c i i I r T )) 2 exp( 1 ( 2 2 σ − − = . Now define a binary function: ⎩ ⎨ ⎧ < − = otherwise if T I I i c j i j i , 0 | | 1 , , δ So the last boundary condition can be expressed as: ∫∫ ⋅ < ij j i Num P djdi ) ( , δ 103 Trees, especially those high and close to buildings, have always been a challenge problem in LiDAR segmentation [52]. Many recent approaches (e.g. [71], [43]) compute expensive 3D normal vectors for each pixel to remove trees. Experimental results showed that over 95% of the trees can be removed by our method at a much lower computation cost. Figure 7.6 shows a challenge tree removal example when there is a tall tree partially overlapped with the building. Without computing the expensive 3D normal vectors, local consistent check is able to confine the region-growing within the actual boundary of the building. To determine marker locations for watershed, a uniform grid is placed onto the depth image. The three marker conditions are: (1) the pixel should pass LCC. This will eliminate many tree markers because of the noise-like and inconsistent nature of tree's depth values. (2) The intensity of this pixel, which corresponds to the relative depth of the location, is above the global water level (WL). (3) In case when multiple pixels inside the same cell satisfying condition (1) and (2), the one with highest intensity will be chosen. It is possible that all pixels fail certain condition and the corresponding cell has no marker at all (e.g. when the cell is placed on the ground or trees). Figure 7.7, Effectiveness of Region Refinement: the initial segmented regions (1 st and 3 rd ) and the final regions after refinement (2 nd and 4 th ). The third step is refinement. The initial segmentations, which generally have low false positive rate, may have jaggy edges along boundaries that are smooth in reality and small 104 105 holes inside the region not from physical depth discontinuity but from remote sensing noises. To reduce false negatives, the refinement process scans background pixels within each segmented region's bounding box. If the majority neighbors of a background pixel are foreground pixels, then it will be re-labeled as foreground. A pair of example is given in figure 7.7. Theoretically, the refinement process should run iteratively until the region's area become constant. In practice, we observe that segmented regions become stable after only two iterations. Figure 7.7 demonstrates the improvement brought by our refinement process. The segmented regions are cropped and enlarged to show details. Nearly all the small holes inside region will be filled and a considerable improvement can be observed on the jaggy edges. Edges can be further smoothed by fitting lines or curves to them. There are many techniques such as [34] and [73] available for this task. However, we find such further refinement and the additional computational cost associated with it unnecessary because the region descriptor we use later in our registration system is robust enough to handle such noises. Segmented regions with an extremely small size or area are either noise regions or not reliable for future registration, and therefore discarded. The rest are still highly redundant, because multiple markers placed on one building generate multiple regions. Since our algorithm can obtain consistent results for the same region from different markers, we merge regions with corner distance less than 3 pixels and area difference within 5% to remove the redundancy. In terms of parameters sensitivity, most parameters can be fixed to some reasonable values. We find the only data-dependent parameter is the global water level (WL). Although ideally its value should be the shortest building's relative height, which depends on the details of re- sampling and normalizing the raw LiDAR, experiments indicate a large range of WL is acceptable by our system. Additionally, because we have single uniform source of LiDAR data for each city, the WL value only needs to be set at most once for each of the four cities' datasets. Finally, all the ROI range extracted from 3D range data and their most-external contours (connected close contour guaranteed by the intrinsic properties of watershed method) represented by contour point lists are saved (figure 7.8). We also develop an algorithm (not covered here) to extract ROI optical from optical images. 106 Figure 7.8, Samples of Extracted Shapes: binary images for some segmented regions and contours from the final segmentation results Figure 7.9 shows one result for each of our four testing cities. Our building extraction algorithm is able to efficiently extract the dominant contours of most buildings no matter for "challenging heavily urbanized areas" [65] or for areas with a lot of tall trees. There exists a small portion of buildings not appearing in the final segmentation results. Two possible reasons are that either they have no marker assigned from the very beginning or their overall height is too low compared with the water level value we specified. Such imperfection in the building extraction part is tolerable to our registration system as long as the dominate contours for many buildings are correctly extracted. Figure 7.9: Color-coded LiDAR ROI Extraction Results 7.2.2 Aerial images segmentaion One important component of our 2D-3D registration method is ROI extraction from aerial images (major steps in figure 7.10), which can be viewed as a special case of general image segmentation problem. Related recent works include: Comaniciu and Meer [21] propose the nonparametric mean shift segmentation algorithm which filters the image while preserving regions and edges. The proposed approach computes the weighted average over points in the 107 108 neighbourhood to compute a mean and then shift the center of the neighbourhood to the new mean. The whole process is repeated until convergence, which is mathematically proved to be guaranteed give proper weight kernel and shift steps. A graph-based segmentation method using normalized cuts, is proposed by Shi and Malik [79]. The method constructs a graph for input image with vertices representing pixels and edges weighted according to pixel similarity measurement. The image segmentation problem is treated as a graph partitioning process during which dissimilarity between different sub-graphs and similarity within the same sub-graphs are both maximized. A more efficient graph-based image segmentation method based on pairwise region comparison is proposed in [30] with similar performance as [79]. The proposed method is reported to be able to preserve detail in low-variability image regions while ignoring detail in high-variability regions, a desirable characteristic for many application areas including building extraction from aerial image. We will compare our aerial image segmentation results with the segmentations produced by [21] and [30]. In contrast to depth images projected from 3D range data, aerial images captured by traditional optical sensor generally exhibit significantly higher variations and complications. Although, the effect of viewing parameters such as distance from camera to the ground (which affects the scale factor) and the viewing angle (which affects multiple rotation factors and produces hidden surfaces and perspective distortion) is mathematically well understood, the variability brought by those factors can still significantly complicate the building extraction task. Furthermore, different parts of the same building may have distinctive colors or intensities while some building may have similar intensity as the roads/grounds around it, which leads to over-segmentation or segmentation leaking during the building extraction process. Finally, the actual boundaries of buildings can, in reality, be frequently obscured by shadows or trees. Unlike the variability brought by viewing parameters, factors like 109 color/intensity variations of the real world and shadows due to different time of the day are beyond our control and difficult (or impossible) to be mathematically analyzed. Early work of Lin and Nevatia [57] aggregates lower level features (edges and lines) in aerial images into higher level ones (junctions, U-constructs and parallelograms), which are eventually used to reach hypotheses about building locations. The proposed building detection and description system (BUDDS) evaluates the rooftop candidates according to whether sufficient evidence can be retained. The proposed system, especially the rooftop selection stage, is improved through machine learning techniques (Maloof et at. [34]). Four machine learning techniques (nearest neighbour naive Bayes, C5.0 and perceptron) are evaluated and verified to play an important role in improving the accuracy and robustness of the building extraction system. Recent work of Wu et al. [104] proposes a hierarchical method, compositional boosting, for similar task of aggregating low level features into high level ones. Their method is able to infer 17 graphlets from images in a common Bayesian framework. They use Adaboosting for initial detection. The joint And-Or graph of those 17 common structures in one image are iteratively refined through bottom-up proposal and top-down validation steps. The compositional boosting method is used in [72] to detect roofs and roads and other low level features such as color histogram and bags of SIFT[60] are used to detect other interest objects for aerial image understanding task. Many false positives are reported after the initial detection stage but the model is allowed to prune inconsistent detections and verify low probability detections through learning techniques focusing on local context. A different approach is presented by Xiao, Cheng, Han and Sawhney[105] for oblique aerial video understanding. Without 3D information from range sensor, they produce the estimated 110 depth using 3D reconstruction from video clip and obtain image registration through SIFT features and on-board positioning hardware. The estimated depth information is used to guide the scene segmentation of aerial video in a way similar to the segmentation problem of 3D range data. Sohn and Dowman [83] proposed a building extraction method for satellite images with high object density and scene complexity. They use local Fourier analysis to obtain the dominant orientation angle in a building cluster in order to extract focused perceptual cues such as line segments. The concept is similar to the principle direction in our work but obtained and used in different ways. The regularized lines partition the image domain into elementary units (Building Unit Shape) and are later grouped for reconstructing building structures. The fractal geometry [64] was introduced by Mandelebrot in the early 1980s and soon used for aerial image understanding in [22] (Cooper, Chenoweth and Selvage) and [74] (Priebe, Solka and Rogers). Based on the properties of nature image features to fit a fractional Brownian motion model, Cooper et al. propose the fractal error metric as a discriminate function to detect interest man-made structures in aerial imagery. Experiments on aerial photographs of a residential area with a high content of various vegetations demonstrate that the fractal error metric performs reasonably well in detecting man-made features and is robust to lighting, contrast and even season changes. In the later work of Solka et al. [84], the fractal measurement is combined with classical statistical features such as the coefficient of variation for the interest region identification problem using unmanned aerial vehicle imagery. Two fundamental differences from the previous works are: (1) the fractal dimension is defined in the presence of boundaries, which is produced through wavelet transformation. The boundary map prevents values on the 111 boundary from being used outside the boundary and improves the feature extraction quality in the existence of many different object/texture types. (2) Instead of a global threshold on the fractal error responses of input aerial image, their final classification stage uses a standard likehood ratio procedure to decide class membership. Recent work of Cao et al. [18] computes the fractal error metric for each pixel based on local window and generates a fractal error image. Then Discrete Cosine Transformation is applied to extract the texture edge features and produce a texture edge image. Finally, the proposed aerial image segmentation algorithm tries to minimize an energy function representing how well the current boundary contains the interest region. This subsection presents our ROI extraction algorithm for aerial photography. The proposed algorithm produces initial ROI through a region-growing process utilizing various image cues from low level features such as intensity and color preference to high level ones such as fractal errors and multiple assistant information maps (AIMs). The detected initial ROI is further refined by a learning-based region regulation step. This component is to extract, from aerial images, buildings’ most external contours repeated and consistent with those extracted from 3D range data. Preprocessing AIMs Construction and Selectively Smoothing Region Growing Learning-based Region Refinement ROI image Input Aerial Image Figure 7.10, ROI Extraction from Aerial Images. After preprocessing of the input images, the first key step is AIMs construction and selectively smoothing. There are three kinds of assistant information maps the region growing process frequently refers to: vege maps, shadow maps and edge maps. As the name suggested, AIMs in our system are for assistance purpose only, because perfect extraction of vege, shadow and edges is not practical today. 112 Figure 7.11, Vege-map. Vege-map (M vege ): By utilizing color information in the aerial photograph, we identify pixels that are dominated by the green channel and possibly vegetations (see figure 7.11 for example). Shadow-map (M shadow ): For each pixel, let I represent the intensity value and (C r , C g , C b ) represent its RGB color channels. A pixel is said be to a shadow pixel if: 1 shadow T I < and 2 } , , max{ shadow b g r T C C C < , where T shadow1 and T shadow2 are thresholds specifying how low the intensity and color channel need to be for a shadow pixel. Because vegetations typically form low reflection regions, the shadow-map typically have many overlaps with the vege-map. Figure 7.12 shows both the shadow map and the results of shadow map minus vege-map. 113 (a) (b) Figure 7.12, Shadow-map: (a): shadow-map with all the shadow pixels marked blue, (b): those shadow pixels that are not in the vege-map Edge-map (M edge ): There are two kinds of edges in our edge-map, the true edges and the in- region edges. Among the initial edges returned by Canny operator, most are not actual boundaries of ROI (true edges) but rather edge responses within those regions (in-region edges) due to slope or textures of the roofs, items like air conditioners on the building's top, or even noises from image sensors. The existence of in-region edges is one primary reason for over-segmentation. Moreover, since our ROI extraction process is a combination of region-driven and edge-driven, it is meaningful to distinguish those two kinds of edges from the very beginning. For urban scenes with regular buildings, an edge pixel is deemed as a part of true edges unless neighboring horizontal or vertical non-edge pixels have similar hues. HSV instead of RGB color space is used because neighboring pixels of either true or in-region edges tend to be affected by different lighting, and hue is generally more robust under such circumstance. The separation of in-region edges from true edges serves two purposes. First, while the true edges will 114 become strict barrier during the region-growing process, those in-region pixels will not. The region-growing process is allowed to pass those in-region pixels with certain penalty to the confidence attribute. Second, we perform selectively smoothing based on the results of in- region edges. The color and intensity of each confirmed in-region edge pixel will be replaced by the average of its non-edge eight neighbors, helping us eliminate those in-region details which will otherwise compromise segmentation performance. The next key step is region matching. A uniform grid is placed on top of aerial images to determine seed locations. Each cell's center P is used as a tentative seed location and if it fails the seed conditions: vege M P ∉ and shadow M P ∉ and edge M P ∉ , the cell is equally divided into four smaller cells and each center of those four sub-cells is tested again. It is possible that all five tests fail and the corresponding cell has no marker at all (e.g., when the cell is placed on trees). Figure 7.13 shows the locations of seed points as blue plus signs. Figure 7.13, Seed Locations for One Sample Aerial Image. 115 116 uring the region growing process, the current pixel (p current ) will be accepted and recursively 1) The fractal error requirement: The theory is based on the properties of nature features to fit For intensity measurement, if the model fits, the average absolute intensity change across where E is the topological dimension (the number of independent variables) and in the image The above equation can be linearized by logarithm: D expanded only if it meets the three expansion requirements: a fractional Brownian motion model. The definition of fractal error in image domain concerns two pixel locations (p c and p r ). The measurement (e.g. intensity) difference of those two locations should be normally distributed with a mean of zero and a variance proportional to the 2H power of the Euclidean distance. several pairs of pixels should follow exponential scaling. Equivalently, the following equation should hold for two arbitrary pixels (p c and p r ): H r c r c p p k p I p I E | | |] ) ( ) ( [| − = − , domain E = 2. k > 0 and 0 < H < 1 are two parameters. The parameter H is related to the fractal dimension D by: D = E + 1 -H. |) ln(| ) ln( )]) ( ) ( [ ln( r c r c p p H k p I p I E − + = − . With the linear equation, we can use machine learning technique to obtain the estimates of H and k. To obtain training data, a window operator is placed on one aerial image's non- building regions. After collecting pixel distances and their associated intensity changes in those regions, the least-squares linear regression is used to compute the optimized H and k . 117 c calculated as the difference between the This training phrase is conducted only once for each city's dataset and has little impact on the computational cost of the whole registration system. The individual fractal error for a pixel location p is actual and estimated values from one of its neighboring pixel p r : H r c r c r c error p p k p I p I E p p F | | )] ( ) ( [ ) , ( − − − = . Finally, the overall fractal error (OFE) for p c is computed as the root mean square (RMS) of these individual errors using a local window centered at p c : ∑ = r c p r c error p p p F n OFE ) , ( 1 , where n is number of pixels considered in a local window. Our xperiments indicates that a e window size of 7 7 × is good for most cases. The returned overall fractal error measures to which degree the local neighborhood of a equ ] and [18], we never particular pixel fits the fBm model, which is previously trained using labeled non-building regions. A low OFE indicates that the center pixel's neighboring region is more likely to belong to a non-building region. Therefore, the center pixel will be excluded from the current growing region. A center pixel with sufficient high OFE will pass this expansion requirement. We select this threshold in a conservative way so that a high false positive rate is allowed. Because fractal error is the first but not the only requirement for p current to be accepted, a false positive will have many chances to be removed by the following r irements while there is no way for a false negative to be recovered later in our current system. In contrast to other approaches using fractal error metric such as [22 compute a fractal map for the entire aerial image because there are many regions in the aerial images that are never reached throughout the region-growing process due to one expansion requirement or another. Instead, we take the compute-on-demand way in which a pixel's OEF is computed when the region growing process first reach this pixel and the result is saved in case the same pixel needs to be accessed again later. 118 2) Requirements from AIMs: After the fractal error requirement, the current pixel is further tested using the three AIMs. It will fail this requirement if either vege current M p ∈ or shadow current M p ∈ . The requirement for shadow-map can be relaxed in heavily urba es ows overlapping buildings but the requirement for vege-map is strict since the chance of vege-map significantly overlapping building regions is very low. nized scen with long shad If the current pixel belongs to a in-region edge of edge-map, it will still pass this test though 3) The dynamic intensity range: Finally the current pixel’s intensity must lie within the cu The current pixel will immediately pass the dynamic intensity range requirement without any with a penalty to this region segment's confidence measurement or equivalently a contribution to the uncertainty measurement. If the current pixel belongs to a true edge, it will neither be accepted nor further expanded, but no penalty needs to be taken either. rrent dynamic intensity range, defined by two variables: the upper bound (U range ) and the lower bound (L range ). Both are initialized as the intensity of initial seed point. The range is expanded simultaneously with the region growing process with a limit for the range's length (range_len). update if: range current range U p I L < < ) ( Otherwise, we introduce a tolerate threshold T range as an expansion limit. The threshold is softened and fluctuated based on the current area to handle the case when the current point falls into a small distinct region contained in a large region we are interested in. The current pixel will still pass this requirement and update the range if: when the current area is smaller than the minimum acceptable area, ) ) 1 ( ( ) ( ) ) 1 ( ( ) ( max ) _ ( max ) _ ( min min Range e T U p I Range e T L p I Area area cur range range current Area area cur range range current ⋅ + − + + < ⋅ + − + − > − − or when the current area is larger, range range current range range T U p I T L + < < − ) ( , where Range max is the maximum adjustable range for T range . Each pixel of the aerial image is associated with a 2-bit attribute called color preference. It is set to 1, 2 or 3 if the corresponding channel is dominant or 0 if no channel can obtain the dominate position. A region’s color preference is set to be the color preference of the seed pixel. We use a more strict T range value if the current expanding pixel has a color preference different from the growing region. The 2-bit color preference attribute is very robust against challenge factors like contrast changes and shadows. Most buildings will have the same color preference for all its pixels regardless which season of the year or what time of the day. However, there are few buildings that have different color preferences for its different parts and are likely be over-segmented because of it. This problem can be handled by region merging which is the last step of our 119 aerial image segmentation component. Those pixels that pass the above three expansion requirements will form the initial ROI. Regions with high confidence should be those clearly distinguished from surrounding background and consequently have small dynamic intensity range (DIR). Moreover, one region will have high uncertainty if it contains a large number of in-region edge pixels (#IREP). Therefore, we define a region R's uncertainty as: ) (# ) _ 1 ( ) ( IREP len range DIR R UCT ⋅ + = . A larger region has higher chance to encounter in-region edge pixels. Avoiding this, we compute the uncertainty per pixel (UPP) as: area R R UCT R UPP . ) ( ) ( = . Figure 7.14, Color-coded Initial Segmentations of the Whole Aerial Image. 120 121 Initial segments with comparatively large UPP or small size / area will be discarded. The rest are called ROI candidates. UPP is also used in the region merging step and the final region matching component. The last step is ROI candidate refinement. The actual number of buildings in the scene is typically less than half the number of ROI candidates because many candidates are false positives such as grounds and roads, and some buildings are over-segmented due to factors such as shadows. The candidate refinement consists of two steps handling the two problems respectively. First, learning-based region regulation is to prune those ROI that are too irregular to become building regions or a part of such regions. For each ROI contour, we construct x and y histograms in the roation-relative frame and compute two attributes measuring their peak strength. Linear Discriminant Analysis is applied to the 5D augmented space to decide a linear boundary, which results a quadratic decision boundary in the original space. Around half of the ROI candidates are pruned by this step (details in section 7.2.4). Second, region merging is used to iteratively merge those regions that are spatially close to each other (especially when their color preferences are compatible) and form additional interest regions. Only ROI candidates with higher confidences (lower UPP attributes) will enter the region merging step because regions with high UPP already contain too many discontinuities. The outputs of our aerial image ROI extraction are interest regions (ROI aerial ) and their contour point lists. 7.2.3 Region matching Given dominant and most-external ROI contours from both aerial images and 3D range data, we choose to use the shape context [7] as our contour descriptor because as a histogram- based approach, it is able to handle issues like pixel location error well. It can also tolerate various shape deformations (common situation in our case due to imperfect segmentation) while capturing the essence of similarity. Last, shape context generates one descriptor for each contour point, which enables us to establish point-to-point dense correspondences. Each ROI's contour points are uniformly sampled to form a contour point list of fixed size (N CPL ). We ordered the list in a counter-clockwise manner starting from the point with the smallest y coordinate. Each CPL point j on ROI i is described by its relative angle difference θ j,k (to other points k of CPL and k ≠j) and logarithm normalized distance r j,k using a log-polar histogram (visualized in figure 7.15): } 0 : ) ( and ) ( { # ) , ( , , , i CPL r k j k j r j i N k b bin r b bin of b b H < < ∈ ∈ = θ θ θ Scale invariance is achieved by distance normalization (we normalize distances using the size of ROI bounding boxes) and by placing shapes of different scales into histograms with a fixed number of r bins. For rotation invariance [8], tangent vectors are computed at each point and treated as x-axis so that the descriptors are based on a relative frame that automatically turns with tangent angles. 122 Figure 7.15, ROI Descriptors: The first column is one ROI extracted from LiDAR data. The second and third are the corresponding ROI extracted from an optical image. Notice the similarity between 1 st and 2 nd column descriptors. Despite many previous efforts in our ROI extraction stage, over-segmentation and segmentation-leaking can still be observed among ROI a and ROI r . Therefore it is still important to allow partial matching (figure 7.16) in the region matching stage, achieved by forming partial descriptors in our algorithm. Continuous subsets of the original sampled contour points are used. We re-sample the partial contour and form new partial descriptors. Though imperfectly segmented regions will have better chance of matching through this, adding more descriptors will also enlarge the necessary searching space and raise the distinctiveness requirement. To better handle this trade-off, only those partial contours containing larger number of corners, consequently generating richer and more distinctive partial descriptors will be considered. To further restrict the total number of ROI descriptors, we generally compute partial descriptors only for ROI r , which are relatively clean and more accurate than ROI a . 123 Figure 7.16, ROI Partial Matching. To search for optimal correspondences, for each ROI r described as N CPL histograms H r (j), all the ROI a,i (0 ≤i<num a ) described as H a,i (j) are sequentially scanned. We efficiently measure the similarity of two ROI as the minimum average histogram distance (matching cost) of their corresponding CPL points. } ) )% (( ) ( ) )% (( ) ( 1 { min 00 , , 0 ∑∑ == < ≤ + + + − CPL CPL a N k N j CPL i a r CPL i a r CPL num i N k j H j H N k j H j H N The searching for minimum has a constant low computational cost of O(N CPL ) because CPL is organized as a counter-clock list of most-external contour points. Once one point's matching is determined, the rest points are automatically corresponded. There is no need to compute the solution for general bipartite matching problem. After the searching process, each ROI r is associated with its best and second best matched ROI a . Among all those tentative correspondences, typically only 10%-40% are correct. The final task is to detect and correct the outliers. 124 We define "cost ratio" for each ROI r as the matching cost ratio of its best matching over the second best matching. Lower cost ratio combined with lower UPP attributes for ROI indicates a higher matching confidence. For example, regular rectangle buildings are generally ambiguous and produce higher cost ratio because many buildings have similar shapes, while buildings of unique shapes will produce lower cost ratio and higher matching confidence (figure 7.17). Figure 7.17, Cost Ratio: the 1st column is ROI contours extracted from range data, the 2nd and 3rd column are the best and second best matching from the input aerial image. Cost ratio for each row is given. For comparatively easy tests with a few distinguished buildings in the scene (figure 7.18), correct initial matchings can be found by simply picking several ROI r with the lowest cost ratio. Each selected ROI r can contribute 10 uniformly sampled contour points providing a large set of point to point correspondences, based on which a global perspective transformation is estimated using least square method. The result is propagated to those unselected ROI r using the recovered transformation and produces the final point to point correspondences across the entire scene. 125 Figure 7.18, Results of an Easy Scene: Outliers and inliers are determined only by cost ratio. (a) initial inliers; (b) matchings are expanded across the entire scene using the estimated transformation. For challenging scenes, the correctness of initial matchings can not by solely decided by cost ratio. We propose a unified framework combining outlier removal and matching propagation together. We first construct a subset of matchings with relatively low cost ratio. This high- confidence subset of matchings serves as the foundation group of transformation estimation. For each iteration of the process, we randomly pick one pair of matchings from the subset and compute a global transformation using least square method. The remaining matchings are scanned to locate those consistent with the estimated transformation by comparing the point- 126 127 to-point correspondences generated by region matching with the matching propagation results. The transformation matrix is updated every time the size of consistent set increases. We compare and evaluate the results of different iterations using two criteria: the number of propagated matching points that are within the spatial range of ROI a , and the average UPP of the consistent set. Throughout the process, global context is implicitly taken into consideration. The whole process runs iteratively until a global transformation (T 1 ) meeting some predefined criteria is found, in which case matchings have already been propagated to all buildings across the scene, or when the list is exhausted, the system will claim that no correspondence could be established. 7.2.4 Principal directions We also propose two techniques particularly designed for urban scenes to improve our system’s performance. In this stage of our system, when dealing with nadir or slightly oblique views, a reasonable assumption is many buildings can be aligned with two principal directions. Our experimental results verified that using principal directions related techniques, we are able to significantly reduce the whole system's computational cost without affecting accuracy and robustness for the majority inputs. During some experiments, it is true that the system could fail to obtain correct principle directions due to detection error or special scene compositions. However, it should be noted that as a speed-enhancement technique, the principle directions are actually optional to our 2D-3D registration system. The system is able to operate just fine without principal directions though the whole registration process will take notably longer. Figure 7.19, Detection of Principle Directions: normalized image (top left) with detected edge pixels (bottom left) and line segments (bottom middle), the histogram (bottom right), and the rotated depth image (top right) To detect those principal directions, we first use Canny operator ([17]) to produce edge responses on each pixel. Then probabilistic Hough transformation ([51]) groups them into line segments. To handle irregular buildings and tolerate noises from edge detection and line grouping, directions of each line segment are placed into a histogram of 36 bins. The first two peaks of the histogram correspond to the two principal directions. For better accuracy, the 3 histogram bins closest to each peak will interpolate the peak position. Given the principal directions, two techniques can significantly reduce the computational cost of this stage. First, the input depth and optical images will be rotated according to the positive and minimum of the two principal directions (figure 7.19). When the angle difference between optical and depth images is large, a full search for all four possible directions (positive and 128 negative for each principal direction) are needed. Therefore, we can generate rotation invariant ROI descriptors by dealing with at most four possible directions without computing the expensive tangent vectors for each point. Second, a learning-based method is used to prune those initial ROI too irregular to become building regions to reduce the necessary searching space for later registration. For similar tasks, many previous works define different models of common building shapes and try to fit one of those prototypes to each segmented regions. Such methods will fail when encountering a new shape that is not in the model library, for example, a unique shaped building or parts of a regularly-shaped building produced by over-segmentation. In contrast to those model-fitting approaches, our histogram-based region regulation using machine learning technique is more efficient and generalized because no pre-defined models are needed. (a) (b) (c) Figure 7.20, Histogram-based Regulations: X (b) and Y (c) histograms for segmented regions (a) The basic assumption of our histogram-based region regulation is that most lines belonging to 129 regular building's contours should be compatible with the principal directions. For contour pixels of each ROI candidate, their x and y coordinates in the rotated relative frame are projected into two 10-bin histograms respectively. As a result of the region-growing process, each ROI candidate is a close external contour, so each histogram should have at least two distinct peaks if the contour lines are aligned well with the two principle directions. Figure 7.20 below shows two segmented regions and the corresponding contours and x, y histograms. The distinctiveness of highest peak is measured by peak neighborhood ratio (PNR), the average of the peak's two closest neighboring bin values and divided by the highest peak's bin value. The PNR of the second peak is the ratio of its bin value to the highest peak's. Finally, the PNR of each histogram is the average of two peaks' PNR. Given two PNR for each ROI, PNR(H x ) and PNR(H y ), our goal is to determine whether this region belongs to the building class (c = 1) or not (c = 0). We choose to use Linear Discriminant Analysis (LDA) to find a quadratic decision boundary for ((PNR(H x ), PNR(H y )), c), because first, LDA gives better accuracy particularly when the amount of training data is small; second, LDA can be computed directly from training data in one pass. For each city's depth images, we choose one with ROI candidates manually labeled as c = 1 or 0 to form the training set. The joint likehood P(PNR, c) is computed as: ]) [ ] [ 2 1 exp( | | ) 2 ( 1 ) ( ) | ( ) ( c) P(PNR, 1 2 1 2 c T c n PNR PNR c P c PNR P c P µ µ π − ∑ − − ∑ ⋅ = ⋅ = − , where µ c is the mean vector for class c and in our case Σ is a 2 by 2 covariance matrix for linear model. Their estimated values are directly computed from training data: 130 A new test region R test will be classified into class 1 if: 0 ) | 0 ( ) | 1 ( log > = = test test R R PNR c P PNR c P . Suppose N is the total number of training regions among which N 1 has label 1 and N 0 label 0. Using the formula of conditional probability and the P(PNR, c) equation above, the final condition for R test to be classified into class 1 is: ) ] [ ] [ 2 1 exp( ]) [ ] [ 2 1 exp( log( ) 0 , ( ) 1 , ( log 0 1 0 0 1 1 1 1 µ µ µ µ − ∑ − − ⋅ − ∑ − − ⋅ = = = − − test test test test test test R T R R T R R R PNR PNR N PNR PNR N c PNR P c PNR P 0 2 1 2 1 ) ( log 0 1 0 1 1 1 0 1 1 0 1 > Σ + Σ − − Σ + = − − − µ µ µ µ µ µ T T R T test PNR N N This condition defines a linear decision boundary on the PNR space. We compute the quadratic decision boundary by using LDA in an augmented (5-dimensional) space: (PNR(H x ), PNR(H y ), PNR(H x )PNR(H y ), PNR(H x ) 2 , PNR(H y ) 2 ) Linear decision boundaries in this 5D space represent quadratic ones in the original space. Applying the decision boundary, the number of ROI candidates are reduced, which also raises the true positive rate approximately by 50% in our experiments (45.7% to 83.3% for figure 7.9 top-left). When principal directions can be correctly detected, applying the above two techniques can reduce the CPU time needed for the first stage roughly from one and half minute to 15 seconds. Based on our hundreds of pair-wise registration tests, the correct detection rate is roughly 90% of the total images. Otherwise, when no clear principal directions can be found (insufficient detected lines or no distinctive peaks in the 36-bin histogram), those two related 131 components will be automatically disabled. 7.3 Registration handing viewpoint changes Inputs to the second stage are two images both from optical sensors, one of them is nadir view already registered with 3D range data by the first stage. The next task is to register other surrounding views with the center nadir view, so that all views are directly or indirectly registered with 3D range data. In order to accomplish this task, two different methods are proposed in section 7.3.1 and 7.3.2 respectively. 7.3.1 MVKP revisted For inputs from different sensors with no initial alignment assumed, we are not able find any existing method that can produce meaningful outputs. Unlike the first stage, the second stage is a typical wide baseline matching for conventional photography. Many widely-accepted methods exist. We first explore the offline multiview training and online query framework of MVKP (chapter 3), because besides robustness to lighting and viewpoint changes, it can efficiently process a large number of independent frames per second, ideal for rapid data fusion and scene rendering applications. Figure 7.21, Feature Selection for Urban Scenes: top-ranking features with (left) and without (right) the additional feature selection. 132 133 During the offline training, MVKP method synthesizes a number of training views for each registered nadir image using randomly generated transformations. An interest point detector will detect potential feature points for all views. Since transformations used to generate training views are known, we are able to tell which feature points of different training views correspond to the same physical object location and group features accordingly. Feature points with high repeatability will be selected. Around each selected feature point we extract 32x32 pixel image patches, treated as a vector composed of 1024 pixel intensities. These vectors would be projected onto a much lower dimensional space using Walsh-Hadamard kernel projection and still preserve the distance relationship. Based on the characteristics of urban scenes, we also apply our feature selection method (section 3.3) computing the variance inside the same view track and distinctiveness across different view tracks. Many feature points along the boundaries of buildings un-stable under 3D viewpoint changes are automatically discarded (figure 7.21). The output of the offline training stage is a feature database for nadir images, with feature descriptors labeled by the physical object locations they correspond to. After images from other viewing directions enter the registration system and strong feature points are detected, normalized patches around those points produce feature vectors for those un-registered images using the same W.H. kernels. Initial matching are established by an efficient nearest neighbor search using two-layer structure and fast filtering with block distances. Finally, standard RANSAC is applied to detect the consistent matching. If the second stage is not able to find a sufficient large consistent set after RANSAC, it will claim the current query image can not be matched to the registered nadir image and input the next query image. Otherwise, our system will compute a global transformation (T 2 ) from the training image (nadir view) to the current query image using those consistent matching. Combined with the transformation T 1 from the first stage, the query image is therefore indirectly registered with 3D range data. 7.3.2 Feature Fusion for Registering Wide-baseline Urban Images Although robust to various geometric distortions, when it comes 3D non-planar objects with large viewing direction changes, typical scenarios found in modern urban photos, many existing image matching methods based on local texture analysis including SIFT descriptors and its many variations as well as the MVKP based method introduced in section 7.3.1 face great difficulty either because of the lack of appropriate textures in urban scenes or due to the difficulty in modeling textures for non-planar objects under complex 3D viewpoint changes, especially for boundary regions. Figure 7.22, Feature Fusion Framework: major components for our feature detection and description containing ROI track (left) and edge groups track (right). In this section, we propose to combine two different kinds of features: the interest regions and the edge groups for the challenging urban image matching problem. The regions we are 134 135 interested in (ROI) ideally are representing conceptually meaningful parts of buildings that are well-distinguished from neighborhood areas. Because regions and edges are closely related in nature, instead of treating the two feature tracks independently, we construct them in an interleaved manner. The initial edge detection results serve as references for ROI extraction component, while the extracted ROI will guide the meaningful grouping of edges (system overview in figure 7.22). Our feature detection component for the ROI track is a region-growing process utilizing two assistant information maps and learning techniques, similar to our ROI extraction component proposed in section 7.2.2. First of all, two AIM are constructed: vege-map and edge-map. The input images are also selectively smoothed during the constructions. For vege-map, we identify pixels dominated by green channel and possibly vegetation areas. There are two kinds of edges in our edge-map, the true edges and the in-region edges. Among the initial edges returned by Canny operator, many are not the actual boundaries of ROI we are interested in (true edges) but rather edge responses within those regions (in-region edges). Since our whole registration process is a combination of region-driven and edge-driven, it is meaningful to distinguish those two kinds of edges from the very beginning. One direct outcome of in-region edges is selectively smoothing. The color and intensity of each confirmed in-region edge pixel will be replaced by the average of its non-edge neighbors. The selectively smoothing helps us eliminate those in-region details which otherwise will compromise ROI extraction performance (figure 7.23). The afterward region growing process is the same as section 7.2.2, which will not be repeated here. Figure 7.23, Effectiveness of In-region Edges and Selectively Smoothing: (a): the initially detected edges (left) and one patch (right) showing true edges (bright edges) and in-region edges (grey edges); (b): the same area before (left) and after (right) selectively smoothing. For the edge grouping component (overview in figure 7.24), first, membership labels are assigned to each edge pixel based on its spatial relationship with its neighboring ROI. When multiple ROI are sufficiently close to one edge pixel (e.g. the distances to the closest and the second closest ROI are similar), multiple labels are assigned. Second, separated edge pixels are grouped into short straight lines (called “edgelets”). Even when a long line has been detected, our algorithm will still return a number of short lines for it because edgelets are more flexible for describing complex boundaries, especially curved ones. The other reason is the convenience of forming dense descriptors starting from edgelets. Traditional sparse descriptors for long lines or curves, simply based on length, angles and curvatures etc., are insufficiently robust for challenging urban scenes because they frequently encounter broken or missing parts from one view to the other, due to shadows, occlusions and viewpoint changes, while our experiments demonstrated that dense descriptors from edgelets are considerably more tolerated to those factors. 136 A voting histogram is constructed for each edgelet recording the number of votes each label receives. The label of the whole edgelet is set to be the label of the histogram’s first (highest) peak. Edgelets with the same label will be grouped together. A second label will be assigned if the height of the second peak is relatively close to the first. Extracted ROI Initial Edge Pixels Edge Pixel Labeling Grouped Edgelets in Ordered List Edgelet Construction Membership Voting && Edgelet Ordering Figure 7.24, Overview of the Edge Grouping Component. Finally, each edge group is stored as one or multiple ordered edgelet lists (OEL). One ordered list initially contains only one edgelet in the group with an end point closest to its bounding box. Other edgelets with an end point sufficiently close to the outgoing end point of the current list are iteratively added to the list, until the list returns to the initial edgelet or no close one could be located any more. Then all the edgelets in the list are removed from the group and the process repeats until the group is exhausted. Edge groups are described based on their OEL using a set of weighted 2D histograms. First of all, the basic description primitives in our case are edgelets instead of contour points 137 138 in shape context. Because our edgelets are already organized in a spatial continuous order, we can uniformly sample the whole OEL to produce a fixed number of virtual anchor points (N VAP =30~50 in our experiments) serving as description centers. They are described by the relative logarithm distances and angles from all edgelet midpoints (index 0<k<N edgelet ) in the current list (index i), weighted by the edgelet lengths, thus producing a 2D weighted histogram for each VAP j . Second, rotation invariance is achieved by using the edgelet directions for midpoints and computing tangent vectors for each VAP, so that a rotation relative frame can be used instead of the absolute one. Next, scale invariance is enhanced by normalizing the distances using the size of the current OEL bounding box, which is shown by our experiments to be more stable than other criteria under edge grouping errors. Finally, to improve the performance in terms of over-grouping of edges, which occasionally happens at heavily urbanized areas, we construct additional partial descriptors using continuous edgelets subsets containing larger number of corners. The ROI description is similar with two major differences. Instead of VAP, actual contour points are used. Moreover, what the histograms contain is no longer edgelets, but sampled contour points excluding the current description center. As a result, the histograms are no longer weighted. More details regarding our high level feature description can be found in section 7.2.3. Finally, our urban image registration method establishes initial matchings for the edge group track and the ROI track respectively, after which the two sets of initial matchings are merged together and outliers are removed to produce final correspondences. 139 In order to establish correspondences for edge groups, for the N VAP histogram descriptors (H 1 ) of one OEL in one input image, the descriptors (H 2 ) of every OEL (index i) in the other image are sequentially scanned and the most similar one are selected as matching. We measure the similarity of two OEL as the minimum average histogram distance (“matching cost”) of their corresponding VAP. Each OEL matching can provide N VAP point-to-point correspondences. Similarly, the ROI track can also produce a large number of initial correspondences based on ROI matching results. We define "cost ratio" for each matching as the matching cost ratio of its best matching over the second best. Low cost ratio indicates a more distinctive matching and high matching confidence. Matchings with high cost ratio are discarded. Next, we apply RANSAC to locate the best consistent subset among the remaining matchings. Finally, a global transformation (T 2 ) is estimated from the selected consistent subset. The correspondence for any image point can also be roughly estimated using T 2 . 7.4 Experimental results The proposed whole 2D-3D registration system was intensively tested using LiDAR data of four cities: Atlanta, Baltimore, Denver and Los Angeles, mainly focusing on urban areas. We have the LiDAR datasets and many aerial images for those cities covering the downtown and surrounding rural areas. We have single source for each city’s LiDAR data. But there is distinctive difference between different cities. For example, range data for Los Angeles contains a large potion of ground and vegetation, and were captured years ago. The resolution is low with a lot of noises due to the technology restriction of early days and naturally, some 140 data sections are out of date because old buildings are torn down and new buildings are erected over the years. Some data like Baltimore contain heavily urbanized areas and are very current with high resolution and low noise. The optical images we tested are even more diverse. Many of them are from various sources (e.g. returned from online image search engine), captured in early years with low resolution. No georeference data can be tracked at all. When aligned with corresponding depth image, initial displacement of optical images is unknown. They may have any rotation (in-plane rotation for nadir images), even upside down. The scale errors range from 0.3 to 3. The location error is very large, even comparable with the image size. The corresponding ROI might lie on the opposite corner of the image or the input optical image may not correspond to the depth image at all. Our testing set currently contains nearly 1,000 images, half of which are intentionally distorted with various geometric and viewpoint changes for performance evaluation. 7.4.1 LiDAR segmentation results Our LiDAR segmentation algorithm has been tested on over 400 depth images projected from the four cities’ range dataset. Each contains around 30 buildings. Figure 7.9 shows one result for each city. Figure 7.25, LiDAR Segmentation Comparison: (a) input depth images; (b) our extracted ROI; (c) graph-based segmentation. Figure 7.25 shows our segmentation results compared with Efficient Graph-based Segmentation [30]. Our extracted regions are more focused on interest buildings and can provide more accurate external contours. The computation time is roughly half of [30]. An average of more than 80% buildings can be correctly extracted by our algorithm. The number of false positives is generally around 10%. It’s true that there may exist a few or several building in a scene wrongly labeled as background. However, we believe it is neither reasonable today nor necessary to request perfect image segmentation. The important thing is how to make the best use of imperfect segmentation results. In our case, how to establish 141 correct correspondences for parts of the scene and expand the partial results to the rest. 7.4.2 ROI extraction from aerial images Figure 7.26 and 7.27 show the color-coded ROI extraction results from aerial and depth images, compared with results generated by classical segmentation algorithm [30]. Our ROI extraction result meets the particular need of our registration system considerably better than others. The returned segmented regions are also more focused on interest buildings and can provide more accurate dominant external contours. Figure 7.26, ROI Extraction Results (from Aerial Images). For setting parameters, we choose UPP and OFE in rather conservative ways only to remove those ROI that are clearly false positives. T_range is dynamically related to the current ROI size. We found changing of range_len have no significant impact on the segmentation results. Those ROI distinctive from background can robustly be obtained unless some unreasonable values are used, while we were not able to find a universal value that can possibly help all the rest ambiguous ones. An average of more than 80% buildings can be correctly extracted from 3D range data during our experiments, while the percentage for correct ROI extraction from aerial images is around 60%. Nonetheless, instead of asking for perfect image segmentation, 142 which is still not feasible today, we also believe the important thing is “how to make the best use of imperfect segmentation results” [40]. In our case, how to establish correct matchings at least for parts of the scene and propogate the partial results to the rest. Figure 7.27, Aerial Photos Segmation Comparison : (a) our ROI extraction algorithm; (b) graph-based segmentation. I decide to develop two new ROI extraction algorithms specially designed for the proposed 2D-3D registration system because: first, although there exist a great number of general purpose image segmentation algorithms such as [21], [79] and [30], as can be seen from our experimental results in those evaluation figures in section 7.4.1 and 7.4.2, they can not be reliably counted on for our ROI extraction task, basically because they never mean to produce the building regions we are interested in and don’t take the many helpful constraints and clues of this particular kind of images into consideration. Even the widely used efficient graph- based segmentation algorithm [30] tends to produce many broken segmentation within the same building region as well as segmentation leaking covering many buildings and grounds (figure 7.27(b)). Consequently, it is very difficult to even utilize their results as initial segmentation. The second reason is we can not directly use or adapt any existing special- 143 144 purposed image segmentation algorithm because as far as we can see, most such algorithms either particular focus on certain application domain (e.g. medical images) or provide special desired inputs to certain systems and difficult or impossible to be generalized to meet our system’s need. Moreover, it is very rare for such special-purposed image segmentation algorithms to have their source codes or executables published and thus it is hard to use them even for evaluation and comparison purpose, even though such kind of comparison is desired. 7.4.3 Different sensor registration First of all, the evaluation of any registration and matching system generally involves invariance to rotation, lighting and scale changes as well as other distortions. For the first stage, orientation invariance is achieved by generating ROI descriptors in relative frames. Scale invariance comes from distance normalization and by placing ROI of different scale into histograms with a fixed number of r bins. Distortion can be tolerated by histogram-based dense descriptors. Lighting issue doesn’t apply to this stage. For the second stage, invariance primarily comes from the multiview training while robustness to lighting comes from patch normalization and W.H. kernel projection (by discarding the first DC component). Both stages search local or semi-local matching primitives across the entire scene. Therefore, location invariance is naturally handled. Furthermore, both stages are robust again missing of partial data (e.g. due to occlusion or historic data). Experiments using nearly 1,000 images have demonstrated the invariance and robustness of our proposed system. Second, concerning the registration success rate, currently the first stage achieves nearly 70% of success rate for our whole testing set of depth and nadir images. For example, for the city of Los Angeles, despite the low resolution and historic nature of the data, 230 out of the total 324 images are correctly registered. The registration failures are mainly due to insufficient 145 matching primitives (ROI), either basically because the lack of such primitives in the scene, in which case even human found the registration difficult or impossible, or because such primitives can not be accurately acquired through segmentation techniques although it "seems" obvious to human observers. Still, to the best of our knowledge, there is no existing registration method that can achieve similar performance for the different sensors problem without support from positioning hardware. The success rate of the second stage largely depends on how wide the viewpoint changes from nadir view. Though wide baseline matching is still a challenging problem by itself, our second stage based on MVKP is typically able to correctly register two or three out of ten frames per second. Experimental results also show the success rate can be increase by up to 10% by using some other methods such as [60], [89], but the registration of each frame generally takes several seconds. Therefore, we decide MVKP is still more suitable for the second stage. Next, when it comes to the accuracy for those success registrations, the averaging pixel errors (APE) for the first stage is within 5 pixels even for propagated matchings (e.g. The APE for Baltimore dataset is 2.89). Those errors primarily comes from the difficulty of locating exact pixel locations inside high-level features due to shadows, the segmentation leaking and breaking, etc. However, the APE for the second stage can be as large of half of a building’s size. That is mainly because our current system simply fit a homography for T 2 using the consistent matchings instead of recovering the actual camera model. The current accuracy is sufficient for tasks like recognition. If needed by some applications, refinement process similar to [107] and [26] can be applied, though they are not yet implemented in our current system. Figure 7.28, Whole System Results: (a): initial correspondences of the first stage, which are used to compute T 1 ; (b) and (c): results after matching propagation, where T 1 and T 2 are applied respectively to the ROI contour points extracted by LiDAR segmentation. The results are visualized by bounding boxes and centers of ROI’ point-to- point matching. 146 Figure 7.29, Registration Results of Our Proposed Approach: (a) initial correspondences (left: normalized depth image; right: input aerial image; middle: aerial image wrapped by the recovered transformation); (b) the final results after matching propagation visualized by the bounding boxes and centers of all interest regions’ point-to- point correspondences. (c) partially missing inputs due to historic data. (d) results registering oblique views. 147 Finally, considering the proposed registration system’s possible future application areas such as mobile augmented reality, handheld 3D scene modeling and texturing, and vision-based personal navigation and localization, etc., it’s very important for the system to occupy as less system resource as possible and still be able to provide the registration results in an interactive manner, so that it is feasible for future applications on mobile or handheld devices with very constrained computational power. On one hand, although much more challenging than the registration task of image pairs from the same sensor, which human vision system can accomplish in a blink of half second, many registration tasks from 2D optical images to their corresponding 3D range data can be accomplished by untrained human tester in a matter of 10-30 seconds. On the other hand, even given the support from positioning hardware using which the initial registration errors are minimum (e.g. within 50 pixels location error), the current state-of-the-art approaches still need at least several minutes [26] or even 20 hours [35] to achieve a success rate around 60%. Pre-Processings ROI extraction from 2D ROI extraction from 3D ROI descriptions ROI Matching Figure 7.30, Composition of Average CPU time for Major Components. 148 149 Throughout the design of our 2D-3D registration system, we always put computational efficiency as one of our primary concerns. The newly proposed region-growing-based ROI extraction algorithms using fractal error measurement and histogram region refinement are highly efficient. ROI for interest buildings can be extracted from both aerial image and depth image in several seconds respectively. Training for fractal and LDA/QDA model happens only once per dataset. The following ROI description only takes a few seconds and several seconds for the region matching component. Due to the introduction of ordered contour point list and the using of cost ratio as initial ranking score, the searching range of region matching is greatly reduced. The final outlier removal and matching propagation steps are combined into one unified and efficient framework. Additionally, there are also a number of algorithm- level optimizations, for example, the two-layer structure is used several times in our proposed system. Whenever possible, we compute some inexpensive criteria to filter out the majority of the data in the first layer so that less data will enter the more expensive second layer. As a summary, without any detailed low-level optimization, the entire registration process of our proposed system for a scene of around 30 building takes roughly 15 seconds in a P4 3.4G PC with a peak memory occupation of 35M. Figure 7.30 presents the average CPU time consumption for each major component of the proposed registration system. Overall, the most time consuming parts are the ROI extraction components, especially the ROI extraction from 2D aerial images, which takes nearly half of the total processing time. Although similar segmentation framework is used, the ROI extraction from 3D range data is notably faster because 3D range data is siganificantly cleaner compared with 2D optical data. The CUP time for pre-processing is mainly used to 150 globally rotate the inputs according principle directions. The ROI description component is the most efficient. The ROI matching portion in the above figure also includes outlier removal and matching propagation components. 7.4.4 Matching wide-baseline urban images Besides the MVKP-based urban image registration method, we also propose a noval method fusing region and edge features. This proposed method has been tested using both real and synthesized challenging aerial photos of several major cities’ datasets. Around 30% of our testing images have viewing directions almost perpendicular to the ground, the rest are oblique views with various unknown 3D camera rotations and zoom-levels. Experimental results indicate ROI track generally provides better correspondences for areas more homogenous in nature, while edge group track is more suitable for parts lie between those areas. Through merging initial matchings from both tracks and refine them together using global context information, our method is able to adapt between those two kinds of features automatically. In terms of registration accuracy and success rate, our method is compared with existing approaches including [60], [54], [95] and [48]. We found the original SIFT, although comparatively slow to compute, produced best results in terms of both accuracy and success rate among the existing methods we tried. During each experiment, the distance ratio of SIFT varies from 0.6 to 0.9 and the best result is kept. We say one registration is a successful one if the refined inliers can recover a transformation roughly aligning the two input images properly (through visual inspection of corresponding buildings). The accuracy is measured by average pixel errors of point-to-point correspondences after RANSAC compared with manually labeled ground truth. Primarily due to the basic lack of appropriate textures in urbane scenes, throughout our experiments, the registration success rate of SIFT is below 40%. As comparison, our method fusing two different kinds of features achieves a success rate of 86%. Approximately 23% of the total tests are successfully registered by both methods. Most correct matchings returned by SIFT are on high-textured planar grounds, while our method is able to register 3D man-made structures using region shapes and edge groups. Figure 7.31, Results form Feature Fusion: registration results including initial matching with low cost ratio from edge groups track (a) and ROI track (b), and final propagated matching estimated by H T (c) 151 Figure 7.31 shows the initial correspondences provided by edge groups (a) and ROI (b) respectively, which are merged, refined and propogated to produce the final matchings displayed in 7.31(c). Figure 7.32, Refined and Propagated Matchings: Correspondences after RANSAC (a), which are used to generate (b) for 2D-3D registration application (see fig. 2), roughly recovering the scene poses and building locations 152 Figure 7.33, Matchings from SIFT: Matchings produced by original SIFT method with distance ratio equals to 0.6 (left) and 0.8 (right) respectively. Figure 7.34, Matchings from Our Proposed Method: Results produced by our proposed method for real (top) and synthesized (bottom) images, including the inputs (left and right columns) and final wrapped results (middle column). 153 Concerning the registration accuracy, as a local texture based approach, SIFT has sub-pixel level accuracy. Our experiments report that for those image pairs SIFT can successfully register, the average pixel error is within 2 pixels for matchings after RANSAC. The same error of our method is 5-10 pixels. The reason primarily comes from the difficulty of locating exact pixel locations inside high-level features we use, intensified by challenging issues such as shadows, imperfect segmentation and line grouping, etc. 7.4.5 Applications Figure 7.35, The Proposed Method Applied to Urban Modeling and Rendering Task. We have applied the proposed techniques to photorealistic modeling and rendering of urban scenes as well as UAV localizations. First, wide-baseline urban images are registered with nadir images using the feature fusion technique proposed in section 7.3.2. Next, nadir images are registered with 3D city models or depth images using the method proposed in section 7.2. 154 Consequently the input urban images become “indirectly” registered with 3D models. Figure 7.35, 7.36 and 7.37 shows some representative results of those applicaitons without utilizing positioning hardware throughout the whole registration process. Figure 7.36, Apply Our Approach to UAV Localization (Moderate Scale Example). 155 Figure 7.37, Apply Our Approach to UAV Localization (Large Scale Example). 156 157 Chapter VIII: Conclusion Targeting on mobile multimedia applications, this thesis focuses on the major challenge posted by such mobile systems: the integrated image matching method has to be efficient, robust and invariant against diverse viewing conditions in our real-world. Each chapter of the thesis addresses one specific challenge handling the automatic image matching problem on mobile multimedia devices and targeting on various limitations of current existing algorithms, based on the general framework of image matching and recognition process. Chapter 3 presents our real-time image matching system: Multiple View Kernel Projection, which utilizes Walsh-Hadamard Kernel Projection combined with a multiview training stage for viewpoint invariance and a feature selection approach in order to efficiently reduce the number of features needed, more suitable for applications on mobile platforms. Experimental results indicate that our compact features provide comparable robustness and invariance while occupying much less disk storage due to their low dimensionality nature, and are consequently more efficient for indexing and searching purposes, particularly friendly to small cell phones and other mobile devices with limited computational power and storage. In chapter 4, we propose a feature augmentation process treating the classic image matching problem as a collection of recognition problems of spatially related image patch sets. Experiments based on standard dataset demonstrate that the produced augmented features are more distinctive and efficient compared with based features, resulting an effective way to upgrade existing features at a nominal computational cost. Next, for the feature matching component, existing methods that search exhaustively or hierarchy in high-dimensional image feature database are highly computationally expensive 158 due to the curse of dimensionality, unsuitable for mobile applications. We develop an effective approach (chapter 5), Fast Filtering Vector Approximation, a nearest neighbor search technique that facilitates rapidly indexing and locating the most similar matches, efficiently matching a very large high-dimensional database of image features. In order to demonstrate the effectiveness of the above proposed individual components, we build an augmented reality application for museum exhibits using natural features replacing artificial calibrated markers (chapter 6). We demonstrate that our proposed methods are ideal for mobile multimedia applications such as AR museums. Finally, targeting on recently emerged interesting applications such as handheld 3D scene modelling and texturing, and vision-based personal navigation and localization, we explore the adaptation of our feature matching methods to different sensors (chapter 7). Our produced system is able to robustly register 2D aerial photography onto 3D LiDAR data. The whole 2D-3D registration process takes merely seconds in today’s PC, taking significant steps towards possible applications on small mobile devices. The end result of this thesis is a complete image matching system integrating our novel components, which is capable of efficiently establishing correspondences for images under various viewing conditions or from different sensors and well compatible with speed and storage limitations of small mobile devices. 159 Bibliography [1] F. Ababsa and M. Mallem, Robust camera pose estimation using 2d fiducials tracking for real- time augmented reality systems, in ACM SIGGRAPH VRCAI, pp. 431–435, 2004. [2] M. C. Andrade, G. Bertrand, and A. A. Araujo. Segmentation of microscopic images by flooding simulation: A catchment basins merging algorithm. In Proc. SPIE Non-linear Image Processing, pp. 164–175, 1997. [3] H. Bay, T. Tuytelaars and L. V. Gool, SURF: Speeded Up Robust Features, Proceedings of the ninth European Conference on Computer Vision, May 2006. [4] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeg-er, The R*-tree: An efficient and robust access method for p-oints and rectangles, In Proceedings of the 1990 ACM SIGM-OD International Conference on Management of Data, pp.322-331, Atlantic City, NJ, 23-25 May 1990. [5] J. S. Beis and D. G. Lowe, Shape indexing using approximate nearest-neighbor search in high- dimensional spaces, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997. [6] R. Bellman, Adaptive Control Processes: A Guided Tour, Princeton University Press, 1961. [7] S. Belongie, J. Malik and J. Puzicha, Shape matching and object recognition using shape contexts, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.24, no.4, pp.509- 522, Apr 2002. [8] S. Belongie, J. Malik, and J. Puzicha, Shape matching and object recognition using shape contexts, Technical Report UCB//CSD-00-1128, UC Berkeley, January 2001. [9] G. Ben-Artzi, H. Hel-Or, Y. Hel-Or, Filtering with Gray-code kernels, International Conference on Pattern Recognition, 2004. [10] S. Berchtold, C. Böhm, H. V. Jagadish, Hans-Peter Kriegel, Jörg Sander, Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces, ICDE, 2000. [11] S. Berchtold, D. Keim, and H.-P. Kriegel, The X-tree: An index structure for high-dimensional data. In Proc. Of The Int. Conference on Very Large Databases, pp. 28-39, 1996. [12] S. Beucher., Watershed, hierarchical segmentation and waterfall algorithm, In Mathematical Morphology and its Applications to Image Processing, pp. 69–76. Kluwer, 1994. [13] S. Blott, R. Weber, A Simple Vector-Approximation File for Similarity Search in High- dimensional Vector paces. Technical Report 19, ESPRIT project HERMES (no.9141), march 1997. [14] A. Boffy, Y. Tsin and Y. Genc, Real-Time Feature Matching using Adaptive and Spatially Distributed Classification Trees, In British Machine Vision Conference, 2006. [15] Y. L. Boureau, F. Bach, Y. LeCun and J. Ponce, Learning mid-level features for recognition, 2010 IEEE Conference on Computer Vision and Pattern Recognition, pp.2559-2566, 13-18 June 2010. [16] M. Calonder, V. Lepetit, P. Fua, K. Konolige, J. Bowmanand P. Mihelich, Compact signatures for high-speed interest point description and matching, 2009 IEEE 12th International Conference on Computer Vision, pp.357-364, Sept. 29 2009-Oct. 2 2009. [17] J. Canny, A Computational Approach To Edge Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 8:679-714, 1986. 160 [18] G. Cao, X. Yang and Z. Mao, A two-stage level set evolution scheme for man-made objects detection in aerial images, EEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 474-479, June 2005. [19] P. Ciaccia, M. Patella, and P. Zezula, M-tree: An effici-ent access method for similarity search in metric spaces. In Proc. of the Int. Conference on Very Large Databases, Athe-ns, Greece, 1997. [20] P. Ciaccia, M. Patella, "PAC Nearest Neighbor Queries: Approximate and Controlled Search in High-Dimensional and Metric Spaces," icde, p. 244, 16th International Conference on Data Engineering (ICDE'00), 2000. [21] D. Comanicu, P. Meer, Mean shift: A robust approach toward feature space analysis, IEEE Transactions on Pattern Analysis aind Machine Intelligence, 24, pp.603-619, May 2002. [22] B. E. Cooper, D. L. Chenoweth and J. E. Selvage, Fractal error for detecting man-made features in aerial images, Electronics Letters, 30(7), pp. 554-555, 1994. [23] V. Coors, T. Huch, and U. Kretschmer, Matching buildings: “Pose estimation in an urban environment.” ISAR, pp. 89–92, 2000. [24] O. Danielsson, S. Carlsson and J. Sullivan, Automatic learning and extraction of multi-local features, 2009 IEEE 12th International Conference on Computer Vision, pp.917-924, Sept. 29 2009-Oct. 2 2009. [25] H. Deng, E. N. Mortensen, L. Shapiro and T. G. Dietterich, Reinforcement Matching Using Region Context, Computer Vision and Pattern Recognition Workshop, 2006 Conference on ,pp. 11-11, 17-22 June 2006. [26] M. Ding, K. Lyngbaek and A. Zakhor, Automatic registration of aerial imagery with untextured 3D LiDAR models, IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 23- 28 June 2008. [27] G. Dorko, and C. Schmid, Selection of scale-invariant parts for object class recognition, Ninth IEEE International Conference on Computer Vision, pp. 634-639 vol.1, 13-16 Oct. 2003. [28] M. Elad, Y. Hel-Or, and R. Keshet, Pattern detection using maximal rejection classifier, In International Workshop on Visual Form, pp. 28–30, May 2000. [29] Z. Fan and B. Lu, Fast recognition of multi-view faces with feature selection, Tenth IEEE International Conference on Computer Vision, pp. 76-81 Vol. 1, 17-21 Oct. 2005. [30] P. F. Felzenszwalb and D. P. Huttenlocher, Effie Graph-Based Image Segmentation, International Journal of Computer Vision, Vo.59, No.2, 2004. [31] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, A. El Abbadi, Vector Approximation based Indexing for Non-uniform High Dimensional Data Sets, CIKM: ACM CIKM 2000. [32] V. Ferrari, T. Tuytelaars, Luc Van Gool, Wide-baseline Multiple-view Correspondences, In Conference on Computer Vision and Pattern Recognition, 2003. [33] J. H. Friedman, J. L. Bentley, and R. A. Finkel, An algorithm for finding best matches in logarithmic expected time, ACM Transactions on Mathematical Software, 3(3):209--226, 1977. [34] C. Fruh and A. Zakhor, Constructing 3D city models by merging ground-based and airborne views. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 562–569, 2003. [35] C. Fruh and A. Zakhor, An automated method for large-scale, ground-based city model acquisition, International Journal of Computer Vision, 60(1):5–24, Oct. 2004. 161 [36] M. Grafe, R.Wortmann, and H. Westphal, AR-based Interactive Exploration of a Museum Exhibit. Augmented Reality Toolkit, The First IEEE International Workshop pp. 5-9. 2002. [37] C. Gu, J. J. Lim, P. Arbelaez and J. Malik, Recognition using regions, IEEE Conference on Computer Vision and Pattern Recognition, pp.1030-1037, 20-25 June 2009. [38] A. Guttman. R-trees: A dynamic index structure for spat-ial searching, In Proc. of the ACM SIGMOD Int. Conf. On Management of Data, pp. 47-57, Boston, MA, June 1984. [39] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2000. [40] V. Hedau, H. Arora and N. Ahuja, Matching Images under Unstable Segmentation, in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Anchorage, AL, 2008. [41] Y. Hel-Or and H. Hel-Or, Real-time pattern recognition using projection kernels, IEEE Transactions on Pattern Analysis and Machine Intelligence, Sep 2005. [42] H. Hirschmüller, H. Mayer, G. Neukum, and HRSC CoI-Team, Stereo Processing of HRSC Mars Express Images by Semi-Global Matching, International Symposium on Geospacial Databases for Sustainable Development, 2006. [43] J. Hu, S. You, and U. Neumann, Approaches to large-scale urban modeling, IEEE Computer Graphics and Applications, 23(6):62–69, 2003. [44] A. B. Huguet, R. L. Carceroni, A. A. Araujo, Towards automatic 3D reconstruction of urban scenes from low-altitude aerial images, 12th International Conference on Image Analysis and Processing, pp. 254-259, Sep. 2003. [45] H. Jegou, M. Douze, C. Schmid and P. Perez, Aggregating local descriptors into a compact image representation, 2010 IEEE Conference on Computer Vision and Pattern Recognition, pp.3304-3311, 13-18 June 2010. [46] R. Jonker and A. Volgenant, A shortest augmenting path algorithm for dense and sparse linear assignment problems, Computing, 38:325–340, 1987. [47] N. Katayama and S. Satoh. The SR-tree: An index struc-ture for high-dimensional nearest neighbor queries. In Proc. of the ACM SIGMOD Int.Conj.on Management of Data, pp. 369-380, Tucson, Arizon USA, 1997. [48] Y. Ke and R. Sukthankar, PCA-SIFT: A more distinctive representation for local image descriptors, In Proceedings of the Conference on Computer Vision and Pattern Recogni-tion, pp. 511.517, 2004. [49] M. Kendall, A new measure of rank correlation. Biometrika, 30:81–89, 1938. [50] D. Keren, M. Osadchy, and C. Gotsman, Antifaces: A novel, fast method for image detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(7):747–761, 2001. [51] N. Kiryati, Y. Eldar and A. M. Bruckstein, A Probabilistic Hough Transform, Pattern Recognition, 24(4):303-316, 1991. [52] K. Kraus and N. Pfeifer, Determination of terrain models in wooded areas with airborne laser scanner data, ISPRS Journal of Photogrammetry and Remote Sensing, 53(4):193–203, 1998. [53] V. Lepetit and P. Fua, Towards Recognizing Feature Points using Classification Trees, Technical Report IC/2004/74, EPFL, 2004. [54] V. Lepetit and P. Fua, Keypoint Recognition using Randomized Trees, Transactions on Pattern Analysis and Machine Intelligence, Vol. 28, Nr. 9, pp. 1465 – 1479, 2006. 162 [55] V. Lepetit, J. Pilet and P. Fua, Point matching as a classification problem for fast and robust object pose estimation, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. [56] K. Lin, H. Jagadish, and C. Faloutsos, The TV-tree: An index structure for high-dimensional data. The VLDB Jo-urnal, 3(4):517-549, Oct. 1994. [57] C. Lin and R. Nevatia, Building detection and description from a single intensity image, Computer Vision and Image Understanding, 72(2), 101–121, 1998. [58] L. Liu and I. Stamos, Automatic 3D to 2D registration for the photorealistic rendering of urban scenes. IEEE Conference on Computer Vision and Pattern Recognition, volume II, pp. 137-143, San Diego, CA, June 2005. [59] L. Liu, G. Yu, G. Wolberg and S. Zokai, Multiview Geometry for Texture Mapping 2D Images Onto 3D Range Data, IEEE Conference on Computer Vision and Pattern Recognition, vol.2, pp. 2293-2300, 2006 [60] D. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, 2004. [61] J. Luo, S. P. Etz, and R. T. Gray, Normalized Kemeny and Snell Distance: A Novel Metric for Quantitative Evaluation of Rank-Order Similarity of Images, IEEE TPAMI, 24(8):1147–1151, 2002. [62] S. Mahamud and M. Hebert, The optimal distance measure for object detection, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. I-248-I-255 vol.1, 18-20 June 2003. [63] M. Maloof, P. Langley, T. Binford, R. Nevatia, and S. Sage, Improved rooftop detection in aerial images with machine learning. Machine Learning, 2003. [64] B. B. Mandelebrot, The Fractal Geometry of Nature, W.H. Freeman and Co., ISBN: 0716711869, New York, August 15, 1983. [65] B. C. Matei, H. S. Sawhney, S. Samarasekera, J. Kim and R. Kumar, Building segmentation for densely built urban regions using aerial LIDAR data, IEEE Conf, on Computer Vision and Pattern Recognition, June 2008. [66] K. Mikolajczik and C. Schmid, A performance evaluation of local descriptors, in Conference on Computer Vision and Pattern Recognition, June 2003, pp. 257–263. [67] J. Mooser, W. Lu, S. You, U. Neumann, An Augmented Reality Interface for Mobile Information Retrieval, ICME 2007 pp. 2226-2229, 2007. [68] U. Neumann and S. You, Natural feature tracking for augmented reality, IEEE Transactions on Multimedia”, 1(1):53–64, 1999. [69] M. Ozuysal, V. Lepetit, F. Fleuret, and P. Fua. Feature Harvesting for Tracking-by-Detection, European Conferen-ce on Computer Vision, 2006. [70] F. Perronnin and C. R. Dance, Fisher kernels on visual vocabularies for image categorization, IEEE Conference on Computer Vision and Pattern Recognition, June 2007. [71] M. Pollefeys, L. J. V. Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch, Visual modeling with a hand-held camera. International Journal of Computer Vision, 59(3):207–232, 2004. [72] J. Porway, K. Wang, B. Yao, and S. Zhu, A hierarchical and contextual model for aerial image understanding, IEEE Conference on Computer Vision and Pattern Recognition, 23-28 June 2008. 163 [73] C. Poullis, S. You, U. Neumann, Rapid Creation of Large-scale Photorealistic Virtual Environments, IEEE Virtual Reality Conference, pp.153-160, 8-12 March 2008. [74] C. E. Priebe, J. L. Solka, and G. W. Rogers, Discriminant Analysis in Aerial Images Using Fractal Based Features, Adaptive and Learning Systems II, F.A. Sadjadi, ed., Proc. SPIE, vol. 1,962, pp.196-208, 1993. [75] J. Rekimoto and Y. Ayatsuka, CyberCode: designing augmented reality environments with visual tags, in Designing Augmented Reality Environments, pp. 1–10, ACM Press, 2000. [76] F. Rottensteiner, J. Trinder, S. Clode, and K. Kubik, Automated delineation of roof planes from LiDAR data, In Laser05, pp. 221–226, 2005. [77] D. Schmalstieg and D.Wagner, A handheld augmented reality museum guide, in IADIS Mobile Learning, 2005. [78] C. Schmid and R. Mohr, Local grayvalue invariants for image retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 530–534, May 1997. [79] J. Shi and J. Malik, Normalized cuts and image segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 731-737, 1997. [80] Z. Si, H. Gong, Y. Wu and S. Zhu, Learning mixed templates for object recognition, IEEE Conference on Computer Vision and Pattern Recognition, pp.272-279, 20-25 June 2009. [81] G. Simon, A. Fitzgibbon, and A. Zisserman, “Markerless tracking usingplanar structures in the scene.” In ISAR, pp. 120–128, 2000. [82] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. IEEE International Conference on Computer Vision, 2003. [83] G. Sohn, and I. Dowman, Extraction of Buildings from High Resolution Satellite Data, In E. Baltsavias, A. Gruen, L. Van Gool (Eds.), Automated Extraction of Man-Made Objects from Aerial and Space Images (III), Balkema Publishers, Lisse, pp. 345-355, 2001. [84] J. L. Solka, D. J. Marchette, B. C. Wallet, V. L. Irwin and G. W. Rogers, Identification of man- made regions in unmanned aerial vehicle imagery and videos, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.20, no.8, pp.852-857, Aug 1998. [85] C. E. Spearman, Proof and measurement of association between two things, American Journal of Psychology, 15:72–101, 1904. [86] M. Toews and W. Wells, SIFT-Rank: Ordinal description for invariant feature correspondence, 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.172-177, 20-25 June 2009. [87] Y. Tsai, Q. Wang, and S. You, "CDIKP: A Highly-Compact Local Feature Descriptor", Proceedings of 19th conference on Pattern Recognition, Tampa, Florida, U.S.A, December 2008. [88] E. Tuncel, H. Ferhatosmanoglu, Kenneth Rose, VQ-Index: An Index Structure for Similarity Searching in Multimedia Databases, ACMMM, 2002. [89] T. Tuytelaars and L. V. Gool, Matching widely separated views based on affine invariant regions, International Journal of Computer Vision, 59(1):61-85, 2004 [90] O. Tuzel, F. Porikli, P. Meer, Region Covariance: A Fast Descriptor for Detection and Classification, European Conference on Computer Vision, May 2006. [91] V. Verma, R. Kumar, and S. Hsu, 3D building detection and modeling from aerial LIDAR data, in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol.2, pp. 2213- 2220, 2006. 164 [92] C. Vestri and F. Devernay, Using robust methods for automatic extraction of buildings, in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol.1, pp. I-133-I-138, 2001. [93] P. Viola, W. M. Wells III, Alignment by maximization of mutual information, International Journal of Computer Vision, pp.. 137–154, 1995. [94] Q. Wang and S. You, Fast Similarity Search for High-Dimensional Dataset, IEEE International Workshop on Multimedia Information Processing and Retrieval, 2006. [95] Q. Wang and S. You, Real-Time Image Matching Based on Multiple View Kernel Projection, IEEE Conference on Computer Vision and Pattern Recognition, 2007. [96] Q. Wang and S. You, Feature Selection for Real-time Image Matching Systems, Proceedings of 19th Conference on Pattern Recognition, Tampa, Florida, December 2008. [97] Q. Wang, J. Mooser, S. You and U. Neumann, Augmented Exhibitions Using Natural Features, the International Journal of Virtual Reality, Vol. 7, No. 4, pp. 1-8, 2008. [98] Q. Wang and S. You, A Vision-based 2D-3D Registration System, IEEE Workshop on Applications of Computer Vision (WACV), Snowbird, Utah, December 7-8, 2009. [99] R. Weber, K. Böhm, Trading Quality for Time with Nearest-Neighbor Search, Lecture Notes in Computer Science, 2000. [100] R. Weber, K. Böhm, Hans-J. Schek, Interactive-Time Similarity Search for Large Image Collections Using Parallel VA-Files, ICDE, 2000. [101] R. Weber, Hans-J. Schek, S. Blott, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Space, Proc. 24th Int. Conf. Very Large Data Bases, VLDB, 1998. [102] J. Winn and A. Criminisi. Object Class Recognition at a Glance, Video demo in Conference on Computer Vision and Pattern Recognition, 2006. [103] P. L. Worthington and E. R. Hancock, Region-Based Object Recognition Using Shape-from- Shading, Proceedings of the 6th European Conference on Computer Vision, pp. 455 - 471 , June 26-July 01, 2000. [104] T. F. Wu, G. S. Xia, and S. C. Zhu, Compositional Boosting for Computing Hierarchical Image Structures, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , June 2007. [105] J. Xiao, H. Cheng, F. Han and H. Sawhney, Geo-spatial aerial video processing for scene understanding and object tracking, IEEE Conference on Computer Vision and Pattern Recognition, pp.23-28 June 2008. [106] R. Zabih and J. Woodfill, Non-parametric local transforms for computing visual correspondance. European Conference on Computer Vision, pp. 151–158, 1994. [107] W. Zhao, D. Nister, and S. Hsu, Alignment of continuous video onto 3d point clouds, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8): pp.1305–1318, 2005.
Abstract (if available)
Abstract
Image matching is a fundamental task in computer vision, used to correspond two or more images taken, for example, at different times, from different aspects, or different sensors. Image matching is also the core of many multimedia systems and applications. Today, the rapid convergence of multimedia, computation and communication technologies with techniques for device miniaturization is ushering us into a mobile, pervasively connected multimedia future, promising many exciting applications, such as content-based image retrieval (CBIR), mobile augmented reality (MAR), handheld 3D scene modeling and texturing, and vision-based personal navigation and localization, etc.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Hybrid methods for robust image matching and its application in augmented reality
PDF
Combining object recognition and tracking for augmented reality
PDF
Line segment matching and its applications in 3D urban modeling
PDF
Supporting multimedia streaming among mobile ad-hoc peers with link availability prediction
PDF
Object detection and recognition from 3D point clouds
PDF
Efficient pipelines for vision-based context sensing
PDF
Machine learning based techniques for biomedical image/video analysis
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Point-based representations for 3D perception and reconstruction
PDF
Single-image geometry estimation for various real-world domains
PDF
Advanced techniques for human action classification and text localization
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
PDF
BIM+AR in architecture: a building maintenance application for a smart phone
PDF
Vision-based studies for structural health monitoring and condition assesment
Asset Metadata
Creator
Wang, Quan
(author)
Core Title
Automatic image matching for mobile multimedia applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
01/24/2011
Defense Date
12/08/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
augmented reality,computer vision and image understanding,content-based image retrieval,image correspondences,image processing and computer graphics,machine learning,OAI-PMH Harvest,object recognition and tracking,urban modeling
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
You, Suya (
committee chair
), Kuo, C.-C. Jay (
committee member
), Neumann, Ulrich (
committee member
)
Creator Email
quanwang@usc.edu,wudi890@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3623
Unique identifier
UC1155080
Identifier
etd-WANG-4259 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-435285 (legacy record id),usctheses-m3623 (legacy record id)
Legacy Identifier
etd-WANG-4259.pdf
Dmrecord
435285
Document Type
Dissertation
Rights
Wang, Quan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
augmented reality
computer vision and image understanding
content-based image retrieval
image correspondences
image processing and computer graphics
machine learning
object recognition and tracking
urban modeling