Close
The page header's logo
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Inference of two-dimensional layers from uncalibrated images
(USC Thesis Other) 

Inference of two-dimensional layers from uncalibrated images

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content NOTE TO USERS This reproduction is the best copy available. ® UMI Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. IN FER EN C E OF 2D LA YERS FR O M U N C A LIB R A TED IM A GES by Eun Y oung K ang A D issertation Presented to the FA C U LTY OF TH E G R A D U A TE SC H O O L U N IV ER SITY OF SO U TH ERN CA LIFO R N IA In Partial Fulfillm ent o f the Requirem ents for the Degree D O C TO R OF PH ILO SO PH Y (C O M PU TER SCIEN CE) A ugust 2003 Copyright 2003 E un Y oung K ang Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3116723 Copyright 2003 by Kang, Eun Young All rights reserved. INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3116723 Copyright 2004 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089-1695 This dissertation, written by E uh Young. under the direction of h_&r_ dissertation committee, and approved by all its members, has been presented to and accepted by the Director of Graduate and Professional Programs, in partial fulfillment of the requirements for the degree of D O C T O R O F PH ILO SO P H Y Director Date August 12. 2003 Dissertation Committee Chair j j , : ; G& k Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements First of all, I thank my advisor, the chairman of department of computer science, Professor Gerard Medioni, for his principal guidance, brilliant insight and enormous patience through out whole research years. He provided me the best environment for my research and showed the greatest role model of academic advisors. Also, I deeply thank my co-advisor and the co- chairman of the defense committee, Dr. Isaac Cohen, for his innovative and immeasurable remarks for my research. His remarks from idea level down to the code level directed me to accomplish my approach. I thank my qualifying examination committee members, Dr. Ulrich Neumann and Dr. Margaret McLaughlin, for constructive suggestions. Also, I thank my final defense committee members, Dr. Laurent Itti and Dr. Elaine Chew, for thorough comments and directions to finalize my thesis. I thank all IRIS lab members, Changki, Fengjun, Jinman, Sungchun, Sunguk, and Tao. My special thanks go to Dr. Alexandre F ran cis and Philippos Mordohai. Alex and Philippos have been great supports. They never hesitated to spend their time to provide useful comments. Also, they gave me a lot of research challenges. I also appreciate my dearest friend Ilmi and GS members, Korean bible study group members, for their prayers and encouragement. I truthfully thank my family, parents and sisters. They have expressed their unconditional and endless support at any situation. They gave me freedom and trust to make my own decisions. My deepest thank goes to my husband, Clint Chua. Clint has provided me immense comfort and outlet by his encouragement, charm and humor. He showed consistent love through good times and bad times and stayed like a rock beside me. Most of all, I thank God for leading me in His best way till now and for the future. ii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table of Contents Acknowledgements...................................................................................................................................... ii Table of Contents........................................................................................................................................ iii List of Tables..................................................................................................................................................v List of Figures.............................................................................................................................................. vi List of Equations...........................................................................................................................................x Abstract......................................................................................................................................................... xi Chapter 1 Introduction.............................................................................................................................1 Chapter 2 Related work............................................................................................................................8 2.1 Layered representation................................................................................................................8 2.2 2D motion parameter estimation............................................................................................. 11 2.3 Projective geometry for motion estimation............................................................................16 Chapter 3 2D parametric motion analysis...........................................................................................19 3.1 Camera motion and 2D transformation.................................................................................. 19 3.2 Correspondences as input data................................................................................................ 22 3.2.1 Feature extraction................................................................................................................ 22 3.2.2 Feature m atching..................................................................................................................23 3.2.3 Global feature matching in frequency dom ain................................................................24 3.2.4 Coarse-to-fine hierarchical approach.................................................................................26 3.3 Parameter estimation.................................................................................................................27 3.4 Robust estim ation..................................................................................................................... 32 3.5 Complexity................................................................................................................................33 3.6 2D motion estimation in 2D m osaic...................................................................................... 33 3.6.1 Mosaic for a set of still images...........................................................................................34 3.6.2 Mosaic for a moving sequence.......................................................................................... 37 3.6.3 Image integration..................................................................................................................41 3.6.4 Averages...............................................................................................................................41 3.6.5 Pixel selection....................................................................................................................... 42 Chapter 4 2D parametric motion analysis based on tensor voting..................................................44 4.1 Tensor voting.............................................................................................................................45 4.1.1 Representation in 3 D ...........................................................................................................47 4.1.2 Tensor decomposition.........................................................................................................48 4.1.3 Tensor communication........................................................................................................49 4.1.4 F eature extraction................................................................................................................ 51 4.2 Affine motion and tensor voting.............................................................................................52 4.2.1 Affine model and joint image space..................................................................................53 4.2.2 Exploitation for affine m otion............................................................................................56 4.2.3 Initial tensor...........................................................................................................................57 iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2.4 2D contour consideration................................................................................................. 59 4.3 Outlier Removal........................................................................................................................60 4.4 Layer segmentation..................................................................................................................61 4.5 Layer m erging...........................................................................................................................62 4.6 Parameter recovery..................................................................................................................63 4.7 Multiple affine motion groups from sparse data...................................................................64 4.8 Motion layer and color homogeneity.................................................................................... 71 4.9 Complexity................................................................................................................................ 73 4.10 Affine to homography..............................................................................................................76 4.10.1 Homography and scene planes......................................................................................77 4.10.2 Affine to homography.................................................................................................... 79 4.10.3 Image rectification from known matches.....................................................................81 4.10.4 Grouping with normal in 2.5D disparity space...........................................................81 4.10.5 Results...............................................................................................................................86 Chapter 5 Conclusion...........................................................................................................................90 5.1 Summary....................................................................................................................................90 5.2 Comparison between parametric and non-parametric approach........................................ 91 5.3 Future extensions....................................................................................................................103 References..................................................................................................................................................104 Appendix A Global registration........................................................................................................113 A. 1 Previous w ork......................................................................................................................... 113 A. 2 Global registration................................................................................................................. 115 A.3 Topology inference................................................................................................................ 116 A.4 Global registration constraint................................................................................................119 A.5 Global parameter refinem ent................................................................................................119 A.6 Global registration in the presence of moving objects...................................................... 122 Appendix B Processing for non-parametric motions.....................................................................128 B. 1 Previous w ork......................................................................................................................... 128 B .2 Background layer extraction.................................................................................................129 B.3 Foreground layer extraction.................................................................................................. 131 B .4 Application to compression.................................................................................................. 134 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables Table 1 Comparison estimated parameters to given real parameters................................................. 66 Table 2 Estimation result - 1......................................................................................................................94 Table 3 Estimation result - II.....................................................................................................................96 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Figures Figure 1 Illustration of a layered representation......................................................................................2 Figure 2 Objective of layered representation............................................................................................7 Figure 3 Illustration of a rigid transformation in 3D............................................................................. 19 Figure 4 Illustration of a planar homography transformation............................................................. 21 Figure 5 Comparison between Harris comers and high curvature points..........................................23 Figure 6 Parameter estimation in frequency domain.............................................................................26 Figure 7 A single motion estimation for creating m osaic.................................................................... 29 Figure 8 A mosaic application based on single motion estimation from a set of still telescope images..................................................................................................................................................35 Figure 9 A mosaic application based on single motion estimation, results from a set of still images, and comparisons between manually stitched images and automatic mosaic images..................................................................................................................................................36 Figure 10 Motion estimation for a swimming pool surveillance........................................................37 Figure 11 Motion estimation for a street surveillance...........................................................................38 Figure 12 Motion estimation for a basketball sequence....................................................................... 40 Figure 13 Image integration methods for creating m osaic...................................................................43 Figure 14 Recovering multiple affine motions using tensor voting....................................................45 Figure 15 Tensor decomposition in 3D excerpted from [MedioniOO]................................................ 48 Figure 16 The fundamental 2D stick field..............................................................................................50 Figure 17 Voting in 2 D ............................................................................................................................. 51 Figure 18 Illustration of decoupled joint image spaces.........................................................................54 Figure 19 Plane-fitted voting field...........................................................................................................59 Figure 20 Voting within boundary...........................................................................................................60 Figure 21 Merging clusters from two decoupled joint image spaces................................................. 62 Figure 22 Synthetically generated correspondences.............................................................................65 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 23 Synthetically generated correspondences and two views of correspondences from Figure 22 in decoupled joint image space.....................................................................................65 Figure 24 Synthetic affine m otions..........................................................................................................67 Figure 25 Three detected dominant affine motions............................................................................... 67 Figure 26 Inputs (walking scenes)............................................................................................................68 Figure 27 Correlation-based initial correspondences............................................................................68 Figure 28 Image difference after motion compensation........................................................................68 Figure 29 Inputs (basketball scenes)....................................................................................................... 69 Figure 30 Correlation-based initial correspondences............................................................................69 Figure 31 Image pixel differences after motion compensation........................................................... 69 Figure 32 Application to 2D m osaic....................................................................................................... 70 Figure 33 Image pixel differences after motion compensation........................................................... 70 Figure 34 Motion-based layer segmentation result............................................................................... 72 Figure 35 Refined motion layers..............................................................................................................75 Figure 36 Motion boundaries for each affine motion layer..................................................................75 Figure 37 Extracted motion layers for the mobile&calendar sequence.............................................76 Figure 38 Planar homography transformation........................................................................................77 Figure 39 Plots of homography transformations in a decoupled joint image space (x,y,x’) with parameter changes................................................................................................................... 79 Figure 40 Normal vector to a surface (excerpted from mathworld)....................................................80 Figure 41 Input images (three books)...................................................................................................... 81 Figure 42 Affine patches........................................................................................................................... 82 Figure 43 Affine patches in 2.5D disparity space with its normal direction after image rectification.........................................................................................................................................84 Figure 44 Three homography groups from local affine patches......................................................... 84 Figure 45 Motion Compensation result...................................................................................................85 Figure 46 Input images...............................................................................................................................87 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 47 Affine groups.............................................................................................................................87 Figure 48 Homography groups and motion compensation.................................................................. 87 Figure 49 One global motion compensation result by RA N SA C.......................................................88 Figure 50 Input images...............................................................................................................................88 Figure 51 Estimated epipolar lines...........................................................................................................88 Figure 52 Two homography groups in 2.5D disparity space and motion-compensated images....89 Figure 53 Synthetic motions - 1................................................................................................................ 93 Figure 54 Synthetic affine motions - II...................................................................................................95 Figure 55 Result comparison - 1............................................................................................................... 97 Figure 56 Result comparison - II..............................................................................................................98 Figure 57 Result comparison - I II............................................................................................................99 Figure 58 Result comparison - IV ..........................................................................................................100 Figure 59 Result comparison - V ........................................................................................................... 101 Figure 60 Residual error based on local registration........................................................................... 116 Figure 61 Global registration m ethod................................................................................................... 117 Figure 62 Initial linear tree linked by black edges, a frame graph linked by black and gray edges for “palace ” sequence......................................................................................................... 118 Figure 63 Globally registered mosaic image - 1...................................................................................120 Figure 64 Globally registered mosaic image - II..................................................................................121 Figure 65 Globally registered mosaic images - III...............................................................................121 Figure 66 Global registration result for a basketball sequence without blending.......................... 124 Figure 67 Global registration result for a basketball sequence with blending................................125 Figure 68 Local registration for “soccer” sequence............................................................................ 126 Figure 69 Global registration for “soccer” sequence..........................................................................127 Figure 70 Input “Girl Walking:” sequence, three frames from 158 input fram es.......................... 132 Figure 71 Background layer extracted from “Girl Walking” sequence........................................... 132 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 72 Extracted foreground layers.................................................................................................. 132 Figure 73 Synopsis mosaic superimposed foreground layers on the background layer................ 133 Figure 74 Mosaic video displaying the superimposed foreground layers on top of the background layer in each time t .................................................................................................... 133 Figure 75 Compression example............................................................................................................ 134 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Equations Equation 1 A Rigid transformation in 3 D ...............................................................................................20 Equation 2 A planar homography transformation................................................................................. 21 Equation 3 2D transformation.................................................................................................................. 22 Equation 4 Feature matching.....................................................................................................................24 Equation 5 Estimation of affine transformation.....................................................................................30 Equation 6 Estimation of homography transformation.........................................................................30 Equation 7 Symmetric transfer error....................................................................................................... 32 Equation 8 Affine transform ation............................................................................................................53 Equation 9 Rewritten affine transformation in parametric space........................................................ 53 Equation 10 Symmetric transfer error..................................................................................................... 63 Equation 11 Correlation matrix from a group of inliers........................................................................64 Equation 12 Planar homography transformation....................................................................................77 Equation 13 Planar homography transformation in a joint image space............................................78 Equation 14 Planar homography transformation in decoupled joint image space........................... 78 Equation 15 Definition of (x,y,f(x,y)) space........................................................................................... 80 Equation 16 Normal vector computation from an affine patches........................................................ 80 Equation 17 Derivation of the normal of the plane in the 2.5D disparity space............................. 83 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract In video processing, it is desirable to have a structured representation that reduces data redundancy and facilitates operations for video analysis and manipulation. A conventional frame-based representation treats a video as a set of independent frames containing redundancies, and provides VCR-like operations. As opposed to the frame-based representation, a layer-based representation provides structures called layers that characterize spatial and temporal correlation between frames. This reduces redundancies and allows layer-based operations, which are consistent across frames and therefore more efficient. In this representation, a layer is a 2D image region with time-varying information generated by camera motion and object motion. Each layer is associated with a corresponding motion descriptor and a compact form of image regions from the original images. Layers can efficiently represent non-rigid objects, transparent object’s motion and shallow 3D surfaces. Also, each layer facilitates encoding additional information such as texture and blurring masks. To obtain a layer-based video representation, grouping based on motion and spatial support is essential, and is challenging. In this dissertation, a robust layer inference method is presented. There are two types o f layers: 3D layers and 2D layers. A 3D-layer consists o f a 3D plane equation, the texture of the plane, and a depth ordering per pixel. A 2D-layer consists of a 2D transformation equation and a 2D sub-image texture including pixels under the same motion. Extracting 3D layers requires the analysis of camera motion, which is non-trivial and unnecessary task for some applications such as video coding and analysis. This dissertation focuses on extracting 2D layers from uncalibrated images. Targeted motion models are affine and homography transformations. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. My approach reliably extracts multiple 2D motion layers, affine or homography, from noisy initial matches. This approach is based on: (1) a parametric method to detect and extract 2D affine or homography layers; (2) the representation of the matching points in decoupled joint image spaces; (3) the characterization of the property associated with affine transformation in the defined spaces; (4) a process to extract multiple 2D motions simultaneously based on tensor- voting; (5) local affine to global homography estimation; (6) layer refinement based on a hybrid property: motion and color homogeneity. The robustness of my approach is demonstrated with many results in diverse applications. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction Video contains more information than a set of independent still images because it captures the temporal evolution of scenes. Frames in a video are correlated in time and have small variants from the previous frame. A conventional frame-based representation treats a video as a set of independent frames containing redundancies. This frame-based representation allows VCR-like operations, such as forward, rewind and pause. Also, this representation carries out repetitive manipulations for the common pixels appearing in many frames although they should be addressed as unified operations based on frame coherence. Consequently, this frame-based representation discourages developments of efficient video tools. In video processing, it is desirable to have a structured representation that reduces data redundancy and facilitates operations for video analysis and manipulation. As opposed to the frame-based representation, a layer-based representation provides structures called layers that characterize spatial and temporal correlation between frames. This reduces redundancies and allows layer-based operations, which are consistent across frames and therefore more efficient. In a layered representation, a layer specifies a 2D image region with time-varying information generated by camera motion and object motion. Each layer is associated with a motion descriptor and a compact form of image regions from the original images. Layers can efficiently represent non-rigid objects, transparent object’s motion and shallow 3D surfaces. Also, each layer facilitates encoding additional information such as texture and blurring masks. Many researchers have popularized the concept and usage of the layered representation over a decade [Adelson91] [Darrell91] [Ayer95] [Irani96] [Ju96] [Shi98] [KoenenOl] [TorrOl] [Ke02], They have demonstrated that a layered representation plays a promising role between a 2D image and 3D structures. 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 1 depicts the concept of the desired layers. Video frames are captured in the presence of motions of camera and independent objects. The desired layers are extracted by analyzing these motions from a set o f moving images. Each layer represents a 2D sub-image under consistent motion. Hence, grouping based on motion and spatial support is essential to obtain a layer-based video representation. Many researchers have made an effort to achieve a layered representation. However, many methods are still based on assumptions, such as a pre-specified number of layers, and are sensitive to initial setting or noise. time t+N Frames captured in the presence o f camera and object motions * t V Desired Layers Figure 1 Illustration o f a layered representation There are two types of layers: 3D layers and 2D layers. In a 3D-layer based representation, each layer consists of a 3D plane equation, the texture of the plane, a depth ordering per pixel, and much more information. In A 2D-layer based representation, each layer is associated with a 2D transformation equation and its corresponding image texture. Extracting 3D layers is a non- 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. trivial, computationally sensitive and excessive task for many applications, such as video coding and analysis. For these applications, 2D-layered representation is sufficient. This dissertation focuses on extracting 2D layers from uncalibrated images and presents a robust layer inference method. There are several major computational issues in order to extract layers. These are: (1) the determination of the number of layers; (2) the representation of motion for each layer; and (3) the assignment o f pixels to layers. To address issues, my approach seeks to provide following. • No requirement of pre-specifying the number of layers • Each layer is associated with a parametric 2D transformation, and targeted motion models are 2D affine and homography transformations. • Pixels are assigned to a layer based on motion saliency and color-region matching. This approach is based on following features. • A parametric approach to detect and extract 2D affine or homography layers. My approach reliably extracts multiple 2D motion layers, affine or homography, from noisy initial matches. Grouping is driven by the objective motion transformation. Discontinuity between layers is defined by the parametric motion models. In general, non-parametric approaches extracting motion layers without any specific motion model target to achieve general clustering based on motion smoothness. However, non-parametric approaches sometimes result in accidental groupings or non-groupings that parametric approaches can avoid. A parametric motion estimation can approximate and generate motion information where proper matches cannot be locally determined. Many applications, such as compression, require extracting motion transformation parameters to encode layers. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • The representation of the matching points in decoupled joint image spaces. My approach forms an algebraic motion estimation problem into a geometric problem in order to apply stronger coherences and constraints for motion layer clustering. The formulated geometric spaces from initial motion matches are decoupled joint image spaces. • The characterization of the property associated with affine transformation in the defined spaces. In decoupled joint image spaces, affine transformations are identified as planes. Each pixel for a layer is clustered based on plane property. • A process to extract multiple affine motions simultaneously based on tensor-voting. Tensor voting identifies and estimates locally salient multiple affine motions while removing outliers. This process is non-iterative. In addition, it does not require any prior assumptions, such as a pre-specified number of layers. The only assumptions made are the neighborhood size for voting and the sub-pixel discontinuity for clustering. There are many iterative motion extraction methods. These methods either perform iterative fitting or require a fixed number of layers for their algorithm in advance. Iterative fitting methods often fail to distinguish and extract several motions. The reason is caused by the fact that the methods fit the most globally dominant motion first from the initial matches, remove the selected matches and then find the next dominant motion iteratively. In the presence of multiple motions and many mismatches, a dominant motion is not conspicuous. Naturally, finding a globally dominant motion is unstable. Tensor voting allows us to infer likelihood information, which identifies a smooth clustering criterion and saliencies of pixels belonging to a smooth motion region. The corresponding affine parameters are recovered by analyzing correlation matrix o f the clustered pixels. • Local affine to global homography estimation. The affine layers are clustered into homography motion groups based on the fact that affine patches under a homography Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. reside on a plane in disparity space. My method estimates local affine motions first. All small local motions are identified as affine motions. Then, local motions are grouped to homography motions. • Layer refinement based on a hybrid property: motion and color homogeneity. After dominant motion layer extraction, layers are refined by using color-region homogeneity. The used assumption is that each homogeneous color region belongs to the same motion region. My approach uses this assumption to define an accurate boundary of the layers and to fill holes within the extracted layers. In general, the local computation of correlation between pixels does not provide a correct motion around occluding/occluded boundaries and noisy areas where the extracted layers include indecisive pixels (holes). For this reason, homogeneous color region is used to assign indecisive pixels to the most- likely motion layer extracted by computing image residual errors based on the extracted motion parameters. Figure 2 shows the objective of the layered representation and its applications. Layered representation can be directly applied to video encoding (compression). For the same reason, the compression society already standardized a motion based-layer for a shape encoding unit. The structure of layers can be used to facilitate matching and recognition of objects. Also, layers derived from multiple frames provide enhanced views, such as a panoramic view. Consequently, these properties can be utilized in many applications such as video surveillance and tele­ commerce. My approach to infer a layer-based representation has advanced in the context of complexity of motions: from a sparse single motion analysis to a multiple affine motion analysis. For this reason, this dissertation begins with explanation o f techniques for a 2D single motion extraction and its associated issues. Then, it focuses on techniques for multiple affine motion extraction. 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The chapters of this dissertation are organized as following. In chapter 2, previous work is introduced and compared to my approach. In chapter 3, 2D parametric motion analysis is addressed. In chapter 4, tensor voting based multiple affine motion segmentation method is introduced. In chapter 5, the summary of this dissertation is presented, and possible future extensions are suggested. In appendix A and appendix B, global registration and background layer extraction method are presented. 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ------------------------------------------- Uncalibrated Moving Images - Set o f frames containing set o f pixels - No coherence between frames - Non-Structural - Redundant - VCR-like operation -Prohibit from efficient development o f authoring, manipulation and interaction tools \ Inference Modules of Motion Layers -2D parametric model (affine, projective) -Simultaneous multiple motion detection -Hybrid method to use motion and color homogeneity 2D Layered Representation - Compact - Structural - Make structural information explicit - Geometric correlation o f frames - Static and non-static objects - Content-based scene description - Facilitate the development o f authoring, manipulation and interaction tools / Impacts and Applications - Video encoding (Compression) - Matching and recognition - Synthesis - Enhanced visualization - Video editing - Layer-based Efficient manipulation - Content-based query - Video surveillance, mosaic for tele-commerce, etc Figure 2 Objective o f layered representation. 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 Related work Many o f previous approaches are partially related to my approach. Therefore, this chapter outlines previous approaches in different categories. First of all, approaches to extract motion layers are addressed. Then, estimation methods of 2D motion parameters are reviewed. In the last section, methods using homography geometry for 3D analysis are reviewed. 2.1 Layered representation Grouping based on common motion with a motion descriptor is one o f the strongest cue for segmenting an image into separate regions or objects. Rather than use an edge-based approach to represent the segmentation of the scene, representing an image stream as layers focus on grouping individual pixels into image chunks under commonly constrained groups. The grouping concept has been popularized and evolved based on the idea of estimation of motion models and their spatial supports. Over a decade, the perspective of viewing a video stream as motion layers has been developed in [Adelson91] [Darrell91] [Ayer95] [Irani96] [Ju96] [Shi98] [KoenenOl] [TorrOl] [Ke02], Parametric approaches to extract motion layers use a specific motion model to define grouping constraint. Each pixel or regions are grouped into layers based on an object motion model. There are iterative parametric approaches. [Adelson91] [Wang94] [Irani92,93,96]. [Adelson91] defines layers as a representational unit for video encoding and presents corresponding decoding methods based on the described layer representation. The extraction of layers is based on a pre-specified number of 2D affine motions and refines segmentation based on image warping error. [Wang94] iteratively clusters motions using pre-computed optical flow. This method computes binary ownership to a region throughout iterative processes and determines regions. Although it uses the same affine motion models, the process is iterative and uses a predefined number of clusters (k- 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. mean clustering). Irani [Irani92] presented a method that detects and tracks multiple moving objects over image sequences. The method depends on estimating dominant motion. It detects and tracks multiple motion areas by compensating for the dominant motion first. The process involves iterative estimation, and is challenged by multiple large moving objects inducing multiple significant dominant motions. In [Irani93], Irani et al. detect and track moving objects, and they reconstruct the incident background occlusions by moving objects at time t. In their work, 2D motion parameters are used to model object motions and use tracking information to refine segmented regions. As an extension in [Irani96], Irani et al. describe a hierarchical representation including layers for the video stream and show its usability according to the applications. They focus on the usage of 2D mosaic rather than layer extraction. The iterative approaches either start with a pre-specified number of layers to fit pixel motions, or perform repetitive extraction-and-removal of dominant motions. In other words, they are based on an assumption of a known number of layers or an existence of conspicuously dominant motion. In [Ju96], the method extracts multiple motion layers based on affine motion descriptor, refines the noisy areas by re-estimating optical flow based on the regularization and repeats these steps from coarse to fine resolutions. The regularization-based method focuses on the boundary discontinuity between regions. This approach attempts to apply different smoothing factors and detect accurate layers based on unreliable discontinuity around motion boundaries extracted from locally measured correlation. Approaches based on the use of Markov Random Fields (MRF) focus on handling discontinuities in the optical flow [Heitz93][Gelgon97][Kervrann95][Boykov98]. These methods give some good results, but they rely on a spatial segmentation result in the early stage o f algorithm, which may not be practical in many cases. 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The main issue associated with layer-based segmentation is the selection of the number of motion regions and the likelihood measure for clustering pixels in the presence of multiple motions and many outliers. To cope with problems associated with multiple motions, several EM-based methods were proposed [Darrell91] [Ayer95] [Weiss96] [Weiss97] [Baker98] [Szeliski99] [TorrOl]. [Darrell91] described a framework for representing image regions as homogeneous chunks (layers) based on the homogeneous support. In this framework, the initial set of hypotheses are set, collect support for those hypotheses and finalizes the number of hypotheses. The framework is eventually based on the fixed number of motions and an extension of the M- estimation methods, and it utilizes many o f estimators including motion or color. [Ayer95] starts layer segmentation without a specified number of layers, and performs an exhaustive search at the M-estimation to extract reasonable number of layers. [Baker98] and [Szeliski99] proposed to use planar layers to model 3D scenes from stereo images, and they also used parametric motion estimation to extract layers. EM-based methods have showed good results. However, these EM- based methods also require specifying the number of regions and a significant amount o f iteration to reach satisfactory performance. In case that EM-based methods start without a specified number of layers, they usually require massive search, which might not guarantee the optimal grouping of regions. As opposed to EM-based method, which requires careful initial setting, my approach extracts dominant motion layers using tensor voting and produces precise layer boundaries without any prior assumptions. Also, my approach utilizes color homogeneity to refine motion layers. The used assumption is that each homogeneous color region belongs to one uniform motion. This color homogeneity property has been extensively used in many motion analysis [Pardas94] and stereo [TaoOl]. [Pardas94] used affine motion to compensate several intensity or texture regions based on dense motion fields. The method is a region-based motion compensation and segments 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. image into different motion regions by estimating affine motions for each segmented texture or intensity image regions. In [TaoOl], by adopting a color segmentation based plane-plus-residual- disparity depth representation, the smoothness constraint in textureless regions is enforced without sacrificing accuracy on depth discontinuities. My method is a parametric approach cooperated with tensor voting for region segmentation. The considered motion model is a 2D motion. Non-parametric and non-iterative region segmentation methods based on tensor voting were proposed [Gaucher99] [Nicolescu02]. [Gaucher99] proposed a method extracting motion layers based on 2D tensor voting in 2D vector fields estimated across three frames and cooperating boundary information in order to refine motion boundaries. As an general extension from [Gaucher99] which has a limitation in extracting general non-parametric motion layers solely based on 2D voting, [Nicolescu02] extracts motion layers from two consecutive frames by using Tensor Voting in a 4D space defined by image coordinate and motion vectors in each pixel. [Shi98] proposed a motion segmentation method by using normalized cuts, which is non-parametric. The grouping is based on the probability distribution of the image velocity and the distance between motion profile at two pixels. Generally, grouping based on normalized graph cut method suffers from its performance as the image size grows. Non-parametric approaches solve general clustering problem. However, the applications requiring parameter recovery, such as compression, benefit from a motion segmentation approach that groups pixels into region of similar parametric motions. 2.2 2D motion parameter estimation Parametric layered approach involves a motion estimation step. Parameter recovery is to estimate correct values for parameters in a given motion model. We introduce previous work for estimating parameters for different models by using different approximation methods. [Chen94] [Reddy96] introduce a method to get 2D rigid transformation parameters in frequency domain. 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. They convert images into frequency domain by using Fast Fourier Transformation (FFT) and compute phase correlation to get the peak global similarity o f two images. By converting images into polar coordinate or log-polar coordinate, it detects rotation or scale variance in frequency domain. This method is actively used in pattern matching application to identify object and retrieve it from database. The method works for largely displaced images. But the recovered parameters are not sub-pixel accurate and the rotation and scale factor requires higher quantization than usual NTSC input. Therefore, this approach is not sufficient for precise parameter estimation. In our approach, this method is used to estimate initial guess for parameter estimation in 2D image registration. Several techniques were proposed for extracting correspondences that fit the targeted parametric model. RANSAC and its enhanced variations are considered robust and are capable of efficiently removing outliers [Fisher81] [LaceyOO]. Recently, [Torr96] introduced the enhanced version of RANSAC called MLESAC. MLESAC carefully guides the selection of the random samples for the next step. It is based on the distribution of the points in order to reduce or reach the theoretical maximum boundary of RANSAC repetition. For a single motion model estimation, RANSAC based methods work very well to remove outliers. However, when several motions exist in the set of correspondences, these RANSAC based approaches do not discriminate or detect several inlier groups from outliers. Alternatively, tensor voting based-methods for estimating fundamental matrix [Tang99b] [TongOl] were recently proposed to remove the outliers robustly. The tensor voting formalism was used as a pre-processing step for outlier removal in 8D and 4D space by characterizing hyper-surfaces. In 8D approach, a plane is parameterized by 8 fundamental matrix variables, and the outliers are the points not on the plane. In this approach, a local smoothness and a global structural constraint are used at the same time. However, the parameters defining the 8D basis 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. make the space neither orthogonal nor isotropic. Therefore, the input must be properly scaled prior to processing. In [TongOl], a 4D approach using the joint image space is derived from point correspondences to reduce the dimensionality and to provide isotropic and orthogonal properties. Inliers are detected as points on a 4D cone as defined by the epipolar constraint in 4D joint image space. Although these methods remove outliers robustly, they do not enforce the structural information during voting stage, recover parameters via an additional RANSAC procedure, and multiple motions are detected by iterative approach. More recently, Jurie et al. use the hyperplane approximation for real-time region tracking [Jurie02], The parameter estimation between two regions in consecutive frames can be obtained by the approximation of Jacobian image. But it involves re-computation of image gradient with respect to the parameters at each iteration, and this step is computationally expensive. Instead, Jurie augments the Jacobian matrix with objective parameters and rewrites it as a coefficient matrix of a set of hyperplane equations. Then, he estimates the matrix minimizing the error from the set of training data set in the least square manner. Therefore, for the tracking, this approach is to find parameters leading to minimum error from the pre-determined coefficients of the set of hyper-plane equations in learning stage. In this paper, a hyperplane is defined by the derivatives of two vectors with the intensity values of target regions and each parameter, which is different from our approach, and the major role of those hyperplanes is for the prediction of next parameters. The extensive use of 2D motion parameter estimation can be also found in mosaic technique. The purpose of mosaic technique is to create a panoramic view after compensating 3D planar motions. The motion compensation technique is usually referred as a 2D image registration. Traditional MPEG-2 method recovers a motion vector for each of small blocks (8x8 or 16x16) 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [KoenenOl], In this case, it cannot model a global motion that correlates each of block motions efficiently. For this reason, most of methods prefer to parametric approach. McMillan et al. generate a panoramic view for a rotating fixed camera and represent the images in plenoptic function with respect to the camera center [McMillan95]. Moreover, they generate a view for a virtual viewpoint from a set of constructed panoramic view and simulate the depth by re-sampling plenoptic function. Szeliski et al. also focus on the mono-centric mosaic, which has a fixed camera rotation axis, and show some extension for 8 parameter recovery [Szeliski94] [Szeliski97]. The method recovers the rotation parameters with focal length or 8 parameters by using Levenberg Marquardt. The idea is extended for the concentric mosaics to allow us to generate virtual camera views within the concentric mosaic views [Shum99] [QuanOl]. The limitation of M cMillan’s work is that it can only create mono-centric panoramic views and requires well-controlled camera system. Szeliski’s work suffers from the local minima problem in case of large interffame motion. Irani et. al recover the 2D quadratic transformation based on the direct image matching [Irani93] [Irani95] [Kumar95] [Irani96]. They also exploit the parallax information to process residual information after the image alignment resulting in 3D corrected mosaic image. Morimoto et al. address how to extract dominant motion between two frames based on a 2D parametric model [Morimoto97] [Morimoto98]. In [Morimoto97], they show that 3D model based stabilization using Kalman filter to estimate the rotation between frames and derotate the camera sequence to compose mosaic. In addition, they show the accuracy of image stabilization technique by using power signal-to-noise ratio [Morimoto98]. Sawhney et al. propose a mosaic method based on the similar minimization technique to Irani [Sawhney95] [Sawhney97] [Sawhney99], But they take into account lens distortion and perform the minimization for all the parameters rather than pair-wise parameters. Irani and Morimoto groups take into account of the 3D correction for 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. mosaic construction, but they do not consider the case of large interframe motion nor and the accumulated error caused by pair-wise registration. Sawhney group partly solves the accumulated error by estimating all transformation parameters at one time. However, considering number of parameters which include 2D quadratic parameter and lens distortion factors for each pair of frames, the minimization technique does not converge unless good initial estimations are given, and the process takes longer time to converge. The result mosaic image is influenced by the shape of the environment map. Peleg et al. introduce manifold mosaic to minimize the image warping caused by the camera motion [Peleg97] [Rousso97] [Rousso98]. The manifold is adaptively defined by the optical flow. They use ‘slit’ as a mosaic unit and stitch the slits. Unlike other methods, they well handle the zoom- in/out case by using pipe projection. This slit-based method is dedicated to create better mosaic images, therefore, it does not give fairly straight method to recover parameters, which is useful for other operations, for example, inversely regenerating input video. Several survey works for parameter recovery methods are explained and compared [Hartley97] [Triggs99] [Hartley97] [Zhang97]. [Zhang97] shows a survey for the epipolar geometry estimation. [Hartley97] also shows and compares the parameter recovery and error minimization techniques for triangulation. The paper pointed out that complicated non-linear methods are not necessarily better, and iterative linear technique can approximate the solution. [Triggs99] gives a massive survey report for bundle adjustment techniques and shows that non-linear optimization technique can perform as fast as linear method by analyzing properties of the objective matrix. All described methods are based on correspondences, and their robustness is significantly dependent on the correct outlier removal. 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.3 Projective geometry for motion estimation Beyond the conventional algebraic 2D homography parameter estimation from image correspondences, my approach exploits the homography induced by the same world plane in order to group affine motion groups into several homography groups. This is based on the planar homography property called projective geometry. In this chapter, several methods using homography property for the motion estimation or grouping are introduced. [Luong93] [Szeliski98] show that knowing and using planar homography from the scene or the points belonging in the same plane can lead much better 3D reconstruction. In [Luong93], homography property between planes is used for the estimation of the fundamental matrix with uncalibrated images in order to cope with the instability from a general fundamental estimation method. When estimating a fundamental matrix based on point correspondences, it shows that the quality of the estimation increases as the number of planes in which the correspondences lie. Alternatively, the number of homographies is estimated, and then the fundamental matrix is computed based on the estimated homographies. [Szeliski98] shows that using external geometric knowledge can lead to significantly better 3D reconstruction by especially hallucinating additional correspondences in areas of know planar motion and applying higher order constraints. Instead o f exploiting camera motion parameters for the 3D reconstruction, exploiting homography framework is introduced in many papers. The homography invariant is used for reconstruction and reprojection in [Shashua94], Unlike classic approach for camera motion estimation involving depth parameter, the approach represents a homography depth and associated homographies for a fixed planes as independent on the camera internal parameters. Kumar [Kumar94] addressed a motion estimation method that is based on decomposing the motion field into the image motion of a parametric surface and a residual parallax field. The 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. residual parallax field is an epipolar field and used for a homography 3D reconstruction o f the scene based on the fact that the parallax field is directly proportional to the height of the point and inversely proportional to its depth from the camera if the motion of points on a parametric surface is compensated. The scene description is dependent on reference plane. It is not likely to work for the scenes composed o f several planar surfaces facing different angles. In [Demirdjian99], the method allows for egomotion (camera motion) and independent motions. The egomotion is characterized by 3D homography constraints, then independent motions are computed. In this paper, camera motion is assumed as a dominant motion so inlier and outlier groups are discriminated based on rigid motion constraints in homography space. The rest of the independent object motions are considered from the extracted outlier points. Each outlier point is grouped together based on the distance metric on image after projection. The grouping is done by the hierarchical clustering algorithm [Johnson 67]. In this paper, the clustering constraint is purely dependent on the distance without the proper motion estimation. It groups the same set points that are close together and throw out isolated points. Unlike other methods, in [AnandanOO], the fundamental matrix property in 4D joint image space is addressed. The estimation of the fundamental matrix is done from estimating local affine parameters, which is the local tangent surface, to estimating global homography parameters, which is the global 4D cone in a 4D joint image spaces. The joint image space feature and the matching constraints are also introduced in [Triggs95]. The analysis of the motion of a planar surface across multiviews is addressed in [Zelnik- Manor99] and [Zelnik-ManorOO], In those paper, it showed that the collection of homographies of multiple planar surfaces across multiple views are embedded in a low dimensional linear sub­ space, and this constraint is used to detect moving objects. Similarly, [KeOl] and [Ke02] exploited the subspace constraint which is derived from the relative affine transformation from 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. multiple frames in order to extract 2D layers. In addition, the subspace constraint-based motion estimation shares idea to enforce cluster constraints in a different space. [KanataniOl] proposed a Tomoasi-Kanade factorization-like motion segmentation method with subspace separation, which segments different motion regions based on the tracked features across multiple frames. The method enhanced the original mathematical formulation by incorporating dimension correction, model selection and least median-fitting in order to increase the stability against noisy. The sub­ space constraint-based approaches successfully analyze the common motion property across multiple frames. However, these approaches are very sensitive to the construction of matrix, which will lead to the extraction of basis of the low dimension, and they require additional iterative grouping processes to cluster pixels after initial segmentation in the low dimension. 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 2D parametric motion analysis My approach focuses on estimating 2D parametric transformations: affine or homography. Therefore, it is necessaiy to specify under which circumstances 2D transformations can approximate scene motions appeared by a camera motion or object motions in 3D. In the first section, the relationship between a camera motion and a 2D transformation is described. Then, the next section addresses how to estimate 2D motion parameters. 3.1 Camera motion and 2D transformation The motions captured by video stream are related to camera motions in 3D. The mapping between two camera coordinates in 3D, (X, Y, Z) and (X 1 , Y ‘ , Z ’ ), can be represented by a rotation (R) and a translation (7) matrices shown in Figure 3 and Equation 1. C(X,Y,Z) (X \Y 'Z )= R (X ,Y 'Z )+ T Figure 3 Illustration o f a rigid transformation in 3D 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (X',Y',Z') = R ( X ,Y ,Z ) + T "*.1 R \2 R u ^21 r 22 R-22 (X ,Y ,Z ) + t 2 _i?3l R $2 i?3 3 Equation 1 A Rigid transformation in 3D A point in ( .X, Y, Z) camera coordinate maps to ( .X Y \ Z ’ ) by a rigid transformation. For a rigid camera motion of a 3D planar surfaces, there exist two planar homography transformations between a world plane and each camera image plane. There also exists a planar homography transformation between two images. Therefore, if all (X, Y, Z) are on a world plane, there exists a planar homography transformation between a world plane and a camera image plane. Images of points on a plane are related to corresponding image points in a second view by a planar homography as shown in Figure 4. This is a homography relation since it depends on only the intersection of planes with lines. Through homography a point in one view determines a point in the other which is the image of the intersection of the ray with a plane. Therefore, 2D homography models a transformation of a planar object in 3D between two imaging planes. In general scenes, 3D planar surface transformation exists in the scenes taken by camera rotating about its axes (Pan-tilt). Also, it exists in the scenes taken by tele-photo lens, which the depth of objects are much smaller than the distance between the object and the camera. The scenes taken by pan-tilt camera or tele-photo lens can be approximated by 2D transformations [Faugeras93] [HartleyOO], For closely related views or far-distant scenes, affine transformation can be used to approximate the transformation. For discrete views by a significant camera rotation or closely-taken scenes, a homography transformation is used. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 4 Illustration of a planar homography transformation hn h\2 /?31 (X ) * 2 = H P = h2 l ^22 h2 3 Y 0 *3 , h 3 1 ^32 ^33. 'hn' hn' V (x \ V = H 'P = V v V Y K V v Equation 2 A planar homography transformation Based on the mathematical relation between camera motion and 2D transformation, recovering 2D transformations is equal to extracting object groups belonging to the same 2D motion groups. In other words, extracting a consistent affine motion are identifying world planar surfaces, which are nearly parallel to the image plane, or non-planar surfaces, which have small depth compared to the camera-to-object distance. Extracting a homography identifies a general planar surface objects facing to the same direction in 3D. The equations o f affine and homography transformations between two image coordinates, one denoted as (x, y) and the other denoted as (x y ’ ), are following: 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a b" f C = + KC x' = ax + by + tx, y' = cx + dy + 1 (a) Affine transformation / t\ rPu Pn f X N y , — P 21 P 22 P 23 y l u ^ 3 1 P 32 P 3 3J (Pux + Pi 2y + Pu) (p 2ix + P 22y + p 2$) (p3lx + p 32y + p 33) ’ (p3lx + p n y + p 33) (b) A planar homography transformation Equation 3 2D transformation Mathematically, the affine motion can be estimated by three correspondences and the planar homography motion can be estimated by four correspondences. The accuracy of the parameter estimation is directly influenced by the selection of correspondences. In general, correspondences are generated by automatic image matching and include many outliers. In the next section, a robust method to generate correspondences is introduced because generating correspondences is a common task for motion analysis. 3.2 Correspondences as input data Layer extraction method starts from the generation of correspondences. Although my approach provides a robust parameter estimation method in the later stage, correspondences, which are provided as inputs for a motion estimation, are preferred to be as accurate as possible. 3.2.1 Feature extraction Features can be defined as comers [Zoghlami97], high curvature points or lines and so on. In this dissertation, features are extracted by Harris comer detector. Harris comer detector computes the 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. locally averaged moment matrix computed from the image gradients, and then combines the eigenvalues of the moment matrix to compute a comer strength, of which maximum values indicate the comer positions. The difference between high curvature points and Harris comers are shown in Figure 5. For a fast parameter estimation, sparse correspondences based on features extraction and matching are used. For a dense layer computation, all image pixels are defined as features. Curvature = I / |VI| HARRIS score map ( a = 0.3) Figure 5 Comparison between Harris comers and high curvature points 3.2.2 Feature matching After feature point extraction from two images, we select initial correspondences. The similarity of correspondences is measured by either Cross Correlation value (denoted as CC from this 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. point) or Sum o f Squared Difference value (denoted as SSD from this point). Two different computations can be used in different levels of parameter estimation. Computing CC takes a longer time than computing SSD because the computation o f CC compensates the mean value of the matching block in the equation. On the other hand, CC computation removes high correlations between pixels in homogeneous intensity/color blocks. In my approach, both of computations of CC and SSD are used. The computations are selectively mixed according to an image resolution that is on process. CC computation is performed in the coarse resolution. Here, the used matching block window size is relatively large compared to the original image resolution. SSD computation is performed in the finest resolution with a relatively small window size. The coarse to fine approach becomes necessary because it expedites the speed and reduces the ambiguity between points in homogeneous intensity/color blocks. It is more desirable for producing dense correspondences. In a dense case, a well-known Lukas-Canade (multi­ resolution) optical flow estimation can be also used as an alternative solution. Equation 4 Feature matching 3.2.3 Global feature matching in frequency domain When there is a large translation, rotation or scale change between images, it is hard to estimate good initial correspondences. Failing to have good initial matches leads to false estimations for feature-based methods or to a local minimum problem in the optimization step for direct methods, such as Levenberg Marquardt. To cope with this situation, initial motion is estimated in frequency domain. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fast Fourier Transform (FFT) method differs from other registration strategies because they search for optimal match according to information in the frequency domain. Let I, and I2 are the two input frames that differ only by a displacement (x0,y0 ). I 2(x,y) = I x( x - x 0, y - y 0) The corresponding Fourier transforms F! and F2 are related by F 2 (Z,V) = e-J2^ 0+ ’^ * F l(Z,?1) The cross-power spectrum o f two frames I and / ’ with Fourier transforms F and F ’ is defined as F (£ i7 )F '* (£ iy ) . ej 2x(.& °+ ny0) \F(Z,ri)F'(4,n) where F* is the complex conjugate of F. Translation property of the Fourier transform (Fourier shift theorem) guarantees that the phase of the cross-power spectrum is equivalent to the phase difference between the images. By taking inverse Fourier transform of the representation in the frequency domain, we have an impulse function; that is, it is approximately zero everywhere except at the displacement that is needed to optimally register the two frames. In polar coordinate, the rotation between two image appears as translation. In the same way, the scale between two images appear as translation in log-polar coordinate. Therefore, we get scale and rotation parameters by converting the inputs to different coordinate systems and computing cross-power spectrum. More precisely, the input images are converted into log-polar images. The scale variant is computed based on cross-power spectrum of two Fourier images from log- polar images. Then, the original images are rectified according to the computed scale. The rectified the images are converted into polar coordinate for rotation parameters. The rotation variant is computed based on cross-power spectrum of two Fourier images from polar images. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The rectified images are rectified again according to rotation parameters. Finally, computation of translation parameters is performed from the rectified images in both of scale and rotation. The following diagram in Figure 6 shows the registration sequences in frequency domain. The sequence flows from the left to right. Dashed line means the rectification with proper parameters, scale or rotation parameters. This approach can approximation a stable solution for the frames overlapping at least fifty percent due to the inherited property in conversion from a Cartesian image to a Fourier image. Translation Rotation Scale Two frames in Log- polar coordinates Two frames in Polar coordinates Two frames in Cartesian coordinates Cross-power spectrum and imp ulse function Figure 6 Parameter estimation in frequency domain 3.2.4 Coarse-to-fine hierarchical approach Although the registration in frequency domain gives good starting points, it is computationally expensive to convert each pair of input images to frequency domain and register them. To support efficient processing, we use hierarchical approach as well. A hierarchical approach performs matching steps from the coarse-to-fine image resolution. 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The method is following. At the first step, Gaussian pyramid images are created for each input frame [Burt83]. And then, the initial matching in frequency domain is performed in the coarsest Gaussian image. Since noise creates the high-frequency in the frequency domain, creating Gaussian image is important. This step increases resistance to noise, and therefore the computation is more robust. The rotation or scale parameters may not be accurately recovered in the coarsest level because the polar coordinate and log-polar coordinate represented in a coarse resolution is not sufficiently precise to characterize a small rotation and scale change. But in the presence of large rotation or scale changes between consecutive scenes, loose initial estimations from the coarse level are still useful. Once initial parameters are obtained, features are propagated to the finer resolution. This step is repeated until it reaches the finest input resolution. 3.3 Parameter estimation This section presents an estimation of 2D motion that is parameterized by 2D affine or homography transformation. In parametric approach, the initial parameters are estimated by many ways. And then, the parameters are iteratively refined in the manner of minimizing the residual errors. These errors are measured by considering all image intensities or features only. The former is called a direct method and the latter feature-based method. The single motion estimation has following technical issues: (1) fast and efficient computation; (2) robust computation; and (3) globally registering estimation across frames in a video. The method should be fast enough to process a video interactively because these days many applications require interactive throughput. The method should be robust in the presence o f many mismatches called outliers. Therefore, the method should provide a way to detect and remove outliers. At last, the method should provide a consistent estimation method registering multiple video frames. My method attempts to address the technical issues mentioned above. The method is based on the robust feature matching method described previously, a RANSAC based outlier removal, and 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. hierarchical parameter refinement. For globally registering multiple frames, we use a graph- based global registration which is described in appendix A. Figure 7 shows the overview of a single motion estimation approach. As depicted in it, the process starts from creating multi-resolution images for each frame. The estimation of homography parameters is highly dependent on the initial estimates. Unfortunately, an arbitrary initialization tends to lead to a local minimum [Szeliski94], Such limitation can be overcome by estimating the initial parameters by using global matching in frequency domain [Chen94] [Reddy96]. However the search for the parameters in the frequency domain is computationally expensive. The computation cost can be drastically reduced using a hierarchical approach. This coarse to fine registration technique is based on a several level pyramid representation of the frames that allows us to recover consecutively the translation of affine and homography parameters. This scheme allows us to refine the parameter estimation by matching feature points at different resolutions with an increasingly complex model. In the last stage of the parameter estimation, we improve the stability by using RANSAC. 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A pair of frames Coarsest? No Yes No Finest ? Yes Feature extraction Feature update Creating multi-resolution images Parameter estimation Image integration Initial parameter estimation by global matching in the coarsest image resolution 1 Mosaic Figure 7 A single motion estimation for creating mosaic The input correspondences are computed as described in the chapter 3. From the correspondences, parameters (six parameters for affine transformation and eight parameters for homography) are computed. Recovering the parameters of the transformation is performed by minimizing gray level error using Least Square measure such as: where My is affine or homography transformation, (x, y) and (x \y ) are corresponding points. 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To measure the error, we use feature-based method. Feature-based method is relatively vulnerable to noises or moving objects. Also, if the features are not well distributed over the frame, the error measurement misleads the parameter estimation. But the parameters can be recovered very fast compared to the direct method. For this reason, we use only features. The drawback of this approach is reduced by forcing the uniform distribution of features, and dynamically updating features and outliers in each level of hierarchy. By using the initial translation parameters and correspondences, we try to minimize the error: Let us denote (x,y) and (x ’ ,y ’ ) two corresponding feature points. We obtain following equations. f x y l 0 0 0" tx , 0 0 0 x y K c J Equation 5 Estimation o f affine transformation f X y 1 0 0 0 - x x ' 0 0 0 x y 1 - xy' v ■yx'' yy' j P x l' P \ 2 P x 3 P lX P 22 P 22 P lX P l 2 ) Equation 6 Estimation of homography transformation 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We use the linear least squares method to compute the parameters. The parameters obtained in coarser resolution are propagated to the finer resolution by changing the translation parameters with respect to the scale factor. Based on the new parameters, we adjust the locations of feature points and target points and repeat the error minimization steps. In homography transformation, the correspondence is given by x '=X/W and y ’= Y/W where X = p n x + p l2y + p l3 Y = p 2lx + p 22y + p 23 W = p 3X x + p 32y +1 Note that in homography transformation case, this Linear Least Square method minimizes the error term e = ( X , Y ) - W ( x ,y ) But we wish to minimize s = ( X / W , Y / W ) - ( x , y ) = (x’,y') - (x,y) If the equation had been weighted by the factor 1/W, the resulting error would have been what we want to minimize. Since the W is dependent on (x,y), we can not used a fixed weight, W, in the equation until we solve the equation. Therefore, we proceed iteratively to adapt W. Let’s denote the weight in the first step as W0. In the next step, we can compute W, by finding P 31 and P32. We repeat this process at each step by multiplying the equation by 1/Wi. If the number of repeated steps is n, the error measure by this process will be s = (X n/W n,Y n/W n)-(x,y ) = (x',y') - (*,.y) 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. It approximates the error that we want to minimize. As an experiment, we tested the 4 correspondences between images to recover parameters with and without this process. With this iterative refinement errors are reduced visibly. 3.4 Robust estimation Traditional least square algorithms use all the data to obtain the objective parameters. If there is a significant number o f outliers, the estimated parameter is nowhere close to the real parameters. To prevent this situation, we use RANdom Sample Consensus (RANSAC) procedure in the parameter estimation stage. RANSAC procedure is different in that it attempts to eliminate the invalid matches. As started by Fisher and Bolles [Fisher81], RANSAC uses as small an initial data set as feasible and enlarges this set with consistent data when possible. It partitions the data set into inliers and outliers based on distance threshold, t. In this dissertation, we use the symmetric transfer error, d, for the threshold, t. d = £ [ < * ( ( * , y ) k M ji ( * ' > / ) * ) + d ( M ij y ) k ’ ( * ' » / ) * ) ] k Equation 7 Symmetric transfer error The idea is following: 3 points or 4 points are selected randomly to estimate the affine or homography transformation. Then the support for this transformation is measured by the number of points that follow the transformation within the distance threshold, t. The random selection (called a sample) is repeated until the number of selection reaches the preset maximum iteration number or adaptively determined the number of samples. After trials, the largest consensus set is selected to estimate transformation parameters. As seen, RANSAC performance is related to the selection of points and the number of samples. In our work, for a data set selection in each 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. iteration, we use bucket-based selection to choose well-distributed points overall the image [LaceyOO]. Also, the chosen sample data is normalized before the estimation in order to increase stability of sample data [HartleyOO] [Zhang95], 3.5 Complexity The complexity of 2D parametric motion estimation is computed based on the subtasks. Given an image having x rows and y columns, first of all, the generating pyramid images takes 0 (k), k denotes the number of pyramid levels, which is usually 3. For global matching in the frequency domain, fast Fourier transformation is used. This task takes 0(log(«/&)), n denotes the number of data which is xy. Extracting features requires to detect Harris comers. Harris comer detector takes 0(log(n/A:)). All other tasks takes constant complexity. Therefore, the complexity of motion estimation is 0(log(«/&))= 0(log{xy/k)). 3.6 2D motion estimation in 2D mosaic This section demonstrates the proficiency of the single motion estimation through the 2D mosaic results. Estimation o f a single 2D motion has been exploited in many o f 2D mosaic techniques. Mosaic technique recovers camera motions and creates a static panoramic view by stitching video frames based on recovered motions. My approach performs a pair-wise registration. Pair-wise image registration estimates transformation parameters for each pair o f consecutive frames. The transformation between each frame and any arbitrary frame is computed by the concatenation of the pair-wise transformation. To create a panoramic view, each frame is warped onto one coordinate space referred as the reference fram e coordinate. The result of 2D mosaic are shown in the context of applications. Each application exemplifies different issues, and we show that our fundamental approach is adaptively used. 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.6.1 Mosaic for a set of still images BioSieht. The goal o f BioSight is to develop the virtual microscope environment on PC. This system simulates real microscope to provide same facilities to the user via graphical user interface. With real microscope, users observe specimens on top of the stage by panning the lens and changing magnification. To simulate it, we generate a large specimen image by stitching set of images captured with the maximum the microscope magnification for panning and generate different image resolution with respect to the user input for magnification. Focusing on the stitching task, the captured images have a couple of features. Firstly, each image size is larger (1540x1140 resolution) than the typical video surveillance imagery. So efficient computation is required. Secondly, the overlapping area between corresponding images is small. The average overlap is around 25. Therefore, good initial estimation o f translation parameters is required to avoid massive computation. To solve this, we used deeper levels of hierarchy and four block comer matching. Theater-Loc. My 2D mosaic method was also used for Theater-Loc project to create panoramic view from a set of images to aid movie director. Before movie directors shoot a film in unfamiliar places, they visit the places or refer to the pictures of the places taken in advance. To give full view of the location, manually stitched pictures are used. But manual stitching cannot provide the transformation of the structure between the pictures. Part of TheaterLoc project is to create a panoramic view and make a database for this purpose. The issues involved here is to recover parameters representing overall transformation from the less than ten percent of overlap and to create a panoramic picture. Based on given overlaps, it is not possible to estimate sound features and correspondences automatically. Therefore, we extended our system to accept rough features locations and correspondences given by hand interactively on top of the same architecture and process described in the previous chapter. The some of results are shown in 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 9. The comparison between manually stitched pictures and generated in-active mosaics are demonstrated. (a) Captured microscopic pattem-like images with large displacements (b) Onion root-tip specimen, three input frames out o f 56 frames (c) Onion root-tip Specimen, static mosaic image Figure 8 A mosaic application based on single motion estimation from a set o f still telescope images 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Manually stitched mosaic images (b) Generated mosaic images Figure 9 A mosaic application based on single motion estimation, results from a set o f still images, and comparisons between manually stitched images and automatic mosaic images 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.6.2 Mosaic for a moving sequence Video surveillance-. Video surveillance is one major area that uses camera stabilization technique. Video surveillance requires the detection and analysis on targeted location and objects. However, when camera moves, it is hard to perform surveillance on target objects for video sequences taken by a camera mounted on moving vehicle, by a hand-held camera, or by a unstable fixed camera. Here, we show that our method provides a perfect solution for this kind of works. In Figure 10, stabilizing video sequence over a pool is shown. This video surveillance attempts to detect accidental drowning. But the fixed camera keeps shaking because it is placed outside and wind makes the camera move. Therefore, video sequence includes moving background. In Figure 10 (b), the stabilized mosaic images are shown to solve this situation. In Figure 11, the mosaic application for a moving camera case is shown. (a) Two input frames from a 384 frame sequence taken by a fixed surveillance camera placed on top o f outside pole (b) Two stabilized mosaic images Figure 10 Motion estimation for a swimming pool surveillance 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Three input frames from a 486 frame sequence taken by a hand-held camera (b) Dynamic mosaic showing the construction o f mosaic being built Figure 11 Motion estimation for a street surveillance Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Sports sequences: Sports sequence is usually taken by a pan-tilt-zoom camera whose motion can be perfectly stabilized by the 2D image registration. The challenging issue in sports sequences is to robustly remove mismatches caused by moving objects. To achieve this, we used RANSAC method in the parameter estimation stage. Figure 12 shows one of sport sequence examples. In this example, notice that many of players are moving. Also, the camera creates zoom and pan­ tilt, and some frames of the video sequence includes motion blur. However, Figure 12 (b) demonstrates that the stabilization is robustly performed. 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Three frames from a video sequence o f 99 frames (b) Two mosaics Figure 12 Motion estimation for a basketball sequence 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.6.3 Image integration Mosaic is a created image by warping all frames into one unified coordinate called a reference frame. The transformation between a frame and a reference frame denoted Mm is computed by concatenating the pair-wise transformations. i where MR i is the transformation matrix between the reference frame R and each frame i and My is the pair-wise transformation matrix between consecutive frames. During warping, more than one pixel from video frames map to the same location in the mosaic coordinate. In this section, we demonstrate and compare several ways to choose a pixel for each location. 3.6.4 Averages Although the registration is geometrically correct between images, it is necessary to blend input images to compensate for changes in frames. This is because human perception is very sensitive to the brightness, and illumination difference can give an impression o f mis-registration. Temporal average takes average pixel value of the overlapping pixels. This method blends all pixels and generates smooth mosaic in the end. But if there is moving objects, it generates ghosting images. Weighted temporal average is the same as the temporal average in that it use all the pixel values, but each pixel has different weighting. In general, the weight is given by distance between the image center and the pixel location. Som e approach uses significant residual m otion measure to blend images [Irani95]. The mosaic image pixel is computed by 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where Im (u, v) is the mosaic image pixel at (u, v) and MR i is the transformation between mosaic image and each overlapping frame i. I£Mm (x,y)) is the pixel from the overlapping frame i. w, (x,y) is given by w, (x, y ) = -------------- ---------------- dist((C ix, C iy ), (x, y)) where (Cix, C,^denotes the image center of frame i. The weight of a pixel in each input frame is calculated based on inverse distance from the image center, meaning that if the pixel is closer to the image center, it has a heavier weight. This is the same idea as Voronoi tessellation. But in Voronoi tessellation, only one closest pixel is chosen. But in this approach the other pixels contribute to create naturally blended mosaic image with different significance. 3.6.5 Pixel selection [Peleg97] suggested to choose a less-warped pixel to create mosaic based on Voronoi tessellation. The idea is to choose the pixel, which is closer to the center of the frame, since it tends to be less- warped. Voronoi tessellation of input frames is computed based on the centers of frames projected in the mosaic coordinate. This approach can give relatively sharp output compared to averaging all the intensities or blending [Sawhney 98], because it takes only one pixel intensity out of several intensities. But in the presence of large illumination changes, we need a blending. We use a weighted temporal average. The weight is the inverse distance from the frame center. The method designed to combine blending and sharpness. Figure 13 shows the different result from different image integration methods. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 31 (a) Initial mosaic having noticeable intensity changes (b) Temporal average image blending and blurring at the same time 9 H (c) Voronoi mosaic less warping and no blending Weighted temporal averaging image blending and less blurring Figure 13 Image integration methods for creating mosaic Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 2D parametric motion analysis based on tensor voting Robustness of parameter estimation relies on discriminating inliers from outliers within the set of matching points. Where the multiple motions present, the inlier detection for each motion group is harder in conventional global parameter estimation methods. Therefore, a robust method to extract inliers and group them together is desired. This chapter presents a method based on tensor voting to estimate 2D multiple affine motions and to eliminate outliers and select inliers properly. Tensor voting-based method simultaneously recovers the parameters of the corresponding affine motion model. The estimation of affine transformation parameters is computed directly from covariance matrix of inliers without processing additional parameter estimation step. The approach is based on the representation of the matching points in a decoupled joint image space and the use of the metric associated with the affine transformation. This method defines a decoupled joint image space from the affine transformation constraint. Additionally, the method enforces the global parametric structure in the defined space during voting stage, detect several independent affine motions and directly estimate the parameters from each set of inliers. This method allows performing a simultaneous multiple 2D motion estimation because tensor voting enables to identifying and extract locally salient motion groups by removing local outliers. In Figure 14, the overview of tensor-voting based approach is shown. First of all, input correspondences are generated as explained in the previous chapter. Each estimated correspondence is normalized and is encoded into generic tensor form in decoupled joint image spaces. To enforce affine transformation property, initial tensor might have directional preference. Tensor voting is performed in two stages. First stage is used to infer preferred direction and then stronger support is collected in the second stage. The saliency of a point is characterized based on the plane normal saliency. Saliency is used to determine inliers and 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. outliers. The grouping is performed based on the normal direction whose points on the same plane is among inliers. For each grouped inliers, parameters is estimated. Initial Correspondences I Inlier detection and grouping Merge space and merge affine groups Second order sparse tensor voting Parameter Recovery Tensor encoding in decoupled joint image spaces I Affine Parameters Figure 14 Recovering multiple affine motions using tensor voting 4.1 Tensor voting Motion estimation using parametric models is widely used for video processing such as image mosaics, video compression and video surveillance. Among all 2D transformation, the affine motion model is a commonly used method for these applications due to its simplicity and the small inter-frame camera motion. Here, I present a robust and non-iterative feature-based method to estimate affine parameters based on tensor voting. 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Tensor voting formalism provides a robust approach for extracting salient structures, such as curve and surface. The input data is encoded as second order symmetric tensors to capture first order differential geometry information and its singularities. Points can simply be represented by their coordinates. A local description of a curve is given by the point coordinates, and its associated tangent or normal. A local description of a surface patch is given by the point coordinates, and its associated normal. Here, however, we do not know in advance what type of entity (point, curve, surface) a token may belong to. Each encoded input (called token) propagates its information to the neighboring data by voting. In a first voting stage, tokens communicate their information with each other in a neighborhood, and refine the information they carry. After this process, each token is now a generic second order symmetric tensor, which encodes confidence o f this knowledge (given by the tensor size), curve and surface orientation information (given by the tensor orientations). In a second stage, these tensor tokens propagate their information in their neighborhood, leading to a dense tensor map, which encodes feature saliency at every point in the domain. This dense tensor map is then decomposed. Surface, curve, and junction features are obtained by extracting local extrema of the corresponding saliency values along a direction. The final output is the aggregate of the outputs for each of the components. The extraction of salient structures is inferred from the canonical description of an arbitrary tensor by its eigensystem representing the local geometric properties o f the data. In the 2D case, the salient structures are points and curves. Each encoded tensor is an ellipse (2D second order symmetric tensor). An ellipse is decomposed into its 2x2 eigensystem, where eigenvectors ex and e2 represent the ellipse orientation and corresponding eigenvalues A ,i and X2 represent the ellipse size. To express it in a compact form, the tensor S is represented as S = X\e\exr+X2 e2 e2 , where Xi>A,2 >0. If a token represents a curve element, e x represents the curve normal, while A,i=l and 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A .2 =0. If a token represents a point element, it has no distinctive orientation with ^i=0 and A .2 =0. In a 3-D case, a very salient curve element is represented by a thin ellipse, whose major axis represents the estimated tangent direction, and whose length reflects the saliency of the estimation. In a more compact form, S = X\e]e ir+ 7^2 e2 e2r+'k2 e3 e3r, where ^ 1 >A ,2 >^3>0 are the eigenvalues, and §\, e2, e2 are the eigenvectors corresponding to X2, respectively. The eigenvectors represent the principal directions of the ellipsoid and the eigenvalues encode the size and shape of the ellipsoid. The extraction of salient structures is inferred from the canonical description of an arbitrary tensor by its eigensystem representing the local geometric properties of the data. Indeed, any arbitrary symmetric tensor can be decomposed by: S = (A 1 - A , ) e , e l r + | > , - X ^ e / , + 2 „ l > , r i= l > 1 1=1 where Xt denote the eigenvalues (sorted in a decreasing order) and e ,- denotes corresponding eigenvectors. In any dimension higher than 3D, the first term of S characterizes the hyper-plane orientation (normal) and the associated (Xj -X2 ) saliency. These local geometric properties are propagated within a domain of influence depending the principal orientation (given by et) and on the associated saliency. 4.1.1 Representation in 3D Each input token is encoded into a second order symmetric tensor. To capture first order differential geometry information and its singularities, a second order symmetric tensor is used. It captures both the orientation information and its confidence, or saliency. Such a tensor can be visualized as an ellipsoid in 3-D. Intuitively, the shape of the tensor defines the type of information captured (point, curve, or surface element), and the associated size represents the saliency. 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To express a second order symmetric tensor S we choose to take the associated quadratic form, and to decompose it into its eigensystem, leading to a representation based on the eigenvalues A ,i, % 2, A -3 and the eigenvectors eu e2, e3. In a more compact form, S = ’ k le]e]T +'k2 e2e2T +'k3 e3 e3T , where ^i>A,2 > A ,3 >0 are the eigenvalues, and eu e2, e3 are the eigenvectors corresponding to Xi, X2, ~ k 3 respectively. The eigenvectors represent the principal directions of the ellipsoid/ellipse and the eigenvalues encode the size and shape of the ellipsoid/ellipse. 4.1.2 Tensor decomposition As a result of the voting procedure, we produce arbitrary second-order, symmetric tensors, therefore we need to handle any generic tensor. The spectrum theorem [Ghosal96] states that any tensor can be expressed as a linear combination of these three cases, i.e., S={X\-'k2 )e\e^+{X2- ^ 3)(^i^iT+^2^2T )+ ^ 3(^i^iT+^2^2T+^3^3T )> where ex e\ describes a stick, (e\e\T +e2 e2 ) describes a plate, and {iie^+ e2 e2 +e3 e3 ) describes a ball. Figure 15 shows the decomposition of a general saliency tensor into these components. :k tensor ball te n s o r plat© tensor Figure 15 Tensor decomposition in 3D excerpted from [MedioniOO] At each location, the estimate of each of the 3 types of information, and their associated saliency, is therefore captured as follows. The point saliency has no orientation, and the saliency is given 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. by X3. The curve saliency has a tangent orientation given by e3, and saliency by X2 -k3. The surface saliency is characterized by normal orientation (Si) and saliency (^i-A,2 ). In 2-D, there is no surface saliency, and curve saliency is expressed by e2 for the tangent orientation, and by for curve saliency as in Figure 15. 4.1.3 Tensor communication We now turn to our communication and computation scheme, which allows a site to exchange information with its neighbors, and infer new information. An efficient voting process allows propagating each data to neighborhood. Each token cast a vote at each site in its neighborhood. The size and shape of the neighborhood and the vote strength and orientation are encapsulated in predefined voting fields. In 2D there are two types of fields: for curve element and point element. Token refinement and dense extrapolation. In 3-D, a point token is encoded as a 3-D ball. A curve element is encoded as a 3-D plate. A surface element is encoded as a 3-D stick. These initial tensors communicate with each other in order to derive the most preferred orientation information (or refine the initial orientation if given) for each of the input tokens (token refinement), and extrapolate the inferred information at every location in the domain for the purpose of coherent feature extraction (dense extrapolation). Note that in the dense extrapolation case, each token is first decomposed into its independent elements, then it broadcasts this information. In this case ball tensors do not vote, as they define isolated features, which do not need to propagate their information. While they may be implemented differently for efficiency, these 2 operations are equivalent to a voting proess, and can be regarded as tensor convolution with voting kernels. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. s w \ v \ t / * - • - - - ' * + * ‘* s / / z t » WWN NSS' • * < " / / / / « I W W V N ' " ' i i u u u 'iijjj;;;;; ■""normal ’"" orientation intensity-coded strength (saliency) Figure 16 The fundamental 2D stick field Derivation of the 3-D voting kernels. All voting kernels can be derived from the fundamental 2-D stick kernel, by rotation and integration, shows this 2-D stick kernel. In [Marr79], we explain in mathematical terms that this voting kernel in fact encodes the proximity and the smoothness constraints. Denote the fundamental 2-D stick kernel by VF. The 3-D stick kernel is obtained by revolving the normal version of VF 90 degrees about the z-axis (denote it by V ’ F ). Then, we rotate V ’ F about the x-axis, and integrate the contributions by tensor addition during the rotation. To obtain the plate kernel, we rotate the 3-D stick kernel obtained above about the z- axis, integrating the contributions by tensor addition. To obtain the ball kernel, we rotate the 3-D stick kernel about the y-axis and z-axis, integrating the contributions by tensor addition. The fields are generated based on the scale factor a (size of neighborhood). Vote orientation corresponds to the smooth local curve continuation from a voter to a recipient. The vote strength D F (d , p , a ) decays with curvature p and distance d between a voter and a recipient as following. | — 1 2 \d\ +p a 2 DF(d,p,<j) = e Figure 17 (a) shows how to generate voting field. A tensor O having curve normal V casts a vote at its neighbor P. The tensor P gets a stick vote from O that smoothly interpolates normal N along the circle fitted between O and P with decay factor D F (d , p , a ) . All neighborhood of O 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. gets a stick vote with respect to the stick field as seen Figure 17 (b) and sum them up using tensor addition. .Vote Received FITTING SURFACE (a) Vote Generation (b) Stick field Figure 17 Voting in 2D 4.1.4 Feature extraction At the end of the voting process, we have produced a dense tensor map, which is then decomposed in three dense vector maps: the surface map, the curve map, and the junction map. Each voxel of these maps has a 2-tuple (s, v), where s is a scalar indicating strength and v is a unit vector indicating direction. These maps are dense vector fields, which are then used as input to our extremal algorithms in order to generate features such as junctions, curves, and surfaces. The definition of point extremality, corresponding to junctions, is straightforward: it is a local maximum of the scalar value s. A point p is on an extremal surface if its strength s is locally extremal along the direction of the normal. A p point is on an extremal curve if any displacement from p on the plane normal to the tangent will result in a lower s value. Detailed implementation can be found in [MedioniOO]. As a result of the voting procedure, we produce a second-order, symmetric tensors, therefore we need to handle any generic tensor. The spectrum theorem states that any tensor can be expressed as a linear combination of these two cases, i.e., 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. S~(ki~X 2)$\e\ +{X2-X'i){i\i\ +e2^i) where e^ei1 describes a stick and (eieiT+e2 e2 T ) describes a ball in the 3D case. At each location, the estimate of each of the two types of information, and their associated saliency, is therefore captured as follows. The point saliency has no orientation, and the saliency is given by 22. The curve saliency has a tangent orientation given by eu and saliency by X2 -23. In the same manner, any arbitrary symmetric tensor can be decomposed by: S = (A, - A2)ete[ + Z ( A ~ «=1 j =1 i=l where 2,- denote the eigenvalues (sorted in a decreasing order) and e ,- denote the corresponding eigenvectors. In any dimension higher than 3D, the first term of S characterizes the hyper-plane orientation (normal) and the associated (2/ -22 ) saliency. These local geometric properties are propagated within a domain of influence depending the principal orientation (given by ei) and on the associated eigenvalues (given by 2/ -22 ). 4.2 Affine motion and tensor voting Here, it is shown that a decoupled joint image space is a 3D space, and that the embedded structure that represents the affine constraint is a 2D plane. By inferring the most salient 2D plane from input correspondences based on tensor voting, we remove the outliers and estimate the parameters directly from inliers. This approach formulates the problem in a geometric space and minimizes geometric distance in a non-iterative manner. It differs from classical techniques, such as least square method, in that they attempt to minimize the algebraic errors iteratively. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2.1 Affine model and joint image space A 2D affine model is defined by six parameters. In this model, a correspondence between a feature point (x,y) from one image and the corresponding point (x ’ ,y ’ ) from the second image is given by the following equation: \ y j V a b c d \ r ( t \ X \ y j + v ^ y Equation 8 Affine transformation which can be rewritten in the parametric space as: x y 1 0 0 0 0 0 0 x y 1 M b c v^'y d J y J Equation 9 Rewritten affine transformation in parametric space A set o f linear equations derived from corresponding points are usually solved by a least square method or its variation that minimize algebraic errors [Zhang97], A joint image space is defined by the 4D space (x, y, x ’ ,y ) where (x,y) and (x ’ ,y ) are two matched pixels. In this representation, each point is a combination of 2D image vectors and is represented by a vector q defined below. The affine transformation can be written as: (q A\) P ‘ P V v ly = 0 w here q = ( x ,y ,x ',y ') and P = f a b - 1 0 O 7 c d 0 - I t y J 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. But in the affine model, the joint spaces (x,y,x) and (x,y,y ’ ) are independently constrained and therefore can be decoupled to reduce the dimension of the joint image space. In other words, in the affine model, the two equations are independent, and therefore the joint space (x, y, x ’ .y ’ ) can be decoupled into two spaces (x,y,x ’ ) and (x,y,y ’ ) (see Figure 18). The decoupled joint image space allows to reduce the dimension of the joint image space and to enforce the affine constraint directly to the tensor voting formalism. Correspondence * Joint Image Space Decoupled Joint Image Spaces Figure 18 Illustration o f decoupled joint image spaces Therefore, by defining p x = (a,b,-l,tx)T, py = (c,d,-l,ty ) r, we have two separate joint spaces qx = (x,y,x',\)T and qy = ( x ,y ,y ',l)T ■ We obtain the following equations in the decoupled joint image spaces: qT xcxqx = o q T y C yqy = o where Cx and Cy are defined by: 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ( 2 a ab - a atx ab b2 - b K - a - b 1 ~ t x Katx btx - t* t] f c 2 cd - c Cty cd d 2 — d dt y - c - d 1 ct dty -* y In a compact form, we obtain the following equations in the decoupled joint image spaces: p \ q ,= 0 p ',1 , = 0 . These equations define two plane equations in decoupled joint image spaces. They are written as: p T x qx = ax-v b y - x'+tx = 0 p T y qy = cx + dy - y'+ ty = 0 In this representation, each 4D point defined by qx = ( x ,y ,x ',l ) T lies on a 2D plane parameterized by p x in the 4D space. Therefore, the points on the plane define the inliers. In the case of perfect correspondences, the eigenvector corresponding to the smallest value of the covariance matrix in the space (x,y,x’ ,l) and (x,y,y’ ,l) characterizes the parameters of the 2D plane. If several affine motions are present among the correspondences, the same number of corresponding 2D planes is defined in the 4D space. Consequently, if we robustly remove outliers and group inliers belonging to the same plane, we can compute the parameters directly from the set of inliers. Tensor voting achieves to extract inliers lying on the plane by finding salient plane. 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. From each correspondence (x,y) and (x \y ’ ), we define two 4D spaces by decoupling the correspondence as (x,y,x’ ,l) and (x,y,y’ ,l). Each 4D space is orthogonal and isotropic, and each 4D point (x,y,x’ ,l) or (x,y,y’ ,/) lies on a plane parameterized by (a,b,-l,tx ) and (c,d,-l,ty) respectively. However, these 4D spaces are embedded 3D spaces in 4D since the last element of the space is always 1. Therefore, we can consider two decoupled 3D space (x,y,x ) and (x,y,y) which are still isotropic and orthogonal. In this 3D decoupled joint image space, the pixel correspondences constrained by an affine transformation are still lying on a plane. Each plane has a normal direction (a, b, -1) or (c, d, -1). The parameter tx and ty constrain the perpendicular distance between the origin of the space and the plane. This property gives the criteria to decide which pixel belongs to which plane. In other words, the embedded structure that represents the affine motion is a 2D plane in the decoupled affine joint image space. This approach formulates the problem in a geometric space and minimizes geometric distance in a non-iterative manner. It differs from classical techniques, in that they attempt to minimize the algebraic errors iteratively. If input correspondences consist of perfect correspondences, each plane can be parameterized by the normal and distance measure. When multiple affine motions are present in the image, the same number of corresponding 2D planes is defined in the embedded 3D space. Consequently, grouping points belonging to a same plane (same motion) naturally leads us to region segmentation and parameter estimation associated with it. In reality, the dense pixel correspondence contains many mismatches. Therefore, we need to have a robust way to infer salient plane features within the noisy input correspondences. We describe to use tensor voting for achieving such selection. 4.2.2 Exploitation for affine motion The tensor voting framework is a general method for extracting structure. In our case, we know the structural shape of interest, which is a plane, and we want to focus on extracting planes in a 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. decoupled joint image space. Therefore, in the following section we will modify the tensor voting framework for planar extractions. 4.2.3 Initial tensor When we encode initial correspondences into tensors, we do not have any prior knowledge about normal direction of planes and displacement from the origin of the decoupled joint image space. Therefore, we encode each initial input as a ball tensor (ellipsoid with equal eigenvalues for each eigenvector) that propagates point information to all direction in any dimension. However, it is computationally expensive to perform voting compared to starting with stick tensor. In the case of dense correspondences, it is more desirable to avoid encoding points into ball tensor. To avoid this, we encode initial points as stick tensors, which a normal direction is defined based on the following assumptions: a - \ , b = 0,T X - ( x '- x ) c = 0 ,d = l,T y = ( y '- y ) where a, b, c, d, Tx and Ty are the six affine parameters. This assumption means that initial affine motion considered as the identity. And, this assumption is reasonable because the affine motion is relatively small between two consecutive video frames. Based on this assumption, each point is considered on the plane having the normal direction (a,Z»,-1) = ( i,o ,- i) ( c , d - 1) = (0,1,-1) Tx and Ty are implicitly encoded in the defined space. Here, we would like to specify one fact. In a decoupled joint image space, when difference between two motions is very small (around 1 or 2 pixels), the discontinuity between planes is not perceptible. Consequently, it is not easy to detect different layers. To handle this, we consider scaled vector space that converts {x ,y, x \ 1) 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. or (* ,y, y ’ ,1) into (x, y, s(x-x ’ ), 1) and (x, y, s(y-y ’ ), 1), and we show that this conversion still holds the affine property in the new scaled spaces. The new spaces (x, y, s(x-x ’ ), 1) and (x, y, s(y-y ’ ), 1) are defined from the original matching pixels (x,y) and (x \y ) using a constant scale parameter s. Corresponding to the new space definition, the two original affine equations become: s((a - l)x + by + 1) = sx' (cx + s (d - \) y + 1) = sy' This gives a linear conversion to the original space and the planar constraint still holds in the new spaces. Therefore, the affine regions that we are looking for are still planes in the new spaces. However, the inverse conversion should be done to estimate correct affine parameters appearing in the images. Following the scaled space definition, the initial normal direction is defined by: (s(a - 1), s b - s ) - (a - 1 ,6 ,- 1 ) = (0 ,0 ,-l) (s c ,s (d - l) ,- s ) = (c — \ , d , - \ ) = (0 ,0 ,-l) We can observe that the initial normal direction is parallel to the last coordinate of the space. Once we have all initial tensors, we perform voting to find exact plane parameters. In a decoupled joint image space, which is an embedded 3D space, each point is represented as 3D ellipsoid and the 3D stick voting field is generated in a similar manner as shown in Figure 17 (b). Here, we enforce an additional property into the voting field in terms o f shape and saliency propagation. The selected field propagates the planar geometry we are looking for in the processed data. The field considered is depicted in Figure 19. We also illustrate its variation compared to the circle-based propagation mechanism used classically in tensor voting implementations. 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Vote Received FITTING PLANE Figure 19 Plane-fitted voting field Figure 19 illustrates our voting procedure given a token O and a normal direction N at that point. A point P collects the vote cast by a point O with a normal direction N, based on the relative orientation of the plane fitted to points P and O, and the distance separating these points. This change enforces the affine metric during voting stage. 4.2.4 2D contour consideration Our approach utilizes boundary information during first voting stage. The contour information is considered to inhibit voting between different color regions or across boundaries. The concept is depicted in Figure 20. In an image, the discontinuity appears along the image boundary. Therefore we use boundary information to maximize voting interaction within the same smooth regions. This is performed only for the first voting stage because the motion discontinuity might not coincide with color discontinuity. 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. boundary Figure 20 Voting within boundary 4.3 Outlier Removal There are two steps in the voting process. During the first voting, each point collects votes cast by its neighbors and infers a principal direction defining a plane. The normal orientation of 2D plane is defined by the eigenvector e, associated to the largest eigenvalue of the decomposed tensor. The saliency of the extracted plane is given by (2/ -A2 ) and characterizes the support of neighboring pixels to the characterized plane orientation. Therefore, isolated random noise has a small saliency due to the little support from its neighboring correspondences. During the second voting step, voting is performed with a planar voting field, a thinner voting field, defined by the normal direction derived during the first step. At this step, only highly salient points (defined by a threshold of the saliency values from the first voting) participate in the voting and cast their tensor properties to neighboring pixels. The performance or the outcome of the result is dependent on sigma value to generate voting field. The sigma value is adaptively used. If the dense correspondences, the voting field is relatively small and if the correspondence is irregular and sparse, the voting field is relative large. In general, sigma value is set to cover at least 8x8 or 16x16 image blocks. 6 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4 Layer segmentation The layer segmentation is performed by clustering points into regions from the most salient points characterized by the second voting step described previously. This is a seed-based region growing method. However, our approach does not require specifying the number of regions in advance, and the seeds are not selected randomly. The seeds are determined by the inferred saliency from voting process, and their saliencies retain the likelihood measures. The coherence is based on plane-smoothness in the decoupled joint image space and is measured by the angular difference of the normal directions (reflected by the coefficients a, b, c, d) and the distance from the space’s origin (reflected by the parameters tx and ty). At each point, the normal orientation of the plane is encoded by the eigenvector ei associated to the largest eigenvalue. The saliency of the extracted plane is given by {X, -X2 ) and characterizes the support o f the neighbors to the characterized plane. Given a seed point p s and its plane description, we compare and cluster neighboring points, /?,; according to their likelihood. This likelihood measure is provided by the distance functions: d s ~ en x s + en y s x s di = en x i + el2y , + e1 3 x ; where (xs , y s, x /) and (x ,-, y it x,’ ) are locations of the seed point, p s and neighbor point, p h and each vector (eu e1 2 , e ^) represents the normal direction of the plane at that location. The likelihood measure is provided by the similarity: ||d s - d -j2. This function approximates the Euclidean distance between two parameterized motions. If the motion difference is smaller than 1 pixel, two points get clustered together. After one region is clustered, the clustering procedure iterates using the next highest salient points. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This clustering step is performed in each decoupled joint image step. In order to fully estimate affine parameters and extract image region, the clustered sets from each joint image space should be merged. Figure 21 Merging clusters from two decoupled joint image spaces 4.5 Layer merging Layer clustering described in the previous section produces over-segmented regions. This is due to the fact that we enforce very strict threshold despite of the imperfect normal direction inferred from the voting method. We therefore merge similar motion layers based on the estimated affine parameters. The similarity between cluster motion is measured by the symmetric transfer error shown in Equation 10. If the overall error caused by two affine motions is less than 1 pixel (sub­ pixel accuracy), the two regions are merged. Notice that this merging threshold value is same as the threshold used for plane clustering process in the voting space. However, we can combine some layers because the affine parameter in each region is estimated in least square manner and they compensate the strict difference observed in the previous steps. Merging step is iterative and repeats until no more merging takes place. 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 N error = — '^ \{ d (x , A~x (x ')) + d ( x ', A (x ))) N i Equation 10 Symmetric transfer error 4.6 Parameter recovery For each clustered layer, the affine parameters are estimated by analyzing the correlation matrix of the clustered points in each space. The parameters of the affine transform are therefore characterized by the eigenvector associated to the smallest eigenvalue of the correlation matrix. Let consider the two joint-image space represented by q x! - ( * ., y . , x ! , l ) for estimating parameters a, b, tx and q yi = ( x i} y t , y 1,1 ) for c, d ,ty where i is the index for each inlier and n is the number of inliers in the cluster. Then, we compose a stacked matrix of inliers mx and m ■ y > m T ( T x = I ? , r a T 1 H x2 m Let M v = m r -mr and M = m m then M x -px 2>2 " £ x x • I* n 2 > 2 2>- tt 2> 1 t x x ' n 2 > ' 2 n 2 > ' n 2> 2> n n n V n b - 1 v tx j = o 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. z* 2 z * ^ 2 > ) ( r > 5 > n 2 > 2 ±yy' n 5 > c d 2 >' i > - n I / 2 n 2 > ' - 1 n 2 > n i y n n \ n n n J Equation 11 Correlation matrix from a group of inliers where px and py are the parameters representation in the corresponding joint-image subspaces. The parameters o f the affine transform are therefore characterized by the eigenvector associated to the smallest eigenvalue. Conceptually this method acts like conventional parameter estimation. However, we showed that encoding the metric of the joint-image spaces in the tensor voting formalism allows us to perform directly outliers removal and multiple affine motion estimation. 4.7 Multiple affine motion groups from sparse data Affine parameter estimation has been studied by a large number o f authors and therefore the presentation of a new approach has to be presented by processing challenging situations and naturally compare it to the state-of-the art. We chose to compare our method the results obtained by RANSAC. However, in the first example we chose a synthetic example that cannot be processed by a RANSAC technique: multiple affine motions. We generated a set of correspondences from two different motions (black and red lines) and added random noise (green lines) with similar amplitude motion as illustrated in Figure 22. Figure 23shows the correspondences in a decoupled joint image space. A decoupled joint image space has 1 for the last term, we dropped the last term and showed correspondences in 3D. The ratio of correct to noisy (ie random) is 0.5. Blue points are selected as inliers by the system. Grouping these inliers into different motion groups is done as describe later. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 22 Synthetically generated correspondences Figure 23 Synthetically generated correspondences and two views o f correspondences from Figure 22 in decoupled joint image space Table 1 shows the comparison between the real affine parameter values and the parameters estimated by the described method. 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Affine parameters for black lines A B c D tx ty Real 0.99 .017 0.017 0.99 20.00 2.00 Estimated 1.00 0.01 0.017 1.00 21.00 1.15 Affine paramelcers for red lines A B C D tx ty Real 1.00 0.00 0.00 1.00 10.00 10.00 Estimated 1.00 0.00 0.00 1.00 9.98 10.10 Table 1 Comparison estimated parameters to given real parameters In Figure 26 and Figure 29 we show a pair of frames extracted from two video sequences with moving objects in the scene. The purpose here is to compare the described method and RANSAC algorithm. We start by selecting feature points using a Harris comer detector with a low threshold allowing to consider strong and weak comers. In Figure 27 and Figure 30, we show the correlation based initial matching. In Figure 28 and Figure 31, we show the image difference after compensating for the motion we have estimated using the described method. Here again we compare the results with the one obtained with the RANSAC method. One can clearly see a better image compensation resulting from the described technique. Especially, in both of Figure 31 and Figure 33 the background is more accurately registered. In Figure 32, the mosaic result based on our method is shown. In Figure 33, the closer comparison between RANSAC and our method is shown. Clearly, the back group is more registered by using out methods. 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Rotation and translation motion (b) Conflicting motion a. ^ n\ 0 ^ - -- - ' N V ^ - x W - r i * - * + ■ “Tr ‘ - . - i r . ; # A i # 1 1 * ■ ■**. ■ -f-v ' - * * . ■ * ! —• I s m y — ^ (c) Transparent background motion with translation (d) Random motions Figure 24 Synthetic affine motions Figure 25 Three detected dominant affine motions Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 26 Inputs (walking scenes) Figure 27 Correlation-based initial correspondences (a) Residual error by RANSAC (b) Residual error by our method Figure 28 Image difference after motion compensation 68 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Figure 29 Inputs (basketball scenes) Figure 30 Correlation-based initial correspondences (a) Residual error by RANSAC (b) Residual error by our method Figure 31 Image pixel differences after motion compensation 6 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. V I I Figure 32 Application to 2D mosaic (a) Four frames from the input sequence (b) Residual error by RANSAC (c) Residual error by our method Figure 33 Image pixel differences after motion compensation 70 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 4.8 Motion layer and color homogeneity Figure 34 presents a segmentation result. Our method successfully extracts multiple affine motion layers, while rejecting outliers. In general, the outliers are conspicuous around motion boundaries or homogeneous color regions. From two input image in Figure 34 (a), one is the reference, the initial pixel correspondences are computed by using an optical flow method. Figure 34 (b) shows the initial correspondences in a decoupled joint image space (x, y, x ’ ), where we can observe several affine planes with many outliers located around tree boundary. Figure 34 (c) shows the extracted affine motion layers based on tensor voting. Each layer is color-coded, and multiple 2D affine motions are successfully extracted. (a) Two input frames (b) Initial correspondences in (x,y,x') 71 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (c) Initial motion layers Figure 34 Motion-based layer segmentation result After layer clustering, un-clustered points have to be processed. These points correspond to outliers or regions where the motion could not be estimated (uniform regions). These outliers create many holes in the image. Filling these holes using the estimated local affine motion does not provide satisfactory results. Indeed, robust motion estimates cannot be inferred in holes nearby motion boundaries. This prevents from clustering these regions based on the motion estimates. To address this problem, we make the following assumption: image pixels in a uniform color region belong to the same motion layer. This assumption is used for refining the extracted motion layers and it provides a complete labeling of image pixels. Our approach performs as follow: we compute a color based segmentation of the reference image and propagate that segmentation to the next image using the computed motion layers and their corresponding affine model. We subsequently measure the Sum of Square Difference (SSD) in order to characterize the residual errors generated by mapping color properties of the pixels to its estimated location on the second image. For pixels that were not labeled by the motion layer segmentation, we consider all adjacent layers and estimate the corresponding SSD residual for the selected pixel. The pixel is then associated to the layer minimizing the SSD residual. Figure 35 72 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) shows the over-segmented regions of the reference image. The color segmentation used in my approach is a simple threshold-based segmentation. Each region tries to fit the motions with respect to the recovered affine motions as seen in Figure 35 (b). Figure 35 (b) shows the refined motion regions and its corresponding relative motion magnitude information (The brighter areas correspond to larger motion areas). In Figure 36, the extracted region boundaries are shown in order to display the accuracy of the layers. We show the layer segmentation results for the mobile&calendar sequence in Figure 37. The segmentation results are produced using the same threshold values, a value for voting and less than one pixel constraint for layer merge, for all images. Figure 37 (c) shows that our segmentation method extracted several dominant motions (related to ball, train, calendar and background) and there boundaries are quite accurate. We show the accuracy of our approach by superimposing detected motion layer boundaries on the reference image. As seen in Figure 36 (a), detected motion boundaries are very close to the actual region boundaries. The only ambiguous parts are the occluded regions that cannot be recovered by computing SSD between two images. 4.9 Complexity The algorithmic complexity of the described approach is analyzed by subtasks: • Initial matching: Multi-resolution Lukas and Kanade optical flow estimation. • Color segmentation: Simple threshold-based segmentation. • Tensor Voting for a decoupled joint image spaces: 0(nM ), where n is the number of pixels and M is the average number of pixels around each pixel. This depends on sigma value. We use a 8x8 or 16x16 window size. 73 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. • Layer refinement - 0(cm ) in the worst case, where c is the number of color regions and n is the number of motion regions. The number of color regions depends on the image. In the garden sequence case, there are more than 500 regions. The number of motion regions is less than 15 in average. Currently, our method does not provide fast performance. However, we have many initiatives, such as coarse-to-fine estimation, to improve the speed performance. This extension is possible because my approach is parametric that can provide the estimation of motion while increasing resolution from coarse to fine. So far, we have described to combine constraints of motion and color regions to derive a layer- based segmentation of the image. Motion layers are firstly derived from noisy input correspondences using the representation of the correspondences in decoupled joint image spaces and tensor voting. My method segments multiple affine motion layers simultaneously. The extracted layers are refined by the color homogeneity constraints. Each segmented homogeneous color region fits into one of the extracted motions layers that minimized the SSD residual. (a) The reference image (left) and its over-segmented color regions (right) 74 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (b) Refined motion regions and corresponding motion magnitudes Figure 35 Refined motion layers Figure 36 Motion boundaries for each affine motion layer (a) Two input frames 75 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (b) Initial motion layers and over-segmented color regions (c) Refined motion regions and its relative motion magnitude indication Figure 37 Extracted motion layers for the mobile&calendar sequence 4.10 Affine to homography The previous chapter showed that the tensor-voting based parameter estimation recovers multiple affine motion parameters simultaneously. As mentioned several times, my approach focuses on both of affine and homography estimations. In this chapter, a method is described to extract 2D homography transformations from multiple local affine motion patches. When scene contains several planes exposed in various angles and a plane is extended in z direction, affine motion extraction method clusters a planes into several partial planes. It is due to the significant depth or disparity differences appearing a scene for a world plane. In these case, affine motion is not enough to approximate scene motions. Therefore, we extend local affine motion estimation method to 2D homography (called homography) motion estimation. First of all, I explain the 76 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. relationship between scene plane and homography and then the parameter estimation method is explained. 4.10.1 Homography and scene planes Images of points on a plane are related to corresponding image points in a second view by a planar homography as shown following Figure. This is a homography relation since it depends on only the intersection of planes with lines. Through homography a point in one view determines a point in the other which is the image of the intersection of the ray with a plane. X C H Figure 38 Planar homography transformation The homography transformation is defined as: h n hl2 h3 X f x N % 2 = ^21 ^2 2 ^23 y K x 3j h3l h3 2 ^33_vly X. x ’ = X Equation 12 Planar homography transformation In joint image space, this transformation becomes: 77 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. - ( * y x' y 1 0 0 ^31 h3 X h x x + h2 X 0 0 — h3 2 — h3 2 h\i + h2 2 y ) _ ^31 — ^32 0 0 - 1 J C 1 — ^31 — ^32 0 0 - 1 y' hx x + h2 X ^12 ^22 - 1 - 1 hX 3 + h2 3 _ uJ = 0 Equation 13 Planar homography transformation in a joint image space In [AnandanOO], the epipolar geometry constraint between two images forms a 4D point cone. Therefore, the homography transformation, which is the epipolar constraint the constraint for a planar object, forms part of the 4D cone, possibly a surface on a 4D cone. The detail proof of describing geometric shape of the homography transformation in 4D needs to be solved. But in this case, as shown in [Tong], it is not trivial to discriminate simultaneously different motion groups unless other parametric estimation approach such as RANSAC is iteratively used to. Alternatively, as in affine case, decoupled joint image space is considered to analyze the transformation. In each decoupled joint image space, the transformation becomes: x y x 2 y y ' i) 1 i) 0 0 h " $ \ hx x (X) 0 0 ^32 hX 2 y — h3 X — h3 X 0 1 x ’ hx x h\2 1 hX 3 _ 0 0 — h3 X h2 X 0 0 ~ ^32 h2 1 y — h3 X - h3 X 0 1 y' ^21 h2 2 1 ^23 _ o = o Equation 14 Planar homography transformation in decoupled joint image space 78 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Figure 39 Plots o f homography transformations in a decoupled joint image space (x,y,x’) with parameter changes As seen in Figure 39, the shape of the interested object in the decoupled joint image space is a twisted surface for each homography transformation. Since the homography parameters are intertwined in both joint spaces due to non-linearity of the equation, it is not trivial to separate and combine salient structures from two decoupled joint image spaces. Especially when there appear several homography transformations in the scene, it is un-measurable to find the interference among different surface structures. Instead, I use to extend the local affine patches to estimate several homographies by analyzing epipolar geometry because affine structure is well defined in the joint image space and has strong property to remove outliers based on linearity of the transformation equation. 4.10.2 Affine to homography In this dissertation, we group affine patches into the same homography by using plane induced homographies given fundamental matrix F and image correspondences. When two images are rectified based on F, each world planar objects forms a plane in 2.5D disparity space. As the first step to compute homography, fundamental matrix F is computed from locally extracted affine patches. For static stereo scene scenario, points in all extracted affine patches can be used to estimate F. For non-static scenes, the affine patches are grouped while estimating multiple F matrices by using RANSAC. As in paper [Luong93], using affine patches that forms partial planes leads to a stable F matrix estimation. 79 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. For each affine patch, we can compute a normal direction. When a space is defined by (x, y,f(x,y)), the direction of normal for a point in this space is defined by N as in Equation 16. Therefore, normal direction of affine patches in (x, y,f(x,yj), where f(x,y) is the motion magnitude, is defined as following. f ( x , y ) = ( x - x ' ) 2 + ( y - y ' ) 2 x '= ax + by + Tx y ' - c x + dy + Ty Equation 15 Definition of (x,y,f(x,y)) space +*y X N = Figure 40 Normal vector to a surface (excerpted from mathworld) f x(x ,y )A fy(x ,y) v - 1 / fx(*>y) = ■?■ = 2( x - a x - b y - Tx){ 1 - a ) - 2 ( y - c x - d y - Ty)c ox df f (x, y ) = — = ~ 2 (x - a x - b y - Tx)b + 2 (y - c x - d y - 7>’ )(1 - d ) dy Equation 16 Normal vector computation from an affine patches Mathmatically, the normal direction of affine patches does not provide enough support to extract homography. When images are rectified with respect to the camera motion (expressed in F 80 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. matrix), the normal direction of rectified affine patches in the disparity space leads to a homography group. 4.10.3 Image rectification from known matches Image rectification is the process of re-sampling pairs of consecutive (or stereo) images taken from different viewpoints in order to produce a pair of matched epipolar projections. These are projections in which the epipolar line run parallel with the x-axies (or y-axies) and match up between views. Consequently, disparities between the images are in the x-direction (or y- direction) only. The method is based on the fundamental matrix. A fixed number of epipolar lines are estimated from the fundamental matrix. This epipolar lines are used to rectify the images. 4.10.4 Grouping with normal in 2.5D disparity space Let us consider images containing three planar objects are considered. The images in Figure 41 are taken by hand-held camera (moved from left to right). In this kind of scenes, there exists several homographies corresponding to each planar objects. Therefore, it is necessary to estimate homographies locally. For these scene, my approach extracted 33 affine groups whose several groups for each world plane. Due to the difference of the motion magnitude for each plane, several affine groups are detected for each plane. In Figure 42, three affine groups are shown among those extracted affine groups. Figure 41 Input images (three books) 81 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Figure 42 Affine patches The computation of the normal derivation is based on estimated parameters, not based on real matches. Therefore, real matches are already compensated on a plane by the parameter estimation. For each affine patch, the new normal direction of the patches is computed by using rectification homography. If we say the rectification homography for the left and right image is H and H ’, the new normal direction for each patch becomes as following. 82 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. I\2 X, fd isp a rity{x, y ) = (xr - x r') + (yr - y r') _ ( H nx + H n y + H u ), {H3lx + H 32y + H 33) y — ( # 21* Hliy ^ 23), x , = {H'n x'+H\2y'+H\3) ( / f 3 1 x + H 32y + H 33) y ( f f '3 1 x '+ H '3 2 y '+ H '33) (H'u x'+H'u y '+ H \ 3) / / ( H ' 3l x '+ H '3 2 y '+ H '33) x' - ax + by + Tx y ' - cx + dy + Ty H 33= \,H '33 = l ' /d isp a rityx (x, y / N = /d isp a rity y(x,y) - 1 y d/disparity /d isp a rity x (x, y ) = dx = 2 #„ (H nx + H a y + H l3) (H'n x '+ H \2 y '+ H \ 3) (H 3lx + H 32y +1) (H'3lx'+H'32y'+1) ( # 1,* + H u y + H i3)H 3^ H n a + H 1 2 c ( # 31* H 32y + 1) ( # 31* H 32y +1) (H'u x'+H\2 y '+ H \3 )(H '3l a + H '3 2 c) ( i / ’3 1 x '+ tf '3 2 / + t f - 3 3 ) 2 (//'3 1 x’ + ^ '3 2 y+i) • + + 2 H ( # 21* + H 22y + # 23) ( # '» x ’+ Z /y y + / r 13) ( # 31X + H 32y + / / 33) ( t f '3 1 x '+ tf '3 2 y '+ H '33), 21 ( # 21* H 22y + H 23)H 31 / / ' 2 1 a + H'„ c 2 2 ( # 3 1 X + # 3 2 ^ + 1) ( # 3 1 * + H 32y + l ) 2 (H'n x'+H\2 y'+ H \3)(H'31a + H '32c) (H'n x'+H'3 2 y + 1 ) ■ + ( t f '3 1 x '+ H '3 2 y '+ H '3 3 )2 /d isp a rity (x,y) = d/disparity Equation 17 Derivation o f the normal o f the plane in the 2.5D disparity space R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. After each normal direction for each patch is computed, the grouping process performs. If each estimated normal direction is similar (threshold value is 10 degree), two patches are grouped into a single homography group, which corresponds to a world scene plane in general. This simple process allows us to extend our layer concept from simple 2D affine motion to scene plane that gives more structured information. Figure 43 shows the plotted normal directions in the disparity space. Surprisingly, only three distinctive normal directions can be observed, and these three normals correspond to three world planes in the original images. We can observe 33 affine groups can belong one of these groups. Figure 44 shows grouping results. Given 10 degree threshold value, three homography groups are successfully detected. In Figure 45, warped images and residual images are presented. The original image is warped and compensated according to each homography motion. Figure 43 Affine patches in 2.5D disparity space with its normal direction after image rectification (a) Homography for the ‘office-depot’ box front side plane (b) homography for the ‘mathematics book’ plane in the left (c) Homography for the book on the floor Figure 44 Three homography groups from local affine patches 84 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) Registered with homography for the ‘office-depot’ box front side plane (b) Registered with homography for the ‘mathematics book’ plane in the left (c) Registered with homography for the book on the floor Figure 45 Motion Compensation result R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission 4.10.5 Results In Figure 46, Figure 47, Figure 48 and Figure 49, the result applied to outdoor house sequence is shown. This sequence includes a large zoom. Hence, it is really hard to generate correspondences. Especially, the correspondences generated on the ground appeared as random correspondences, and they could not form any affine planes. In Figure 47, arbitrary two extracted affine groups are presented. Figure 48, shows the extracted two homography groups from local affine groups. Also, it shows the warped images and residual images according to homography transformations. Figure 49 shows the warped image and residual images according to the global homography estimation. Global homography is computed by, my RANSAC-based single motion estimation method. As seen, it proves that one global motion can not fit the motions in the scenes. In Figure 50, Figure 51, and Figure 52, the result applied to outdoor house sequence is shown. This sequence is acquired by a camera mounted on a moving car. Figure 50 shows the epipolar lines plotted based on recovered fundamental matrix. Each rectification homography is computed based on this epipolar lines. In this case, each rectification homography shows a small translation in y-axis. Figure 52 shows the affine patches in disparity space. As seen, we can observe two distinctive planes. It also shows the grouped points in affine patches into homograph groups. This groping process does not require to re-compute new matches. It simply remaps the original correspondences based on estimated affine parameters and rectification homographies. 86 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Figure 46 Input images Figure 47 Affine groups Figure 48 Homography groups and motion compensation 87 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Figure 49 One global motion compensation result by RANSAC Figure 50 Input images Figure 51 Estimated epipolar lines R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Figure 52 Two homography groups in 2.5D disparity space and motion-compensated images 89 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Chapter 5 Conclusion 5.1 Summary This dissertation presented a 2D layered representation of the video and subsequent processing methods. Rather than representing video as a set of frames, it is desirable to have an efficient video representation that provides structural information of the video content for video size reduction and analysis. Layered representation characterizes time varying information caused by motion: camera motion and independent motion. The robust processing methods analyze the multiple 2D motions from a video and represent consistent motions into layers. My approach seeks following to address technical issues involved in a layer extraction. • No requirement of pre-specifying the number o f layer • Each layer is associated a parametric 2D transformation, and targeted motion models are 2D affine and homography transformations. • The pixels are assigned to a layer based on motion saliency and color-region matching. Technically, my approach is parametric. It reliably extracts multiple 2D motion layers, affine or homography, from noisy initial matches. As opposed to non-parametric approaches, parametric approach can avoid accidental groupings or non-groupings. Also, parametric estimation can approximate and generate motion information where proper matches cannot be locally determined. In addition, many applications, such as compression, require extracting motion transformation parameters to encode layers. My approach forms an algebraic motion estimation problem into a geometric problem in order to apply stronger coherences and constraints for motion layer clustering. The formulated geometric spaces from initial motion matches are decoupled joint image spaces. In decoupled joint image spaces, affine transformations are 90 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. identified as planes. Each pixel for a layer is clustered based on plane property. My approach provides a process to extract multiple affine motions simultaneously based on tensor-voting. Tensor voting identifies and estimates locally salient multiple 2D motions while removing outliers. This whole process is non-iterative. In addition, it does not require any prior assumptions, such as a pre-specified number of layers. The only assumptions made are the neighborhood size for voting and the sub-pixel discontinuity for clustering. The corresponding affine parameters are recovered by analyzing correlation matrix of the clustered pixels. The affine layers are clustered into homography motion groups based on the fact that affine patches under the same homography reside on a plane in disparity space. My method estimates local affine motions first based on the fact that all small local motions can be identified by affine motions. Then, it groups the local motions to homography motions. After dominant motion layer extraction, layers are refined by using color-region homogeneity. The assumption is that each homogeneous color region belongs to the same motion. My approach uses this assumption to define an accurate boundary o f the layers and to fill holes within the extracted layers. In general, the local computation of correlation between pixels does not provide a correct motion around occluding/occluded boundaries and noisy areas where the extracted layers include indecisive pixels (holes). For this reason, homogeneous color region is used to assign indecisive pixels to the most-likely motion layer extracted by computing image residual errors based on the extracted motion parameters. 5.2 Comparison between parametric and non-parametric approach My approach focuses on 2D parametric motion layers. When a motion that is not characterized in the defined motion appears, my approach does not perceive as a motion layer. Here, the compared results between parametric and non-parametric 4D voting approaches [Mircea02] are presented. [Mircea02] addressed a non-parametric motion extraction method based on tensor 91 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. voting in 4D (x, y, vx, vy) space. The method extracts smooth motion layers. First of all, it generates multiple matches from the input images and chooses the strongest match for each image pixel in order to keep the uniqueness constraint after the first voting. In the second voting stage, it reinforces the inferred information though voting among selected tokens. Then, layer clustering based on 2D surface smoothness in 4D is performed. 2D densification is followed in order to create dense layer information. As the last step, each boundary of the dense layer is refined based on 2D tensor voting. 2D tensor voting uses images edges and its motion vectors around motion boundaries extracted from the previous steps. Two approaches are compared by using several synthetic and real images. For the synthetic case, the error between the ground truth and the estimated parameters is measured by symmetric transfer error in Equation 7. The exact parameters are shown in Table 2 and Table 3. The results shows that my approach produces better parameter estimation from a sparse correspondence set. The error for the 4D voting approach occurs when the densification is performed based on the normal direction of the extracted initial layers. Due to the non-parametric smoothness, the points are interpolated based on slightly different neighborhood normal directions. Therefore, the estimated parameters are not correct any more. 92 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) Affine motion (b) Translation motion ^ N - V ’ * 3 * -H - ’ ' V i--* I W ' 'jr? ^ '* S? s'< : . v ✓ * * \ *\*r ' t r / j ^ j t o „ /*£•>&. % • / iTysV A - ^ s ^ (c) Random motions '• * rtnjuV i "■ •w d i , . « - , = . . ■ . ’ ■ ■ • • (e) Two views o f motions in a decoupled joint image Figure 53 Synthetic motions - I Reproduced with permission of the copyright owner. Further reproduction prohibited without permission Test Environment 1 Image size Number of affine motions Total number of points Noise ratio 320x320 2 754 100% Parameters a b Tx c d Ty Affine Motion 1 (112 points) 0.950000 0.100000 -5.000000 0.120000 0.970000 3.000000 Affine Motion 2 (265 points) 1.000000 0.000000 -5.000000 0.000000 1.000000 3.000000 Recovered Parameters by My Approach a b Tx c d Ty Affine Motion 1 0.949998 0.100000 -5.000049 0.119910 0.970020 2.997528 Affine Motion 2 1.000001 -0.000001 -5.000239 -0.000001 1.000001 3.000350 Recovered Parameters by 4D Voting Approach Affine Motion 1 0.9499 0.0980 -4.7585 0.1190 0.9650 3.2520 Affine Motion 2 1.0056 -0.0047 -4.4741 -0.0002 1.0103 3.6318 Measured Errors (Symmetric Transfer Errors) in Pixel Motions estimated by My Approach Motions estimated by 4D Voting Approach Less than 1 pixel More than 1 pixel Table 2 Estimation result - 1 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Affine motion (b) Translation (c) Transparent affine motion (d) Random motion (e) Two views of motions in a decoupled joint image Figure 54 Synthetic affine motions - II 95 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Test Environment 2 Image size Number of affine motions Total number of points Noise ratio 320x320 3 874 100% Parameters A B Tx c D Ty Affine Motion I (81 points) 0.950000 0.100000 -5.000000 0.120000 0.970000 3.000000 Affine Motion 2 (272 points) 1.000000 0.000000 -5.000000 0.000000 1.000000 3.000000 Affine Motion 3 (84 points) 0.990000 0.001000 10.000000 0.002000 0.990000 10.000000 Recovered Parameters by My Approach A B Tx c D Ty Affine Motion 1 0.950005 0.099987 -4.999668 0.119998 0.970036 2.998835 Affine Motion 2 1.000000 -0.000001 -4.999714 -0.000000 1.000000 2.999859 Affine Motion 3 0.989164 0.000954 10.111085 0.001962 0.989998 10.005099 Recovered Parameters by 4D Voting Approach Affine Motion 1 0.9616 0.0878 -5.1022 0.1046 0.9756 3.1011 Affine Motion 2 0.9978 0.0095 -6.2107 -0.0022 1.0095 1.6299 Affine Motion 3 0.9927 0.0006 9.6638 0.0016 0.9909 10.0304 Measured Errors (Symmetric Transfer Errors) in Pixel Motions estimated by My Approach Motions estimated by 4D Voting Approach Less than 1 pixel More than 1 pixel Table 3 Estimation result - II 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Figure 55, Figure 56, and Figure 57, the results for a set of real images is shown. 4D voting approach extracts boundary accurately by using 2D voting based on the monocular cue (edges from intensity differences and motion vectors). My approach extracts the correct number of layers and accurate affine motions based on extracted inliers. However, the result demonstrates that color region based matching is not sufficient to assign color regions around occlusion into proper motion layers. (a) Two input images (b) Two extracted layers, initial layer boundaries and refined boundaries by 4D voting approach Background: a=1.0, b=0.0, Tx-0.0, c=0.0, d=1.0, Ty=0.0 Fish motion: a=1.0, b=-0.0, Tx=6.0, c=0.0, d=1.0, Ty=0.0 (c) Two extracted layers, velocity fields, layer boundaries, and affine parameters by my approach Figure 55 Result comparison - 1 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Two input images (b) Velocity fields and three extracted layers, initial layer boundaries and refined boundaries by 4D voting approach Background : a=l.001378, b=-0.002478, Tx=0.956671, c=0.000000, d=l.000000, Ty=-0.000000 Left Car: a=l.016844, b=-0.000180, Tx=3.488205, c=0.000000, d=l.000000, Ty=-0.000012 Right car: a=l.018364, b=0.013781, Tx=5.238319, c=0.000001, d=l.000001, Ty=0.000100 (c) Three extracted layers, velocity fields, boundaries by my approach, and affine parameters Figure 56 Result comparison - II 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Two input images (b) Velocity fields and two extracted layers, initial layer boundaries and refined boundaries by 4D voting approach Background: a=0.996600, b=-0.000540, Tx=3.189110, c=0.000000, d=1.000000, Ty=0.000000 Foreground: a=0.989366, b=-0.004522, Tx=8.180511, c=0.004830, d=0.989339, Ty=-0.128776 (c) Two extracted layers, velocity fields, and layer boundaries Figure 57 Result comparison - III 9 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Two input images (b) Initial boundaries o f four extracted layers and refined boundaries by 4D voting approach (c) Extracted motion layers and velocity fields by my approach Figure 58 Result comparison - IV In Figure 58, 4D voting approach method missed the rotating ball motion. The boundaries of background and calendar are accurately recovered. My approach recovers most of motions. However, it includes many other regions due to noise in the initial matches. 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Two input images (b) One smooth layer extracted by 4D voting approach Figure 59 Result comparison - V Figure 59 shows that 4D voting approach extracts one smooth layer that corresponds to a camera motion. In this case, velocities extracted through densification might not correspond to the image regions properly because it does not involve boundary refinement. My approach extracts three homography motions as seen in Figure 44. It shows the exact difference between parametric and non-parametric approach. Parametric approach detects several targeted motion layers from a single camera motion. 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Through synthetic and real images, several facts are observed. The characteristics of 4D voting approach can be summarized as following. • Ability to process non-rigid motion such as flag motion • Accurate boundary extraction by incorporating 2D voting based on monocular intensity cue • Slow computation for processing multiple matches and densification 4D voting approach extracts boundary accurately compared to my approach. To extract accurate motion boundary, region matching is not sufficient. It is necessary to incorporate with edge information because certain regions are connected across different motion regions. My approach is characterized as following. • Fast processing. No multiple matches and no densification steps involved • Accurate estimation o f motion parameters • Target affine motion only When there exists 2D affine motion, the model driven method produces fast results because the voting is performed in lower dimension compared to 4D voting approach. The parameters are estimated based on strong inliers, and therefore parameters are very accurate. However, my approach cannot process non-affine motions, such as flag motions. In a flag motion case, my approach extracts segments of flag, but it could not infer a whold flag motions as one smooth motion. However, as seen through many image examples, most o f the motion existing between two consecutive video frames can be modeled by affine. 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3 Future extensions In the future, first o f all, further evaluation of my approach needs to be performed. A quantitative comparison to previous approaches should be addressed in terms of accuracy and performance. The first limitation of my work is that my approach requires to refine the motion boundaries further around occlusion areas. The another limitation of my work is that the multiple 2D motions are estimated only for a pair of consecutive images. Therefore, extensions to multiple frames are required in order to achieve a true layered representation from a video stream. This will lead to a compact form o f image regions from an original video stream. Integrating multiple frames can be done in two ways. At first, the layers from each pair-wise estimation can be integrated by tracing corresponding layers across frames. Alternatively, the simultaneous multi­ frame estimation can be done. Another direction of extension is to integrate several image properties, such as motion , color and edge, in tensor voting framework . Currently, edge and color properties are not directly used in tensor voting framework. A work towards integrating hybrid properties, motion, color and edges in tensor voting framework remains as a challenging work. 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. References [Adelson91] E. Adelson, Layered Representation for Image Coding, Technical Report 181, MIT Media Lab. Vision and Modeling Group, December 1991. [AggarwalOl] M. Aggarwal and N. Ahuja, High Dynamic Range Panoramic Imaging, IEEE International Conference on Computer Vision (ICCV), 2001. [AnandanOO] P. Anandan and S. Avidan, Integrating Local Affine into Global Homography Images in the Joint Image Space, European Conference on Computer Vision (ECCV), 2000. [Ayer95] S. Ayer and H. Sawhney, Layered Representation of Motion Video, IEEE International Conference on Computer Vision (ICCV), 1995, pp. 777-784. [Baker98] S. Baker, R. Szeliski and P. Anandan, A Layered Approach to Stereo Reconstruction, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1998. [Ben-Ezra98] M. Ben-Ezra, S. Peleg and M. Werman, Efficient Computation of the Most Probable Motion from Fuzzy Correspondences, IEEE Workshop on Applications o f Computer Vision (WACV), 1998. [Benedetti98] A. Benedetti and P. Perona, Real-time 2-D Feature Detection on a Reconfigurable Computer, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1998. [Boykov98] Y. Boykov, O. Veksler, R. Zabih, Markov Random Fields with Efficient Aproximations, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 648-655, 1998. [Burt83] P. Burt and E. Adelson, The Laplacian Pyramid as a Compact Image Code, IEEE Trans, on Communication, Vol. 31, No. 4,1983. [Can99] A. Can, C. Stewart and B. Roysam, Robust Hierarchical Algorithm for Constructing a Mosaic from Images of the Curved Human Retina, International Conference on Pattern Recognition (ICPR), 1999. [Capel98] D. Capel and A. Zisserman, Automated Mosaicing with Super-resolution Zoom, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1998. [CaspiOl] Y. Caspi and M. Irani, Alignment o f Non-Overlapping Sequences, IEEE International Conference on Computer Vision (ICCV), 2001. 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [Cham98] T. Cham and R. Cipolla, A Statistical Framework for Long-Range Feature Matching in Uncalibrated Image Mosaicing, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1998. [Chen94] Q. Chen, M. Defrise and F. Deconinck, Symmetric Phase-Only Matched Filtering of Fourier-Mellin Transformation for Image Registration and Recognition, IEEE Trans, on Pattern Analysis and Machine Intelligence (PAMI), Vol. 16, No. 12, 1994. [Darrell91] T. Darrell and A. Pentland, Cooperative Robust Estimation Using Layers of Support, M.I.T Media Vision and Modeling Group Tech Report, No 163, Feb., 1991. [Davis98] J. Davis, Mosaics of Scenes with Moving Objects, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1998. [Demirdjian99] D. Demirdjian and R. Horaud, A Homography Framework for Scene Segmentation in the Presence of Moving Objects, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1999. [Domaika99] F. Domaika and R.Chung, Stereo Correspondence from Motioin Correspondence, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1999. [ElgammalOO] A. Elgammal, D. Harwood and L. Davis, Non-parametric model for background subtraction, European Conference on Computer Vision (ECCV), 2000. [Faugeras93] O. Faugeras. Three Dimensional Computer Vision: A Geometric Viewpoint. M IT Press, 1993. [Fisher81] M. A. Fisher and R. C. Bolles, Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography, Comm. Assoc. Comp. Vol. 24, No. 6, pp. 381-395,1981. [FitzgibbonOl] A. Fitzgibbon, Stochastic Rigidity: Image Registration for nonwhere-static Scenes, IEEE International Conference on Computer Vision (ICCV), 2001. [Francois99] A. Francois and G. Medioni, Adaptive Color Background Modeling for Real-time Segmentation of Video Streams, Proceedings o f the International Conference on Imaging Science, Systems, and Technology, Las Vegas, NA, June 1999, pp. 227- 232. [Gaucher99] L, Gaucher and G. Medioni, Accurate Motion Flow Estimation with Discontinuities, IEEE International Conference on Computer Vision (ICCV), Kerkyra, Greece, Sept 1999. [Gelgon97] M. Gelgon, P. Bouthemy, A Region-Level Graph Labeling Approach to Motion- Based Segmentation, CVPR, pp. 514-519, 1997. 105 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [Ghosal96] S. Ghosal, A Fast Scalable Algorithm for Discontinuous Optical Flow Estimation, IEEE Trans, on Pattern Analysis and Machine Intelligence (PAMI), vol. 18, no. 2, pp. 181-194,1996. [GrossbergOl] M. Grossberg and S. Nayer, A General Imaging Model and a Method for Finding its Parameters, IEEE International Conference on Computer Vision (ICCV), 2001. [GutchessOl] D.Gutchess, M.Trajkovic,E.Cohen-Solal,D.Lyons and A.K Jain, A Background Model Initialization Algorithm for Video Surveillance, IEEE International Conference on Computer Vision (ICCV), 2001. [HanOl] M. Han and T. Kanade, Multiple Motion Scene Reconstruction from Uncalibrated Views, IEEE International Conference on Computer Vision (ICCV), 2001. [Hartley97] R. Hartley and P.Strum, Triangulation, Computer Vision and Image Understanding (CVIU), 1997. [HartleyOO] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2000. [Heitz93] F. Heitz, P. Bouthemy, Multimodal Estimation of Discontinuous Optical Flow Using Markov Random Fields, IEEE Trans, on Pattern Analysis and Machine Intelligence (PAMI), vol. 15, no. 12, pp. 1217-1232, 1993. [Hom86] B. Horn. Robot Vision. M IT Press, 1986. [Isgro99] F. Isgro and E. Trucco, Homography Rectification without Epipolar Geometry, Proceedings IEEE International Conference on Computer Vision and Pattern Recognition(CVPR), June, 1999. [Irani92] M. Irani, B. Rousso and S. Peleg, Detection and tracking multiple moving objects using temporal integration, European Conference on Computer Vision (ECCV), May 1992, pp. 282-287. [Irani93] M. Irani and S. Peleg, Motion Analysis for Image Enhancement: Resolution, Occlusion, and Transparency, Journal o f Visual Communication and Image Representation (VCIR),Vo\. 4, No. 4, pp. 324-335, December 1993. [Irani95] M. Irani, P. Anandan and S. Hsu, Mosaic Based Representation of Video Sequences and Their Applications, IEEE International Conference on Computer Vision (ICCV), 1995. [Irani96] M. Irani, P. Anandan, J. Bergen, R. Kumar, and S. Hsu, Efficient Representations of Video Sequences and Their Applications. Signal Processing: Image Communication, special issue on Image and Video Semantics: Processing, Analysis, and Application, Vol. 8, No. 4, May 1996. 106 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [Irani98] M. Irani, P. Anandan and D. Weinshall, From Reference Frames to Reference Planes: Multi-View Parallax Geometry and Applications, European Conference on Computer Vision (ECCV), June, 1998. [Ju96] S. X. Ju, M. J. Black and A. Jepson, Skin and Bones: Multi-layer, Locally Affine, Optical Flow and Regularizaton with Transpanrency, Proceedings IEEE International Conference on Computer Vision and Pattern Recognition(CVPR), 1996. [Jurie02] F. Jurie and M. Dhome, Hyperplane Approximation for Template Matching, IEEE Trans, on Pattern Analysis and Machine Intelligence (PAMI), 2002. [KangOO] E.Y. Kang, I. Cohen and G. Medioni, A Graph-based Global Registration for 2D Image Mosaic, International Conference on Pattern Recognition (ICPR), 2000. [Kang02] E.Y. Kang, I. Cohen and G. Medioni, Robust Affine Motion Estimation in Joint Image Space using Tensor Voting, International Conference on Pattern Recognition (ICPR), 2002. [Kang97] S. Kang and R. Weiss, Characterization of Errors in Compositing Panoramic Images, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1997. [KanataniOl] K. Kanatani, Motion Segmentation by Subspace Separation and Model Selection, IEEE International Conference on Computer Vision (ICCV), 2001. [KeOl] Q. Ke and T. Kanade, A Subspace Approach to Layer Extraction, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, Dec. 2001 . [Ke02] Q. Ke and T. Kanade, A Robust Subspace Approach to Layer Extraction, IEEE Workshop on Motion and Video Computing, Orlando, Florida, Dec. 2002. [Kervrann95] C. Kervrann, F. Heitz, A Markov Random Field Model-Based Approach to Unsupervised Texture Segmentation Using Local and Global Spatial Statistics, IEEE Trans. On Image Processing, 4:6, pp. 856-862, 1995. [KoenenOl] R. Koenen, MPEG-4 Overview (V.18), International Organization for Standardisation, ISO/TEC JTC1/SC29/WG11 Coding o f Moving Pictures and Audio, 2001. [Krishnan96] A. Krishnan and N. Ahuja, Panoramic Image Acquisition, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1996. [Kumar94] R. Kumar, P. Anandan and K. Hanna, Shape recovery from multiple views: a parallax based approach, DARPA Image Understanding Workshop (IUW), 1994. 107 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [Kumar95] R. Kumar, P. Anandan, M. Irani, J. Bergen and K. Hanna, Representation of Scenes from Collection of Images, IEEE Workshop on Representations o f Visual Scenes (WRVS), June 1995. [LaceyOO] A. Lacey, N. Pinitkam and N. Thacker, An Evaluation of the Performance of RANSAC Algorithms for Stereo Camera Calibration, The British Machine Vision Conference (BMVC), 2000. [Lee97] M.S. Lee and G. Medioni, Inferred Descriptions in Terms o f Curves, Regions, and Junctions from Sparse, noisy, binary Data, International Workshop on Visual Form, pp. 350-367, 1997. [Lee98] M.S. Lee and G. Medioni, Inferring Segmented Surface Description from Stereo Data, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 346-352, 1998. [Lee99] M.S. Lee and G. Medioni, Grouping 0 -, into Regions, Curves, and Junctions, Computer Vision and Image Understanding (CVIU), 1999. [Li99] S. Li and S. Tsuji, Qualitative Representation of Scenes along Route, Journal o f Image and Vision Computing (IVC), pp. 685-700, Vol. 17, 1999. [Luong93] Q. Luong and O. Faugeras, Determining the Fundamental Matrix with Planes: Unstability and New Algorithms, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1993. [Marr79] D. Marr and T. Poggio, A Theory of Human Stereo Vision, Proc. Roy. Soc. London, vol. B204, pp. 301-328, 1979. [McMillan95] L. McMillan and G. Bishop, Plenoptic Modeling: An Image-Based Rendering System, Proceedings o f ACM SIGGRAPH, 1995. [MedioniOO] G. Medioni, M.S. Lee, and C.K Tang, A Computational Framework for Feature Extraction and Segmentation, Elsevier, 2000. [Morimoto97] C. Morimoto, R. Chellappa and S. Balakirsky, Fast Image Stabilization and Mosaicking, DARPA Image Understanding Workshop (IUW), New Orleans, May, 1997. [Morimoto98] C. Morimoto and R. Chellapa, Evaluation of Image Stabilization Algorithms, Proceedings o f IEEE ICASSP, May, 1998. [Nayer97] S. Nayer, Catadioptric Omnidirectional Camera, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1997. 108 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [Nicolescu02] M. Nicolescu and G. Medioni, Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D, To appear in the Proceedings o f the European Conference on Computer Vision, Copenhagen, Denmark, May 2002. [Nicolescu02] M. Nicolescu and G. Medioni, Perceptual Grouping from Motion Cues Using Tensor Voting in 4-D, European Conference on Computer Vision (ECCV), 2000, vol. Ill, Copenhagen, Denmark, May 2002, pp. 423-437. [Pardas94] M. Pardas, P. Salembier, and B. Gonzalez. Motion and region overlapping estimation for segmentation-based video coding. IEEE International Conference on Image Processing, (ICIP),pp. 428-432, 1994. [Peleg97] S. Peleg and J. Herman, Panoramic Mosaics by Manifold Projection, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1997. [Qian02] G. Qian, R Chellappa and Q. Zheng, A Bayesian Approach to Simultaneous Motion Estimation of Multiple Independently Moving Objects, International Conference on Pattern Recognition (ICPR), 2002. [QuanOl] L. Quan, L. Lu, H. Shum andM . Lhuillier, Concentric Mosaic(s), Planar Motion and ID Cameras, IEEE International Conference on Computer Vision (ICCV), 2001. [Reddy96] B. S. Reddy and B. N. Chatteiji, An FFT-based Technique for Translation, Rotation and Scale-Invariant Image Registration, IEEE Trans, on Image Processing (IP), Vol. 5, No. 8, 1996. [Rousso97] B. Rousso, S. Peleg and I. Finci, Mosaicing with Generalized Strips, DARPA Image Understanding Workshop (IUW), New Orleans, May, 1997. [Rousso98] B. Rousso, S. Peleg, I. Finci and A. Rav-Acha, Universal Mosaicing Using Pipe Projection, IEEE International Conference on Computer Vision (ICCV), 1998 [Sawheny95] H. Sawhney. S. Ayer and M. Gorkani, Model-based 2D&3D Dominant Motion Estimation for Mosaicing and Video Representation, IEEE International Conference on Computer Vision (ICCV), 1995. [Sawheny97] H. Sawhney and R. Kumar, True Multi-Image Alignment and its Application to Mosaicing and Lens Distortion Correction, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1997. [Sawheny98a] H. Sawhney, S. Hsu and R. Kumar, Robust Video Mosaicing through Topology Inference and Local to Global Alignment, European Conference on Computer Vision (ECCV), 1998. [Sawheny98b] H. Sawhney, R. Kumar, G. Gendel, J. Bergen, D. Dixon and V. Paragano, Video Brush: Experiences with Consumer Video Mosaicing, IEEE Workshop on Applications o f Computer Vision (WACV), 1998. 109 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [Sawheny99] H. Sawhney and R. Kumar, True Multi-Image Alignment and Its Application to Mosaicing and Lens Distortion Correction, IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), Vol. 21, No. 3,1999. [SchechnerOl] Y. Schechner and S. Nayer, Generalized Mosaicing, IEEE International Conference on Computer Vision (ICCV), 2001. [Shashua94] A. Shashua, Homography Structure from Uncalibrated Images: Structure from Motion and Recognition, IEEE Trans, on Pattern Analysis and Machine Intelligence (PAMI), Vol. 16, No. 8, 1994. [Shi98] J. Shi and J. Malik, Motion Segmentation and Tracking Using Normalized Cuts, IEEE International Conference on Computer Vision (ICCV), 1998. [Shum98] H. Shum and R. Szeliski, Construction and Refinement of Panoramic Mosaics with Global and Local Alignment, IEEE International Conference on Computer Vision (ICCV), 1998. [Shum99] H. Shum and L. He, Rendering with Concentric Mosaics, Proceedings o f ACM SIGGRAPH, 1999. [Szeliski94] R. Szeliski, Image Mosaicing for Tele-Reality Applications, IEEE Workshop on Applications o f Computer Vision (WACV), 1994. [Szeliski97] R. Szeliski and H. Shum, Creating Full View Panoramic Image Mosaics and Environment Maps, Proceedings o f ACM SIGGRAPH, pp. 251-258, 1997. [Szeliski98a] R. Szeliski, A Multi-View Approach to Motion and Stereo, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1998. [Szelisk98b] R. Szeliski and P. Torr. Geometrically constrained structure from motion: Points on planes. European Workshop on 3D Structure from Multiple Images o f Large- Scale Environments (SMILE), pp. 171-186, Germany, June 1998. [Szeliski99] R. Szeliski, P. Anandan and S. Baker, From 2D Images to 2.5 Sprites: A Layered Approach to Modeling 3D Scenes, IEEE International Conference on Multimedia Computing and Systems(ICMCS), Florence, Italy, Vol. 1, June, 1999, pp. 44-50. [Tang98a] C.K. Tang and G. Medioni, Inference of Integrated Surface, Curve, and Junction Descriptions from Sparse 3-D Data, IEEE Trans, on Pattern Analysis and Machine Intelligence (PAMI), Vol. 20, No. 11, pp. 1206-1223,1998. [Tang98b] C.K. Tang and G. Medioni, Integrated Surface, Curve and Junction Inference from Sparse 3-D Data Sets, IEEE International Conference on Computer Vision (ICCV), pp. 818-824,1998. 110 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [Tang99a] C.K. Tang and G. Medioni, Robust Estimation o f Curvature Information from Noisy 3D Data for Shape Description, IEEE International Conference on Computer Vision (ICCV), pp. 426-433,1999. [Tang99b] C.K. Tang, G. Medioni, and M.S. Lee, Epipolar Geometry Estimation by Tensor Voting in 8D, IEEE International Conference on Computer Vision (ICCV), pp. 502-509, 1999. [TaoOl] H. Tao, H. S. Sawhney, and R. Kumar, A global matching framework for stereo computation, IEEE International Conference on Computer Vision (ICCV), 2001. [TongOl] W. Tong, C. Tang and G. Medioni, Epipolar Geometry Estimation for Non-Static Scenes by 4D Tensor Voting, IEEE International Conference on Computer Vision (ICCV), 2001. [Tordoff02] B. Tordoff and D. Murray, Guidede Sampling and Consensus for Motion Estimation, European Conference on Computer Vision (ECCV), 2002. [Torr96] P. Torr and A. Zisserman, MLESAC: A new robust estimator with application to estimating image geometry, Computer Vision and Image Understanding(CVIU), 1996. [Torr97] P. Torr and D. Murray, The Development and Comparison o f Robust Methods for Estimating the Fundamental Matrix, International Journal o f Computer Vision 24(3), 1997, pp. 271-300. Kluwer Academic Publishers. [TorrOl] P. Torr, R. Szeliski and P. Anandan, An Integrated Bayesian Approach to Layer Extraction from Image Sequences, IEEE Trans, on Pattern Analysis and Machine Intelligence (PAMI), 23(3), March 2001, pp. 297-303. [Toyama99] K. Toyama, J. Krumm, B. Brumitt and B. Meyers, Wallflower: Principles and practice o f background maintenance”, IEEE International Conference on Computer Vision (ICCV), 1999. [Triggs95] B. Triggs, Matching Constraints and Joint Image, IEEE International Conference on Computer Vision (ICCV), 1995. [Triggs99] B. Triggs, P. McLaguchlan, R. Hartley and A. Fitzgibbon, Bundle Adjustment - A M odem Syndissertation, Proceedings o f Vision Algorithms, 1999. [TriggsOl] B. Triggs, Joint Feature Distributions for Image Correspondences, IEEE International Conference on Computer Vision (ICCV), 2001. [Wang94] J. A Wang and E. Adelson, Representing Moving Images with Layers, IEEE Transaction on Image Processing Special Issue: Image Sequence Compression, 3(5), September 1994, pp.625-638. Ill R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [Weiss96] Y. Weiss and E. Adelson, A unified mixture framework for motion segmentation: incorporating spatial coherence and estimating the number of models, In. Proc. IEEE Conf. Computer Vision Pattern Recognition, 1996, pp.321-326. [Weiss97] Y. Weiss, Smoothness in Layers: Motion segmentation using nonparametric mixture estimation, ”, In. Proc. IEEE Conf. Computer Vision Pattern Recognition, 1997, pp.520-526. [Wren97] C. Wren, A. Azarbayejani, T. Darrel and A. Pentland, Pfinder: Real-time tracking of the human body. IEEE Trans, on Pattern Analysis and Machine Intelligence (P A M I),Yol. 19, No. 7, pp. 780-785,1997. [Xiong98] Y. Xiong and K. Turkowski, Registration, Calibration and Blending in Creating High Quality Panoramas, IEEE Workshop on Applications o f Computer Vision (WACV), 1998. [Zelnik-Manor99] L. Zelnik-Manor and M. Irani, Multi-View Subspace Constraints on Homogrphies, IEEE International Conference on Computer Vision (ICCV), 1999. [Zelnik-ManorOO] L. Zelnik-Manor and M. Irani, Multi-Frame Estimation of Planar Motion, IEEE Trans, on Pattern Analysis and Machine Intelligence (PAMI), Vol. 22, No. 10, October 2000. [Zhang95] Z. Zhang, R. Deriche, O. Faugeras, Q.-T. Luong, A Robust Technique for Matching Two Uncalibrated Images Through the Recovery of the Unknown Epipolar Geometry', Artificial Intelligence Journal, Vol.78, pp. 87-119, October 1995. [Zhang97] Z. Zhang, Determining the Epipolar Geometry and its Uncertainty: A Review", International Journal of Computer Vision (IJCV), 1997. [ZhuOl] Z. Zhu, E. Riseman and A. Hanson, Parallel-Perspective Stereo Mosaics, IEEE International Conference on Computer Vision (ICCV), 2001. [Zoghlami97] I. Zoghlami, O. Faugeras and R. Deriche, Using Geometric Comers to Build a 2D Mosaic from a Set of Images, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1997. [ZomatOO] A. Zomet and S. Peleg, Efficient Super-Resolution and Applications to Mosaics, International Conference on Pattern Recognition (ICPR), pp. 579-583, 2000. 112 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Appendix A Global registration From estimated 2D parameters, it is wanted to process multiple frames and generate mosaic images. For efficient processing, the described motion estimation method is performed in pair­ wise. The obtained results from pair-wise local registration is insufficient to generate a globally registered mosaic image or background layers of a complete scene. In fact, although the pair­ wise registration may be accurate, the concatenation of these pair-wise transformations leads to global alignment errors as depicted in Figure 60 (b). Therefore, the use of a global registration technique is required to build consistent large views from a collection of images. The accumulated error occurs partly due to the accumulation of errors in the pair-wise registration but mostly due to overlooking the topology of the swaths, or relative position of the frames. Therefore, in order to build consistent large views from a collection of images, temporally and/or spatially adjacent frames have to be registered. This approach merges local (temporally adjacent) and global (spatially adjacent) approaches. A.1 Previous work Most of the parameter estimation work has focused on the estimation o f the parameters modeling the camera motion from a pair of images [Irani95] [Peleg97] [Szeliski97]. They overlook the accumulated error during mosaic construction, and they used a simple solution such as median filtering to hide the accumulated error. But these approaches prohibit to extract correct foreground objects because mis-registered error fuses with foreground motions. Several methods have been proposed for global registration [Davis98] [Sawhney98a] [Sawhney98b] [Shum98] [Xiong98], [Davis98] proposes the solution to a linear system derived from the collection of pair-wise registration matrices. It assumes all the pair-wise registration information are known and 113 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. minimizes the algebraic errors caused by the different concatenation of pair-wise registration leading to frame to mosaic transformation. This method does not explicitly consider the topology of frames in the mosaic space and does not involve real image matching within the overlapping area. Therefore, it simply takes average of all the geometric differences. In [Sawhney98b], the frame-to-mosaic scheme is used. This scheme registers each newly acquired frame to the mosaic as it is being built. However, these approaches have a major drawback: a single erroneous registration is enough to create a non-consistent mosaic. This suggests that the complete mosaic has to be updated for each newly acquired frame, and also that the reference frame cannot be changed to accommodate the 2D topology of the swaths. This topology reflects the relative placement of the acquired frames. In most mosaicing approaches, this topology is either disregarded, or fixed initially [Shum98]. Recently, Sawhney et al [Sawhney98a] have proposed an approach based on a graph representation o f the topology of the swaths. The approach suggested there relies on few heuristics to derive a simplified graph description. At last, none of these method separately considers the global errors by moving objects in the scene. Based on the fact that mosaic deals with large overlap, [Capel98] [ZomatOO] extend simple image registration with super-resolution technique. [ZhuOl] propose a stereo mosaic for the sequences taken by a camera mounted on the airborne platform with a dominant translation. Besides development for general mosaicing, there are some other approaches. [Can99] designed a dedicated mosaic method for a medical retina images. [Li99] uses the panoramic view to create landmarks by incorporating range and color information for segmenting structural information from the created panoramic view. [FitzgibbonOl] considers the image registration for nowhere- static scenes by stochastically learning scene changes and using the trained information to predict the parameters. [Nayer95] propose to use special camera to create mosaic rather than depending 114 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. on solely software solution. [AggarwalOl] [SchechnerOl] propose to use additional filter device on top of the normal or omni-directional camera to capture the dynamic range of the pixels. A.2 Global registration In this work, we present an innovative approach allowing us to perform local and global alignments by efficiently using the 2D topology of the swaths. We present a graph-based method for creating seamless 2D mosaics. When multiple frames overlap in the mosaic space, a global registration is performed to minimize the accumulated registration errors for a seamless mosaic. We use a graph to describe the temporal and spatial connectivity and show that global registration can be obtained by considering the overlapping regions connected in the graph. The method is based on the characterization of global registration based on a collection of connectivity on frame graph. It merges local and global registration into a single framework. The graph representation of the topology of the swaths allows us to search for the connecting frames in the sequence to the considered reference frame. The construction of the layer in the mosaic space relies on the relative temporal and geometric locations of the frames. In the case o f video clips, we have a sequential order of frames and the relative geometric locations have to be estimated. These relative locations are given by an estimation of the camera transformation between two consecutive frames. For this purpose, we perform a pair-wise registration using a homography transformation. This registration is carried over every consecutive pair of frames, and therefore provides an efficient estimation of the topology of the swath by the concatenation of the estimated homography transformations. The estimation of the pair-wise transformation is explained in the image registration section in detail. 115 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) Three input frames of “palace” sequence (b) Mosaic by using pair-wise local registration only Figure 60 Residual error based on local registration A.3 Topology inference Figure 61 shows overview of global registration process. The idea is that we reduce the accumulated errors in overlapping regions by refining the obtained ffame-to mosaic registration parameters. The topology of frames and overlapping region estimation are recovered based on initial transformation. Initial transformations are computed by the concatenation of pair-wise transformation. Since the location and shape of a frame and the overlapping regions in the mosaic space are influenced by the reference frame, we first find a new reference frame leading to a minimum warping. Then, the frame graph is constructed to explicitly express the connectivity o f frames in space. Based on grid matching and bundle adjustment, we refine original transformations. 116 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Input frames, pair-wise local registration, initial m osaic Image integration Frame graph construction Finding a new reference frame Parameter refinement Grid matching A globally registered m osaic Figure 61 Global registration method Mosaic image changes with respect to the reference frame because each input image is warped by a transformation between the reference frame and the frame denoted as M m. Before we compute the connectivity o f frames based on the overlapping regions, we find a new reference frame, which minimizes the image warping caused by the frame-to-mosaic transformation. The new reference frame is chosen by the following criterion. The reference frame, R, is selected by min(A(R)). A (R ) = Y.m^ j j where A is the mosaic area whose each input frame, I, is warped with respect to the reference frame R. M R j is the transformation between the reference frame R and image j. Based on our transformation model, a warped image is a quadrangle and can be decomposed into two triangles. The area is computed by the sum of areas o f two triangles. This computation is rather simple and fast because the area is computed based on the pre-acquired transformation 117 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. parameters and only four comers of the image are used and the computation does not involve any access to real image. In Figure 62, the new reference frame is represented by x-mark circle. Each node represents a center of each input frame, and the location of node is the warped location of each center with respect to frame to mosaic transformation. The global registration requires the knowledge of the overlapping areas between projected frames. These areas allow us to identify overlapping frames in the mosaic space. Initially, the connectivity among the frames is a binary tree whose the root is the reference frame. In this tree, if there is more than 50% overlap between frames, we create an edge between frames to express a spatial connectivity. Therefore, after adding spatial connectivity of frames, the connectivity forms a graph and we call it fram e graph. Frame graph plays an important role for efficient error minimization. Figure 62 shows the constructed frame graph for “palace” sequence. Each node represents the center o f the frame in the mosaic space. The linear tree linked by black edge is the initial temporal connectivity and additional red edges show the spatial connectivity in the mosaic space. Gray node is the initial reference frame and the gray node with cross mark is the new reference frame selected by the Figure 62 Initial linear tree linked by black edges, a frame graph linked by black and gray edges for “palace " sequence 118 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A.4 Global registration constraint Global registration requires to estimate errors in mosaic space for the minimization purposes. If we formulate it, the globally registered mosaic needs to satisfy the following criteria. f m in v - M vIjI Implementing such an approach is computationally expensive, unless an efficient representation of the frames topology is used. Instead, we use a tessellation o f the mosaic. We divide the mosaic into regular grids. The center points defining the grids are considered as anchors for grid matching. We extract grids from each images by reprojecting the anchor position in the mosaic grid and perform grid matching if two grids are extracted from the frames connected in a frame graph. Based on this grid matching, we rewrite the initial global registration constraint as: f \ m in X | M w (G <)“ M */(G y')| \ i j e F where F is the frame graph, ij is an edge between frame i and frame j. Gi and Gj are the centers of the grid in image i and j. Grid matching is used to reduce computation time. But considering the number of grids and the number of edges connected to each grid, it is still expensive operation. To make the global registration more efficient, we perform grid matching for images overlapping more than 50%. A.5 Global parameter refinement Based on acquired grid adjustment information, least square method is used to estimate new parameters. With respect to all the global transformation between the reference frame and each frame, we solve CX=b. X is the all the frame-to-mosaic parameters that we want to globally 119 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. refine. C consists of the re-projected locations of grids in the image, b consists of new grid locations after consider the grid matching. We present results in Figure 65. (a) A mosaic without global registration (b) A mosaic with global registration Figure 63 Globally registered mosaic image - 1 120 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) A mosaic without global registration (b) A mosaic with global registration Figure 64 Globally registered mosaic image - II Figure 65 Globally registered mosaic images - III R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission Here, we briefly check the complexity of the global registration process. Pair-wise registration requires 0(N-1) where N is the number of input frames. In Pentium III 1GHz, our program performs 15 frames / sec for 320x240 resolution for pair-wise registration. Grid matching requires O(BM) where B is the number of grids and M is the number of edges between frame. In general M is less than N/2. At worst case, the whole global registration process requires O(BN). A.6 Global registration in the presence of moving objects Our global registration method works well. But if there is significant number of moving objects in the sequence, the correlation between grids gives wrong adjustment for the global registration. We extend our approach to a robust algorithm in the presence of moving objects. As well as benefiting o f the graph-based framework, we want to take into account of the presence of moving objects. To achieve this, when the correlation between grids is computed, for each grid from a frame, its likelihood to the neighbors is also computed. Then, motion grid is determined by the median of likelihood values of the connected grids, which means that we single out grids having likelihood less than median. If we rewrite the global constraint, = m in ZK ~ M * \ ijeF ijeF L(Bi)>median L(Bi)>median L(Bj)>median _ L(Bj)>median The results show the described method works on the most challenging situations like basketball, soccer or tennis videos as in Figure 67 and Figure 68. The algorithm is different from the original global registration method. First of all, it performs pair-wise local registrations an finds a new reference frame. Construct an image mosaic and a frame graph. Locate a grid of the mosaic. (Steps 5th -8th are done on each grid point.) For each frame containing the whole grid, find corresponding grid in the other frames, the other frame should be connected in the frame graph. 122 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Compute the mis-registration by using local correlation. Store the correlation. When all the corresponding blocks are processed, compute the likelihood for each block of the grid. Construct the sparse matrix by using only the block matching having the likelihood above the median and refine the parameters. These last two steps distinguish blocks containing motion blocks and try to remove these blocks from the computation of global registration. At last, it constructs image mosaic by multi-view registration from these refined frame projection matrices. 123 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) Three frames of a basketball sequence (b) Local registration (no blending) (c) Global registration (no blending) Figure 66 Global registration result for a basketball sequence without blending 124 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) Local registration (temporal averaging) (b) Global registration (temporal averaging) Figure 67 Global registration result for a basketball sequence with blending 125 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) Three input frames from a soccer sequence (b) Local registration (temporal averaging) Figure 68 Local registration for “soccer” sequence 126 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) Global registration (temporal averaging) (b) Global registration (median filter) from 30 frames Figure 69 Global registration for “soccer” sequence 127 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Appendix B Processing for non-parametric motions Even when we have independent motions that are not characterized by either affine or projective transformation, it is still necessary to differentiate those motion regions. To achieve this, we introduce background and foreground layers. These layers are identified based on spatio- temporal color distribution. The background layer consists of static pixels after the camera motion compensation. The residuals appear due to moving objects or parallax. Since these residuals are not restricted by camera motion only, we categorize them into foreground layers. Here, we achieve in modeling background and extracting background layer. Based on background modeling, we also characterize components of the foreground moving object layer and extract them. Technical issue for the background layer processing is to perform robust background modeling in the presence of moving objects. For the object motion layer processing, we need to define property of the foreground pixels and to group them together. To achieve these, we use hybrid pixel-based background modeling that takes into account color and optical flow to define the property of background pixels. And we use Gaussian distribution and connected component grouping for moving objects. B .l Previous work The purpose of modeling background is to extract the static component of the video scenes and to infer the moving objects or foreground. In general, background subtraction has been developed for the fixed camera case. Most of background extraction or background initialization methods use unimodal representations, which compute a single value (median or mean) per pixel [Francois99] [Toyama99] [Wren97], which our system also uses. Background extraction using the median intensity relies on the assumption that the background at each pixel is visible more than fifty percent o f the time. But the advantage of using median rather than mean is that it 128 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. avoids blending pixel values. The adaptive algorithms updates the background model [Francois99] [Toyama99] [Wren97]. But these methods also blend pixel values in areas of the image that have recently changed. Recently, [GutchessOl] pointed out that the only pixel change does not give correct information for background extraction. The method uses the background initialization during the learning period only for the fixed camera case and does not perform background adaptation over the time. My approach is very similar to their work but the difference comes from that we consider a moving camera input and perform the background extraction for each color channel because many methods show that the limitation considering pixel changes with only intensity values. B.2 Background layer extraction Background extraction is performed after local pair-wise registration and global registration to minimize residuals from mis-registration. After image registration, the situation becomes similar to background initialization for a fixed camera surveillance. The difference is that the background initialization for a fixed camera usually have long period of exposure without any moving objects in it and allow to model the property of the scene during that period. Also, each pixel has the same amount of exposure period. However, in our case, the moving objects often appear and each pixel has different period of exposure due to camera motion. Therefore, it is required to have a robust method to model background pixels in the presence of moving objects. Before we discuss about our background modeling method, we characterize the property of background pixels. • Each pixel of the background image appears for at least a short interval of the sequences • The background is stationary. 129 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. • Background adaptation is not necessary since the whole sequence is given for the process. Our background model computes the background property per pixel. Background property is based on color and net optical flow. Not only relying on color, mainly inspired by [GutechessOl], we use the information about motion in vicinity of a pixel as additional criterion. We use the optical flow in the neighborhood surrounding each pixel. If the direction of optical flow in the neighborhood is toward the pixel, then there is likely to be a moving object approaching the pixel. However, if the majority of optical flow is directed away from the pixel, it is likely that the moving object is leaving the area. For each pixel, we compute the probability o f the “incoming” and “outgoing” optical flow called net-flow. Net-flow is the sum of the incoming value and the negative value of outgoing optical flow. The negative value of net-flow for each pixel decides the strength of the background property. N et - flo w = Inflow - Outflow Inflow = the optical flo w entering to the p ix e l O utflow = the optical flo w exiting to the p ix e l The corresponding algorithm is described below. Algorithm Description 1. For each pixel location in the mosaic space, find all intervals o f stable intensities. 2. Initialize accumulated net-flow and likelihood to zero. 3. Compute optical flow for each pair o f consecutive image frames. 4. Compute the net-flow between two frames. 5. Compute the accumulated net-flow over time. 6. Compute the likelihood of the pixel being background at time t: 7. For each pixel, calculate the average likelihood over a stable interval, and choose an interval with maximum likelihood. 8. The mean/median and variance of each color channel over the chosen interval are used as parameters of the background model. 130 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. From the algorithm, each pixel for the background image is mean or median of the chosen interval. Figure 71 shows the extracted background layer from the input presented in Figure 70. B.3 Foreground layer extraction Foreground layer extraction builds Gaussian distribution for each pixel. Based on extracted background property, we build Gaussian distributions for each pixel. Gaussian distribution is modeled by mean p and standard deviation a. The threshold value for deciding moving objects is 2<J2 . Since the distribution is computed in pixel by pixel, detected residual pixel include scattered pixels due to a change of illumination. To remove this, image morphological operations and connected component processing are performed. In Figure 72, extracted foreground layers are presented. Each person at each time t represents each foreground layer, and all layers are shown in one image for better understanding. In Figure 73, synopsis mosaic is shown. The synopsis mosaic displays all event represented by foreground layers on the background layer. In Figure 74, the mosaic video is represented. Each foreground layer is superimposed on the background in each time t. This sequence can help for the surveillance to locate the moving objects in the large environment. 131 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Figure 70 Input “Girl Walking:” sequence, three frames from 158 input frames Figure 71 Background layer extracted from “Girl Walking” sequence Figure 72 Extracted foreground layers 132 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission Figure 73 Synopsis mosaic superimposed foreground layers on the background layer Figure 74 Mosaic video displaying the superimposed foreground layers on top of the background layer in each time t 133 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. B.4 Application to compression Compression involves 2D motion estimation. Figure 75 shows motion estimation method applied for the compression. The sequence is captured with camera zoom-in. In zoom case, the output resolution should be taken into account. In our method, based on the zooming factor, we inversely apply the transformation for the mosaic construction. In zoom-in case, if we start the mosaic construction from the input resolution, a successive frame is represented by a part of the input frame, and the resolution of successive frames becomes coarser than original resolution. Therefore, we inversely generate the mosaic based on the recovered zoom parameter and try to keep the full input resolution. (a) Input sequence, three frames out of 52 frames with zoom-in (b) Out sequence, three frames out of 52 frames with zoom-in Figure 75 Compression example We computed the approximate compression ratio for this sequence. 134 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. • Number O f bytes o f input sequence »12MB • a = 52 frames • b = 320x240 resolution for each frame • c = RGB color channels, 8 bits per channel • Total number o f bytes = abc = 95,846,400 bits == 11,980,800 bytes ~ 12 MB • The avi video sequence with full quality compression by IV32 Intel Indeo Video R3.2 compression: 1.51 MB. • Number O f bytes of re-generated sequence from the mosaic and residual images: less than 1MB. • a = Background mosaic image size: 335x263 • b = Number o f residual pixels in 52 frames: 52800 • c = RGB color channels, 8 bits per channel • d = Overhead for compression such as indicating location o f the residual pixels • Total number of bytes = abc + d = 422,716 bytes + overhead ~ 423 KB + overhead. If we assume the overhead can compromise the frame compression ratio performed in the individual frames, the expected compression ratio is still high. Our motion estimation method shares the same idea as MPEG-4 encoding. Based on rough computation, we showed the parametric global encoding will provide higher compression. 135 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button
Conceptually similar
Persistent object detection and tracking within and across multiple cameras
PDF
Persistent object detection and tracking within and across multiple cameras 
Tensor voting in computer vision, visualization, and higher dimensional inferences
PDF
Tensor voting in computer vision, visualization, and higher dimensional inferences 
Low-state mechanisms to protect the network from greedy and malicious agents
PDF
Low-state mechanisms to protect the network from greedy and malicious agents 
Multi-view image -based rendering and modeling
PDF
Multi-view image -based rendering and modeling 
Multi-level three-dimensional building modeling by integration of aerial and ground view images
PDF
Multi-level three-dimensional building modeling by integration of aerial and ground view images 
Mining sequential data through layered phases
PDF
Mining sequential data through layered phases 
Resource management in large-scale data stream recording architectures
PDF
Resource management in large-scale data stream recording architectures 
Mathematical techniques for optimizing data gathering in wireless sensor networks
PDF
Mathematical techniques for optimizing data gathering in wireless sensor networks 
Trusted grid and P2P computing with security binding and reputation aggregation
PDF
Trusted grid and P2P computing with security binding and reputation aggregation 
Semantic mapping using mobile robots
PDF
Semantic mapping using mobile robots 
Software architectural support for disconnected operation in distributed environments
PDF
Software architectural support for disconnected operation in distributed environments 
Measurement and monitoring in wireless sensor networks
PDF
Measurement and monitoring in wireless sensor networks 
Large motion-based pose estimation method
PDF
Large motion-based pose estimation method 
Modeling, rendering and animating human hair
PDF
Modeling, rendering and animating human hair 
Learning objects, places and relations in a brain model of visual navigation
PDF
Learning objects, places and relations in a brain model of visual navigation 
Modeling the mirror:  Grasp learning and action recognition
PDF
Modeling the mirror: Grasp learning and action recognition 
Rapid generation of structural model from network measurements
PDF
Rapid generation of structural model from network measurements 
Probing the chemical control of mineral scale and metal corrosion at the microscopic level
PDF
Probing the chemical control of mineral scale and metal corrosion at the microscopic level 
Synthesis and optimization of application-specific intranets
PDF
Synthesis and optimization of application-specific intranets 
Three-dimensional reconstruction from serial non-contiguous sections using variational implicit techniques
PDF
Three-dimensional reconstruction from serial non-contiguous sections using variational implicit techniques 
Action button
Asset Metadata
Creator Kang, Eun Young (author) 
Core Title Inference of two-dimensional layers from uncalibrated images 
Contributor Digitized by ProQuest (provenance) 
Degree Doctor of Philosophy 
Degree Program Computer Science 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag Computer Science,OAI-PMH Harvest 
Language English
Permanent Link (DOI) https://doi.org/10.25549/usctheses-c16-638933 
Unique identifier UC11335041 
Identifier 3116723.pdf (filename),usctheses-c16-638933 (legacy record id) 
Legacy Identifier 3116723.pdf 
Dmrecord 638933 
Document Type Dissertation 
Rights Kang, Eun Young 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au... 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA