Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
(USC Thesis Other)
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MOTION SEGMENTATION AND DENSE RECONSTRUCTION OF SCENES CONTAINING MOVING OBJECTS OBSERVED BY A MOVING CAMERA by Chang Yuan A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2007 Copyright 2007 Chang Yuan Acknowledgments Myfirstthanksgotomyadvisor, ProfessorG´ erardMedioni, whohasguidedmethrough every aspect of academic research and helped me build a high-level perspective of the computer vision field. My PhD research would not be accomplished without his enlight- ening ideas, insightful comments and warm support. I appreciate the constructive advice and precious experience shared by Professor Ra- makant Nevatia and Professor Isaac Cohen. I enjoyed the casual “corridor” conversation with Ram (and wish I could have more) and the pleasant cooperation with Isaac. I would also like to thank Professor Gaurav Sukhatme, Professor Paul Debevec, Dr. Zhengyou Zhang, Professor Boris Rozovsky, Professor Alexander Tartakovsky and Professor James Moore II for spending their time in my defense and/or qualifying exam committees. My years at USC would not be so enjoyable without the help and friendship from my talented colleagues at USC Computer Vision Group, especially from Jinman Kang, DouglasFidaleo, QianYu, TaeEunChoe, ChangkiMin, FengjunLv, XuefengSong, and Bo Wu. My deepest love goes to my wife Lan, who has always been my inspiration and has sacrificed a lot for me. Lastly, I am so grateful to my family and Lan’s family for their care, encouragement, and support. I truly dedicate this dissertation to them. ii Table of Contents Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Overview of the Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Multiple Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Feature Extraction and Matching . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Camera Projection Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Geometric Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Homography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Essential and Fundamental Matrix . . . . . . . . . . . . . . . . . . 18 2.4.3 Trifocal Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Robust Estimation of Geometric Constraints . . . . . . . . . . . . . . . . 21 2.6 Registration Process within Temporal Windows . . . . . . . . . . . . . . . 24 2.7 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.8 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Motion Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Multi-view Motion Detection . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1 Formal definition of motion detection . . . . . . . . . . . . . . . . 35 3.3.2 Limitation of the epipolar constraint . . . . . . . . . . . . . . . . . 36 3.4 Parallax Rigidity Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 Plane+Parallax representation . . . . . . . . . . . . . . . . . . . . 39 3.4.2 Structure consistency constraint . . . . . . . . . . . . . . . . . . . 40 iii 3.4.3 Derivation of Structure Consistency Constraint . . . . . . . . . . . 43 3.4.4 Degenerate cases for three-view constraints . . . . . . . . . . . . . 47 3.5 Robust Estimation of Structure Consistency Constraint . . . . . . . . . . 49 3.5.1 Linear solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.2 Non-linear refinement . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5.3 Stability analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.6.1 Geometric constraint estimation . . . . . . . . . . . . . . . . . . . 55 3.6.2 Homography-based motion detection . . . . . . . . . . . . . . . . . 56 3.6.3 Parallax filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6.4 Spatial-temporal JPDAF based object tracking . . . . . . . . . . 60 3.7 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7.1 Qualitative evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7.2 Quantitative evaluation . . . . . . . . . . . . . . . . . . . . . . . . 66 3.7.3 Parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.8 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4 Sparse 3D Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Overview of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3.1 Motion segmentation, object tracking, and frame decimation . . . 80 4.3.2 Preliminaries for 3D reconstruction . . . . . . . . . . . . . . . . . . 81 4.3.3 3D Reconstruction of the static background . . . . . . . . . . . . . 82 4.3.4 Scale ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.4 3D Shape Recovery of Moving Objects . . . . . . . . . . . . . . . . . . . . 85 4.4.1 Reconstruction process for moving objects . . . . . . . . . . . . . . 86 4.4.2 Shape inference from relative motion . . . . . . . . . . . . . . . . . 88 4.5 3D Alignment of Moving Objects . . . . . . . . . . . . . . . . . . . . . . . 91 4.5.1 An integrated solution to object scale and motion . . . . . . . . . 91 4.5.2 Planar-translation assumption . . . . . . . . . . . . . . . . . . . . 95 4.5.3 Solving object motion with a known plane . . . . . . . . . . . . . . 98 4.5.4 Solving object motion based on an unknown plane . . . . . . . . . 101 4.6 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.6.1 Qualitative results of indoor image sequences . . . . . . . . . . . . 104 4.6.2 Qualitative results of outdoor image sequences . . . . . . . . . . . 108 4.6.3 Quantitative results of the sequences . . . . . . . . . . . . . . . . . 110 4.7 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5 Dense Volumetric 3D Reconstruction . . . . . . . . . . . . . . . . . . . . . . . 113 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.3 Initial Processes of the Approach . . . . . . . . . . . . . . . . . . . . . . . 118 5.4 Volumetric Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.4.1 Bounding volume estimation . . . . . . . . . . . . . . . . . . . . . 120 iv 5.4.2 Adaptive volume tessellation . . . . . . . . . . . . . . . . . . . . . 121 5.4.3 Photo-motion variance . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.5 Voxel Labeling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.5.1 Initialization with surface points . . . . . . . . . . . . . . . . . . . 125 5.5.2 Deterministic labeling method . . . . . . . . . . . . . . . . . . . . 126 5.5.3 Graph cuts based optimization . . . . . . . . . . . . . . . . . . . . 128 5.5.4 Summary of the labeling process . . . . . . . . . . . . . . . . . . . 130 5.6 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.7 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 v List of Tables 2.1 Minimal number of features required for geometric constraint estimation . 16 3.1 Multi-view geometric constraints for motion detection . . . . . . . . . . . 34 3.2 Quantitative evaluation results of motion detection. . . . . . . . . . . . . 69 4.1 Average 2D reprojection errors (in pixels) for all the sequences. . . . . . . 110 5.1 Examplesofvoxellabelsequencesatdifferenttimeinstants: EMP(empty), BAK (background), OBJ k (the k th moving object) . . . . . . . . . . . . . 125 6.1 Camera-object categorization of dynamic video scenes . . . . . . . . . . . 138 vi List of Figures 1.1 Example sequences to be analyzed . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The pipeline of our approach. . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Extracting and matching Harris corners from video frames. . . . . . . . . 11 2.2 The RANSAC scheme for robust estimation of geometric constraints . . . 22 2.3 Image registration results of the “uvs” sequence. . . . . . . . . . . . . . . 26 3.1 Adecision-treebasedschemeformotiondetectionbymulti-viewgeometric constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Applying the epipolar constraint to motion detection. . . . . . . . . . . . 37 3.3 “Plane+Parallax” decomposition with respect to two reference planes. . . 42 3.4 Degenerate cases for motion detection with three-view constraints. . . . . 47 3.5 Stability of the estimatedG matrix w.r.t. pixel noise and inlier ratio. . . 54 3.6 Overview of the motion detection and object tracking system. . . . . . . . 55 3.7 Distribution of disparity values with respect to the structure consistency constraint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.8 Detection and tracking results of the “road” sequence. . . . . . . . . . . . 62 3.9 Disparity maps with respect to three different constraints. Darker pixels correspond to larger disparity values. . . . . . . . . . . . . . . . . . . . . . 63 3.10 Motion detection results of the “seq29” sequence. . . . . . . . . . . . . . . 64 3.11 Motion detection results of the “Tutor Hall” sequence. . . . . . . . . . . . 65 vii 3.12 Ground-truth data for frame 55 in the “road” sequence. . . . . . . . . . . 67 3.13 Detection results of frame 45 in “Tutor Hall” with wrong parameters. . . 70 4.1 Our approach to 3D Euclidean reconstruction of dynamic scenes . . . . . 79 4.2 Global scale ambiguity in 3D reconstruction of static scenes. The camera is depicted by a triangle facing the object point. . . . . . . . . . . . . . . 85 4.3 Reconstruction of moving objects with different detail levels. . . . . . . . 87 4.4 Theconversionof“realcamera+movingobject”to“virtualcamera+static object”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5 Object motion estimation with the planar-translation constraint. . . . . . 96 4.6 Results of the “toy-car¨ sequence . . . . . . . . . . . . . . . . . . . . . . . 105 4.7 Results of the ball sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.8 Results of the “forest” sequence. . . . . . . . . . . . . . . . . . . . . . . . 109 5.1 Overview of our approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2 3D volumetric decomposition of the scene. . . . . . . . . . . . . . . . . . . 121 5.3 Examples of oriented image patches projected from the same voxel into two images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.4 Results of frame 90 (left) and 110 (right) from the “toy-car” sequence. . . 132 5.5 Results of frame 1200 (left) and 1232 (right) from the “forest” sequence. . 133 viii Abstract WeinvestigatetwofundamentalissuesinComputerVision: 2Dmotionsegmentationand 3D dense shape reconstruction of a dynamic scene observed from a moving camera. The scene contains multiple rigid objects moving in a static background, while the camera undergoes general 3D rotation and translation. Our goal is to segment the video frames into 2D motion regions and static background areas, and then to reconstruct the dense 3D shape of both parts of the scene. Motion segmentation of image sequences shot by a moving camera is inherently diffi- cult as the camera motion induces a displacement for all the image pixels. This camera motioniscompensatedforbyanumberofgeometricconstraintsestimatedbetweenvideo frames. The pixels that cannot be compensated for by these constraints are classified as motion regions. A novel 3-view constraint is proposed to handle the cases where existing ones do not work well. The geometric constraints are combined in a decision tree based methodforsegmentingthemotionregionsfromthebackgroundareaineachvideoframe. Aftermotionsegmentation,sparse3Dstructureofthestaticbackgroundand3Dcam- eramotionareestimatedbythewell-developed“StructureandMotion(SaM)”methods. The same SaM methods are applied to recover the 3D shape of moving objects from a moving camera, based on their relative motion. The object scale and motion, however, can be only solved up to an unknown scale, unless additional assumptions are available. ix In our scenario, a planar-motion assumption is introduced: the object motion trajectory must be parallel to a plane. With the aid of the planar-motion assumption, the 3D object motion trajectory can be uniquely determined. The sparse 3D reconstruction of the dynamic scene is extended to a dense volumetric one. The whole scene is divided into a set of volume elements, termed as voxels. Each voxel is assigned an object label which may change over time. The task of dense recon- struction is then accomplished by a novel voxel coloring method that finds the optimal label assignment for each voxel to minimize photo-motion variance measures between the voxels and the original images. x Chapter 1 Introduction 1.1 Problem Definition In this dissertation, we analyze the image sequences containing multiple rigid objects moving in a static background, shot by a camera undergoing general 3D rotation and translation. Typical examples are of that airborne or handheld cameras follow moving objects in a cluttered background, as shown in Figure 1.1. (a) airborne cameras (b) handheld cameras Figure 1.1: Example sequences to be analyzed Our goal is to perform the following two tasks on such image sequences: 1. to segment the video frames into motion regions and static background areas in 2D image space; 1 2. to reconstruct the dense shape of both moving objects and static background in 3D object space. Indeed, the latter of the two tasks depends on the former one, as the 3D object shape is inferred from 2D image cues. In order to ensure the generality of our approach, we do not assume any model information of the scene to be known, such as objects’ appearances, dimensions and identities. Instead, the main sources of information for our approach are the cues from original images, include: (1) photometric cues, namely intensity values of image pixels, which are utilized by image matching methods [82, 54]; (2) geometric cues, namely the positions of image pixels, which are exploited by multi-view geometry algorithm [21]. In a word, our approach is considered as a generic geometric analysis on the “moving objects + moving camera” scenarios. Our work can be applied to a broad range of areas, including • Video surveillance. The moving camera spans larger portion of surveillance area and generates richer description of the objects to be tracked, including their di- mension, distance, and 3D shape. • Image synthesis. The 3D reconstruction results can be used to generate a virtual fly-around from new viewpoints of the whole dynamic scene. Synthetic objects can be inserted into the original scene as well. • Augmented reality. Annotative text can be drawn onto the reconstructed surface of moving objects and updated dynamically. 2 • Robot navigation. The visual sensors mounted on the robots can be enhanced by the ability to track and localize the moving objects such as walking humans. • Content-based video coding. The segmented motion regions can be encoded with higher bit-rates while the static background is encoded with lower ones, which eventually reduces the length of video streams. 1.2 Overview of the Approach Both of the two primary tasks depend on certain initial tasks. The pixel-level motion segmentation (or detection) tasks require that the video frames are registered by the geometric constraints estimated between sets of sparse feature points. The dense 3D shape of both moving objects and the static background is extended from the sparse shape of the scene, namely the set of 3D points and lines on the object surface. The primary and initial tasks are integrated into a four-stage process organized in a pipeline as shown in Figure 1.2, where each task depends on the tasks before it. We briefly describe the method for each task as follows. In an abstract view, the whole process goes from 2D image space to 3D object space, and also transits from sparse to dense correspondences. Task 1: Multiple image registration (Chapter 2). The camera motion induces a number of geometric constraints that can relate the pixels instaticbackgroundareasacrossmultipleframes,includingthe2Dhomography,epipolar constraintandtrilinearconstraint[21]. Thegeometricconstraintsarerobustlyestimated from feature points extracted and matched among original video frames. Then any pixel 3 inonevideoframeisregisteredtocorrespondingpixelpositionsinadjacentframeswithin a temporal window [82]. Task 2: Motion segmentation (Chapter 3) Motion segmentation of scenes observed by a moving camera is inherently difficult as the camera motion induces a displacement for all the image pixels. Furthermore, there may exist strong parallax due to static 3D structures. In this task, each image pixel is classified into either planar background, parallax or motion regions by a decision tree based scheme which sequentially applies a number of existing geometric constraints as well as a novel one, called “structure consistency constraint” [80]. Throughout this dissertation, this task is referred to interchangeably as motion segmentation or motion detection. As the main contribution of this method, the structure consistency constraint is derived from the relative camera poses among three frames, as one of the family of so called“Plane+Parallax”constraints[26,49]. Unlikeexistingplanar-parallaxconstraints, the structure consistency constraint does not require the reference plane to be constant acrossmultipleviews. Itdirectlymeasurestheinconsistencybetweentheparallaxvector from the same point despite the camera motion and variation of the reference plane. The structure consistency constraint is capable of detecting moving objects followed by a moving camera in the same direction, a so called degenerate configuration where existing constraints cannot handle. 4 Multi-Image Registration 2D Motion Segmentation Sparse 3D Reconstruction Dense 3D Reconstruction 2D Motion Label Images 3D Structure & Motion 3D Volumetric Shape Geometric Constraints Original Image Sequences Figure 1.2: The pipeline of our approach. 5 Task 3: Sparse 3D Reconstruction (Chapter 4). In order to reconstruct, in 3D Euclidean space, dynamic scenes observed from a moving camera, we apply an integrated process which segments the moving objects from the static background (the first two tasks) and produces 3D reconstruction for both parts of thescene. Classic“StructureandMotion”(SaM)methodsareappliedtoreconstructthe staticbackgroundfirst[21,47,42]. Then,anovelmethodisproposedtoreconstructeach moving object independently, namely inferring 3D object shape and motion trajectory. The 3D shape of a moving object is recovered based on the relative motion between the moving camera and the moving object. Therefore, the original configuration of moving object and moving camera is converted to that of a static object seen by a hypothetical moving camera, called a virtual camera. This virtual camera shares the same intrinsic parameters with the real one, but moves differently due to object motion. The object motion trajectory computed based on the relative motion and real camera motion, however, is influenced by the unknown scale of object shape, also known as the “scale ambiguity” [46]. As a key contribution of our method, we prove that the 3D object motion can be solved uniquely, despite the unknown scale, under additional conditions about the object’s motion. We prove that a sufficient condition is for the object’strajectorytobeplanar. Thena3Dmotiontrajectoryisestablishedbystretching the object shape to the correct scale and moving the object by the estimated motion. Task 4: Dense 3D Reconstruction (Chapter 5). After the sparse reconstruction task is done, we divide the 3D scene into a set of volume elements, termed as voxels, organized in an adaptive octree structure [65]. Each voxel 6 is assigned a label at each time instant, either as empty, or belonging to background structure, oramovingobject. Thetaskofdenseshapereconstructionisthenformulated as assigning each voxel a dynamic label which minimizes photo and motion variance between voxels and the original sequence [56]. We propose a three-step voxel labeling method based on a robust photo-motion variance measure. First, a sparse set of surface points are utilized to initialize a subset of voxels. Then, a deterministic voxel coloring scheme carves away the voxels with large variance. Finally, the labeling results are refined by a Graph Cuts based optimization method to enforce global smoothness between the voxels adjacent in both temporal and spatial domains [7]. 1.3 Outline The rest of the dissertation is organized as follows. In Chapter 2–5, we describe each of the four tasks of our approach respectively in detail. As the chapters are extended from loosely connected papers published (or to be published) in academic conferences and journals, they are presented in a logical flow similar to their original versions: problem definition,relatedwork,themethod,experimentalresults,andsummaryanddiscussions. Finally, Chapter 6 concludes the dissertation by summarizing the contributions of our work and discussing the directions for future research. The papers that serve as the basis for Chapter 2–5 are listed as follows. • Chapter 2 (multi-image registration) and chapter 3 (motion segmentation): 7 – Chang Yuan, G´ erard Medioni, Jinman Kang, and Isaac Cohen, “Detecting Motion Regions in Presence of Strong Parallax from a Moving Camera by Multi-view Geometric Constraints”, to appear in IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) – JinmanKang,IsaacCohen,G´ erardMedioniandChangYuan,“Detectionand Tracking of Moving Objects from a Moving Platform in Presence of Strong Parallax”,intheProc. ofIntl. Conf. on Computer Vision(ICCV),pp. 10-17, 2005 • Chapter 4 (sparse 3D reconstruction): – Chang Yuan and G´ erard Medioni, “3D Reconstruction of Scenes with Rigid MovingObjectsObservedfromaMovingCamera”,submittedtoIEEETrans. on Pattern Analysis and Machine Intelligence (PAMI) – Chang Yuan and G´ erard Medioni, “3D Reconstruction of Background and Objects Moving on Ground Plane Viewed from a Moving Camera”, in the Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2261-2268, 2006 • Chapter 5 (dense 3D reconstruction): – Chang Yuan and G´ erard Medioni, “Inferring 3D Volumetric Shape of Both Moving Objects and Static Background Observed by a Moving Camera”, to appear in IEEE Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), 2007 8 Chapter 2 Multiple Image Registration 2.1 Introduction In this chapter, we describe a method for estimating the geometric relationship (or con- straints)betweenthepixelsacrossmultiplevideoframesandaligntheseframestogether, whichiscalledamulti-imageregistrationprocess[82]. Forpixelsinoneimage,themulti- image geometric relationship can predict their corresponding positions in other images. The image registration process serves as the basis for motion segmentation, video stabi- lization, and also provides cues for the 3D reconstruction steps. We first describe the process of extracting and matching the feature points from multiple images. Then, the camera projection models are briefly introduced as well as a number of multi-view geometric constraints. A general robust scheme is applied for the estimation of these geometric constraints. We show how the multiple video frames can be registered by homography/affine motion models. The image registration results of real-world sequences are shown, followed by the are shown experimental results. 9 2.2 Feature Extraction and Matching The image feature (or interest) points refer to a subset of image pixels that appear more salient than the other pixels. These salient pixels lie on the object surface in the scene and contain useful semantic information. Most state-of-the-art feature detectors follow a common scheme to extract feature points up to sub-pixel accuracy along multiple scales [16, 54, 35], briefly described as follows: 1. Build a multi-scale representation (e.g. Gaussian image pyramid with 3-5 scale levels) for the original image; 2. Compute the saliency measure for each pixel in every level of the image pyramid; 3. Label the pixels that are local extrema in their neighborhood as initial feature points; 4. Removethefeaturepointswhosesaliencymeasuresarenotlargeenoughcompared to adjacent ones (non-maximal suppression). One commonly used feature detector is the Harris corner detector [16]. The saliency measure for Harris corner, called “cornerness” measure, is computed from the eigen valuesofanormalmatrixastheouter-productofhorizontalandverticalimagegradients. Pixels that have the maximum cornerness compared to its adjacent 8 pixels in the same scale are classified as corners. As its name indicates, the Harris corner detector respond accurately to man-made objects with sharp edges and apparent corners, e.g. buildings and vehicles [54]. 10 (a) original frame 21 (b) original frame 22 (c) corners in frame 21 (d) corners in frame 22 (e) matched corners between frame 21 and 22 Figure 2.1: Extracting and matching Harris corners from video frames. Another kind of commonly used feature points is the Scale-Invariant Feature Trans- form(SIFT)points[35]. Thesaliencymeasureiscomputedforeachpixelastheresponse to Difference-of-Gaussian (DoG) filters with various kernel sizes. The pixels with maxi- mal DoG response compared to its neighborhood of 26 pixels in three pyramid levels are 11 classified as feature points. SIFT detector is sensitive to complex shaped objects with rich texture, e.g. natural landscape scenes [54]. Feature detectors are selected based on the different characteristics of the image sequences to be processed. For all image sequences used in our work, man-made objects occupy a large portion of the scene, and therefore the Harris corner detector is selected. Figure 2.1(b) shows the Harris corners extracted from two video frames shown in (a). Notethattheremayexistfeaturepointswithdifferentscales(shownbycircles)extracted from the same location. The 2D feature points that correspond to the same part of 3D object surface need to automatically matched across multiple images. A feature descriptor is extracted from the local image region for each feature point to facilitate this matching process. Only those points with similar feature descriptors are treated as correct correspondences. As illumination and viewpoint changes may occur in the sequence, the feature descriptors must be robust enough to accommodate geometric and photometric variations. Orientationanglesarecomputedforeachfeaturepointtorepresentthe3Dorientation of the object surface observed from the camera [35]. The gradient magnitude values in the local region around a point are accumulated into a 1D histogram, whose bins are divided by different angles. The bins that contain largest gradient magnitudes are treated as possible candidates for the orientation angles. As one 2D angle may not depict the 3D orientation accurately, more than one bins are selected. There may be multiple descriptors extracted from the same feature point, each from a image region along rotated by the corresponding orientation angle. 12 One commonly used feature descriptor is a square image window extracted from the local region of the feature point. The intensity values in each channel are subtracted by their mean value to remove the effect of affine illumination change. The similarity between image windows is measured by pixel-level cross-correlation [13]. The image window-based descriptor is sensitive to 3D viewpoint change and photometric variation, and therefore more suited for consecutive video frames with smaller changes. The other feature descriptor used in our work is a normalized 2D gradient histogram accumulated from local square window in gradient images along orientation angles [35]. Thesimilaritybetweengradienthistogramsismeasuredbycomputingtheanglesbetween two histogram vectors. The gradient histogram is scale-invariant in theory and resistant to 2D affine transform in practice as well. It is more suited for images with larger viewpoint and illumination changes. The feature matching process starts with two initial images [47]. Each point in one imageiscomparedtoallthepointsintheotherimageandtheonewithlargestsimilarity is kept as a candidate. Two points are treated as matched only if they are mutually the best candidate for each other. All the other one-way matches are disregarded to ensure high matching quality. The process continues by adding more images sequentially and repeating the two-image matching process between the new image and each existing image. In practice, the feature points for each image are organized in a 2D k-D tree for faster query speed [5]. The search region for each feature point is restricted to be half of the image size or even smaller. Figure 2.1(c) shows the matching results between two video frames where the point pairs (in red) are connected with segments (in yellow). 13 2.3 Camera Projection Models Let us introduce the notations for describing a moving pinhole camera which shoots an image sequence. Let K denote the number of video frames and P the number of points in 3D Euclidean space. Let p kj (u, v ,1) T denote 2D homogeneous point coordinates projected from 3D point P j (x, y, z, 1) T onto the image plane of camera k, where k =1,...,K and j =1,...,P. We do not differentiate the point coordinates with their inhomogeneous counterparts if no confusion is caused. Thecameraprojectionprocesscanbedecomposedintotwosteps. Thefirststepisto obtain the camera coordinate of a world point P j relative to camera k. Let the camera pose denoted by a rotation matrix R k and the camera center C k relative to the world coordinate. The point coordinateP kj relative to camera k is then obtained as, P kj =R k (P j −C k )=R k P j +T k (2.1) where T k =−R k C k is the translation vector for camera k. For simplicity, we assume that the world coordinate system is the same of that of camera 1, namely R 1 = I and C 1 =0. The second step is to project the camera coordinate of the point onto the image plane. The perspective projection model is as follows: u v = f k z x y (2.2) 14 where f k is the focal length of camera k. The internal parameter matrix for camera k is simplified as A k = diag(f k , f k , 1), with the assumptions that pixels are square and that image centers correspond to the principal point. Combining both steps, we have the camera matrix for the perspective projection model as follows, M k =A k R k T k (2.3) Then the perspective projection can be represented as p kj ∼M k P j (2.4) There exists a family of affine cameras with their projection matrices in the form as follows, M aff = m 11 m 12 m 13 m 14 m 21 m 22 m 23 m 24 0 0 0 m 34 (2.5) Theimportantpropertyofaffinecamerasisthatthe2Dprojectionsarenotdependenton the depth of individual points. Therefore, the parallelism is preserved in the projection: the 2D projections of 3D parallel lines are still parallel. Boththeperspectivecameraandaffinecameraarethespecialcasesofa3Dprojective camera, M proj = m 11 m 12 m 13 m 14 m 21 m 22 m 23 m 24 m 31 m 32 m 32 m 34 (2.6) 15 Different projection models can be converted to one another by a 4×4 non-singular matrixT such that, ˜ M=MT −1 , ˜ P=TP (2.7) The transformation from projective camera to an affine one is called an affine upgrade, and the one from affine to perspective camera is called an Euclidean upgrade. 2.4 Geometric Constraints Due to the camera motion, there exist a number of geometric constraints which relate the 2D pixels projected from the same 3D point in multiple views, as summarized in [21]. They include 2-view collineation (homography), 2-view epipolar constraint (fun- damental/essential matrix), 3-view trilinear constraint (trifocal tensor). The geometric constraints over 4 or more views are omitted here, as they are more of theoretical inter- ests. In Table 2.1, we summarize the minimal number of feature correspondences needed for computing the multi-view geometric constraints. Also listed are the linearity of the method and number of solutions. Table 2.1: Minimal number of features required for geometric constraint estimation Homography 4 point pairs, 1 linear solution Affine 3 point pairs, 1 linear solution Essential Matrix 5 point pairs, 10 non-linear solutions [43] 8 point pairs, 1 linear solution [21] Fundamental Matrix 2 point pairs (given the homography), 1 linear solution 7 point pairs, 1 or 3 linear solutions [36] 8 point pairs, 1 linear solution [17] 6 point pairs, 15 non-linear solutions [63] Trifocal Tensor 13 line or 7 point triplets, 1 linear solution [4, 18] 6 point triplets, 1 or 3 linear solutions [70] 16 2.4.1 Homography Suppose there exists a planeπ in the 3D space, such that the points on the plane satisfy N T P 1 = d where N is the normal vector and d is a scalar value, all w.r.t. camera 1. Substituting the plane equation into (2.1), we have the collineation relationship between two 3D points, P 2 =(R 2 + T 2 N T d )P 1 (2.8) After projecting the 3D points to the image planes, a 2D homography is obtained as follows, p 2 ∼H 21 p 1 (2.9) where H 21 =K 2 (R 2 +T 2 N T /d)K −1 1 (2.10) H 21 is a 3×3 non-singular matrix. The homography H 21 defines a one-to-one pixel-mapping between two views. The projections of static points on the reference plane are constrained by the homography. A special case of 2D homography is the 2D affine motion model which has a simpler form as follows, A 21 = s 1 s 2 t x s 3 s 4 t y 0 0 1 (2.11) 17 The homography can be estimated from the linear equations generated by cross- productingp 2 with itself: (H 21 p 1 )×p 2 =0 (2.12) Since each pixel pair (p 1j , p 2j ) provides two independent linear equations, the eight unknowns in homography are solved from four pairs of pixels which do not lie on the same 2D line (not collinear). For the affine model, only three pairs are needed for six unknowns. However, we may be aware of the degenerate configuration of homogra- phy/affine estimation: if the three or four points are lying on one line, the estimation homography is unreliable and loses its rank-3 property. A residual error with respect to the homography is then defined to evaluate how consistent a pixel pair is with a homography, e homo =k(H 21 p 1 )/(h 3 p 1 )−p 2 k (2.13) where h 3 is the third row of H 21 . The residual error is indeed a 2D Euclidean distance between the original pixel position and the predicted position. 2.4.2 Essential and Fundamental Matrix TheplaneformedbythetwocameracentersC 1 andC 2 , andtheobjectpointPiscalled the epipolar planeπ e . The coplanar property ofC 1 ,C 2 andP, is expressed as follows, −−→ C 1 P·( −−−→ C 1 C 2 × −−→ C 2 P)=0 (2.14) 18 Expressing the coordinate-independent terms with individual coordinate systems, we have ¯ p T 2 E 21 ¯ p 1 =0 (2.15) where E 21 =[T 2 ] × R 2 (2.16) ¯ p i = K −1 i p i is the normalized image coordinates. [v × ] denotes the skew-symmetric matrix generated by a vectorv(v 1 , v 2 , v 3 ) T : 0 −v 3 v 2 v 3 0 −v 1 −v 2 v 1 0 . If the original image coordinates of 2D pixels are used instead of the normalized coordinates, we obtain the fundamental matrix which is also a bilinear relationship: p T 2 F 21 p 1 =0 (2.17) where F 21 =K −T 2 E 21 K −1 1 (2.18) Moreover, since the rank of both matrices is 2, we have|E 21 |=0 or|F 21 |=0. Both the essential and fundamental matrices lead to a point-to-line transfer rela- tionship. For any pixel p 1 in view 1, an epipolar line l 2 = F 21 p 1 is induced where the corresponding pixel p 2 should lie on. Similarly, p 2 also induces an epipolar line l 1 that p 1 should lie on. For a moving point, its projections are inconsistent with the epipolar constraint. The residual error is defined as the pixel-to-line distances as follows, e epi = 1 2 (|p 1 ·l 1 |+|p 2 ·l 2 |) (2.19) 19 There exist various methods for estimating the essential/fundamental matrix. The essentialmatrixcanbesolvedefficientlyfromatleast5pixelpairs[43]. Thefundamental matrix requires more pixel pairs: 7 pairs (1 or 3 linear solutions) or 8 pairs (1 linear solution) [21]. Recently a non-linear method was proposed to obtain 15 solutions as well as the focal length [63]. If a planar homography between two frames is known in advance, the fundamental matrix can be solved efficiently with two more points which are inconsistent with the homography, a typical “Plane+Parallax” scenario [21, 26]. The two points generate a pair of parallax vectors which intersect at the epipole e 2 . The fundamental matrix is then solved as F 21 =[e 2 ] × H 21 . The degenerate configuration for estimating the epipolar constraint is that points lie in a 3D plane. In this case, the fundamental matrix is a rank-3 matrix and is equivalent to the planar homography which is a one-to-one mapping. 2.4.3 Trifocal Tensor In the case of three views, we can obtain a trilinear constraint which relates both the points and lines. The constraint is encoded in a 3×3×3 array, called “trifocal tensor” T. Given the points (p 1 ,p 2 ), the trifocal tensor defines a unique point transfer to view 3 as the pointp 3 : p 3l =p 2i 3 X k=1 p 1k T kjl −p 2j 3 X k=1 p 1k T kil (2.20) where i, j, l =1,...,3. 20 The degree of freedom of the trifocal tensor is 18. Therefore, estimating the trifocal tensor requires at least 6 triplet of points or 13 triplets of lines [18]. If the number of point triplet is 6, there might be one or three solutions. If there are 7 point triplets, there will be a unique linear solution [4]. A non-linear parametrization of the trifocal tensor based on 6 point triplets is also possible [70]. 2.5 Robust Estimation of Geometric Constraints In this section, we discuss how the geometric constraints are robustly estimated from the matched feature points. There are usually several hundred pairs of feature points matched between video frames. The geometric constraints can be estimated by applying a linear least-squares method to all these point matches. However, the result by the least-squares method may be erroneous as the feature points may belong to different moving objects. Furthermore, the feature locations may not be accurate due to image noise, illumination changes, or shadows. Therefore, we propose to first normalize the positionsoffeaturepointsandthenutilizetheRandomSamplingConsensus(RANSAC) [11] to alleviate these errors and obtain accurate estimation of geometric constraints. Before computing the geometric constraints, the feature points are pre-conditioned by a normalization process [21]: 1) shifting: move all the data points such that their centroid is at the origin; 2) scaling: scale the shifted points such that the coordinates of each point fall into [−1, 1]. This pre-conditioning avoids the numerical errors due to large scale differences. 21 1. Randomly select a sample containing a minimal number of feature matches; 2. If the sample is degenerate which could not lead to a valid solution, go to 1; 3. Compute a tentative geometric model from the selected matches; 4. Classify each feature match into either an inlier whose residual error w.r.t. the tentative model is smaller than a threshold or an outlier with larger error; 5. If the number of inliers is so far the maximal one, store the sample and re-estimate the expected number of samples to take; 6. If the number of selected samples is smaller than the expected number of samples, go to 1; 7. Recompute the geometric model based on all the inliers. Figure 2.2: The RANSAC scheme for robust estimation of geometric constraints Then, the normalized feature matches are fed into a RANSAC-based robust esti- mation framework. RANSAC is an iterative model-fitting scheme which is summarized in Figure 2.2. Note that, in the process of selecting minimal number of features, we perform a degeneracy test over the feature (step 2). If any sample contains degenerate configuration, this sample is discarded immediately and a new sample will be taken. The expected number of samples is computed based on the assumption that after many selection trials, at least one sample will contain only inlier data points [11]. Let z denote the probability of the random event that a so called error-free sample is selected and w the predicted percentage of inliers over the whole set. Let N be the expected number of samples and n the minimal number of matches needed for the model estima- tion. Based on the uniform distribution assumption, it is easy to derive the equality as follows, (1−w n ) N =1−z (2.21) 22 The expected number of samples N is then estimated as N = log(1−z) log(1−w n ) (2.22) Intuitively, the more samples are taken, the more inliers exist in the whole data set, the more likely a correct model will be selected. In our scenario, the underlying assumption about the RANSAC scheme is that the feature points belonging to the static background are more than any other set of points belonging to individual moving objects. In other words, the static background is the dominantobjectinthescene,aconditionwhichisusuallysatisfiedintheimagesequences utilized in our work. The number of inlier matches, as the indicative measure for the RANSAC scheme, cannot impose enough constraints onto the inliers if the selected threshold is too large. Therefore, the measure may be replaced by a more restrictive one: the summed residual errorsofinliers. Onlythesamplethatleadstotheminimalresidualerrorwillbeselected as the final model. One disadvantage of the RANSAC method is that a threshold value needs to be selected for finding the inliers/outlines in the whole set of feature points. In some cases, this threshold is hard to pick. There are other robust estimation schemes, such as the Least Median Squared (LMedS) estimation method [81] which takes the median residual error as the indicative measure, avoiding the threshold value. 23 2.6 Registration Process within Temporal Windows Thegoaloftheimageregistrationprocessistoaligntogethertheimagesshotatdifferent times over the same scene [82]. For a comprehensive review of the image registration techniques, interested readers are referred to [82]. A large amount of related work com- puted a dense optical flow field between images to find the relationship between image pixels[2]. However,itisdifficulttoobtainaccurateestimationofopticalflowsifthescene contains uniform background and multiple moving objects, as in our case. Instead, we follow the methods in [28] to register the images based on parametric inter-frame motion models. In our scenario, the scene is usually distant and can be well approximated by a plane, observed by a smoothly moving camera. Based on the planar scene and smooth inter-frame motion assumptions, the most appropriate motion model is selected as 2D homography transform, which can predict most image pixels accurately. Between any pair of consecutive frames t and t+1, a homography model H t+1 t is computed by the RANSAC scheme (Section 2.5) based on the matched feature points (Section 2.2). For any pixel p t in frame t, its corresponding pixel p t+1 in frame t+1 is predicted by the homographyH t+1 t : p t+1 =H t+1 t p t . Similar to the two-frame registration process, we propose to use homography models to register frame t to the frames within a temporal window W(t){s :|s− t|≤ Δ t } where Δ t is the half size of the temporal window. Here, we assume that all the frames withinthistemporalwindowsatisfythesameassumptionsmadebetweentwoconsecutive frames. 24 The homography H s t from frame t to s∈ W(t) is obtained by concatenating the homographies between consecutive frames as follows: H s t = H s s−1 H s−1 s−2 ···H t+2 t+1 H t+1 t if t≤s, H t t−1 H t−1 t−2 ···H s+2 s+1 H s+1 s −1 if t>s. (2.23) Similarly, any pixelp t in frame t is registered to a pixel in frame s byH s t : p s =H s t p t . In order to overcome accumulated errors of the concatenated homographies, these homographies are further refined by a Bundle Adjustment method similar to [8], which minimizes the prediction errors by the Levenberg-Marquardt algorithm over the whole setofinlierfeaturepointswithinthetemporalwindow. Inpractice, the2Daffinemotion model, as a special case of homographies, is utilized for multi-image registration as it can be estimated more reliably and sacrifices little accuracy. 2.7 Experimental Results Here we present an example of image registration results on the video sequences shown in Figure 2.1. 6 out of 201 video frames are shown in Figure 2.3 (a) and (b). Then, a mosaicimageisgeneratedbystitchingthese201videoframesbasedontheaffinemotion models, as shown in Figure 2.3(c). The visually satisfactory mosaic image demonstrates that the planar scene and smooth camera motion assumptions work very well in this sequence. One can also observe that the moving vehicles are eliminated in the mosaic image, which is the basis of motion segmentation process in Chapter 3. 25 (a) original frames 0, 40, and 80 (b) original frames 120, 160, and 200 (c) mosaic image from frame 0 to frame 200 Figure 2.3: Image registration results of the “uvs” sequence. 2.8 Summary and Discussion In this chapter, we described the method for registering multiple video frames based on the frame motion models estimated from the matched feature points. The method dependsonthetwoassumptions: (1)thesceneisdistantsuchthatitcanbewellapprox- imated by a plane; (2) the camera is moving smoothly such that the inter-frame pixel 26 motion is small. With the aid of the RANSAC scheme, the inter-frame motion model can be reliably fitted to the feature points in the static background, despite the presence of multiple moving objects. There are several directions for future research. We are interested in reducing the influence of one poorly registered frame over the whole set of frames within the temporal window. This may be accomplished by evaluating the residual errors with respect to the motion models between each pair of frames such that the selected frames contain the least sum of residual errors. This will help improve the general quality of image registration in a large number of frames. 27 Chapter 3 Motion Segmentation 3.1 Introduction We investigate a fundamental issue in computer vision: detecting and tracking motion regions in video sequences captured by a moving camera. The scene may contain mul- tiple objects moving rigidly in a static background. The background may also contain strong parallax produced by large 3D structures. The camera views the scene through perspectiveprojectionwhileundergoinggeneral3D motion. Ourgoalistoutilizevarious geometric constraints to segment the video frames into a static background and a num- ber of motion regions. A large number of applications could benefit from the proposed methodology, including video surveillance, object tracking, robot navigation and video compression. Motion segmentation in dynamic video scenes is inherently difficult as the moving camera induces 2D motion for each pixel. Consequently, the apparent pixel motion for points on moving objects is generated by both the independent object motion and the camera motion. In contrast, the apparent motion of points in the static background is 28 strictly due to the camera motion. The camera motion results in a number of multi-view geometric constraints which are estimated in Chapter 2, which are applied to the motion detection task: those consistent with the constraints are classified as background pixels while the inconsistent ones are classified as independently moving regions. The first geometric constraint used in our approach is the 2D planar homography [21]. Homography is used as a global motion model to compensate for the camera motion between consecutive video frames [28]. Pixels consistent with the homography areclassifiedasbelongingtothestaticplanarpartofthescene. Thoseinconsistentones, calledresidualpixels,maycorrespondtomovingobjects(motionregions)ortostatic3D structure with large depth variance (parallax pixels). Additional geometric constraints are then needed to separate the parallax pixels from the motion regions. The epipolar constraint is a commonly used constraint for motion detection between twoviews[36]: ifapixelinview1doesnotlieontheepipolarlineinducedbyitsmatched pixel in view 2, the corresponding 3D point is determined to be moving. However, the epipolar constraint is not sufficient to detect all kinds of 3D motion. Indeed, there exists a special kind of 3D motion, called degenerate motion: the 3D point moves along the epipolar plane formed by the two camera centers and the point itself, while its 2D projections move along the epipolar lines. In this case, such a moving point cannot be detectedbytheepipolarconstraint. Thedegeneratecaseoftenhappenswhenthemoving camera follows the objects moving in the same direction. In order to overcome the limitations of the epipolar constraint, geometric constraints over more than two views need to be imposed over the matched pixels. The trilinear constraint (trifocal tensor) can be applied to detect the moving points across three views 29 [21]. However,estimatingthetrifocaltensorisanon-trivialtask;requiringaccuratepoint correspondences and large camera motion. The trilinear constraint is usually applied to a restricted case of segmenting sparse feature points across three distinct views [68, 20]. Weproposeanovelthree-viewconstraintasanalternativetothetrilinearconstraint, called “structure consistency constraint”. The proposed constraint is capable to detect the degenerate motion which the epipolar constraint cannot detect. Implemented within the “Plane+Parallax” framework [30, 57], the proposed approach is more reliable than the trifocal tensor, in presence of image noise, false matches and small camera motion. Given the homography and the epipoles between two views, the scene is decomposed into a reference plane and off-plane structures (corresponding to 2D parallax) in the 3D projective space [57]. The structure consistency constraint is a bilinear relationship between a pair of 3D projective structures corresponding to the same static point. It is indeed a parallax rigidity constraint across three views. The structure consistency constraint is different from previous planar-parallax con- straints [26, 50, 34] in several aspects. First, previous approaches assume that the refer- enceplaneremainsconstantacrosstwoormoreviews. Thisassumptionmaynotbevalid as the inter-frame homographies are automatically estimated and the reference planes may correspond to different parts of the scene. This happens often in the case where there exist more than one planar patch in the scene. The proposed constraint does not dependonthisassumption. Secondly, thestructureconsistencyconstraintdoesnotneed a static reference point [26]. It works simply on the 2D images of a single 3D point in three views, instead of point pairs. Furthermore, the structure consistency constraint does not rely on the assumption of small motion between consecutive frames [50]. 30 Thestructureconsistencyconstraintisrepresentedbya4×4matrixthatencapsulates the camera motion, variation of the reference planes and camera internal parameters. Derived in generally uncalibrated cases, it can be further simplified to a combination of the normal vectors of the two reference planes if the camera parameters are known. The structure consistency constraint is estimated in a way similar to that of estimating the fundamental matrix [81]. The Least Median of Squares (LMedS) scheme is applied to robustly find a solution with the least median residual errors. We argue that the structure consistency constraint is geometrically equivalent to the trilinear constraint as both of them utilize the relative camera poses from three views. Furthermore, the proposed bilinear constraint could be extended to four views, as long as these four views share the same part of the scene. Nonetheless, there exist some degenerate cases in three or more views where neither the trilinear constraint nor the structure consistency constraint can detect the moving objects, as explained in Section 3.4. With the three geometric constraints mentioned above, the motion detection scheme is implemented as a decision tree as shown in Figure 3.1. After utilizing the homography forinitialdetection,theoriginalimageisinitiallysegmentedintotheplanarpixels,which are consistent with the homography, and the residual pixels. Then a “parallax filtering” step combines the epipolar constraint and structure consistency constraint to separate parallax pixels from the independently moving regions. A practical motion detection and tracking system is built based on this scheme and is described in Section 3.6. This chapter is extended from earlier versions appeared in [29] and [80]. The rest of this chapter is organized as follows. In Section 3.2, we briefly review the existing 31 Original Image Homography consistent? Epipolar consistent? Structure consistent? Parallax Pixels N Y Y N Planar Pixels Motion Regions N Y Parallax Filtering Initial Detection Figure 3.1: A decision-tree based scheme for motion detection by multi-view geometric constraints approaches related to our work. Section 3.3 formally defines the motion detection prob- lem and identifies the degenerate configuration which the epipolar constraint is unable to handle. We introduce the structure consistency constraint in Section 3.4 and discuss how to estimate the constraint robustly in Section 3.5. Our practical system is ex- plained in Section 3.6 with implementation details. The experimental results are shown and discussed in Section 3.7. Section 3.8 summarizes this chapter with conclusions and discussions. 3.2 Related Work The existing multi-view geometric constraints used in motion detection methods can be roughlydividedintotwocategories. Thefirstcategoryisderivedmerelyfromtherelative cameraposesinmultipleviews,called“relative-pose”constraints,withoutanyadditional assumption about the scene. The epipolar constraint is such a constraint between two 32 views[36]. Thetrilinearconstraint,asarelative-poseconstraintoverthreeviews[21],can infertheexactpixelpositioninthethirdviewgivenapairofcorrespondedpixelsfromthe firsttwoviews. Anydeviationfromtheestimatedpositionisusedtodecideifthepointis moving or not. [68] presents a motion segmentation method by selecting the most suited motion model over feature correspondences and clustering them into different motion groups in a joint statistical framework. However, the motion segmentation method was only applied to sparsely matched features. A family of multi-body constraints [74, 20] clusters the sparse feature correspon- dences into multiple moving objects in two or three views, without knowing the number of moving objects. The constraints, including a multi-body epipolar constraint in two views [74] and a multi-body trilinear constraint in three views [20], segment the whole point set into a number of motion groups where the group count is estimated automat- ically. However, these methods require that there exist at least six accurate point or line correspondences from each motion regions, which is not easily satisfied in practical situations. Thesecondcategory, called“planar-parallax”constraints, dependsona3D reference plane in the scene. If the scene can be approximated by a dominant plane, there exists a homography between two views [26] or a homography tensor among three views [58]. They can be used for motion detection: any pixel inconsistent with the homography is classified as motion regions. However, if the static scene also contains 3D structure with significant depth variance, the parallax effect must be considered. In [26] the authors proposed a pair-wise rigidity constraint in two or three views which relates the projective depth and parallax displacements. The method requires 33 Table 3.1: Multi-view geometric constraints for motion detection Two Views Three or More Views Relative Pose Epipolar Multi-body Epipolar ([26]) Trilinear Multi-body Trilinear ([7]) Planar Parallax Affine Homography Homography Tensor ([22]) Shape Constancy ([20]) Parallax Flow Field ([15]) Structure Consistency (this paper) Pair-wise Parallax Rigidity ([10]) finding a reference point which is assumed to be static. However, if this reference point happenstobemoving,allthestaticpointswillbeerroneouslydetectedasmovingregions. Therefore, the constraint is not suited for a fully automatic detection system. In [50], a shape constancy constraint is proposed to relate matched pixels across three or more views. The method makes further assumptions such as the inter-frame motion is small enough. [34] computes a residual planar parallax normal flow field and then segments the scene by combining two normal flow fields in a linear model related to camera ego- motion. All these methods assume that the reference plane is constant in the scene such that the distances of 3D points to the plane remain the same. However, this assumption is not always valid as the reference planes are automatically selected in different frames. The above discussed constraints are summarized in Table 3.1 (there are more con- straintsnotcoveredhere,ofcourse). Generally,therelative-poseconstraintsworkwellon the sparsely matched feature points. However, they require accurate feature correspon- dences and large camera baselines between consecutive frames [21]. The planar-parallax constraints are less sensitive to image noise and more suitable for dense correspondences under small camera motion. However, they become ineffective if the scene cannot be approximated by a plane. 34 3.3 Multi-view Motion Detection In this section, we formally define the problem of motion detection in multiple views and then describe the degenerate cases in which the epipolar constraint fails. 3.3.1 Formal definition of motion detection Hereafter, by “point” we mean a 3D point while “pixel” refers to the point’s 2D image or projection. Suppose that a static point P(x,y,z) T in the world coordinate is viewed by a moving camera at different time instants, i = 1,2,3,...Let P i (x i ,y i ,z i ) T denote its 3D camera coordinates at time i. Let the camera pose relative to the world coordinate be denoted by a 3×3 rotation matrix R i and a 3×1 translation vector T i . Then the transformation from world to camera coordinate is: P i =R i P+T i (3.1) For simplicity, we assume that the coordinate system of camera 1 is also the world coordinate, that isR 1 =I andT 1 =0. Let p i (u i ,v i ,1) T denote the 2D image of P in view i by a perspective projection model, p i =K i P i /z i (3.2) and alternatively, z i p i =K i P i (3.3) whereK i holds the camera intrinsic parameters for view i. 35 If the point P is moving, let P 0 = P + ΔP denote its new position in the world coordinate at time 2 and p 0 2 as its projection in view 2. From (3.1) and (3.3), we can relatep 2 andp 0 2 as follows, z 0 2 p 0 2 =z 2 p 2 +K 2 R 2 ΔP (3.4) Similarly, let P 00 =P 0 +ΔP 0 denote the new point position in the world coordinate at time 3 andp 00 3 as its projection in view 3. We have z 00 3 p 00 3 =z 3 p 3 +K 3 R 3 (ΔP+ΔP 0 ) (3.5) In other words, the matched 2D pixels in multiple views may be the projections of a moving 3D point, i.e. (p 1 ,p 0 2 ,p 00 3 ) instead of (p 1 ,p 2 ,p 3 ). The motion detection problem in two or three views from a moving camera is then formulated as validating if the matched pixels (p 1 ,p 0 2 ) or (p 1 ,p 0 2 ,p 00 3 ) correspond to the same static 3D pointP, or equivalently ΔP=0 or ΔP=ΔP 0 =0. 3.3.2 Limitation of the epipolar constraint The epipolar constraint is a commonly used geometric constraint for motion detection in two views, which encapsulates the relative orientation between two cameras [36]. In the uncalibrated case, it is represented by the fundamental matrixF 21 which relates the 2D images of the same 3D point as follows [36], p T 2 F 21 p 1 =0 (3.6) 36 l' 1 p 1 p' 1 p' 2 p 2 l 2 C 1 e 12 e 21 P P' ΔP C 2 Π e (a) moving pixels deviate from epipolar lines C 1 e 12 e 21 C 2 l' 1 p' 1 p 1 p 2 l 2 P ΔP P' p' 2 Π e (b) degenerate case: pixels move along epipolar lines Figure 3.2: Applying the epipolar constraint to motion detection. where the subscripts ij are short for i←j, where view i is the target view while view j is the reference (source) view. Letl 2 =F 21 p 1 denote an epipolar line in view 2 induced byp 1 . If the point is static, p 0 2 shouldideallylieonl 2 . Similarly,p 1 shouldideallylieontheepipolarlinel 0 1 =F T 21 p 0 2 . A pixel-to-line distance d epi is defined to measure how much the pixel pair deviates from the epipolar lines, d epi =(|l 0 1 ·p 1 |+|l 2 ·p 0 2 |)/2 (3.7) where|l 0 1 ·p 1 | and|l 2 ·p 0 2 | are the perpendicular distance from p 1 to l 0 1 and from p 0 2 to l 2 respectively. Each line vectorl(l u ,l v ,l w ) T is normalized such that l 2 u +l 2 v =1. d epi is used to detect if the point P is moving or not. Ideally, if d epi > 0, the pixels deviate from the epipolar lines and therefore P is moving, as shown in Figure 3.2(a). In the figure, the two images are taken by a moving camera at time 1 and 2, with C 1 , C 2 as 3D camera centers and e 12 , e 21 as 2D epipoles. P is the position of a 3D point at time 1 and moves toP 0 at time 2. P is projected into two images asp 1 andp 2 , whileP 0 corresponds top 0 1 andp 0 2 . Π e is the 3D epipolar plane determined byC 1 ,C 2 andP. 37 There exists a special kind of configuration, called degenerate configuration, where the moving points cannot be detected by the epipolar constraint. This configuration occurs when the camera follows the objects moving in the same direction [21, 50, 67], as illustrated in Figure 3.2(b). In the 3D Euclidean space, the point P moves to P 0 which remains in the epipolar plane established by the camera centers, C 1 , C 2 , and P itself. In 2D images, the pixel p 0 2 moves along l 2 . In this situation, the point is moving and yet d epi =0. Therefore, the epipolar constraint is not sufficient to determine whether a point is static or moving in two views. If the pixels are moving along the epipolar lines, its motion cannot be detected by the epipolar constraint. In order to solve this degeneracy, multi-view constraints need to be introduced. For example, the trilinear constraint can be utilized to detect moving points across three views [68]. However, estimating the trifocal tensor is not a trivial task, and it requires accurate feature matches and short camera baselines. In the next section we propose a novel three-view constraint that can be more reliably estimated. 3.4 Parallax Rigidity Constraints The“Plane+Parallax”representationof3Dsceneshavebeenusedinpreviousapproaches suchas [26,34,50,57]. Here,weproposeanovelconstraintwithinthe“Plane+Parallax” framework for detecting moving points across three views. 38 3.4.1 Plane+Parallax representation Let Π denote a so called reference plane in the 3D space. The in-plane points satisfy the plane equation asN·P=d whereN(N x ,N y ,N z ) T is the normal vector ofΠ and d is the distance of the origin from Π. The distance of point P from Π is defined as its “height” H such that H =N·P−d. A projective depth of pointP [26] is then defined as the ratio of point’s height over point’s depth, γ =H/z (3.8) IfP∈Π, then γ =0. Inversely, if γ6=0, thenP / ∈Π. Given a static off-plane point P 0 with its projective depth γ 0 , the relative affine structure ofP is defined as [57]: k = γ γ 0 = z 0 H 0 H z (3.9) This is where the two previous approaches, [57] and [26], are connected. Since γ 0 is constant for each pair of views, k is viewed as a projective depth up to a scale factor hereafter. Let Π 12 denote a reference plane in the scene which is selected between view 1 and 2. It induces a homography transformation between the two views, H 12 , such that the images of any in-plane points satisfy p 1 ∼p 1w =H 12 p 2 [26]. Otherwise, the 2D image 39 of a static off-plane point is decomposed into the warped position (planar warping) plus the parallax vector[57] as follows, p 1 ∼H 12 p 2 +k 12 e 12 (3.10) where the epipolee 12 is the projection of camera center in view 2 into view 1. As shown in (3.10), p 1 , p 1w and e 12 are collinear. By performing cross-product of p 1 over both sides of (3.10) followed by vector normalization, we obtain the projective depth k 12 as follows, k 12 = (H 12 p 2 ×p 1 ) T (p 1 ×e 12 ) kp 1 ×e 12 k 2 (3.11) where the scale ofH 12 determines the global scale of k 12 . The “Plane+Parallax” representation in (3.10) is indeed a projective reconstruction of the scene. Let ˜ P 12 = (p 1 ;k 12 ) = [u 1 v 1 1 k 12 ] T denote the projective structure of the 3D point P. If the camera parameters and plane positions are known, ˜ P 12 can be converted into its Euclidean counterpartP . 3.4.2 Structure consistency constraint A number of methods for detecting independent motion in 3D scenes were proposed by previous “Plane+Parallax” approaches [50, 57, 26]. They assume that the reference plane across multiple views remains constant and therefore the distance of the points to the reference plane remain unchanged. In practice, however, this assumption is not always valid. 40 Given a set of matching features between two views, a 2D homography is automati- cally computed by robust schemes [11]. The reference plane defined by the homography may correspond to different parts of the scene. For example the reference plane might correspond to the ground plane in view (1, 2) and a building facade in view (2, 3). Therefore, the previous constraints are not suitable for application. In order to solve this problem, we propose a novel three-view constraint which works with varying reference planes. Let us suppose that the “Plane+Parallax” decomposition isobtainedbetweenviews(1,2)andviews(2,3),asshowninFigure3.3. Thehomography H 12 induces the projective structure of a static point, ˜ P 12 = (p 1 ,k 12 ), from view 1 and 2. From ˜ P 12 itself, it is hard to tell if the point is moving or not. However, if another projective structure ˜ P 23 = (p 2 ,k 23 ) from view 2 and 3 is obtained, there exists a relationship between the pair of projective structures. As derived in the next subsection, a bilinear relationship exists between the pair of projective structures corresponding to the same static point as follows, ˜ P T 23 G ˜ P 12 =0 (3.12) whereG is a 4×4 matrix. The G matrix represents a bilinear constraint over 3D projective structures of the same point, similar to the 2D case of the fundamental matrix. Intuitively,G ˜ P 12 creates a3Dplanewhere ˜ P 23 shouldliein. However,suchplanehasnogeometricmeaninginthe 41 C 1 e 12 e 23 C 2 l' 1 p 1 p' 2 p 1w C 3 p 2w P(P') Π 12 Π 23 Figure 3.3: “Plane+Parallax” decomposition with respect to two reference planes. 3D projective space. We define an algebraic error function to measure the consistency of a pair of projective structures using the proposed constraint: d G ( ˜ P 12 , ˜ P 23 )= ˜ P T 23 G ˜ P 12 (3.13) If d G = 0, the two projective structures are consistent with the G matrix and the corresponding point is static. Otherwise, the 3D point is moving. The matrixG encapsulates the normal vectors of two reference planes, the camera’s relative orientation and unknown scale factors. It directly relates the pair of projective structures from view (1, 2) and (2, 3) without knowing the camera configuration and plane position. If the two reference planes are identical, G is still valid and can be applied to the cases used in [50, 57, 26]. Gisessentiallyathree-viewgeometricconstraint,aseachprojectivestructurerelates a pair of corresponding pixels. In this sense, the structure consistency constraint can be considered as the combination of one trilinear constraint and two planar homographies. 42 Furthermore, it can be extended to four views (i, i+1), (j, j+1), i6=j. The projective structures of the same point in the scene are obtained as ˜ P i,i+1 from views (i, i+1) and ˜ P j,j+1 from views (j, j +1). As long as the two pairs of views share the same scene, there exists a bilinear relationshipG i,j between ˜ P i,i+1 and ˜ P j,j+1 . 3.4.3 Derivation of Structure Consistency Constraint Let Π 1 be the reference plane selected between view 1 and 2, with its plane equation as N 1 ·P 1 = d 1 . The projective depth of P 1 between view 1 and 2, k 12 , is rewritten as follows, k 12 = z 0 H 0 N T 1 P 1 −d 1 z 1 = v T 1 P 1 −1 α 1 z 1 (3.14) where v 1 is the normal vector of Π 1 scaled by 1/d 1 . α 1 = H 0 /(d 1 z 0 ) is a constant scale factor between view 1 and 2, determined by the unknown off-plane pointP 0 . The projective structure ofP 1 between view 1 and 2 is obtained as ˜ P 12 =(p 1 ,k 12 ). Similarly, suppose Π 2 is the reference plane selected between view 2 and 3, with its plane equation as N 2 ·P 2 = d 2 . The projective depth k 23 between view 2 and 3 is obtained as, k 23 = v 2 T P 2 −1 α 2 z 2 (3.15) wherev 2 is the scaled normal vector ofΠ 2 and α 2 is the scale factor between view 2 and 3. The projective structure ofP 2 between view 2 and 3 is obtained as ˜ P 23 =(p 2 ,k 23 ). For simplicity, we assume the camera intrinsic parameters K i to be I, which does not affect the relative projective structure. Therefore, z i p i = K i P i is simplified to be z i p i =P i . By substituting this into (3.14) and (3.15), we express the reciprocals of z 1 43 and z 2 as the inner products of the projective structures with the scaled plane normal vectors as follows, z −1 1 =v T 1 p 1 −α 1 k 12 = v T 1 −α 1 ˜ P 12 (3.16) z −1 2 = v T 2 −α 2 ˜ P 23 (3.17) Let r T z denote the third row of rotation matrix R 2 and t z the third component of translation vector T 2 . The Euclidean depth of point P 2 could be related to that of P 1 by extracting the third row inP 2 =R 2 P 1 +T 2 as below, z 2 =z 1 (r T z p 1 )+t z (3.18) By substituting (3.16) and (3.17) into (3.18), we have: v T 1 p 1 −α 1 k 12 =(v T 2 p 2 −α 2 k 23 )[(r T z +t z v T 1 )p 1 −t z α 1 k 12 ] (3.19) Rewriting the each side of (3.19) as the inner products, LHS= ˜ P T 23 0 0 1 0 v T 1 −α 1 ˜ P 12 RHS= ˜ P T 23 v 2 −α 2 (r T z +t z v T 1 ) −α 1 t z ˜ P 12 44 After moving the left hand side to the right, we have, ˜ P T 23 G ˜ P 12 =0 (3.20) whereG is a 4×4 matrix relating the two projective structures of the same 3D point. G can be expressed in details as follows, G= v 2x w T −v 2x t z α 1 v 2y w T −v 2y t z α 1 v 2z w T −v T 1 −(v 2z t z −1)α 1 −α 2 w T −α 2 t z α 1 (3.21) where v 1 (v 1x ,v 1y ,v 1z ) T and v 2 (v 2x ,v 2y ,v 2z ) T are the scaled normal vectors of the two reference planes andw =r z +t z v 1 . G is a rank-2 matrix since α i 6= 0 and v i 6= 0, i = 1, 2. It absorbs two plane normal vectors (v 1 , v 2 , 3 unknowns from each), the third row of camera rotation (r z , 2 unknowns), the third component of camera translation (t z , 1 unknown) as well as two unknown scale factors (α 1 , α 2 ): totally 11 unknowns. 45 More knowledge about the camera motion or the plane positions can help simplify the G matrix. For instance, if t z =0,G is simplified to be G= v 2x r T z 0 v 2y r T z 0 v 2z r T z −v T 1 α 1 −α 2 r T z 0 (3.22) Sincer T z is a row from a rotation matrix, there exists an additional constraint:kr z k=1. Then r z , α 2 , α 1 , v 2x , and v 2y can be obtained, which reduces the unknowns in G to be 4. It is not always feasible to recover the parameters, such as reference plane positions and relative poses, from a given G matrix, as the intrinsic parameters might not be available. Instead, we seek a compact parametrization as follows, which fully exploits the rank-2 property of theG matrix and removes the linear redundancy. It is always possible to find two columns from the G, g (1) and g (2) , such that they are linearly independent and have largest non-zero norms. Then the original matrix G is converted to a 12-dimension parameter vector, [g (1) g (2) λ 31 λ 32 λ 41 λ 42 ] T where λ k1 g (1) +λ k2 g (2) =g (k) , k =3,4, correspond to the rest two columns of the originalG. If the condition thatkGk = 1 is also enforced, one of the four coefficients, for instance λ 42 , could be solved from other parameters. The final number of parameters is reduced to 11, which is identical to the number of unknown parameters. 46 3.4.4 Degenerate cases for three-view constraints The trilinear constraint and the structure consistency constraint exploit the information from three views to detect moving objects. They are capable to detect most of the degenerate cases mentioned in the previous section. However, there still exists a subset of degenerate cases that cannot be detected, should both of the following conditions be satisfied [50, 67]: i) the camera and objects are moving in the same direction; ii) their velocities satisfy a constant proportional relationship. Fortunately, this subset happens much less frequently than the whole set of degenerate motion. C 1 C 2 C 3 x z O P P' P'' P Cameras Moving object point Ground plane Virtually static point Figure 3.4: Degenerate cases for motion detection with three-view constraints. Figure 3.4 shows an example of a camera tracking a moving point across three views. We assume that both the camera and the point are located in the same 3D vertical plane, y = 0, in the world coordinate and both move in the direction parallel to the plane z =0, which is an extreme case of degenerate motion. LetC i (x ci ,0,z c ) T denote the camera position in view i (i=1,2,3). LetP(x p ,0,z p ), P 0 (x 0 p ,0,z p ), and P 00 (x 00 p ,0,z p ) denote respectively the point position in each view. A 47 virtual point ˜ P(˜ x,0,˜ z) is obtained by intersecting 3D rays −−→ C 1 P, −−−→ C 2 P 0 and −−−→ C 3 P 00 , which is indeed a 3D point triangulation process [21]. There exists a proportional relationship from the two similar triangles Δ ˜ PPP 0 and Δ ˜ PC 1 C 2 : z c − ˜ z z p − ˜ z = x c2 −x c1 x 0 p −x p (3.23) If the point motion between view 2 and 3, x 00 p −x 0 p , also satisfies this proportional rela- tionship: z c − ˜ z z p − ˜ z = x c3 −x c2 x 00 p −x 0 p (3.24) then the static virtual point ˜ P has the same projections as the moving pointP 00 . In this situation, the trilinear constraint cannot determine whether the pixel triplet comes from a static point or a moving one. We show that the structure consistency constraint is unable to handle this case as well. The plane z = 0 is assumed to be the reference plane across three views, so the planenormalvectorbecomesv =[0,0,1] T . TheGmatrixissimplifiedtobeonlyrelated to projective depth as follows, G= 0 0 0 0 0 0 0 0 0 0 0 α 1 0 0 −α 2 0 (3.25) Meanwhile, the projective depth values are obtained as k 12 =(z p −1)/(α 1 z p ) and k 23 = (z p −1)/(α 2 z p ). Thentheresidualerrord G =α 1 k 12 −α 2 k 23 =0, indicatingthatthepair 48 of projective structures is consistent with the structure consistency constraint, which is contradictory to the ground truth. The above example shows that there exist some degenerate cases that neither the trilinear constraint nor the structure consistency constraint can handle. This example can be extended to more than one point, as long as the motion of the points considered satisfiestheproportionalrelationshipdefinedin(3.23)and(3.24). Similarly,theexample can be extended to an arbitrary number of views. Fortunately, these cases happen much less frequently in reality, as the proportional relationship is not easily satisfied. 3.5 RobustEstimationofStructureConsistencyConstraint Inthissection,wedescribehowthestructureconsistencyconstraintisrobustlyestimated from pairs of projective structures ( ˜ P j 12 , ˜ P j 23 ) across three views. The estimation of G matrix consists of two steps: first obtain a linear solution from a set of noisy points by a robust LMedS scheme [81] and then refine it with non-linear optimization techniques over the inlier points. Before solvingG, we perform a data normalization to pairs of projective structures, such that the pixel coordinates and projective depth values are normalized to [−1,1] and [0,1] respectively [17]. This normalization step helps reducing numerical errors and increasestherobustnessoftheestimation. Furthermore,Gitselfisnormalizedsuchthat kGk=1. 49 3.5.1 Linear solution By reshaping theG matrix into a 16×1 vectorg, we convert the bilinear multiplication in (3.12) to q j g =0 (3.26) where the elements of q j are computed from those in ˜ P j 12 and ˜ P j 23 . g is obtained by SVD decomposition and reshaped intoG. GivenkGk = 1, the linear solution to G requires at least 15 pairs of projective structures ( ˜ P j 12 , ˜ P j 23 ), j = 1,...15. In presence of image noise and erroneous matches, however,usingthewholepointsetwillintroduceunnecessaryerrorsandgreatlyinfluence theaccuracyoftheestimatedGmatrix. Therefore,arobustestimationschemeisneeded to find the correctG matrix from a set of noisy points. The Random Sample Consensus (RANSAC) scheme is a common choice which finds a solution with the largest inlier support [11]. However, RANSAC requires a pre- determined threshold to find the inliers. In our case, this threshold is hard to select since the residual error to theG matrix is not a geometric distance. Instead, we employ the Least Median of Squares (LMedS) estimator which does not depend on a threshold for finding inliers [81]. The LMedS estimator randomly selects 15 pairs of projective structures to compute the G matrix as in (3.26), and compute the median of the squared residual errors over the whole set of projective structures as follows: median ( ˜ P j 23 ) T G ˜ P j 12 2 (3.27) 50 This process is repeated for a large number of iterations. The G matrix which minimizes the median residual error is considered as the correct solution. Any points with their errors smaller than the median error are classified as inlier points, while the rest are outliers. An implicit assumption made by the LMedS estimator is that the outlier points take up less than 50% of the whole set of points, such that the median error reaches its minimum when a correct solution from inlier points is obtained. 3.5.2 Non-linear refinement The linear method does not exploit the fact that there are only 11 unknown parameters in G (see appendix). In theory, only 11 pairs of projective structures are sufficient for obtaining a non-linear solution to G, while the linear solution is affected by the redundancy in 15 point pairs. In order to solve this problem, we convert the linear solution into a compact 11- dimension parameterization and then refine the parameters by non-linear optimization methods. First, aSVDdecompositionisappliedtotheGmatrixtoenforceitsrank-2property. G is decomposed to Udiag(s 1 ,s 2 ,s 3 ,s 4 )V T where s i (i = 1,...,4) are singular values of G, listed in a non-descending order. The rectifiedG is given asUdiag(s 1 ,s 2 ,0,0)V T . G is then converted into a 11-parameter vector ¯ g as described in the appendix. This 11-dimension parameter vector provides a compact representation ofG and removes the 51 linear redundancy. It is refined by the Levenberg-Marquardt (LM) algorithm [44] such that the reconstructedG matrix minimizes the following Mean Squared Error (MSE): X j ( ˜ P j 23 ) T G ˜ P j 12 2 (3.28) subject tokGk=1. This error function is computed over the inlier points only, instead of the whole point set. 3.5.3 Stability analysis We perform two experiments on synthetic data to test the stability of the estimation process of the G matrix. The estimated matrix, denoted by ˜ G, is compared to the ground-truthG matrix by an error measure G ∈[0,2] defined as follows, ε G =min ˜ G k ˜ Gk − G kGk , ˜ G k ˜ Gk + G kGk ! (3.29) Three methods are evaluated in presence of pixel noise and outliers. The first one is the LMedS-based method in Section 3.5.1, which identifies a set of inlier points. The second method computes a new ˜ G by least-square fitting to the inlier points. The third one refines ˜ G by non-linear fitting (Section 3.5.2) based on the same set of inliers. All the three methods are used in each experiment, called “LMedS”, “LMedS+Linear”, and “LMedS+Nonlinear” respectively. We first briefly describe how we generated the synthetic data considered for this test. First, we synthesize a moving camera with constant intrinsic parameters and three different poses, with both 3D rotation and translation. Second, two different reference 52 planes are randomly selected, which are ensured to be visible to all the cameras. The inter-frame homographies and epipolar constraints are obtained from the known 3D configuration. Last, twosetsof3D pointsarerandomlygenerated. Thefirstsetcontains 500 static off-plane 3D points which project to 2D parallax pixels. The second set consists of points moving in 3D epipolar planes and projected to the pixels moving along the epipolar lines, called degenerate motion pixels. The coordinates of all 2D pixels are generated with the range of [−1,1]. ThefirstexperimenttestshowsensitivetheGmatrixiswithrespecttonoiseinpixel coordinates. Here, only the parallax pixels are used. We add i.i.d. Gaussian noise n to the parallax pixels, where n∼ N(0,σ) and σ is the noise level (standard deviation). The noise level σ is gradually increased from 0 to 0.2 which is 20% of the original range of pixel coordinates. Thesecondexperimentevaluatesthesensitivityoftheestimationprocesstotheratio of inliers over the whole data set. Here the inliers are parallax pixels and outliers are degenerate motion pixels. The number of motion pixels is gradually increased from 0, whilethenumberofparallaxpixelsiskeptconstant. Thentheinlierratiodecreasesfrom 100% to as low as 30%. Each pixel is contaminated with 3% noise as well. Bothexperimentswererepeateda100timestoincreasethegenerality,andtheresults are shown in Figure 3.5. The estimation errors G rises as the noise level σ increases in an approximate exponential curve, as shown in Figure 3.5(a). Both “LMedS+Linear” and “LMedS+Nonlinear” methods generate larger errors than “LMedS”, as they fit G more closely to the noisy data. In Figure 3.5(b), the estimation error drops as the ratio 53 0 0.05 0.1 0.15 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 noise level (σ) G matrix error (ε G ) LMedS LMedS+Linear LMedS+Nonlinear (a) estimation error vs. pixel noise 30 40 50 60 70 80 90 100 0 0.2 0.4 0.6 0.8 1 1.2 1.4 inlier ratio (%) G matrix error (ε G ) LMedS LMedS+Linear LMedS+Nonlinear (b) estimation error vs. inlier ratio Figure 3.5: Stability of the estimatedG matrix w.r.t. pixel noise and inlier ratio. of inlier points increases up to 100%. In presence of outliers, “LMedS+Nonlinear” leads to smallest errors, again due to its ability of fitting more closely to the inlier points. Based on the experimental results obtained from synthetic data, we conclude that the estimation ofG is reliable as long as the pixel noise is below 5% and the inlier ratio is above 70%. Another conclusion is that “LMedS+Nonlinear” outperforms the other two methods in the presence of noise and outliers. The two conclusions are based on perfectinter-framehomographiesandepipolarconstraints,andthesynthesizedpixelsare uniformly drawn from [−1,1]. Higher estimation errors are expected if the inter-frame constraints are noisy and the pixels are anisotropically distributed. 3.6 Implementation Details In this section, we present some implementation details of an automatic system for detecting and tracking moving objects in video scenes from moving cameras. As shown in Figure 3.6, the system is built as a pipeline of four stages: geometric constraint 54 Original Video Frames Residual Pixels Motion Regions Homography-based Detection Parallax Filtering Region Trajectories Spatio-temporal Tracking Homography Epipolar Constraint Structure Consistency Constraint Estimation Figure 3.6: Overview of the motion detection and object tracking system. estimation, homography-based motion detection, parallax filtering, and spatio-temporal tracking. The system starts with the robust estimation of multi-view geometric constraints based on feature point correspondences extracted from video sequences. Then, a back- groundmodelimageiscomputedforeachframebyhomography-basedbackgroundmod- eling. Thepixelsinconsistentwiththebackgroundmodelareclassifiedasresidualpixels. The parallax pixels are filtered out of residual pixels by applying robust outlier detec- tion methods to disparity values with respect to each geometric constraint. Finally, the 2D motion regions obtained from each frame are linked into motion trajectories by a spatio-temporal tracking algorithm. 3.6.1 Geometric constraint estimation The detailed process of estimating geometric constraints is described in Chapter 2. The homography between any two consecutive frames, H t+1,t , is computed between frame t and t+1, which corresponds to a reference plane in the scene. In practice, the 2D affine motion model is used as a good approximation of homography and requires only three point pairs for estimation. 55 The fundamental matrix is estimated between frames with larger temporal interval, namely t and t+δ t , where δ t is set to 5 frames to allow longer camera baseline. δ t may vary due to different speed of camera motion. The fundamental matrix is computed by a plane+parallax approach [21] based on the homography concatenated from frame t to t+δ t . Similarly, the structure consistency constraint is estimated based on the off-plane points matched across frame t, t+δ t and t+2δ t . 3.6.2 Homography-based motion detection Let us consider a video frame t 0 , which is called the reference frame. The back- ground model is obtained by registering the images within a sliding temporal window W detect (t 0 ) = [t 0 −Δ detect , t 0 +Δ detect ] to the reference frame t 0 , where Δ detect is the temporal window size. For any pixel p t 0 in frame t 0 , let p t denote its corresponding pixel in frame t∈W detect (t 0 ). A background model for p t 0 in frame t 0 is obtained by fitting to a model from thesetofcorrespondedpixels{I(p t 0 −Δ detect ),...,I(p t 0 ),...,I(p t 0 +Δ detect )}, whereI(p t ) denotes the image intensity of pixel p in frame t. In our system, the mode of pixel intensities is used to represent the background model [28]. There are, of course, other background models, such as the Gaussian mixture model [62]. The inter-frame motion caused by the moving camera is compensated by the homo- graphies from any frame t∈W detect (t 0 ) to the reference frame t 0 ,p t ∼H t,t 0 p t 0 . This is a typical image registration process [82] which warps all the frames within the window to the reference frame. The homography-based detection avoids finding dense optical 56 flows between consecutive frames which are more time-consuming and sensitive to image noise [2]. The homography H t 0 ,t from any frame t∈ W detect (t 0 ) to frame t 0 is obtained by concatenating the homographies between consecutive frames as follows: H t 0 ,t = H t 0 ,t 0 +1 ···H t−1,t if t≥t 0 , H t 0 ,t 0 −1 ···H t,t−1 if t<t 0 . (3.30) In order to overcome accumulated errors of these concatenated homographies, they are further refined by a Bundle Adjustment method similar to [8], which minimizes the prediction errors by the Levenberg-Marquardt algorithm over the whole set of inlier points within the temporal window. After the background model is estimated, we obtain a pixel-level binary detection mask by comparing the background model to the original frame: those pixels with in- tensity differences larger than a threshold σ detect are residual pixels, while those below the threshold belong to the static background. σ detect controls how large the intensity differencesbetweenbackgroundmodelsandoriginalimagesaretolerated. Obviously,the smaller the threshold is, the more residual pixels are identified. 3.6.3 Parallax filtering Before filtering the parallax pixels, dense pixel correspondences, i.e. optical flow [2], are established between the residual pixels in frame t, t+δ t , and t+2δ t . The number of residual pixels is much smaller than that of the whole image, which reduces the 57 computation load. The homography between two frames is used to provide an initial positionforfindingtheoptimalcorrespondence. Then,theopticalflowsareestimatedby finding the corresponding pixels with maximum normalized cross correlations between image windows. For each pair or triplet of matched pixels, two disparity values are computed, d epi from the epipolar constraint as in (3.7) andd G from the structure consistency constraint as in (3.13). These two disparity values are used in a two-step parallax filtering process, as shown in Figure 3.1, to determine if a pixel corresponds to motion or parallax. At each filtering step, we apply the following robust statistics techniques to separate the parallax pixels from the motion pixels [25]. Since all the geometric constraints are computed from noisy image feature points, it is necessary to predict the distribution of disparity values. We assume that the image measurement noise of feature points in both x- and y- directions satisfy a 1D Gaussian distribution N(0,σ) where σ is the standard deviation. The disparity values are then assumed to satisfy a χ 2 (k,σ) distribution [25], as they are computed from a quadratic form p T 2 Ap 1 , where k is the degree of freedom. Figure 3.7 shows a typical histogram of the disparity values w.r.t. the structure consistency constraint (solid line) and the probabilitydensityfunctionvaluesofthefittedχ 2 distribution(dashedline). Parameters of the χ 2 distribution, namely σ and k, are obtained by maximum likelihood estimation (MLE) based on prediction errors over the feature points [25]. With the estimated distribution parameters, any pixels whose disparity is greater than 3σ are treated as outliers [25], i.e. motion pixels. This threshold can be smaller if 58 0 0.2 0.4 0.6 0.8 1 x 10 −3 0 0.01 0.02 0.03 0.04 disparity probability density Original Fitted Figure 3.7: Distribution of disparity values with respect to the structure consistency constraint. we want to include more motion pixels. However, the number of mis-detected parallax pixels might also increase. Duetoestimationerrorsandnoise,thefilteredpixelsmaystillcontainasmallamount ofparallaxpixels. Inordertodiscriminatethesetwokindsofpixels,wedefinealikelihood function for each motion pixel based on the disparity of pixelp t : L(p t )=1− 1 2 (e −λ epi d epi +e −λ G d G ) (3.31) where λ epi and λ G control how much the likelihood changes with respect to disparity. In this manner, the true-positive motion pixels are assigned with higher likelihood, while those false-positive parallax pixels correspond to lower likelihood. 59 The filtered motion mask images are further refined by standard morphological op- erations, such as erosion and dilation [13]. Connected pixels are grouped into compact motion regions, while scattered pixels are removed. 3.6.4 Spatial-temporal JPDAF based object tracking ObjecttrackingisformulatedasaJointProbabilityDataAssociationFiltering(JPDAF) process in the 2D+t domain [28]. Here, we present a novel optimization method for extractingtheoptimalpathbycollectingdiscriminatingevidencesfrompastobservations within the temporal window of multiple frames. The parallax filtering step removes a large amount of parallax pixels from the original residual pixels. The rest of residual pixels are assigned pixel-level motion likelihood, which is utilized as one of the cues for the JPDAF-based tracking. In the proposed JPDAF, the optimal position of the moving object at each time step depends on several cues: the current appearance observation, 2D velocity estimation, and the object region’s motion likelihood. Each cue is associated with a probability measure. We propose to define a joint probability reflecting current and past obser- vations and define the appropriate data association (i.e. tracking) by maximizing this joint probability by collecting discriminating evidences in a sliding temporal buffer (¿60 frames). The large number of frames helps removing the parallax regions and merging the split motion regions. A simple view of this problem is of that the detected motion regions are treated as nodes in a unidirectional graph, and tracking an object is formulated as finding a path linking the motion regions that belong to the same object. The Viterbi algorithm 60 (dynamic programming) [75] is applied to obtain the optimal path partitioning of the original graph which corresponds to the maximal joint probability. For further details about the object tracking process, interested readers may refer to [28, 29]. 3.7 Experimental Results In this section, we present experimental results obtained by the detection and tracking system on a number of real-world video sequences. In all the sequences, the camera undergoes general rotation and translation. Both qualitative and quantitative results demonstrate the effectiveness and robustness of our approach. 3.7.1 Qualitative evaluation In figure 3.8, we show the detection and tracking results of a video sequence, called the “road” sequence. A number of vehicles on the road are followed by an airborne camera moving in the same direction, as shown in three original frames in Figure 3.8(a)-(c). This is a typical example of the degenerate motion discussed in Section 3.3.2. The insufficiency of the epipolar constraint is illustrated in Figure 3.8(d) and (e). As one can observe from the figures, the measured pixels (marked with white squares) on the moving vehicle lie exactly on the epipolar lines (dark lines). This degenerate motion cannot be effectively detected by the epipolar constraint. Also shown in Figure 3.8(d) and (e) is the variation of the reference plane. In both (d)and(e), thefeaturepointsthataremostconsistentwiththehomographyaremarked withwhitecircles,whichroughlyindicatethepositionofthereferenceplaneinthescene. One can observe thatthe reference plane changes over different frames. InFigure 3.8(d), 61 (a) original frame 55 (b) original frame 60 (c) original frame 65 (d) epipolar line and reference plane (frame 55) (e) epipolar line and reference plane (frame 60) (f) initialdetectionmaskofframe 55 (g) filtered detection result of frame 55 (h) filtered detection result of frame 60 (i) filtered detection result of frame 65 (j) tracking result of frame 55 (k) tracking result of frame 60 (l) tracking result of frame 65 Figure 3.8: Detection and tracking results of the “road” sequence. 62 (a) epipolar constraint (b) trilinear constraint (c) structure consistency constraint Figure 3.9: Disparity maps with respect to three different constraints. Darker pixels correspond to larger disparity values. thereferenceplanebetweenframe55and60isalmostoverlappedwiththegroundplane. However,thereferenceplanebetweenframe60and65inFigure3.8(e)isskewedfromthe ground plane, containing the points on the traffic signs and the bushes on the roadside. The original detection mask for frame 55 is shown in Figure 3.8(f) where the residual pixels are marked black. The residual pixels include both the moving vehicles and parallax pixels, such as traffic signs and houses on the roadside. The parallax pixels are filtered by a decision tree-based process described in Section 3.6.3. The final motion mask images are shown in Figure 3.8(g)-(i). Notice that more than 90%ofmovingvehiclesaresuccessfullydetected, aswellassomeparallaxregions. These parallax regions are effectively removed in the tracking step and only the moving objects are grouped into object trajectories (Figure 3.8(j)-(l)). Figure 3.9 compares the disparity maps with respect to three different geometric constraints: epipolarconstraint,structureconsistencyconstraintandtrilinearconstraint. The disparity maps are converted into gray-scale images where brighter pixels indicate smaller motion likelihood. One can observe that the disparities for moving objects w.r.t. theepipolarconstraintareevensmallerthanthosefortheparallaxpixels. Therefore,the epipolarconstraintdisparity cannot provide useful cues fordetecting degenerate motion. 63 (a)results of frame 700 (b)results of frame 703 (c)results of frame 706 Figure 3.10: Motion detection results of the “seq29” sequence. The structure consistency constraint, however, is not affected by the degenerate motion. The disparities for moving objects w.r.t. the structure consistency constraint are much larger than those for the parallax pixels. We also compute the disparity map with respect to the trilinear constraint estimated bythecodefrom[40],asshowninFigure3.9(c). Notethattheaverageinliererrorforthe estimated trilinear constraint is as large as 5.0 pixels. The corresponding disparity map does not present a useful pattern for parallax filtering, due to the unreliable estimation of the trifocal tensor. Anothervideosequenceshotbyaninfra-redairbornesensor,called“seq29”,isshown in Figure 3.10. Each row in the figure contains, from left to right, the original image, 64 (a) results of frame 45 (b) results of frame 50 (c) results of frame 55 Figure 3.11: Motion detection results of the “Tutor Hall” sequence. residual pixels, motion regions after parallax filtering and final detection results after morphological refinement. In the sequence, two vehicles make turns beside large buildings and trees. The results for three frames 700, 703 and 706 are respectively shown in three rows (from top tobottom). Eachrowshowstheoriginalimage, initiallydetectedresidualpixels, filtered motion regions, and final detection results (from left to right). The initial residual pixels containalargenumberofparallaxpixels,includingbuildingedges,trees,androadcurbs. Most of these parallax pixels are effectively removed by parallax filtering, where the filteredmotionmaskimagesaremuchcrisperthantheoriginalones. Aftermorphological refinement,onlytwomotionregionsareconstantlydetectedinthesequencewhichlargely facilitates the tracking step. 65 Finally, we show the experimental results on a video sequence shot by a hand-held camera, called “Tutor Hall”, in Figure 3.11. Similarly, each row in the figure contains, from left to right, the original image, residual pixels, motion regions after parallax filter- ing and final detection results after morphological refinement. The camera spans around a building while following a moving vehicle. Among a number of planar surfaces in the scene,thegroundplaneisautomaticallyselectedasreferenceplanesbetweenconsecutive frames most of the time. As shown in the residual pixel maps, the building facades are effectively suppressed by multi-frame image registration, although they are indeed parallax pixels. Most of the detected parallax pixels belong to the edge of the tall building and road curbs. The parallax filtering step successfully removed a large amount of these parallax pixels, despite the fact that many parallax pixels lie on the epipolar lines. As a side effect, some partsofmotionregionsarediminishedaswell. Finally, themovingvehicleissuccessfully detected with compact regions after morphological refinement. It is worthwhile to mention that in this sequence, both the epipolar constraint and structureconsistencyconstraintareestimatedwithlargeinliererrors. Itmaybebecause that the motion of the hand-held camera is not large enough compared to the dimension of objects in the scene, especially the tall building. 3.7.2 Quantitative evaluation In order to quantitatively evaluate the performance of our system, we have manually labeled ground-truth data on the above video sequences. The ground-truth data refer to a number of 2D polygons in each video frame which approximate the contour of motion 66 regions. The labeled polygons include the shadow regions as well, since our method does not remove object shadows. For each video sequence, more than 60 frames, or 25% of the whole sequence, are labeled. Figure 3.12 shows the ground-truth object regions and motion mask image for frame 55 in the “road” sequence. (a) labeled motion regions (b) mask image Figure 3.12: Ground-truth data for frame 55 in the “road” sequence. Based on the ground-truth and detected motion mask images, we define two area- based metrics to evaluate our method, which are similar to those in [39]. LetΨ t g denote the set of pixels which belong to ground-truth motion regions in frame t andΨ t d denote the set of actually detected pixels in framet. A recall measure (detection rate) is defined to evaluate how many detected pixels lie in the ground-truth motion regions: Rec(t)= |Ψ t d T Ψ t g | |Ψ t g | (3.32) andaprecisionmeasure (relatedtofalse alarm rate)whichevaluates howmany detected pixels are indeed motion pixels: Prec(t)=1− |Ψ t d T Ψ t g | |Ψ t d | (3.33) 67 where|Ψ| is the number of pixels within Ψ and Ψ is the complement set of Ψ. Both measures range between 0 and 1. The higher both measures are, the better is the performance of motion detection. Average recall and precision measures are computed over the labeled video frames to evaluate the performance of our motion detection method. The quantitative evaluation results are shown in Table 3.2. For each sequence, four different sets of recall/precision measures are listed as the results of different steps: initial homography-based motion de- tection (“Initial Detection” in the table), parallax filtering with both epipolar and struc- ture consistency constraints (“Epipolar+Structure”), parallax filtering with the epipolar and trilinear constraints (“Epipolar+Trilinear”), and the final morphology-based refine- ment before and after parallax filtering (“Morph. Refinement”). Let us first compare the effects of different processing steps on the measures. Initial homography-based detection generates the baseline recall and precision. Then the recall measure is decreased by the parallax filtering step, either by “Epipolar+Structure” or by “Epipolar+Trilinear”, as a number of motion pixels are rejected as parallax pixels. In contrast, the precision measure is increased, since the parallax pixels are gradually removed. Themorphologicalrefinementstepshelpobtainingmorecompactmotionregionsand boostingbothmeasures,astheparallaxpixelsaremorescatteredthanthemotionpixels. The results before and after parallax filtering demonstrate the advantage of removing the parallax pixels. By comparing the measures in the second row and third row of the table, we find that the measures obtained by “Epipolar+Trilinear” are almost always lower than those 68 Table 3.2: Quantitative evaluation results of motion detection. Rec/Prec(%) “road” “seq29” “Tutor Hall” Initial Detection 87.9 / 34.1 64.4 / 16.9 61.5 / 49.6 Epipolar+Structure 83.5 / 45.1 52.2 / 42.5 52.6 / 67.7 Epipolar+Trilinear 76.9 / 38.3 35.2 / 46.0 51.7 / 54.5 Morph. Refinement Before Parallax Filtering 91.2 / 36.4 78.1 / 30.4 80.3 / 65.0 Morph. Refinement After Parallax Filtering 96.9 / 47.9 89.6 / 84.4 92.1 / 99.3 by “Epipolar+Structure”. This provides a quantitative evidence that the structure con- sistency constraint performs better than the trilinear constraint. The recall of all the video sequences are generally satisfactory, close or above 90%. The precision score depends on how much 3D structure is contained in the scene. The “road” sequence achieves the lowest precision score, as it contains much more parallax than the other two. 3.7.3 Parameter selection There are a few parameters that are found to be critical for system performance. Other parameters are either less influential or can be estimated from the input data (for exam- ple, the standard deviation of χ 2 distribution in the parallax filtering process). The first important measure, the temporal window size used by homograph-based image registration, Δ detect , needs to be adjusted with various scene configurations, cor- responding to different camera motion and object motion. In our experiments, we find that Δ detect = 45 works well for a large number of sequences. In other words, a total of 90 frames (3 seconds if the frame rate is 30 frames per second) are adequate for most sequences, leading to satisfactory recall and precision measures. 69 (a) temporal window is too small (b) detection threshold is too high Figure 3.13: Detection results of frame 45 in “Tutor Hall” with wrong parameters. Figure 3.13(a) shows the motion mask image for frame 45 in the “Tutor Hall” se- quence which is detected with a small window size as 15 frames. Compared to the original residual pixels in Figure 3.11, the moving vehicle is not correctly segmented, as the image registration process fails to identify the background pixels within the shorter temporal window. The second one, image difference threshold σ detect , is set at a low value, e.g. 30 out of 255. The threshold needs to be adjusted to different scene configurations in order to includeallthepossiblemotionpixelsandenoughparallaxpixelsaswell. Ifthethreshold issettoohigh, themotionregionsmaynotbefullydetected, asshowninFigure3.13(b). Thethirdoneisthetemporalintervalδ t usedfortheestimationofepipolarconstraint andstructureconsistencyconstraint,whichiscurrentlysettobe5. Ifthecameramotion is rather small, this interval needs to be increased for stable estimation of geometric constraints. 70 3.8 Summary and Discussion In this chapter, we have presented a novel method for detecting moving objects in video sequencesviewedfrommovingcameras. Itusesmultiplegeometricconstraintsformotion detection in two or three views. We analyzed in detail the well-known fact that a special kind of 3D motion cannot be detected by the epipolar constraint. Furthermore, we proposed a novel structure consistency constraint to relate the projective structures of a static point among three views within the “Plane+Parallax” framework, in the presence of camera motion and variation of the reference plane. The geometric constraints were integrated into a practical system for detecting and tracking moving objects observed by a moving camera. The encouraging experimental results demonstrate the effectiveness and robustness of our approach. We argue that our approach is well suited for the motion detection and tracking tasks on video sequences with the following characteristics: 1) the scene contains enough texture areas for extracting feature points; 2) the inter-frame camera motioncanbewellapproximatedbyhomographymapping,eitherdistantscenesorscenes with a dominant plane; 3) the scene contains either no parallax (perfectly planar scene) or strong parallax (a fairly large amount of parallax is needed for reliable estimation of geometric constraints); 4) neither the camera nor the objects move abruptly. These assumptions are similar to those made in previous motion detection approaches [26, 50]. There are several directions for future research. We are interested in automatic es- timation of the parameters (such as temporal window sizes and the intensity difference threshold for different video sequences or even different segments in the same sequence). 71 Thiscouldbedonebyintegratingthemotionsegmentationapproachesonthesparsefea- ture points [20, 70, 74] before performing the pixel-level motion detection. Furthermore, the geometric “motion vs. parallax” analysis of image sequences can be combined by integrating machine learning based object recognition/classification methods to improve the accuracy and robustness of our approach. 72 Chapter 4 Sparse 3D Reconstruction 4.1 Introduction We study image sequences recorded by a perspective camera undergoing general motion, whereanumberofrigidobjectsmoveonagroundplane. Theremightalsoexiststatic3D structures not in the ground plane. A typical scenario is that of vehicles moving on the groundobserved by an airborne camera, withbuildings and other3D structures present. Our goal is to obtain a 3D Euclidean reconstruction of both the static background and the moving objects. This dynamic scene reconstruction method can be applied to a broad range of ap- plications including: (1) video surveillance: Not only can the moving camera span a large field of view, but also more semantic information about the moving objects can be extracted, e.g. how far and how large the object is. (2) image synthesis: The complete reconstruction of both the static background and moving objects provides an integrated framework for building the replica of the scene and rendering the scene from arbitrary 73 viewpoints. (3)augmentedreality: Annotativeinformationcanbedrawnontothesurface of the moving objects which are reliably tracked in the 3D Euclidean space. We define the dynamic scene reconstruction from a moving camera as the inference of the following geometric entities from the original image sequence: (1) 3D structure of the static background; (2) 3D camera motion trajectory (poses) across the frames; (3) 3D shape of moving object (due to rigid object motion); (4) 3D object motion trajectory across the frames. Here, the background structure or object shape refers to sparse 3D point clouds on the surface of the static background or a moving object. As both the moving objects and the static background are rigid, their 3D shape does not change over time. Motion trajectory refers to the 3D segments connecting either camera centers or object points. The problem of reconstructing a static rigid scene observed by a moving camera (re- latedtothefirsttwoentities),called“StructureandMotion”(SaM),hasbeenthoroughly studied in the past decades [21, 41, 47]. Recently, the reconstruction of moving objects from a moving camera has also gained much attention [1, 12, 15, 74, 3, 64, 53, 46, 45]. However, these methods either do not infer the object trajectory, or allow only a limited kind of 3D motion trajectories (e.g. lines), or require a large number of frames. In other words, they do not provide a general solution to the last two geometric from a small number of frames. We propose a novel approach to obtain a complete 3D reconstruction of both the static background and moving objects, including all the four geometric entities. Before building 3D reconstruction, the approach starts with motion segmentation and object 74 tracking in image space, as described in Chapter 3. Each video frame is segmented into a background area and a number of tracked motion regions. Classical SaM techniques are then applied to obtain a Euclidean reconstruction of the static background. A plane-based self-calibration method [38] is applied to estimate both the intrinsic and extrinsic parameters (poses) of the camera. Once the cameras are calibrated, the structure of the static background is computed by triangulation methods [19, 43]. The whole set of camera poses and scene structure are further refined by a “Bundle Adjustment” process [72]. All the structure and motion are obtained up to a global scale in 3D Euclidean space, known as a global “scale ambiguity”. The reconstruction of moving object is accomplished in a two-step method. The first step is to recover the 3D shape of moving objects based on the relative motion between the moving camera and the moving object. A rigid moving object seen by a moving camera can be treated as if a static object were observed by a virtual camera. The virtual camera shares the same intrinsic parameters with the real camera, but moves differently to take into account object motion. The same SaM techniques are applied to estimatetheshapeofthemovingobjectsaswellastheposesofvirtualcamera. However, the recovered object shape is also up to an unknown scale. Thesecondstep, calledalignmentstep, istoestimatethecorrectobjectscaleandthe object motion with respect to the static background, in order to establish the 3D motion trajectory. Weformallyprovethattheobjectscaleandmotionareinterdependent. Once the object motion is solved, the object scale is known, and vice versa. Weproposeanovelmethodtoinfer3D objectmotionfromtheposesofbothrealand virtual cameras. We prove that if an additional linear constraint on object translation 75 is introduced, the object motion can be solved uniquely and linearly. From an abstract point of view, the relative camera-object motion is the addition of camera motion and object motion. Intuitively, the object motion can be obtained by subtracting the camera motionfromtherelativemotion. However, duetotheshapeambiguityintroducedinthe SaM process, this subtraction is not directly applicable unless additional assumptions are introduced. In our approach, this assumption is formulated as a planar-trajectory constraint: the 3D trajectory of moving objects must be planar. In other words, the object’s translation must be parallel to a certain plane. This assumption is less restrictive than the assump- tion of linear trajectories [1, 15]. We also show that degenerate cases may occur when theseconditionsarenotsatisfied,whichrarelyhappens. Iftheplaneisknown,theobject translation at each frame can be solved independently and uniquely by linear equations as long as certain conditions are satisfied. Otherwise, the object motion is solved by searching for an optimal object scale within a valid range subject to a rank-2 condition on the object translation. Thegroundplane, whichexistsnaturallyinreal-worldsequences, playsanimportant role in our approach. It induces 2D inter-frame homographies for motion segmentation and camera self-calibration. It also provides the constraint for 3D object alignment. If this ground plane is dominant in the scene, it will be automatically selected by robust estimation schemes such as RANSAC [11]. Otherwise, it can be interactively selected by marking at least four pairs of matched points in the first two frames and automatically tracked across the frames. More feature points consistent with the planar homography are added and tracked as well. 76 This chapter is extended from an earlier version appeared in [79]. The rest of the chapter is organized as follows. In Section 4.2, we relate our work to previous methods. Section 4.3 presents an overview of our approach and introduces the steps of motion segmentation and reconstruction of the static background. Section 4.4 present the tech- niques for recovering the 3D shape of the moving objects. The problem of 3D object alignment is addressed in Section 4.5. Experimental results are given in Section 4.6. We conclude this chapter and discuss directions for future work in Section 4.7. 4.2 Related Work The problem of estimating the structure of a static scene from a moving camera, termed as “Structure and Motion”, has received much attention in recent years. The readers are referred to [21, 10] for a thorough review and [47, 42] for details of practical sys- tems. Here, we focus on approaches to the 3D reconstruction of dynamic scenes with independently moving objects. Thefirstcategoryofmethodsutilizetheconceptoftherelativecamera-objectmotion, asinourapproach,andreconstructmultipleobjectsatthesametime. Inthesemethods, the image feature correspondences are mapped into high-dimensional algebraic entities that embed different motion groups. Costeria [9] proposed a multi-body factorization method for affine cameras, by permuting the stacked image measurement matrix into disjoint sub-blocks and decomposing each sub-block into the structure and motion of a moving object by SVD. Han [15] proposed a similar SVD-based method to estimate linear constant object motion under weak-perspective cameras. 77 Torr [69] proposed a geometric robust information criterion (GRIC) in the Bayesian formulation to select the optimal motion model for a group of features. Schindler [52] improved this method with robust error analysis and enhanced information selection criteriainpresenceofoutliers. Afamilyofmulti-bodyconstraintsareproposedtoembed multi-body motion under perspective cameras into high-order variety [12, 20, 3, 74]. For instance, Vidal [74] proposed a multi-body fundamental matrix embedding the motion of each object between two views. The fundamental matrix for each object is extracted from the multi-body constraint by the GPCA method. Theadvantageofthesemethodsisthatthenumberofmovingobjectsisautomatically estimated and all the SaM tasks are completed simultaneously. However, they only recoverthe3Dshapeofmovingobjectsuptounknownscalesanddonotestimatethe3D object motion trajectory. Furthermore, they are sensitive to image noise and matching errors: a wrong match from one object might invalidate the entire reconstruction for all objects. Thesecondcategoryofmethodsfirstreconstructsthestaticbackgroundandthenin- fers the 3D shape and motion of each moving object independently. Avidan [1] proposed a “trajectory triangulation” method to infer the 3D position of a point moving on a 3D line from a moving camera. The linear trajectory is extended to 2D conic sections [1] andmoregeneralcurves[27]. Ozden[46]proposedanumberofconstraintstoresolvethe scale ambiguity in the estimated 3D shape of moving objects, including planar trajec- tory and heading constraints. However, this method requires a large number of frames (¿100)forrobustestimation. Ourmethodismostlysimilartothisapproach. Incontrast, however, our method provides a closed-form solution to object scale and motion from a 78 Original Image Sequence Static Background Motion Regions Real Camera Poses Virtual Camera Poses 3D Structure of Background 3D Object Scale & Motion 3D Shape of Moving Objects 3D Structure of Static Background & 3D Trajectories of Moving Objects 2D Motion Segmentation & Tracking SaM SaM 3D Alignment Figure 4.1: Our approach to 3D Euclidean reconstruction of dynamic scenes minimal two frames and analyzes the possibility of imposing more constraints onto the object translation. 4.3 Overview of Our Approach Ourapproachiscomposedofanumberofsteps,asshowninFigure4.1. Theinputtoour approach is an image sequence with independently moving objects recorded by a moving camera. The first step is a motion segmentation and object tracking process which segments every original frame into a number of 2D motion regions and background area and track the motion regions over time. Then SaM methods are applied to reconstruct the background, namely the background structure and camera poses. The 3D shape of each moving object is inferred independently from the relative camera-object motion (Section 4.4). Then a 3D trajectory for each moving object is formed by aligning its shape according to the motion at different frames with respect to the 3D background (Section 4.5). The final output includes a set of 3D object motion trajectories, as well as the static background structure. 79 4.3.1 Motion segmentation, object tracking, and frame decimation We apply the method presented in Chapter 3 to detect and track the motion regions in video frames. A 2D homography transformation is utilized to compensate for the global camera motion between each pair of consecutive frames. It is estimated by robustly fitting to the matched Harris corners [6] by a RANSAC-based scheme [11]. If the ground plane is dominant in the scene, it will automatically correspond to the homography. Otherwise, the homography is computed based on at least 4 pairs of points that are interactively selected to be located on the ground plane in the first two frames and tracked across subsequent frames. Any pixels that are not compensated for by the homography are classified as residual pixels. These residual pixels, however, include both the motion regions and parallax pixels which correspond to 3D off-plane structure with large depth variance. The latter is removed by evaluating their consistency with multi-view parallax rigidity constraints [26, 80]. Then the 2D object trajectories are formed by tracking the motion regions across multiple frames [29]. Furthermore, a number of frames with larger time intervals, called key frames, are selected from the whole sequence, since longer camera baseline lead to more reliable 3D reconstruction [47]. The key frames are determined by examining if the feature matches are better explained by the epipolar geometry model than the 2D homographies [47]. Typically 5-10 key frames are selected out of a 100-frame sequence. Hereafter, the 3D reconstruction process deals with the key frames only. 80 4.3.2 Preliminaries for 3D reconstruction Let us introduce a number of notations needed by the reconstruction process. Let K denote the number of key frames, and N the number of 3D points. The camera pose (extrinsic parameters) at frame k (∈{1,...,K}) is represented by the 3D camera centerC k and rotation matrixR k . Without loss of generality, we assume that the camera coordinate system at frame 1 is the world coordinate, namely R 1 = I andC 1 =0. The camera intrinsic parameters are assumed to be constant across key frames with zero skew and square pixels. The intrinsic parameter matrix is K = f 0 u 0 0 f v 0 0 0 1 , where f is the focal length and (u 0 ,v 0 ) is the principal point. LetX j (x,y,z) T denotethe3DEuclideancoordinateforthej-thpoint(j∈{1,...,N}) andx kj (u,v,1) T the homogeneous coordinates of its 2D projection in frame k . The full perspective projection model [21] is expressed as follows, x kj ∼KR k [I|−C k ] X j 1 (4.1) where∼ denotes equality up to an unknown scale andI is the 3×3 identity matrix. LetN(n 1 ,n 2 ,n 3 ) T denotethenormalvectorofthegroundplaneanddasthedistance of the world origin from the plane. The points on the ground plane are constrained by the plane equationN T X=d. 81 4.3.3 3D Reconstruction of the static background Thereconstructionprocessstartsbyfindingfeaturecorrespondencefromthestaticback- ground. The same set of feature points used in the motion segmentation are matched betweenkeyframes, insteadofconsecutive frames. Due tothe longbaseline betweenkey frames, the SIFT feature descriptors are extracted and matched for each feature point [35], which perform well under 3D view-point variation. Only those points lying in the 2D background areas are selected and matched across the key frames. Classical SaM techniques are then applied to solve the 3D background structure and camera poses [21]. If the camera intrinsic parameters are known, the relative camera pose between two key frames is estimated with at least 5 pairs of point matches by the method in [43]. In our situation, we can take the advantage of the inter-frame 2D homographies computed in the motion segmentation process. As these homographies are induced by the same ground plane, a camera self-calibration method [38] is applied to iteratively estimate camera intrinsic and extrinsic parameters as well as the position of the ground plane. Given the camera poses and 2D matches, the 3D point positions in the Euclidean space are estimated by triangulation methods [21]. Before utilizing triangulation meth- ods,weapplytheinter-framehomographiestoclassify2Dpointmatchesintotwosubsets: the matches that are consistent with homographies correspond to the 3D points lying in the ground plane, called planar matches, and the rest of them are above the ground plane, called non-planar matches. 82 A non-linear triangulation method [19] is applied to the non-planar matches between the consecutive pairs of frames. The triangulation problem is formulated as searching a pair of epipolar lines which minimizes the 2D distances between reprojected points and epipolar lines. The 3D point is then obtained by intersecting the optical rays derived from the epipolar lines. Foreachplanarmatch{x kp }(p∈{1,...,N}),weapplyaleast-squaresfittingmethod toestimatethecorresponding3DpointX p withtheaidoftheplaneequation. Anumber of linear equations are derived from (4.1) for the frames that is spanned by the point match as follows, (x kp ) [×] KR k (X p −C k )=0 (4.2) where (v) [×] denotes the 3×3 skew-symmetric matrix derived from 3-vectorv [21]. Thelinearsystemofequationsin(4.2)isaugmentedbytheplaneequationasfollows: N T X p =d (4.3) Intuitively, (4.3) states that the intersection of all the optical rays must also occur in the ground plane. This additional constraint helps remove the irregular solutions of points below the ground plane and improves the numerical stability of triangulation methods. Starting from a pair of key frames, the reconstruction process continues with each subsequent key frames by sequentially solving the camera poses and updating the set of 3D points [47]. After all key frames are processed, both the camera poses and point positions are refined by the “Bundle Adjustment” method [72, 33], which minimizes the objective function as follows, 83 static = X k X j kx kj −KR k (X j −C k )k 2 + X p |N T X p −d| 2 (4.4) Theobservationsintheobjectivefunctionarethe2Dobjectpointsx kj lyinginthestatic background area. The parameters to be refined include the camera intrinsic parameters K, camera poses (R k ,C k ), point positionsX j , and the plane equation (N,d). The first itemintheobjectfunctionisthecommonlyusedsumofsquaredreprojectionerrors. The seconditemimposesconstraintsontothepointsonthegroundplane{X p }byminimizing their distances to the ground plane. 4.3.4 Scale ambiguity It is known that there exists a global scale ambiguity in the 3D reconstructed results for the static scene [21]. If both the object points and camera centers are scaled by a non- zerofactorσ,the2Dprojectionsin(4.1)donotchange: KR k (σX j −σC k )∼σx kj ∼x kj . Thescaleambiguityexistsduetothefactthatthecameratranslationcanbesolvedonly uptoanunknownscaleduringcameraposeestimation. Incontrast, thecamerarotation is uniquely determined. Figure 4.2 illustrates the effects of scale ambiguity, where an object point X is ob- served by a moving camera with its center denoted byC k (k =1,2,3). C 1 is assumed to beattheworldorigin, namelyC 1 =0. ThescaledcameracentersσC k andσXgenerate the same 2D projections. There exists only one scale σ ∗ such that the object scale and 84 C 1 X σX C 2 C 3 σC 2 σC 3 Figure 4.2: Global scale ambiguity in 3D reconstruction of static scenes. The camera is depicted by a triangle facing the object point. camera centers are identical to the real-world situation, which cannot be obtained unless additional physical information about the real world is introduced. Due to the scale ambiguity, the same set of 2D points corresponds to an infinite set of 3D reconstruction results. We show how this ambiguity causes difficulties for object motion estimation in Section 4.5. 4.4 3D Shape Recovery of Moving Objects Now that the Euclidean reconstruction of the static background is obtained, we turn to 3D reconstruction of moving objects, each of which is treated independently with the same method. In this section, we first present an overview of the reconstruction process of moving object’s shape, scale and motion. Then the inference of 3D object shape is discussed. The estimation of object scale and motion is discussed in the next section. 85 4.4.1 Reconstruction process for moving objects Similartothecaseofthestaticbackground,thereconstructionprocessstartsbyextract- ing SIFT feature matches that lie in the motion regions across the key frames. Before applying the SaM algorithms, we first evaluate how much object detail is available, and then determine which reconstruction method is more appropriate. It is known that the performance of SaM algorithms is influenced by a number of factors, including the number of image features, and the spatial distribution of features intheimage[21]. Thesefactorsaretermeddetaillevelsoftheconsideredobjectshapein this chapter. Intuitively, high-detail objects leads to the reconstruction results that em- bed more details about the true 3D object surface while low-detail ones may correspond to only a small portion of the surface. Reconstructing a moving object in the scene is influenced by its detail level. The number of feature points tracked across multiple views is usually small. The features lie within the motion regions, which correspond to a narrow field of view. Therefore, we need to determine whether the moving object contains enough detail or not. A moving object is called a high-detail one if it satisfies the following criteria; other- wise it is a low-detail one: • At least 5 points are tracked within the motion regions across the key frames; • The average area spanned by the points in each image is larger than a 2D W×W window (e.g. W =20 pixels); 86 Motion regions Feature extraction & matching Feature matches Virtual & real camera poses Enough detail? 3D object scale & motion 3D object shape SaM Trajectory triangulation Motion model Model selection Y N General motion estimation Figure 4.3: Reconstruction of moving objects with different detail levels. The first criterion ensures that the camera motion between key frames can be reliably estimated [21], while the second one requires the spatial distribution of feature matches to be large enough. Low-detail reconstruction. When the image does not contain enough details for the moving object, we need to introduce further information for reconstruction. That is, the object’s motion trajectory is assumed to be known, including 3D lines, 2D conic sections, and more general trajectories [1, 27]. We first test if the object is moving on straightlinesandthentry2Dconicsections. ARANSAC-basedalgorithm[11]isapplied to select the minimal set of points for a trajectory model and test all the other points 87 if they are consistent with the trial motion model. Then the trajectory triangulation method is applied to estimate the 3D trajectories of object points. High-detail reconstruction. When there are enough image cues for applying the classical SaM methods, a robust motion selection method [48] is applied to decide which motion model best fits the feature matches. The method for high-detail shape reconstruction is presented in the following sections. After the object shape is recovered, we also need to infer the scale and motion of the moving object, with respect to the static background. Figure 4.3 shows the work flow of the process of reconstructing moving objects from a moving camera. The extracted feature matches are tested against the two criteria for determiningthedetaillevelofthemovingobject. Thendifferentreconstructionmethods are selected based on different detail levels: trajectory triangulation methods for low- leveldetailsandclassicalSaMmethodsforhigh-leveldetails. Oneadvantageoflow-detail trajectory triangulation is that the scale and motion of object points are uniquely solved as well, and avoids the need for an additional alignment process. However, as indicated by its name, it does not extract as much amount of details about the object surface as high-level SaM methods. Hereafter, we discuss the cases of high-level details only. 4.4.2 Shape inference from relative motion As 2D motion regions are the projection of moving objects into a moving camera, it is not feasible to apply the standard SaM methods to infer the object shape from motion regions. Instead, the relative motion between the moving camera and the moving object is utilized to provide 3D cues of the object shape, as is consistent with real-life human 88 C 1 v C 2 v C 3 v X 3 r X 2 r X 1 r C 1 r C 2 r C 3 r X v Figure 4.4: The conversion of “real camera+moving object” to “virtual camera+static object”. perception. The configuration of a moving object seen by a moving camera is then converted into another one of that the object remains static while being observed by a hypotheticalmovingcamera, calledthe“virtual”camera. The camerawhichrecordsthe actual sequence is called the “real camera”. As shown in Figure 4.4, the virtual camera shares the same intrinsic parameters with the real one and overlaps with it at frame 1. However, the virtual camera undergoes a different motion, due to the combination of motions of both the real camera and the moving object. Let us introduce a superscript “v”, for virtual cameras, to the notations of camera poses and object positions. Then each 2D point ¯ x kj in motion regions is generated by the virtual camera pose (R v k ,C v k ) and object pointX v j as follows, x kj ∼KR v k X v j −C v k (4.5) Similarly, a superscript “r” is added to the configuration of real cameras, resulting in the notations of (R r k ,C r k ,X r j ). 89 The pose of the virtual camera (R v k ,C v k ) in frame 1 is identical to that of the real camera, R v 1 =R r 1 =I, T v 1 =T r 1 =0 (4.6) Insubsequentframes,thevirtualcamerahasdifferentrelativeposesthantherealcamera due to object motion. Standard SaM methods are then applied to obtain the 3D object shape and virtual camera motion. If 2D homographies (or collineation in 3D Euclidean space [38]) are se- lected as the best motion model, the reconstructed object surface can be approximated by a planar patch. The camera poses and object shape (planar surface) are solved by the plane-based self-calibration method [38]. Otherwise, the essential matrices [21] are selected as the basis for recovering general 3D object shape. The camera poses are ob- tained by decomposing the essential matrices estimated between consecutive pairs of key frames into 3D rotation and translation and the object shape is solved by triangulation methods [43]. Similar to the situation of real cameras, the final 3D reconstruction is refined by Bundle Adjustment as follows, virtual = X k X j kx kj −KR v k (X v j −C v k )k 2 (4.7) where x kj are the observed 2D points. The fixed parameters are the camera intrinsic parametersKsharedbyboththerealandvirtualcameras. Theparameterstoberefined include virtual camera poses (R v k ,C v k ) and object points X v j . The inferred object shape is also affected by the shape ambiguity as in the case of static background. The unknown scale cannot be recovered until some a priori of the 90 object shape is known. This scale ambiguity can be removed by imposing additional constraints as introduced in the next section. 4.5 3D Alignment of Moving Objects After inferring the 3D shape of each moving object, we are interested in aligning the inferred shape into the static background. In other words, a 3D trajectory is established for each moving object with respect to the static background in the 3D Euclidean space. In order to establish 3D trajectories, we need to resolve the ambiguous scale of the inferred 3D object shape so that the object matches the static background in a correct scale. Wealsoneedtodeterminethe3Dobjectmotionateachframe,inordertoanimate theobjectpointsduetoitsmotion. Asshowninthissection,theobjectscaleandmotion areinterdependent,andcanbesolvedbyimposingadditionalconstraintsontotheobject translation. 4.5.1 An integrated solution to object scale and motion As is also pointed out in [46], the object motion cannot be uniquely determined in the presence of the scale ambiguity. Let us investigate the projection process of moving object points. In the imaginary configuration of “virtual camera+static object”, the 3D reconstruction is computed from the 2D object points, as follows, x kj ∼KR v k [I|−σC v k ] σX v j 1 (4.8) where an unknown scale σ is introduced for the ease of later analysis. 91 Now, let us return to the real-life situation of “real camera+moving object”. As the object is moving between frames, letX r kj denote the 3D position for thejth object point in frame k relative to the real camera, where the superscript “r” is short for real. By construction,the3Dposesofrealandvirtualcamerasareidenticalinframe1. Therefore, the initial 3D positions of object points are also the same, X r 1j =X v j . Object points in subsequent frames k are transformed from their original positions in frame 1 by rigid object motion as follows, σX r kj =σR b k X v j +T b k (4.9) where (R b k ,T b k ) denotes the object’s rotation and translation with respect to frame 1, and superscript b indicates object motion. Indeed, the translation vector T b k is a 3D displacement of the object centroid across different frames, although the object centroid is generally unknown. It is obvious thatR b 1 =I andT b 1 =0. Alternatively, the object motion can be represented incrementally between frame k−1 and k, where k =2,...,K: σX r kj =σΔR b k X r k−1,j +ΔT b k (4.10) where ΔR b k = R b k (R b k−1 ) −1 and ΔT b k = T b k − ΔR b k T b k−1 . Once we know the object motionatframek−1andincrementalmotion(ΔR b k ,ΔT b k ), the positionofobjectpoint at frame k can be inferred from (4.10). The 2D object pointsx kj are then generated by projecting the moving object points into the real camera: x kj ∼KR r k (σR b k X v 1 +T b k −C r k ) (4.11) 92 Note that static objects could be considered as a special case with R b k =I and T b k =0 which simplifies (4.11) to be (4.1). The two equations, (4.8) and (4.11), are converted into matrix forms and compared as below: x kj ∼ KR v k [I|−σC v k ] σX v j 1 ∼ KR r k [R b k |T b k −C r k ] σX v j 1 (4.12) Comparingtheequivalenttermsin(4.12),wehaveR v k ∼R r k R b k . Becausetherotation matrices are normal matrices, the equality holds: R v k = R r k R b k . Therefore the object rotation can be solved from the rotation matrices of real and virtual cameras: R b k =(R r k ) −1 R v k (4.13) The corresponding incremental object rotation is derived as below: ΔR b k =(R r k ) −1 R v k (R v k−1 ) −1 R r k (4.14) On the other hand, the constraint on object translation is derived from (4.12) as below: T b k −C r k =−σ(R r k ) −1 R v k C v k =−σR b k C v k (4.15) 93 or equivalently T b k =C r k −σR b k C v k (4.16) Thisequationshowsthattheobjecttranslationisalinearcombinationoftherealcamera motion and virtual camera motion weighted by the unknown object scale σ. Once the object scale is solved, the object translation T b k can be estimated uniquely as well. Inversely, once the object translation at frame k is solved, the object scale is fixed as follows, σ =− (T b k −C r k )·(R b k C v k ) R b k C v k 2 (4.17) We can prove the uniqueness of solved object translation and scale by contradiction. Suppose that there are two non-zero scales σ 1 and σ 2 , σ 1 6= σ 2 , such that T b k −C r k = −σ 1 R b k C v k =−σ 2 R b k C v k for frame k, we can infer thatR b k C v k =0; otherwise the equality does not hold. As the rotation matrixR b k is non-singular,C v k =0 (∀k). In other words, all the cameras overlap at the origin of coordinate system, which is contradictory to the fact that the virtual camera is moving. Therefore, the uniqueness of the solution to the object scale is proved. The uniqueness of solved object translation can be proved in a similar way. Equations (4.15) and (4.17) show the interdependent relationship between the object scale and motion. In practice, it is usually difficult to obtain accurate estimation of the object scale, due to the lack of physical information about the real world. Therefore, we need to introduce additional constraints to solve the object translation first, which are discussed in the rest of this section, and solve the object scale subsequently. 94 As residual errors may occur in the process of solving object motion and scale, we apply another Bundle Adjustment process to refine the object shape, motion and scale together, by minimizing the objective function as follows, moving = X k,j x kj −K h R r k R b k σX v j +T b k −C r k i 2 (4.18) where the observations are 2D object points x kj , fixed parameters include camera in- trinsic parametersK and real camera poses (R r k ,C r k ), and the parameters to be refined are object shapeX v j , global object scale σ, and object motion (R b k ,T b k ) (except frame 1 asR b 1 =I andT b 1 =0). 4.5.2 Planar-translation assumption Due to the presence of the unknown scale factor σ in (4.15), the three unknowns in the translation vector T b k cannot be uniquely solved. Let us remove σ by applying cross product ofR b k C v k to both sides of (4.15): (R b k C v k )×(T b k −C r k )=0 (4.19) The cross product ofR b k C v k is embedded into a skew-symmetric matrixQ k =(R b k C v k ) [×] [21]. Then (4.19) is re-written as Q k T b k =Q k C r k (4.20) 95 C 1 r C 2 r C 3 r (a) unconstrained object scale and motion (b) constrained object scale and motion (c) degenerate cases for planar- motion constraint Figure 4.5: Object motion estimation with the planar-translation constraint. As it is easy to prove that Q k is rank 2, (4.20) provides only two unique linear equations for the three unknowns inT b k . AsT 1 =0, the number of unknowns in object translation from frame 2 to K is 3(K−1). The number of unique equations in (4.20) are 2(K−1). Therefore, we need to introduce at least K−1 constraints in total, or one constraint per each frame k (>1). Inordertoresolvetheissueoflackingenoughconstraintsontheobjecttranslation,an additionalconstraintabouttheobjecttranslationisintroduced,calledplanar-translation assumption: the object translation must be parallel to a plane. In other words, the object motion trajectory must be planar, which happens frequently in the real-life applications. Our real-life experience is that it is difficult for an airplane passenger to determine the size and speed of another plane flying below his own plane . However, passengers can easily tell the distance and motion of a vehicle moving on the ground plane. This planar-translation constraint is illustrated in Figure 4.5. Without additional constraints,theobjectscaleandmotioncannotbeuniquelyestimatedasin(a). However, with the assumption that the object moves on a plane, a unique solution to the object scale and motion can be obtained as shown in (b). 96 The planar-translation constraint implies that the object translation vector is per- pendicular to the normal vector of the selected plane, shown as follows, N T T b k =0 (4.21) whereN is a 3-vector. If the plane is known, the normal vector provides one more linear constraint on the object translation in each frame. Combining (4.20) and (4.21), we can solve object translationuniquely. Inourcase,thisplaneisselectedasthegroundplanewhoseposition is estimated in the SaM process for the static background. Indeed, any known plane will suffice,with more details given in Section 4.5.3. Even if the plane is unknown, we can obtain a solution to the object translation by searching for the optimal object scale as explained in Section 4.5.4. We need to point out that there exist degenerate cases when the planar motion constraint cannot generate a unique and non-zero solution to the object translation. We argue that the degenerate cases happen if the combination of object rotation and virtual camera translationR b k C v k is co-linear with the real camera translationC r k , or say R b k C v k ∼ C r k . If the above two vectors are co-linear, their cross-product becomes zero: R b k C v k ×C r k = 0. Therefore, the RHS side of (4.20) becomes zero. The number of non-zero solutions to the object translationT b k is either zero or indefinite, and the scale ambiguity is not resolved. The geometric meaning of the degenerate cases is that the real camera moves in a trajectory parallel to the object motion trajectory, as shown in Figure 4.5(c). We 97 believe that this situation does not happen frequently in practice and therefore the object translation and scale can be uniquely estimated most of the time. If the object does not rotate while moving, namely R b k = I, the planar-motion tra- jectories become linear ones. Then the degenerate cases for planar trajectories become exactly those for the linear trajectories, namely C r k ∼ C v k or T b k ∼ T b k−1 [1]. In other words, the degenerate cases for planar-motion trajectories are a superset of those for linear ones. If the degenerate cases do happen, the object scale and motion need to be approx- imated by introducing additional assumptions. For example, the object scale may be approximated by assuming that there exists at least one reconstructed object point that lies on the plane. In other words, the ambiguous scale of the object shape can be deter- mined by stretching the object shape to different scales until one point hits the ground plane. Sincealltheobjectspointslieonorabovetheplane, wehaveN·σX j −d≥0. By assuming that at least one point satisfies the equality N·σ ∗ X j −d = 0, we can obtain an approximated object scale: σ ∗ =d/(N·X j ). 4.5.3 Solving object motion with a known plane Beforeapplyingtheplanar-motionconstraintbasedonaknownplane,letusfirstconsider a general form of constraints on the object translation. The general constraint on object translation at frame k is expressed as φ(T b k )=0. Although the translational constraints can be of any closed-form functions, the linear ones are easier to find and compute than high-order ones. 98 We propose to introduce M(≥ 1) linear constraints over the object translation in frame k,T b k , as below: A k T b k =b k (4.22) where A k is a M×3 matrix and b k is a M×1 vector for frame k. A k and b k are not necessarily constant in different frames. Indeed, the constraints can be derived from all kinds of information in previous or current frames, such as camera poses, object motion and shape. Ideally,theadditionallinearequationsleadtoauniquesolutiontoobjecttranslation. However,practicalnumericalcomputationrequiresaclearevaluationoftheexistenceand uniqueness of the solution, as is shown in the following theorem. Theorem 1 A linear, unique, and non-zero solution to object translation T b k exists if and only if the following conditions hold: 1. rank-3 condition: rank Q k A k =rank Q k Q k C r k A k b k =3 (4.23) 2. non-zero condition:kQ k C r k k+kb k k>0. Proof: The object translationT b k needs to be solved from an inhomogeneous system of M +3 linear equations as follows: Q k A k T b k = Q k C r k b k (4.24) 99 whereQ k andA k provide 3 and M equations respectively. The existence and uniqueness of a solution to T b k depend on two rank-3 conditions [66]: (1) The columns in the LHS coefficient matrix in (4.24) must span a 3-dimension subspace (uniqueness). The condition is equivalent to the constraint that the rank of the LHS coefficient matrix must be 3. (2) The RHS vector must lie in the spanned space (existence). Similarly, the condition is translated to the rank-3-ness of the augmented coefficient matrix, which is composed of the LHS coefficient matrix and RHS vector. Furthermore, as the object translation is general non-zero, one more condition is imposedthattheRHSvectorin(4.24)mustnotequalzero. Equivalently, themagnitude of the RHS vector must be larger than zero. This is because, if the RHS vector becomes zero, no matter whether the rank-3 conditions are satisfied or not, no unique non-zero solution to the object translation could be found. In practice, the magnitude of the 4-vector is compared to a very small positive number, for example, 1e−10. Therefore, as long as the two rank-3 conditions and non-zero conditions hold, T b k can be obtained uniquely from the equation system by Gaussian elimination or SVD decomposition. Alternatively, the linear constraints can be imposed onto the incremental object translation ΔT b k = T b k −ΔR b k T b k−1 , based on the assumption that T b k−1 and ΔR b k are already known. The original constraints on object translation are modified as follows, Q k ΔT b k =Q k (C r k −ΔR b k T b k−1 ) (4.25) 100 Additional linear constraints are imposed onto the incremental translation similar to (4.22): A 0 k ΔT b k =b 0 k (4.26) whereA 0 k andb 0 k are alternatives toA k andb k in the case of incremental translation. Theorem 1 shows that not all the additional linear constraints will generate valid solutions to the object translation. The object translation can be reliably solved from (4.24) only when the rank-3 conditions and non-zero conditions are satisfied. The planar-motion constraint is a special case of (4.22) whereA k =N T andb k =0. As long as the plane normal vector N satisfies the two conditions, T b k can be solved without ambiguity. However, when the degenerate cases for planar-motion constraint occur,theRHSvectorintheaugmentedequationsystemin(4.24)becomesa4-dimension zero vector and violates the non-zero condition. After the object translation in each frame is solved, a corresponding object scale σ k is solvedfrom (4.17). Due tonumericalerrors, the scales estimatedfrom differentframes may be different. The global object scale is then selected as the average of the scales from all frames, σ = 1 K K X k=1 σ k (4.27) 4.5.4 Solving object motion based on an unknown plane Let us first express some existing constraints on object translation in our context. For example, the linear trajectory constraint [1, 15] can be translated as the condition that 101 the incremental translation at frame k is co-linear with that of frame k− 1, namely ΔT b k ×ΔT b k−1 =0 or in the form of (4.22) as: (ΔT b k−1 ) [×] ΔT b k =0 (4.28) Indeed, as both ΔT b k−1 and ΔT b k are unknown in (4.28), 2(K−2) non-linear constraints are introduced in total. Furthermore, if the constant speed assumption [15] is also true, we have ΔT b k = αΔT b k−1 . If α = 1, the motion is constant speed. If α > 1(< 1), the objectisaccelerating(decelerating)alongtheline. Theseconstraintsareinter-dependent on the solution from other frames. [45] presented a thorough study on the planar-trajectory constraint when the plane itself is unknown. In it, the true object scale is estimated by finding the minimal deter- minantofpointcovariancematricesfromalargenumberofframes. Onlythetrueobject scale will lead to a reasonable planar trajectory. In our case, however, the number of key frames is much smaller (5-15 frames). Therefore, we apply a different approach to utilize the planar-trajectory constraint. Before describing the approach, let us first define the stacked version of (4.16): T b =C r −σR b C v (4.29) where the stacked 3× 3K object rotation matrix and 3× K translation matrix are R b =[ R b 1 ··· R b K ] andT b =[ T b 1 ··· T b K ]. The stacked 3×K motion matrices of real and virtual cameras areC r =[ C r 1 ··· C r K ] andC v =[ C v 1 ··· C v K ]. 102 One can observe from (4.30) that the stacked object matrix is a linear combination of the real camera motion and the virtual camera motion weighted by the object scale. If the object scale σ varies, the object translation matrix changes too. Based on the planar-translation constraint, the object translation in frame k is con- strained by (4.21). In other words,T b k in any frame k lies in the 2-dimension null space of the normal vector N, despite the fact that N is unknown. Therefore, the stacked object translation matrix is rank-2: rank(T b )=2 (4.30) Intuitively, the rank of the stacked object translation matrix is 3 if the object trans- lationvectorsarenotparalleltotheunknownplane. Wearguethatthereexistsaunique object scale such that the rank of stacked object translation is 2. In order to find this object scale, we apply the Singular Value Decomposition (SVD) [66] toT b , generating three singular values denoted by s 1 ,s 2 ,s 3 in the non-ascending order. The rank-2 con- dition is then converted into s 1 ≥ s 2 ≥ 1 and s 3 /s 2 < 2 , where 1 and 2 are two thresholds controlling the accuracy of practical float-point computation. Wefirstobtainanapproximationofobjectscale ¯ σ bythemethoddescribedinSection 4.5.2. Based on this initial scale, a range of possible object scales is defined as [¯ σ/λ,¯ σλ], where λ(>1) controls how large the range is. Then we apply a Monte-Carlo method to generate multiple samples of the object scale within the range [14] and test each sample againsttherank-2conditions. Theobjectscalewhichpassestherank-2testandachieves the minimum s 3 /s 2 is selected as the true object scale. 103 When the degenerate cases of the planar-motion constraint occur, the centers of either virtual or real cameras lie in a plane, resulting in the situation that the rank of bothstackedmatricesC r andC v is2. ThestackedobjectmotionmatrixT b , asthelinear combination ofC r andC v , is then inherently rank-2, no matter how the object scale σ varies. 4.6 Experimental Results Our method was tested against a number of real-world image sequences. All the se- quences are shot by moving cameras recording dynamic scenes with multiple moving object running on a constant ground plane. The sequences are divided into two cate- gories: one shot in indoor environments by handheld cameras, and the other in outdoor environments by airborne cameras. In all the sequences, the moving objects contain enough 3D non-planar structure such that the essential matrices are selected as mo- tion models. Both qualitative and quantitative results are presented to demonstrate the effectiveness and robustness of our method. 4.6.1 Qualitative results of indoor image sequences We first show two sequences taken by a handheld camera in office environments, with camera tilt angles from 45 to 60 degrees. Cardboard boxes were placed on the floor as off-plane structure. For the indoor sequences, more details about the object shape can be recovered as the objects are close to the camera. However, the motion segmentation task faces larger difficulties from the indoor scene due to the large parallax effects. 104 (a) original frames (60, 75, and 90) (b) segmented frames (60, 70, and 80) (c) inferred background structure (left two: wireframe; rightmost: textured) (d) inferred object shape (left two: wireframe; right three: textured) (e) 3D object trajectory Figure 4.6: Results of the “toy-car¨ sequence 105 The first video contains a toy-car moving along a curve on the floor. Three original video frames are shown in Figure 4.6(a). The corresponding segmented motion regions are labeled in red in Figure 4.6(b). 9 of the original 100 frames are selected as key frames. SIFT feature points [35] are extracted from the key frames as the basis for the SaM process. About 20 feature matches are extracted from the background and the moving object respectively. The reconstructed 3D shape of the background and the moving toy-car is shown in Figure 4.6(c) and (d), in both wireframe and textured models. The texture image is generated by stitching the background areas from consecutive video frames for a ex- panded view of the ground plane. The surface mesh model is manually selected from the surface points. As one can observe, the 3D shape of the toy-car is visually satisfactory, preserving the properties of planar surfaces and right angles. Yet the cardboard box is reconstructed as a parallelepiped instead of cuboid. This is because the camera is viewing it at such a high tilt angle that the perspective effect of the cardbox is not large enough. The 3D motion trajectory of the toy-car is shown in Figure 4.6(e). The object points run along different trajectories while the 3D centroid undergoes the planar-translation motion. As one can observe from the different view angles, the car is rotating and translating parallel to the ground plane, which truly happens in the real world. The planar-motion constraint is sufficient to uniquely determine the size and motion of the toy-car. In the second video, a volleyball is rolling on the floor, as shown in Figure 4.7(a)- (c). Note that due to the color similarity with the floor, the white regions on the ball 106 (a) original frame 45 (b) original frame 54 (c) original frame 65 (d) segmented frame 45 (e) segmented frame 54 (f) segmented frame 65 (g) fitted 3D sphere (h) real vs. virtual camera poses (i) 3D motion trajectory Figure 4.7: Results of the ball sequence are not fully detected and some background area is included in the motion region as well, as shown in Figure 4.7(d)-(f). However, this does not affect finding good feature correspondences from the patterns on the ball surface. Knowing that the ball is a sphere, its parameters (center and radius) are estimated by sphere fitting to the reconstructed points. Then any point on the sphere, even the invisible ones, can be reconstructed on the spherical surface (Figure 4.7(g)). In Figure 4.7(h), onecanobservethatthecentersofvirtualcamerasarefarawayfromthoseofthe real cameras. This is because the distance between camera centers and the ball center is 107 much larger than the ball radius such that a small rotation of the ball points generates a large displacement of the virtual camera. Although the points on the ball follow a complex 3D trajectory, the 3D trajectory of sphere center lies in a plane parallel to the ground plane, shown as a red 3D segmented line connecting dotted ball models in Figure 4.7(i). 4.6.2 Qualitative results of outdoor image sequences Wealsoshowexperimentalresultsofonevideosequenceshotbyairbornevideocameras. AsshowninFigure4.8(a), twocarsaremakingturnsonthegroundplanebeingfollowed by a moving camera. This is a difficult case as there are multiple moving objects and the objects either induce self-occlusion or are partially occluded due to the narrow field of view. Three key frames out of the original image sequence are shown in Figure 4.8(a). In (b), the corresponding motion segmentation and object tracking results are shown with different color masks imposed onto each moving object. The motion regions that correspond to the same color belong to the same 2D motion trajectory. SIFT feature pointsareextractedfromthesegmentedmotionregionsandstaticbackground,asshown in Figure 4.8(c), and matched across the key frames. Despite occlusion and image noise, we are able to extract a number of points from the visible parts of both cars, with the textured view given in Figure 4.8(d). Based on the planar translation constraint, two motion trajectories are constructed for moving objects respectively, as shown in Figure 4.8(e). In the figures, the positions of each car across all the frames are shown in the gray wireframe models, while the car’s current 108 (a) original frames 1200, 1216, and 1232; (b) segmented motion regions; (c) extracted point matches; (d) recovered 3D shape of moving objects; (e) 3D motion trajectories of both objects Figure 4.8: Results of the “forest” sequence. 109 Table 4.1: Average 2D reprojection errors (in pixels) for all the sequences. Errors (# of Points) Toy-car Ball Forest (red car) Forest (blue car) Real Cam.+Background 0.794 (23) 1.037 (36) 2.245(17) 2.245(17) Virtual Cam.+Moving Obj. 0.781 (29) 0.651 (28) 1.438(13) 1.055(25) Real Cam.+Moving Obj. 0.781 (29) 2.232(28) 1.986(13) 1.832(25) position is highlighted with the color similar to their motion regions. By comparing multiple rendering images of the motion trajectories, one can observe that both 3D motion trajectories can accurately depict the real-world motion in the original video sequence. 4.6.3 Quantitative results of the sequences The 3D reconstruction results from all the video sequences are quantitatively evaluated by measuring the average 2D reprojection errors. Indeed, there are three defined re- projection errors: one generated by real cameras and static background in (4.4), one by virtual cameras and moving objects in (4.7) and one by real cameras and moving objects in (4.18). Theaverageerrorsforallthreesequencesaswellasthetotalnumberofreconstructed points are shown in Table 4.1. The results for the “forest” sequence span two columns as there are two independently moving objects, referred as red and blue cars. As one can observe, the errors for reconstructing the static background are larger than those of moving objects. This might be caused by the fact that the feature correspondences from a cluttered background are less reliable than those from a single compact object. Nevertheless, good reprojection errors around one pixel are achieved in both indoor 110 sequences. The results of outdoor sequences are not as good as the indoor ones, due to smaller perspective effects and large image noise. Another observation is that the reconstruction quality of the static background does not influence that of the moving objects and vice versa. However, the quality of object alignmentisinfluencedbythereconstructionresultsofbothpartsofthescene. Asshown in the third row, the motion estimation error of the toy-car is not even noticeable. In contrast, the error of the ball sequence is large, probably due to the large movement of thevirtualcamerawhichisestimatedlessstably. Theeffectofobjectalignmentforboth cars in the forest sequence lies between those for the two other sequences. 4.7 Summary and Discussion Wehavepresentedanovelapproachto3Dreconstructionofdynamicsceneswithobjects moving on the ground plane from a moving camera. The reconstruction process starts with the inference of 3D camera motion and background structure. Then each moving object is reconstructed independently with respect to the background. The 3D shape of moving objects is solved by the relative motion between the object and the camera. Then the object scale and motion is solved from the real camera motion and the relative camera-objectmotion. Wehaveshownthroughdetailedanalysisthattheobjectrotation canbesolveduniquelyandyetthereexistsascaleambiguityinthesolutiontotheobject translation. Additional linear constraints are needed for resolving the scale ambiguity in this process. In our situation, the ground plane is utilized to impose such a constraint: 111 the translation of moving objects must be parallel to the ground plane. Both qualitative and quantitative results demonstrate that our approach is effective and robust. There are many directions for future research. We are interested in extending the sparse representation of the object surface (point clouds) to denser representation (sur- face meshes), which is described in Chapter 5. The surface mesh model can be fitted to the dense 3D point clouds obtained by multi-view stereo matching methods [47]. An- other direction is to extend the approach for rigid object motion to non-rigid ones. For example, the non-rigid human body or face motion can be decomposed into a number of rigid parts [52, 69]. Then the motion trajectory of all the parts can be established with the inter-connection constraints between different parts and form the final set of motion trajectories for the human body or face. 112 Chapter 5 Dense Volumetric 3D Reconstruction 5.1 Introduction We study video sequences with one or more objects moving rigidly on a ground plane. There may exist static 3D structure in the scene as well. The video camera undergoes general3D motionwhileimagingthescenebyperspectiveprojection. Atypicalexample is that of an airborne camera following vehicles running on a road. The goal of our approach is to infer the 3D dense shape of the scene from video sequences. The scene, including moving objects and static background, is divided as a set of 3D volume elements, called voxels, organized in an adaptive octree structure [65]. The shape inference task is then formulated as a voxel occupancy labeling problem, namely deciding whether each voxel is occupied by an object or not. Thisworkhasabroadrangeofpotentialapplications. Forinstance, itcanbeapplied to improve aerial surveillance systems with the inference of 3D distance and motion of moving objects. It can also be used for visual modeling and image synthesis when the object contains more details. 113 The reason for aiming at a voxel-based method rather than a surface-based one [51] is that it is very difficult to recover the 3D shape of an object’s surface from a moving camera while the surface itself is also moving. Existing solutions (e.g. [37]) work only with an affine camera, but not a perspective camera in our case. In contrast, as the voxels remain static in the scene, their occupancy can be effectively determined by photo consistency tests [56, 73], regardless whether the reconstructed object is moving or not. Object surfaces can be then inferred from the voxel occupancy results, although at a coarse level. The main difference between our approach and existing volumetric methods [56, 31, 60, 59, 76, 71] is that our method directly handles the case that moving objects are observed by a moving camera. As the moving objects occupy different voxels at different times, the label of a voxel may change over time. In other words, a dynamic label, instead of a constant one, is assigned to each voxel. Therefore, the dense 3D shape of background structure and moving objects is inferred in a uniform manner. Another difference is that we do not have as much control of the scene (e.g. turning tables) as in the previous methods, as our real-world videos are captured from passive image sensors. The number of calibrated views (usually 8-10) is also smaller than those methods. Therefore, our results may not have the same quality level as those methods do. However,ourapproachcanbeappliedtonotonlytheclassicalareaofphoto-realistic image synthesis, but also to 3D object tracking and mensuration. TheoverviewofourapproachispresentedinFigure5.1. Anumberofinitialprocesses are required before the shape inference process. The first process (described in Chapter 3) is to segment the original images into motion regions and static background. After 114 Original Video Sequence 2D Motion Regions 3D Object Motion Sparse 3D Back- ground Structure Sparse 3D Shape of Moving Objects 3D Camera Poses Background Area 3D Bounding Volume of Moving Objects 3D Bounding Volume of Static Background Labeled 3D Voxel Octree 2D Motion Segmentation and Tracking 3D Structure from Motion Dense 3D Shape Inference Figure 5.1: Overview of our approach. removingthepixelsthatbelongtothestaticbackground,weidentifyanumberofmotion regions in each frame and tracked them across the sequence. The second one (described in Chapter 4) is to build a sparse reconstruction of both the background and moving objects, namely camera poses and sparse sets of 3D points. Structure from Motion techniques [21, 47] are applied to reconstruct the background and each moving object individually. The 3D motion of moving objects is solved by assuming that the object motion trajectory is planar [79]. The volumetric shape inference process starts by determining the bounding volume of voxels. A slanted pyramid is established for each camera by projecting the image rectangle onto the ground plane. The bounding volume, which is the union of all the pyramids, is decomposed into an octree structure of voxels with different sizes. The voxels which correspond to image patches with more details, such as motion regions, are sub-divided into smaller voxels. The voxel labeling process is based on a robust photo-motion variance measure de- fined between two voxels at two times. The measure combines both photometric and motion cues to determine if two voxels belong to the same surface patch of an object. 115 The photo variance is defined as the normalized cross correlation between multiple ori- ented image patches projected from voxels. The motion variance evaluates how close the moved voxel center is to the other one. Thevoxellabelingprocessisdividedintothreesteps. Thefirstoneutilizesthesparse set of points solved in the initial processes. Since these points are located on the object surface, the voxels which contain these points are labeled as the corresponding object. Thevoxelswhichintersectwithopticalraysextendingtothesesurfacepointsarelabeled as empty. Thesecondstepisadeterministicvoxelcoloringprocess. Eachvoxelisiteratedinthe ascending order of its distance to the camera centers. At each voxel, the photo-motion variance is evaluated if the voxel belongs to every possible object. If the minimum vari- ance is above a threshold, the voxel is carved away. Otherwise, the voxel is assigned the labelofthatobject(staticbackgroundisalsotreatedasanobjectwithzeromotion). The photo-motion variance measure is evaluated between multiple pairs of frames, in order to remove possible occlusion effects. This deterministic step, however, does not impose any global constraints onto labeling results and relies on a pre-determined threshold. The final step of the labeling process refines the labeling results by an energy min- imization based method. A global energy function which integrates the photo-motion variances from all voxels is minimized by the Graph Cuts method [7]. The energy func- tionalsoincludesasmoothnesscostwhichimposessmoothnessconstrainttotheadjacent voxels in both temporal and spatial domain. The deterministic and graph cuts methods are repeated a few times with different variance thresholds, which reduces the chances of getting stuck at a local minimum. 116 The rest of the paper is organized as follows. The related work is reviewed in Sec- tion 5.2. Section 5.3 presents the initial processes of our approach. The volumetric decomposition and the three-step voxel labeling method are introduced in Section 5.4 and 5.5 respectively. The experimental results are shown in Section 5.6, followed by the conclusion and discussion in Section 5.7. 5.2 Related Work Thereexistsalargebodyofworkonbuildingvolumetricmodelsfrommultiplecalibrated images. A survey and performance evaluation of recent approaches are presented in [55]. Only those methods that are most relevant to our paper are reviewed here. These methodsaredividedintotwocategories: deterministicmethodsandenergyminimization based methods. TheVoxelColoringmethod[56]isoneofthefirstdeterministicmethodsthatcombine image silhouette and color information to build 3D volumetric models. Since any voxel on a Lambertian surface corresponds to consistent image patterns, a photo consistency test is applied to every voxel: if the color variance is larger than a threshold, the voxel is labeled as empty; otherwise it is labeled as part of the object surface. A number of subsequent approaches, such as Space Carving [31] and Generalized Voxel Coloring [59], extend the original approach with more general camera placements and more efficient labelingmethods. Thesedeterministicmethods, however,facethedifficultyoffindingan appropriate threshold for carving the inconsistent voxels, which is inherently varying in 117 different image regions. Furthermore, these methods does not impose global smoothness and may make conflicting decisions in the sequential carving process. In contrast, energy minimization based methods do not have such problems. Instead of making a harsh decision for each voxel, a global energy function is defined to accumu- latethevarianceofallvoxels. Thisenergyfunctionisminimizedbydiscreteoptimization techniques, such as Graph Cuts [7]. [60] first introduced the Graph Cuts method to find the optimal visual hull. [76] and [71] extend the approach by utilizing the visual hull as a topological constraints over the scene and searching for an optimal 3D volumetric cut in a voxel based graph. Further constraints are imposed with known surface patches. Our method extends these approaches to handle dynamic scenes with moving objects. To our knowledge, the only other paper on volumetric reconstruction of dynamic video scenes is [73]. In it, the space carving scheme is extended to carving pairs of voxels between two time instants, called hexels. A stereo pair of camera is used to capture the 3D scene, thus allowing non-rigid object motion and dense scene flow. In our case, however, we use only one camera and assume that the object motion is rigid. Furthermore,ourapproachisnotlimitedtotwoframes,butalsoworksonlongsequences. 5.3 Initial Processes of the Approach A number of initial processes are needed before inferring the dense 3D shape. The first processistosegmentthemovingobjectsfromthestaticbackgroundandthentrackthem along the video sequence (Chapter 3). 2D homographies between consecutive frames, which is induced by the ground plane, are robustly computed from the matched Harris 118 corners [16]. Any pixels that are inconsistent with the inter-frame homographies are labeled as residual pixels, including motion regions and parallax pixels corresponding to static3Dstructures. Theparallaxpixelsarefilteredoutbyimposingepipolarconstraints [21] and parallax rigidity constraints [80]. The detected motion regions are then linked into different object trajectories across multiple frames by a spatio-temporal tracking method. Consequently, the original video sequence is segmented into the static background (plane and parallax) and a number of 2D object trajectories. Video frames with large inter-frame camera motion are preferred for a reliable reconstruction. Therefore, a num- berofdistinctframeswithlongbaselines,called“keyframes”,areautomaticallyselected fromthewholesequence[47]. Hereafter,thereconstructionprocessesonlydealwiththese key frames. The second process applies Structure from Motion (SfM) techniques to these key frames to build a sparse 3D reconstruction of both the static background and moving objects (Chapter 4). Camera poses and the position of the ground plane are obtained by a plane-based self-calibration method [38]. Then the 3D points on static structures are computed by classical SfM techniques [21, 47]. The 3D shape of each moving object is solved by the same SfM techniques based on the relative motion between the object and the moving camera. The object scale and motion trajectory is recovered by enforcing an additional constraint that the object motion must be parallel to the ground plane [79]. The output of the two initial processes includes a set of calibrated images, in which a numberofmotionregionsareidentifiedandtracked. Thesparse3D shapeofbackground structure and moving objects is estimated as well as the 3D object motion trajectory. 119 5.4 Volumetric Decomposition In this section, we discuss how the 3D scene is adaptively decomposed into a set of voxels. Then a robust photo-motion variance measure is defined to evaluate how well the voxels correspond to the original images. 5.4.1 Bounding volume estimation Withthe2D motionregionsand3D cameraposesobtainedinpreviousprocesses,wecan determine the 3D bounding volumes of the static background and moving objects. Most previous approaches build a visual hull by intersecting the optical rays originated from image silhouette boundaries [32]. However, this is not possible in the scene with moving objects, as the optical rays back-projected from 2D motion regions do not intersect at the same point. In our situation, the ground plane is used to bound the object volume. Originated from each camera center, four optical rays pass the image corners until they hit the ground plane. The resulted quadrangle is connected to the camera center and forms a 3D slanted pyramid. This pyramid serves as the visualhullinour situation, as is easy to prove that every object voxel lies in the pyramid. Similarly, a 3D pyramid is determined for each moving object by projecting the boundaries of motion regions in each image to the ground plane. The final bounding volume is the union of the pyramids generated by all cameras. Oneadvantageofusingthegroundplaneisthattheboundingvolumecanbedetermined automatically, without knowing the scene dimension in advance. 120 camera original image ground plane subdivided voxels motion region pyramid Figure 5.2: 3D volumetric decomposition of the scene. 5.4.2 Adaptive volume tessellation The3D objectvolumeisdividedintoasetofvoxelsadaptingtodifferentlevelsofdetails in the scene. Each voxel can be subdivided into eight sub-voxels with smaller sizes, as shown in Figure 5.2. In order to determine the sub-division level of a certain voxel, we projecttheboundingboxofthisvoxelintoalltheoriginalimagesandmeasurethesizesof the projected polygons. The polygon size affects the accuracy of photo consistency tests and ultimately the reconstruction quality. Therefore, a voxel is sub-divided continuously untilitsprojectedpolygonisnolargerthanthemaximalregionsize. Inourexperiments, it is set to be 20×20 for the background area and 5×5 for motion regions. An hierarchical spatial structure, namely octree [65], is utilized to organize these voxels in a coarse-to-fine order. Only the leaf nodes in the octree are used in the shape inference process. Hereafter, any voxel we refer to is indeed a leaf node in the octree. Figure 5.2 illustrates the 3D volumetric decomposition of the scene, where only one camera is shown for simplicity. The bounding pyramid (in blue) is determined by projecting 3D optical rays passing the corners of image rectangles (in blue) and motion 121 regions (in red) onto the ground plane. The voxels in the pyramid may be sub-divided into smaller voxels, which are organized in an octree structure. 5.4.3 Photo-motion variance In the voxel labeling process, if a voxel belongs to the static background or any moving object,thevoxelitselfanditscorrespondingvoxelatanyothertimemusthaveconsistent photo appearances and similar 3D motion. In order to measure this consistency, we define a robust photo-motion variance function between two voxels at two different time instants. If the two voxels belong to the static background, they are indeed identical. Otherwise,thetwovoxelsresideatdifferentlocationsbeingrelatedby3D objectmotion. Let a voxel in the 3D space by v with its bounding box as B(v) and its center as C(v). Let t∈ T ={1,...,T} denote a time instant where T is the number of key frames. The photo-motion variance is defined between voxelv at time t and voxelv 0 at time t 0 as follows, φ(v,t,v 0 ,t 0 )=λ p φ p (v,t,v 0 ,t 0 )+λ m φ m (v,t,v 0 ,t 0 ) (5.1) where φ p and φ m are respectively the photo and motion variance normalized into [0,1]. λ p ,λ m ∈(0,1) balance the contribution of two variances as λ p +λ m =1. The larger the variance is, the less likely the two voxels belong to the same part of object surface. Similar to [71], the photo variance is computed based on the normalized cross corre- lation (NCC) between image patches. The NCC is affected by the orientation of image patches and will reach its maximum if the 2D patch orientations are consistent with 122 the corresponding 3D surface orientation. Since the 3D surface orientation is unknown, we need to extract multiple oriented image patches for the same voxel to increase the probability of solving the true 3D surface orientation. A number of orientation angles (usually 3-4) are estimated at the projected voxel center by finding the maxima in 1D gradient histograms [35]. From each orientation angle θ i , an image patch,w i (v,t), is extracted at the projection of voxel centerC(v) in image I t . Similarly a number of image patches, w 0 j (v 0 ,t 0 ), are extracted from image I t 0 for voxelv 0 . The final photo variance is computed by selecting the maximum NCC between pairs of image patches : φ p (v,t,v 0 ,t 0 )=α(t,t 0 )(1−max i,j NCC[w i (v,t),w 0 j (v 0 ,t 0 )]) (5.2) where α(t,t 0 )∈[0,1] is a weighting factor inversely proportional to the distance between camera centers at time t and t 0 . α is scaled such that it is 1 for the same camera and 0.5 for the largest distance between camera centers. Note that, (5.2) does not always prefer the closest view. Instead, it selects the window orientations which maximize NCC between any pair of views. Figure 5.3 gives an example of computing photo variance between multiple oriented image patches. Among all the patches (labeled in green) in each image, the one that leads to maximum NCC or equivalently minimum photo variance is selected as the best patch(inred). The2Dorientationofthebestpatchisconsistentwiththe3Dorientation of the ground plane. 123 Figure 5.3: Examples of oriented image patches projected from the same voxel into two images. The motion variance between two voxels depends on the 3D displacement between two voxels. Let the 3D rigid motion of object k from time t to t 0 be denoted byM k t,t 0 = (R k t,t 0 ,T k t,t 0 ) where R and T are respectively 3D rotation and translation of the object. For the static background, R k t,t 0 =I and T k t,t 0 =0. Then the motion variance is defined by evaluating how closely the center of voxel v is driven by motion M k t,t 0 to the center of voxelv 0 : φ m (v,t,v 0 ,t 0 )=1− kR k t C(v)+T k t −C(v 0 )k r(v 0 ) (5.3) where r(v 0 ) is the half diagonal length of voxel box B(v 0 ). Intuitively, when the voxel center C(v) moves to overlap with the other voxel center C(v 0 ), the motion variance between two voxels is 0. When the voxel center C(v) lies on the surface of voxel box B(v 0 ), the motion variance is 1. 5.5 Voxel Labeling Process In this section, we will first introduce preliminary definitions and then propose a three- step voxel labeling method. 124 Table 5.1: Examples of voxel label sequences at different time instants: EMP (empty), BAK (background), OBJ k (the k th moving object) 1 2 3 4 5 EMP EMP EMP EMP EMP BAK BAK BAK BAK BAK EMP OBJ 1 OBJ 1 OBJ 1 EMP OBJ 1 EMP OBJ 2 OBJ 2 EMP A label l = L(v,t) is assigned to voxel v at a time instant t∈{1,...,T}. The label l can be EMP (empty), BAK (static background), or OBJ k (the k th moving object), k = 1,...,K where K is the number of moving objects. The static background can be considered as the 0 th object, namely BAK=OBJ 0 . The label for a voxel may change over time, resulting in a label sequence L(v) = {L(v,t)|t = 1,...,T}.A voxel can be either invisible to a certain camera (INV), or empty (EMP), or belonging to the background structure (BAK). Once a voxel is labeled as the background, its label remains at BAK all the time. Otherwise, a voxel may be transit from being empty to being occupied by a moving object, or even occupied by different moving objects. Typical examples of label sequences are shown in Table 1. 5.5.1 Initialization with surface points A number of 3D points have already been computed in the SfM process. These points, whichlieonthesurfaceofbackgroundstructureandmovingobjects,areusedtoinitialize a subset of voxel labels. Let P k t denote a 3D point belonging to object k at time t. A 3D line segment, −−−→ C t P k t , is established by connecting P k t to camera center C t , which is a part of the optical ray from C t . All the voxels intersected with the line segment are 125 labeled as empty at time t, while the voxel containing P k i itself is labeled as OBJ k at time t. More voxel labels can be initialized with further information about the scene. For example,ifa3Dsurfacepatchisknowntobepartofthek th object,thevoxelsintersected with the patch are assigned OBJ k and all the voxels penetrated by the corresponding line segments are labeled as empty. This is especially useful when the ground plane is known to occupy a large portion of the original scene. 5.5.2 Deterministic labeling method We propose a deterministic voxel labeling method to assign each voxel a label sequence based on the photo-motion variance. It is assumed that the scene does not lie in the convex hull of all the camera centers. Therefore, we can iterate through the voxels in the ascending order of the distance of voxel centers to the surface of camera bounding box, which guarantees that the labeling decision for a voxel only affects the decision for subsequently visited voxels, but not the previous ones [56]. For each voxel v that is visible at time t, we compute its photo-motion variance as if it is assigned OBJ k . Ifv is assigned the background (BAK) label, v =v 0 . The photo variance is computed by averaging its variance scores between t and other time instants: φ(v,t,BAK)= 1 |T b (v)| X t 0 ∈T b (v) φ(v,t,v,t 0 ) (5.4) 126 where T b (v)⊆ T is the set of time instants at which v is visible and belongs to the static background. Since φ m (v,t,v,t 0 ) = 1, Eq. (5.4) is similar to the photo variance functions defined in [56, 71]. if v is assigned OBJ k (k≥ 1) at time t, the new position of voxel center C(v) at any other time t 0 is predicted by its motion from t to t 0 . if the moved voxel center does not lie in the bounding volume of object k, the variance is not evaluated. Otherwise, the voxel v 0 containing the moved voxel center is searched in the octree and is used to compute the variance φ(v,t,v 0 ,t 0 ). The final variance for v being assigned label OBJ k is computed as, φ(v,t,OBJ k )= 1 |T m (v)| X t 0 ∈Tm(v) φ(v,t,v 0 ,t 0 ) (5.5) where T m (v)⊆ T is the set of time instants satisfying: 1) v is visible to the camera time t; 2)v moves to overlap withv 0 ; 3)v 0 is visible to camera at t 0 and remains in the bounding volume of object k. Eq. (5.5) is a generalized version of the photo variance defined in [73]. The overall deterministic labeling method is summarized in Algorithm 1. The voxels areiteratedintheascendingorderofthedistancebetweenitscentertotheboundingbox of camera centers. For each voxel v, the photo-motion variance, φ(v,t,l), is computed at time t with every possible object label l (BAK or OBJ k ). Let l ∗ denote the label which minimizes φ(v,t,l). If the minimum variance, φ(v,t,l ∗ ), is below a threshold δ, the voxel is labeled as l ∗ at time t, L(v,t)=l ∗ . Otherwise, the voxel is labeled as empty at time t, or say the voxel is “carved”. Notice that we need to make T labeling decisions for each voxelv, instead of only one in the case of static scenes. 127 Sort all the voxels in the ascending order based on their distances to the camera convex hull; for every voxel v in the sorted list do for every time instant t∈{1,...,T}do if v is occluded by other voxels then go to next voxel; for every possible object label l do compute photo-motion variance φ(v,t,l); end find the label l ∗ which minimizes φ(v,t,l) ; if φ(v,t,l ∗ )<δ then L(v,t)=l ∗ ; else L(v,t)=EMP; end end end Algorithm 1: Deterministic voxel labeling algorithm This deterministic method, however, has a few limitations. First, it relies on a pre- determined threshold which is hard to find with different sequences, even on different image regions. Second, it may make decision for a voxel that conflicts with another one foravoxelvisitedearlier, asitonlyconsiders“local”informationavailabletothecurrent voxel. Third, the labeling method does not impose any global smoothness constraints onto the voxel labels. 5.5.3 Graph cuts based optimization The limitations of the deterministic labeling method can be avoided by considering all the voxels at the same time. Let the total set of labels per each voxel per each time 128 instant be denoted by L ={L(v,t)|v = 1,...,V,t = 1,...,T}. We define an energy function over the label setL [7] as follows, E(L)= X v,v 0 X t,t 0 D(v,t,v 0 ,t 0 ,l)+ X v,v 0 X t,t 0 V(l,l 0 ) (5.6) where (v,v 0 ) denotes the pair of voxels adjacent in space and (t,t 0 ) the pair of adjacent time instants. This energy function is expected to reach its minimum when all the voxels are given correct labels. Therefore, the voxel labeling program is converted into a global energy minimization problem. The graph cuts algorithm is applied to efficiently solve this problem, with both the α-expansions and αβ-swap iterations repeated until the global energy converges to its minimum [7]. The data cost function D(·) measures how well the voxel labels match the original images. In our implementation, it is defined as the photo-motion variance φ(v,t,v 0 ,t 0 ,l) with additional surface constraints. Specifically, when the voxelv is known to be on the surface of thek th object, D(v,t,l)=+∞ ifl6=OBJ k . This guarantees that the resulted surface passes the existing surface points. The second cost V(·) imposes smoothness constraints over voxels adjacent in both spatial and temporal domain. For voxelv, (v,t) is the neighbor of the voxel itself at any other time instant (v,t 0 ). For a pair of voxels v and v 0 , (v,t) and (v 0 ,t 0 ) are neighbors only ifv andv 0 are adjacent in the 3D space and|t−t 0 |<δ t where δ t is a constant time interval. The smoothness term V(l,l 0 ) is set to be the classical Potts energy model [7]. 129 5.5.4 Summary of the labeling process The final method combines all the three steps, as is summarized in Algorithm 2. The combination of deterministic method and Graph Cuts based method is repeated a few times with different variance threshold drawn within the valid range of photo-motion variance. Indeed, thischangesthestartingpointoftheoptimizationprocessandreduces the probability of getting stuck at local minimum. The labeling which leads to the minimum energy is kept as the final result. Initialize the voxels with surface points; repeat Randomly select a deterministic threshold within the range of photo-motion variances; Label the voxels by the deterministic method; Refine the labels by the graph cuts method; Keep the labeling results with minimum energy; until enough trials have been finished ; Algorithm 2: Voxel labeling process The time complexity of our algorithm is analyzed as follows. Let L, V, T denote respectively the number of possible object labels, the total number of voxels, and the number of time instants. The deterministic method makes O(LT) operations at each voxel by finding the best label sequence, resulting in totally O(LTV) operations. The graphcutsmethodevaluatesthedatacostofeachvoxelateachtimeandthesmoothness costbetweeneachvoxel-timepairs,resultinginO(LTV +LT 2 V 2 )operations. Therefore, the total complexity of the voxel labeling method is O(LT 2 V 2 ). 130 5.6 Experimental Results We demonstrate the effectiveness and robustness of our method by testing it against two video sequences. The first video, called “toy-car”, is a 150-frame sequence shot indoors by a hand-held camera. A toycar moves on the ground, while a box is placed as the background structure. The results of two frames, out of totally 8 key frames, are shown in Figure 5.4 (a) and (b). The original images (left column in Figure 5.4 (a) and (b)) are segmented into motion regions belonging to the toycar (labeled in red) and the static background (middlecolumn). The3D backgroundstructure(thegroundplaneandthebox)isshown in the right column, as well as the 3D object trajectory (in gray), in which the sparse object shape at the current frame is indicated by red color. The inferred dense 3D shape of both the background and moving objects is shown in Figure 5.4(d)-(f), where each opaque voxel is rendered with the average color of its corresponding 2D image patches. Figure 5.4(d) shows the labeling results by the deterministic coloring step, which does not impose global smoothness constraint. After refining the labeling results by the graph-cuts step, we can easily identify the 3D shape of the static box and moving toy car from Figure 5.4(e). We can also clearly see that the toy car occupies different voxels at two frames from a different view point as in Figure 5.4(f). The second sequence, called “forest”, is a 100-frame one shot by an airborne camera which follows two cars making turns on the ground. The original images in Figure 5.5(a) are segmented into the motion regions (different colors indicate different object labels) 131 (a) original image, segmented image, and corresponding 3D shape for frame 90; (b) original image, segmented image, and corresponding 3D shape for frame 110; (c) dense 3D space (after deterministic coloring) (d) dense 3D shape (after graph cuts) (e) dense 3D shape (after graph cuts; another view) Figure 5.4: Results of frame 90 (left) and 110 (right) from the “toy-car” sequence. 132 and the ground plane as shown in Figure 5.5(b). The 3D dense shape of both cars and the ground plane is correctly inferred from 9 key frames, as shown in Figure 5.5(c) and (d). The 3D object motion can be clearly identified by comparing the two images in Figure 5.5(e). (a) original images (b) motion regions (c) dense 3D shape (after graph cuts) (d) dense 3D shape (after graph cuts; another view) Figure 5.5: Results of frame 1200 (left) and 1232 (right) from the “forest” sequence. Below are a few numbers about space and time complexity of our method. The scene in “toycar” sequence is decomposed into a total of 135,217 voxels , of which 32% belongs 133 to the toy car. The three processes (motion segmentation and tracking, SfM, and dense shape inference) take approximately 20 minutes, 5 minutes and 1.5 hours respectively, in which the three labeling steps take respectively 5 minute, 20 minutes and 1 hour. The intensive parallel computation in these processes can be accelerated to 10 or more times faster by GPU implementation [24]. 5.7 Summary and Discussion We have proposed a novel approach to inferring 3D dense shape of the scene where objects move rigidly on a ground plane being observed by a moving camera. The scene is decomposed into a set of voxels organized in an adaptive octree structure. A robust photo-motion variance measure was introduced to evaluate the inconsistency between voxels and original images. We proposed a three-step method for finding the best dy- namic label for each voxel to minimize the total variance. The experimental results demonstrated the effectiveness and robustness of our method. There are a few directions for future research. It is necessary to investigate how to reliably extract surface meshes, which change due to object motion, from the labeled voxels. Furthermore, it would be interesting to extend the rigid object motion to non- rigid deformations, with the potential applications in 3D human modeling and motion capture. 134 Chapter 6 Conclusions 6.1 Summary of Contributions In this dissertation, we have presented an integrated approach to infer the 2D and 3D shape of scenes containing rigid moving objects from a moving camera. It is a generic geometricanalysismethodwithabroadrangeofapplications. Encouragingexperimental resultsonalargenumberofimagesequencedemonstratetheeffectivenessandrobustness of our approach. My contributions in each stage of the whole process are listed as follows: • Multiple image registration (Chapter 2). – Implementation of a multi-scale feature point detector and an efficient multi- orientation matching method; – ImplementationofaRANSACbasedalgorithmforrobustestimationofmulti- view geometric constraints; – Implementation of a multi-image registration method with non-linear refine- ment. 135 • Motion segmentation (Chapter 3). – A decision tree based scheme that combines the geometric constraints to clas- sify each pixel into the planar background, parallax, or moving objects (mo- tion vs. depth analysis); – A novel 3-view geometric constraint that relates the pixels belonging to the static background within the “Plane+Parallax” framework; – Apracticalandrobustsystemfordetectingandtrackingmovingobjectsfrom a moving camera in presence of strong parallax. • Sparse 3D Reconstruction (Chapter 4). – An integrated process that reconstructs the shape of both the static back- ground and the moving objects as well; – A novel method to reconstruct the 3D object motion trajectory with the planar-motion assumption; – Theoreticalanalysisoftheadditionalassumptionimposedontoobjectmotion. • Dense 3D reconstruction (Chapter 5). – A novel multi-view variance measure that evaluates the photo-motion consis- tency between labeled voxels and the original images; – A novel three-stage voxel labeling method that assigns dynamic labels to the voxels in order to minimize the global photo-motion variance. 136 6.2 Future Directions The future directions for each task are discussed in their corresponding chapters. Here, we take a perspective on the future directions of our approach at a higher level. • Image understanding by integrated geometry inference and object recognition As mentioned in the introduction, our approach does not assume any model information about the scene to be available. The 2D and 3D shape is inferred solely from the photometric and geometric cues extracted from the image sequences. However, these cues are sensitive to to illumination changes, occlusion, viewpoint changes, and so on [54, 21, 47]. More semantic information of the scene is, of course, preferred. We are interested in improving our approach by integrating the geometric inference with the object recognition methods. Our motion segmentation method requires the objects to be moving all the time for constant motion cues. Otherwise, the object will be treated as the structure in the static background. There exists a large amount of work in the areas of object detection and recognition from still images and videos. For example, the 2D patterns of moving vehicles[61]orwalkinghumans[77]canbereliablydetectedfromvideosequences. These methods generates robust estimation of the object’s position along the sequence, even in the cases when the object stops moving. The motion segmentation method may perform much better by combining pixel motion cues with object patterns [78]. As for the 3D reconstruction methods, the semantic information of the scene inferred by object recognition methods is also very helpful. As demonstrated in [22, 23], different parts of the scene are first recognized as various kinds of surface, and then reconstructed 137 basedontheinferredsurfaceproperties,suchasplanarpatchesandrightangles. Another example is about the vehicles again: if the 2D patterns of a vehicle are recognized in the image [61], the corresponding model in a 3D vehicle model database can be used to provide constraints for more accurate inference of the dense 3D shape in shorter time. • Automatic applicability determination We do not claim that our method can handle any given image sequence, where various kinds of complex scenes may be present. Indeed, we are certain that some of our tasks cannotbeappliedtoagivenimagesequenceifitdoesnotmeetappropriaterequirements. However, it is necessary to design an algorithm to automatically determine whether our tasks can be applied to the sequence or not. Image sequences are classified into four categories by different camera and object characteristics. The camera may be static, purely rotating (with zero translation), or undergoing general motion (rotating+non-zero translation). Two types of object dimen- sions are considered: 1) small objects that can be well approximated by one or a few number of points and contain little details (texture) on the object surface; 2) large ones that correspond to a fairly detailed 3D surface and contain much texture. The corre- sponding applicable tasks are listed in Table 6.1. The reasons why a task is applicable in that scenario have been explained in corresponding chapters. Table 6.1: Camera-object categorization of dynamic video scenes Tasks Camera Motion Object Dimension Motion Segmentation Static/Rotation/General Small/Large Reconstruction of Static Background General Small/Large Reconstruction of Moving Objects Static/Rotation/General Large Reconstruction of Both Parts General Large 138 We are interested in designing a scheme to automatically select the most appropriate tasks for a given image sequence. For example, the image sequences shown in Chapter 3 containsmallerobjects,whichinvalidatestheapplicabilityof3Dreconstructionmethods. Anotherexampleisthatanimagesequenceshotbyapurelyrotatingcameraisnotsuited for 3D reconstruction tasks as the required camera translation is not present, although the moving objects may be reliably segmented from this sequence. • Automatic parameter estimation, error analysis and failure recovery Our approach can be improved with automatic estimation of critical parameters, better error analysis and the ability to recover from failure modes. There exist a number of parameters which are critical to the performance of our method. However, these parameters are currently determined by heuristic knowledge and do not adapt well to different video sequences. Similar to the method described in Section 3.6, robust statistics techniques can be applied to estimate these parameters automatically. Noise and errors are inevitable during the computation of vision problems. The robustness of our approach can be improved by explicitly modeling the errors of vari- ables with appropriate distributions (e.g. Gaussian distribution), and examining the error propagation process as well. The variables In this manner, we can obtain more information about the reliability and accuracy of experimental results. Our approach may fail on certain image sequences, or even certain segments of a sequence. Although in most cases the failure is acceptable, it is necessary to detect and recover from this failure. For instance, when an affine motion model computed between 139 two frames is completely wrong, the image registration process in the whole temporal window may be influenced. It is desirable to disregard any frames exceeding a failure tolerance measure [81] such that the computation of the whole sequence will not be ruined by the failure on only one frame. 140 Bibliography [1] S. Avidan and A. Shashua. Trajectory triangulation: 3D reconstruction of moving points from a monocular image sequence. IEEE Trans. PAMI, 22(4):348–357, 2000. [2] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of Optical Flow Techniques. IJCV, 12(1):43–77, 1994. [3] A. Bartoli. The geometry of dynamic scenes - on coplanar and convergent linear motions embedded in 3d static scenes. CVIU, 98(2):223–238, May 2005. [4] P. A. Beardsley, P. H. S. Torr, and A. Zisserman. 3D Model Acquisition from Extended Image Sequences. In ECCV, 1996. [5] J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509 – 517, 1975. [6] S. Birchfield. An Implementation of the Kanade-Lucas-Tomasi Feature Tracker. http://www.ces.clemson.edu/ stb/klt/. [7] Y. Boykov, O. Veksler, and R. Zabih. Efficient approximate energy minimization via graph cuts. IEEE Trans. PAMI, 20(12):1222–1239, 2001. [8] M. Brown and D. G. Lowe. Recognising panoramas. In Proc. of ICCV, pp. 1218– 1225, 2003. [9] J. P. Costeira and T. Kanade. A multibody factorization method for independently moving objects. IJCV, 29(3):159–179, 1998. [10] O. Faugeras and Q.-T. Luong. The geometry of multiple images. The MIT Press, 2001. [11] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communi- cations of the ACM, 24:381–395, 1981. 141 [12] A. W. Fitzgibbon and A. Zisserman. Multibody structure and motion: 3-D recon- struction of independently moving objects. In Proc. of ECCV, vol. 1, pp. 891–906, 2000. [13] R. C. Gonzalez and R. E. Woods. Digital Image Processing. Prentice Hall, 2002. [14] J.M.HammersleyandD.C.Handscomb. Monte Carlo methods. Methuen(London, UK), 1975. [15] M. Han and T. Kanade. Reconstruction of a scene with multiple linearly moving objects. IJCV, 59(3):285–300, 2004. [16] C. Harris and M. Stephens. A combined corner and edge detector. In Proc. of 4th Alvey Vision Conference, 1988. [17] R. Hartley. In defence of the 8-point algorithm. In Proc. of ICCV, pp. 1064–1070, 1995. [18] R. Hartley. Lines and Points in Three Views and the Trifocal Tensor. IJCV, 22(2):125–140, 1997. [19] R. Hartley and P. Sturm. Triangulation. CVIU, 68(2):146–157, 1997. [20] R. Hartley and R. Vidal. The multibody trifocal tensor: motion segmentation from 3 perspective views. In Proc. of IEEE CVPR, vol. 1, pp. 769–775, 2004. [21] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, 2004. [22] D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo pop-up. In ACM SIG- GRAPH, pp. 577–584, 2005. [23] D. Hoiem, A. A. Efros, and M. Hebert. Putting objects in perspective. In Proc. of IEEE CVPR, number 2137-2144, 2006. [24] A. Hornung and L. Kobbelt. Robust and efficient photo-consistency estimation for volumetric 3d reconstruction. In ECCV, vol. II, pp. 179–190, 2006. [25] P. J. Huber. Robust Statistics. Wiley-IEEE, 2003. [26] M. Irani and P. Anandan. A unified approach to moving object detection in 2D and 3D scenes. IEEE Trans. PAMI, 20(6):577–589, June 1998. 142 [27] J. Y. Kaminski and M. Teicher. A general framework for trajectory triangulation. Journal of Mathematical Imaging and Vision, 21:27–41, 2004. [28] J. Kang, I. Cohen, and G. Medioni. Continuous tracking within and across camera streams. In Proc. of CVPR, pp. 267–272, 2003. [29] J. Kang, I. Cohen, G. M´ edioni, and C. Yuan. Detection and tracking of moving objects from a moving platform in presence of strong parallax. In Proc. of ICCV, pp. 10–17, 2005. [30] R. Kumar, P. Anandan, and K. Hanna. Direct recovery of shape from multiple views: A parallax based approach. In Proc. of ICPR, vol. 1, pp. 685–688, 1994. [31] K. N. Kutulakos and S. M. Seitz. A theory of shape by space carving. IJCV, 38(3):199–218, Jul 2000. [32] A. Laurentini. The visual hull concept for silhouette-basedimage understanding. IEEE Trans. PAMI, 16(2):150–162, 1994. [33] M. Lourakis and A. Argyros. sba: A generic sparse bundle adjust- ment c/c++ package based on the levenberg-marquardt algorithm. http://www.ics.forth.gr/~lourakis/sba. [34] M. I. Lourakis, A. A. Argyros, and S. C. Orphanoudakis. Independent 3D motion detection using residual parallax normal flow fields. In Proc. of ICCV, pp. 1012– 1017, 1998. [35] D.Lowe. Distinctiveimagefeaturesfromscale-invariantkeypoints. IJCV,60(2):91– 110, 2004. [36] Q.-T. Luong and O. D. Faugeras. The fundamental matrix: Theory, algorithms, and stability analysis. IJCV, 17(1):43–75, 1996. [37] A. Maki, M. Watanabe, and C. Wiles. Geotensity: Combining motion and lighting for 3d surface reconstruction. IJCV, 48(2):75–90, 2002. [38] E. Malis and R. Cipolla. Camera self-calibration from unknown planar structures enforcing the multiview constraints between collineations. IEEE Trans. PAMI, 24(9):1268–1272, 2002. [39] V. Y. Mariano, J. Min, J.-H. Park, R. Kasturi, D. Mihalcik, H. Li, D. Doermann, and T. Drayer. Performance evaluation of object detection algorithms. In Proc. of ICPR, vol. 3, pp. 965–969, 2002. 143 [40] B. Matei, B. Georgescu, and P. Meer. A versatile method for trifocal tensor esti- mation. In Proc. of ICCV, vol. II, pp. 578–585, 2001. [41] D. Nist´ er. Reconstruction from uncalibrated sequences with a hierarchy of trifocal tensors. In Proc. of ECCV, vol. 1, pp. 649–663, 2000. [42] D. Nist´ er. Automatic dense reconstruction from uncalibrated video sequences. PhD thesis, Royal Institute of Technology KTH, Stockholm, Sweden, 2001. [43] D.Nist´ er. Anefficientsolutiontothefive-pointrelativeposeproblem. IEEE Trans. PAMI, 26(6):756–770, 2004. [44] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, 1999. [45] K. E. ¨ Ozden, K. Cornelis, L. V. Eycken, and L. V. Gool. Reconstructing 3d inde- pendent motions using non-accidentalness. In IEEE CVPR, pp. 819–825, 2004. [46] K. E. ¨ Ozden, K. Cornelis, L. Van Eycken, and L. Van Gool. Reconstructing 3D trajectories of independently moving objects using generic constraints. CVIU, 96(3):453–471, 2004. [47] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, and J. Tops. Visual modeling with a hand-held camera. IJCV, 59(3):207–232, 2004. [48] M.Pollefeys,F.Verbiest,andL.V.Gool. Survivingdominantplanesinuncalibrated structure and motion recovery. In Proc. of ECCV, LNCS 2351, pp. 837–851, 2002. [49] H.S.Sawhney. 3DGeometryfromPlanarParallax. InProc.ofCVPR,pp.929–934, 1994. [50] H. S. Sawhney, Y. Guo, J. Asmuth, and R. Kumar. Independent motion detection in 3D scenes. IEEE Trans. PAMI, 22(10):1191–1199, 2000. [51] D.ScharsteinandR.Szeliski. Ataxonomyandevaluationofdensetwo-framestereo correspondence algorithms. IJCV, 47:7–42, 2002. [52] K. Schindler and D. Suter. Two-view multibody structure-and-motion with outliers through model selection. IEEE Trans. PAMI, 28(6):983–995, 2006. [53] K. Schindler, J. U, and H. Wang. Perspective n-view multibody structure-and- motion through model selection. In Proc. of ECCV, 2006. [54] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest point detectors. IJCV, 37(2):151–172, 2000. 144 [55] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proc. of IEEE CVPR, pp. 519–526, 2006. [56] S. M. Seitz and C. R. Dyer. Photorealistic scene reconstruction by voxel coloring. IJCV, 35(2):151–173, 1999. [57] A. Shashua and N. Navab. Relative affine structure: Canonical model for 3D from 2D geometry and applications. IEEE Trans. PAMI, 18(9):873–883, 1996. [58] A. Shashua and L. Wolf. Homography tensors: On algebraic entities that represent three views of static or moving planar points. In ECCV, vol. 1, pp. 507–521, 2000. [59] G. G. Slabaugh, W. B. Culbertson, T. Malzbender, M. R. Stevens, and R. W. Schafer. Methods for volumetric reconstruction of visual scenes. IJCV, 57(3):179– 199, 2004. [60] D. Snow, P. Viola, and R. Zabih. Exact voxel occupancy with graph cuts. In Proc. of IEEE CVPR, pp. 345–352, 2000. [61] X. Song and R. Nevatia. A model-based vehicle segmentation method for tracking. In Proc. of ICCV, pp. 1124–1131, 2005. [62] C. Stauffer and W. Grimson. Adaptive background mixture models for real-time tracking. In Proc. of CVPR, vol. 2, pp. 246–252, 1999. [63] H. Stew´ enius, F. Schaffalitzky, and D. Nist´ er. How Hard is Three-View Triangula- tion Really? In ICCV, 2005. [64] P. Sturm. Structure and motion for dynamic scenes: the case of points moving in planes. In Proc. of ECCV, pp. 867–882, 2002. [65] R. Szeliski. Rapid octree construction from image sequences. CVGIP: Image Un- derstanding, 57:23–32, 1993. [66] F. Szidarovszky and S. Molnar. Introduction to Matrix Theory. World Scientific, 2001. [67] P. Torr. Outlier Detection and Motion Segmentation. PhD thesis, University of Oxford, Engineering Dept., 1995. [68] P.Torr. Geometricmotionsegmentationandmodelselection. Philosophical Trans.: Mathematical, Physical and Engineering Sciences, 356(1740):1321–1340, 1998. 145 [69] P. Torr. Bayesian model estimation and selection for epipolar geometry and generic manifold fitting. IJCV, 50(1):35–61, 2002. [70] P. Torr and A. Zisserman. Robust Computation and Parametrization of Multiple View Relations. In ICCV, pp. 727–732, 1998. [71] S. Tran and L. Davis. 3d surface reconstruction using graph cuts with surface constraints. In ECCV, vol. II, pp. 219–231, 2006. [72] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle adjustment – a modern synthesis. In Workshop on Vision Algorithms, pp. 298–372, 1999. [73] S. Vedula, S. Baker, S. Seitz, and T. Kanade. Shape and motion carving in 6d. In Proc. of IEEE CVPR, pp. 592 – 598, 2000. [74] R. Vidal, Y. Ma, S. Soatto, and S. Sastry. Two-view multibody structure from motion. IJCV, 68(1):7–25, 2006. [75] A.J.Viterbi. Errorboundsforconvolutionalcodesandanasymptoticallyoptimum decoding algorithm. IEEE Trans. Inform. Theory, IT-13:260–269, 1967. [76] G. Vogiatzis, P. H. S. Torr, and R. Cipolla. Multi-view stereo via volumetric graph- cuts. In CVPR, 2005. [77] B. Wu and R. Nevatia. Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. IJCV. to appear. [78] Q. Yu, I. Cohen, G. Medioni, and B. Wu. Boosted markov chain monte carlo data association for multiple target detection and tracking. In Proc. of ICPR, pp. 675– 678, 2006. [79] C. Yuan and G. Medioni. 3d reconstruction of background and objects moving on groundplaneviewedfromamovingcamera. InProcofIEEECVPR,pp.2261–2268, 2006. [80] C. Yuan, G. Medioni, J. Kang, and I. Cohen. Detecting motion regions in presence ofstrongparallaxfromamovingcamerabymulti-viewgeometricconstraints. IEEE Trans. PAMI. to appear. [81] Z. Zhang. Determining the epipolar geometry and its uncertainty: A review. IJCV, 27(2):161–195, 1998. [82] B. Zitova and J. Flusser. Image registration methods: a survey. IVC, 21:977–1000, 2003. 146
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
3D deep learning for perception and modeling
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Line segment matching and its applications in 3D urban modeling
PDF
Moving object detection on a runway prior to landing using an onboard infrared camera
PDF
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
Accurate 3D model acquisition from imagery data
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Model based view-invariant human action recognition and segmentation
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Accurate image registration through 3D reconstruction
PDF
Registration and 3-D reconstruction of a retinal fundus from a fluorescein images sequence
PDF
Facial gesture analysis in an interactive environment
PDF
Fast iterative image reconstruction for 3D PET and its extension to time-of-flight PET
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Robust representation and recognition of actions in video
PDF
Point-based representations for 3D perception and reconstruction
PDF
Exploitation of wide area motion imagery
PDF
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
Effective incremental learning and detector adaptation methods for video object detection
Asset Metadata
Creator
Yuan, Chang
(author)
Core Title
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
05/23/2007
Defense Date
04/27/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D reconstruction,motion segmentation,OAI-PMH Harvest
Language
English
Advisor
Medioni, Gerard (
committee chair
), Moore, James, II (
committee member
), Nevatia, Ramakant (
committee member
), Tartakovsky, Alexander (
committee member
)
Creator Email
cyuan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m497
Unique identifier
UC1314335
Identifier
etd-Yuan-20070523 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-494232 (legacy record id),usctheses-m497 (legacy record id)
Legacy Identifier
etd-Yuan-20070523.pdf
Dmrecord
494232
Document Type
Dissertation
Rights
Yuan, Chang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
3D reconstruction
motion segmentation