Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multi-view image -based rendering and modeling
(USC Thesis Other)
Multi-view image -based rendering and modeling
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. ProQuest Information and Learning 300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 800-521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Multi-view Image-Based Rendering and Modeling Copyright 2000 by Qian Chen A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment o f the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) May 2000 Qian Chen Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3017991 Copyright 2000 by Chen, Qian All rights reserved. _ _ < 8 > UMI UMI Microform 3017991 Copyright 2001 by Bell & Howell Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. Bell & Howell Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES. CALIFORNIA 90007 This dissertation, written by .Qian.Oien.................................................. under the direction of h Dissertation Committee, and approved by ail its members, has been presented to and excepted by The Graduate School, in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY D an of Graduate Studies D a tt A pji.L .7a..2.Q 9Q . DISSERTATION COMMITTEE Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements I first would like to thank Prof. Gerard Medioni for being an absolutely wonderful thesis supervisor through the entire process. He started me out with great ideas and problems, provided good feedback and suggestions during the research, and made sure that the final output is o f high quality. As well as teaching me computer vision, he showed me that good research can be fun. I thank him for making my time at USC a very rewarding experience. Thanks to Prof. Ulrich Neumann and to Prof. Wlodek Proskurowski for serving on my committee and providing very useful comments on the thesis draft. Dr. Neumann is the leader o f the Computer Interfaces thrust in the Integrated Media Systems Center which ultimately makes my research possible, academically and financially. I am grateful to Prof. Proskurowski for clarifying some o f the issues in numerical optimization. Thanks also to Prof. Ram Nevatia for his interest in my research and many casual discussions conducted in the IRIS hallway. I received help from several other members o f the IRIS group and the CGIT group during various stages o f this work. In particular, thanks to Alexandre Francois for being my lab mate during the years and answering questions whenever I have one; to Reyes Enciso for the valuable discussions about stereo and, especially, face reconstruction. I deeply appreciate my wife and my lovely daughter for patiently supporting me throughout this long academic journey. Thanks also to my parents and the rest o f my fam ily who are all excited about "one o f them" getting a doctorate. ii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. TABLE OF CONTENTS ACKNOWLEDGEMENTS...........................................................................................ii LIST OF FIGURES.......................................................................................................vi LIST OF TABLES.........................................................................................................ix ABSTRACT.....................................................................................................................x CHAPTER 1 INTRODUCTION................................................................................. 1 1.1 W h a t is I m ag e-B ased Rend er in g a n d M o d e l in g ..................................................I 1.2 Issu es...................................................................................................................................4 1.3 O bjective a n d A pproach...............................................................................................6 1.4 O u tline a n d N o ta tio n s.................................................................................................8 CHAPTER 2 ELEMENTS OF PROJECTIVE GEOMETRY................................ 10 2.1 Projective Spa c e............................................................................................................ 10 2.1.1 Definitions...................................................................................................10 2.1.2 Canonical affine space embedding and the plane-at-infinity........................11 2.1.2 Collineation................................................................................................ 12 2.1.4 Duality........................................................................................................12 2.1.5 Properties ofP2........................................................................................... 13 2.1.6 Cross-ratio.................................................................................................. 14 2.2 Projection vs. Rec o n str u c tio n................................................................................14 2.2.1 Projection—the forward process................................................................. 14 2.2.2 Reconstruction— the inverse process........................................................... 15 2.2.3 Canonical forms o f the projection matrices.................................................16 2.3 A bsolute Q uadrjc a n d A bsolute C o n ic ...............................................................17 CHAPTER 3 RELATED WORK............................................................................. 19 3.1 Structure-F rom-M o t io n ............................................................................................ 19 3.2 Rela ted W ork in C o m puter G r a p h ic s...................................................................21 3.2.1 Light-field rendering................................................................................... 21 3.2.2 Image mosaicing......................................................................................... 22 3.2.3 Image warping............................................................................................ 23 3.3 Rela ted W ork in Co m puter V is io n ........................................................................ 24 3.3.1 Epipolar geometry...................................................................................... 24 3.3.2 Trifocal tensor..............................................................................................25 3.3.3 Quad-linearity..............................................................................................26 3.3.4 Projective reconstruction.............................................................................27 3.3.5 Self-calibration........................................................................................... 28 3.4 H y b r id M ethods............................................................................................................. 30 3.5 Su m m a r y ...........................................................................................................................31 iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4 IBRM FROM TWO VIEWS.............................................................. 32 4.1 Re v iew of Rela ted W ork. ............................................................................................33 4.1.1 Aerial triangulation....................................................................................33 4.1.2 Binocular stereo..........................................................................................35 4.2 Epipolar G eo m etr y in D e ta il.................................................................................... 37 4.3 Re c tif ic a t io n....................................................................................................................38 4.4 I m a g e M a t c h in g ..............................................................................................................41 4.4.1 The u-v-d volume......................................................................................... 41 4.4.2 Disparity surface extraction........................................................................42 4.4.3 Extraction on the "Herve's Face ” stereo pair.............................................45 4.4.4 Improving time efficiency............................................................................ 47 4.5 Re c o n s tr u c tio n.............................................................................................................. 48 4.5.1 Projective reconstruction by matrix factorization........................................48 4.5.2 Euclidean reconstruction.............................................................................50 4.6 Re su lts................................................................................................................................52 4.6.1 Pentagon pair..............................................................................................52 4.6.2 Renault A utomobile Part..............................................................................55 4.6.3 Herve's Head.............................................................................................. 55 4.6.4 A fall head example..................................................................................... 56 4.7 A Face Reconstruction Sy s t e m ................................................................................ 58 CHAPTER 5 MULTI-VIEW PROJECTIVE RECONSTRUCTION..................... 62 5.1 Fa c to r iza tio n-based M eth o d s.................................................................................. 63 5.1.1 Factorization-based reconstruction for affine cameras................................63 5.1.2 Extension to perspective cameras................................................................ 64 5.1.3 The Iterative Factorization Algorithm (IFA)................................................ 66 5.2 Ra n k T heorem U n d er T he Co m m o n-Im ag e-Pla n e Co n d it io n ........................ 69 5.3 A G eo m etric Ex p la n a tio n............................................................................................77 5.4 Ite r a tiv e M ethods.......................................................................................................... 78 5.4.1 Mohr's non-linear minimization method....................................................... 78 5.4.2 Our approach.............................................................................................. 80 5.4.3 Simulation results........................................................................................ 83 5.5 Ex p er im en ta l Results on N oisy D a t a ..................................................................... 88 5.5.1 Experiments on synthetic data..................................................................... 89 5.5.2 Experiments on real data............................................................................. 91 5.6 Fin a l Re m a r k s.................................................................................................................. 91 CHAPTER 6 APPLICATION OF PROJECTIVE RECONSTRUCTION— VIEW SYNTHESIS 95 6.1 Re v ie w of Rela ted W o r k ............................................................................................. 96 6.1.1 View morphing............................................................................................. 96 6.1.2 View synthesis.............................................................................................. 97 6.1.3 View synthesis in tensor space...................................................................... 98 6.2 System O v e r v ie w ..........................................................................................................100 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.3 W arping Fo r m u l a t io n ................................................................................................ 102 6.3.1 Vertex mapping.......................................................................................... 102 6.3.2 Triangle filling........................................................................................... 103 6.3.3 Fundamental matrix computation...............................................................104 6.4 A ddressing Pa r tia l O c c l u s io n ............................................................................... 105 6.5 A ddressing T otal O c c l u s io n .................................................................................. 108 6.6 R esu lts..............................................................................................................................108 6.7 So m e Re m a r k s ................................................................................................................ 109 CHAPTER 7 MULTI-VIEW EUCLIDEAN RECONSTRUCTION.....................112 7.1 PD M Es t im a t io n .............................................................................................................113 7.1.1 A Igor it hm description................................................................................ 113 7.1.2 Relation to existing methods.......................................................................116 7.1.3 Initialization............................................................................................... 118 7.2 Self-c a lib r a tio n............................................................................................................119 7.3 Sim u l a t io n ....................................................................................................................... 120 7.4 Results on Real I m a g e s.............................................................................................. 123 7.5 A pplication to O bject In s e r tio n............................................................................. 125 7.5.1 Implementation issues.................................................................................126 CHAPTER 8 CONCLUSION AND FUTURE WORK......................................... 129 REFERENCES............................................................................................................132 APPENDIX A MATRIX DECOMPOSITION.........................................................140 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LIST OF FIGURES Figure 1.1 Computer graphics - the forward problem..........................................................1 Figure 1.2 Computer vision - the backward problem...................................................... 2 Figure 1.3 IBRM - bridging the gap................................................................................ 3 Figure 1.4 The overall system.......................................................................................... 7 Figure 4.1 Aerial Triangulation and Stripe Adjustment..................................................34 Figure 4.2 Bundle Adjustment in photogrammetry........................................................ 34 Figure 4.3 Epipolar geometry........................................................................................ 37 Figure 4.4 Illustration o f the rectification algorithm...................................................... 39 Figure 4.5 The original and the rectified stereo pair....................................................... 40 Figure 4.6 Pseudo code o f the tracing algorithm............................................................ 45 Figure 4.7 Results on Herve’ s Face stereo pair.............................................................. 46 Figure 4.8 Output disparity map o f the Herve’s Face p a ir..............................................47 Figure 4.9 Results o f the Pentagon pair..........................................................................53 Figure 4.10 Results o f the Renault Auto Part pair.......................................................... 54 Figure 4.11 Results on the Herve’s Head pair................................................................ 55 Figure 4.12 Reconstructed full head model.................................................................... 57 Figure 4.13 Results from tensor-voting..........................................................................58 Figure 4.14 Our stereo system made from two FUJI DS-300........................................ 59 Figure 4.15 Examples and results from the Face System................................................60 Figure 4.16 More examples and results..........................................................................61 Figure 5.1 Illustration o f IF A .........................................................................................67 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5.2 Pseudo code of IF A .............................................................................................68 Figure 5.3 The common-image-plane condition............................................................. 73 Figure 5.4 Average reconstruction error o f the CIL-0001 dataset................................... 76 Figure 5.5 Illustration o f W IE........................................................................................ 82 Figure 5.6 Pseudo code o f WIE for projective reconstruction......................................... 82 Figure 5.7 The planar configuration...............................................................................83 Figure 5.8 The spherical configuration...........................................................................84 Figure 5.9 Convergence o f W IE.....................................................................................85 Figure 5.10 The five largest singular values in 100 iterations........................................86 Figure 5.11 Comparison o f convergence rate................................................................87 Figure 5.12 Reprojection error vs. noise level...............................................................90 Figure 5.13 The terminal sequence............................................................................... 93 Figure 5.14 The wooden house sequence......................................................................94 Figure 6.1 Illustration o f the view synthesis approach................................................... 97 Figure 6.2 Illustration o f trifocal tensor..........................................................................99 Figure 6.3 System flow-chart........................................................................................1 0 1 Figure 6.4 Homography between two image planes......................................................104 Figure 6.5 Mapping o f T-junctions............................................................................... 105 Figure 6.6 V isibility compatible order.......................................................................... 107 Figure 6.7 Handling occlusion.................................................................................... 110 Figure 6.8 Synthesized views o f a computer monitor.................................................. 1 11 Figure 6.9 A fuller synthesis....................................................................................... 111 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 6.10 Looking into a room - an extrapolation example.........................................111 Figure 7.1 Locus o f principal point in 20 experiments...................................................1 2 1 Figure 7.2 Relative errors vs. noise level (d).................................................................1 2 1 Figure 7.3 Relative errors vs. view separation (6)......................................................... 122 Figure 7.4 Relative errors vs. number o f views (N)....................................................... 122 Figure 7.5 Reconstruction results o f the terminal sequence...........................................124 Figure 7.6 Two views o f the reconstructed house......................................................... 124 Figure 7.7 Misalignment o f real and virtual objects...................................................... 127 Figure 7.8 Correct registration after error compensation................................................128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LIST OF TABLES Table 1. Notation table............................................................................................................9 Table 2. Recovered relative camera movement (mm).....................................................77 Table 3. Comparison over 4 views—the terminal sequence............................................ 93 Table 4. Comparison over 8 views— the terminal sequence............................................ 93 Table 5. Results on the house sequence.......................................................................... 94 Table 6. Average Euclidean reconstruction error on real data........................................124 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract While both work with images, Computer Graphics and Computer Vision are different fields. Computer graphics starts with the creation o f geometric models and produces image sequences. Computer vision starts with images or image sequences and produces interpretations including geometric models. Lately, there has been a meeting in the middle, the goal being to create photorealistic images with the help o f accurate models recovered from projections o f real objects. This convergence has produced a new subfield called Image-Based Rendering and Modeling (IBRM). In this research, the geometric aspects o f IBRM are studied. The case o f two views is studied first. New algorithms are developed that automatically align the input images, match them and reconstruct 3-D surfaces in Euclidean space. The matching algorithm is designed to cope with complex shapes such as human faces. The reconstruction algorithms are then generalized to the multi-view case, based on a stratified framework. At the root o f the stratification is a novel projective reconstruction algorithm that produces a non-metric structure o f the scene. An image-based rendering system is implemented, that directly uses this non-metric structure to synthesize images from novel viewpoints. On top o f that, a novel algorithm is developed to upgrade the non-metric structures into metric ones. Intrinsic and extrinsic camera parameters are obtained at the same time. One potential application, showcased in this thesis, is to blend computer graphics objects into real images with correct perspective and occlusion. The proposed theory and the associated algorithms lay down the groundwork for future development in multi-view image based rendering and modeling. x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction 1.1 What is Image-Based Rendering and Modeling Traditionally, computer graphics and computer vision are considered to be inverse problems. Working in the forward direction, computer graphics begins with the creation o f geometric models. These models are subsequently tessellated, colored (or textured), lighted and projected onto a computer screen, producing images (Figure 1.1). Therefore, much o f the efforts in graphics research have been devoted to the efficient representation, storage and rendering o f geometric models. The problems with graphics are that, no matter how detailed the models can be, and how fast rendering can be performed, the creation time is still time consuming, and the final images may look artificial. Computer Graphics * /- m Image Model Synthetic Camera Figure 1.1 Computer graphics - the forward problem Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Working in the backward direction, computer vision tries to replicate the human vision process on computers— producing high-level interpretations from real image sequences (Figure 1.2). Often, a layered approach is taken: In early vision, spatial and temporal features are detected. Features are then grouped according to some similarity measure. For example, nearby and parallel edge segments are grouped into lines; connected and perpendicular lines form rectangles. Groups are aggregated into objects. Once objects are formed, intelligent tasks such as recognition can be performed. The difficulty o f such an approach (and o f computer vision in general) is that these layers are interdependent. While achieving high-level objectives relies on solving low-level problems, the availability o f high-level knowledge certainly makes the task o f solving low-level problems easier and more efficient. Thus, a robust vision system should both top-down and bottom-up. Such a system has yet to be successfully demonstrated. Computer Vision Images by Real Cameras Symbolic Description Dog? Camel? . ) ,nt«rPretation Eatable? Figure 1.2 Computer vision - the backward problem Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Recently, there has been a convergence o f research to bridge the gap between the two fields. Looking in the direction o f computer vision, we see a tendency o f aiming at the recovery o f geometric models, branching o ff the original intent o f obtaining symbolic or semantic descriptions. Looking in that o f computer graphics, we see more and more applications that demand photorealistic results. At the intersection is the emergence o f a new sub-field called Image-Based Rendering and Modeling. IBRM represents the effort o f using the geometric and photometric information recovered from real images to generate new images with the hope that the synthesized ones appear photorealistic, as well as reducing the time spent on model creation. It tries to leverage the ease with which photographs or videos are taken and their amazing power to communicate. In fact, our eyes are so fine-timed to texture and color information that much o f the inaccuracy present in the recovered data goes unnoticed. Image Based Rendering/Modeling Image-Based Rendering Synthesized Image Real Images Image-Based Modeling Figure 1.3 IBRM - bridging the gap 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.2 Issues There is a full spectrum o f existing research works tackling the IBRM problem, ranging from those based on structure reconstruction, to model-free ones. In between are those which make use o f both geometric and photometric information, thus possess some o f the advantages o f the two extremes. In analyzing these different approaches, we have noticed the following issues. Image understanding vs. structure recovery Despite tremendous progress in vision research in the past a few decades, what has been achieved in the image understanding field as the front end o f an intelligent system remains limited. Part o f the reason is that there is strong indication o f interaction between high and low level information processing in the human vision process. Take edge detection as an example. At the lower level, edges consist o f pixels with large intensity changes; at the higher level, they are the boundaries o f objects. The correct way to perform edge detection therefore is to couple it with other high-level operations such as segmentation and recognition. Unfortunately, ongoing research efforts have not been successful in modeling this kind o f interaction. If the purpose is to recover structures, especially for graphics usage, the situation becomes quite different. First, in most cases, it is unnecessary to understand the images. Second, when it becomes necessary, a human operator is allowed to provide the high- level information, while the computers perform all the low-level computational work. 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Structure recovery vs. image warping The traditional way o f recovering structures undergoes two steps. First, one calibrates the cameras, meaning that both their internal parameters and external poses are determined. Second, correspondences among images o f scene features are established. From these two pieces o f information, it is possible to compute 3-D Euclidean coordinates o f the involved entities. While this process is seemingly doing the right thing in the sense that we live in a three-dimensional Euclidean world, it has some undesired properties. For instance, calibration is a tedious procedure, often performed off-line with some mechanically measured ground truth data. In addition, it is only valid in a certain range. The accuracy degrades as the real environment departs from the original calibration setup. On the other hand, if all one wants are images rather than models, and if there are ways to warp existing images directly, why bother to create models (and to calibrate the cameras) at all? This is exactly the idea that prompts the advent o f image-based rendering. It has been demonstrated by many authors (Chen and Williams [17], Havaldar, Lee and Medioni [43], McMillan and Bishop [60], Seitz and Dyer [80]) in the graphics community that, at least in some special cases, image warping is possible using only the correspondence information. Euclidean vs. projective reconstruction A similar problem has also been raised in the vision community: is it an overkill to try and reach Euclidean reconstruction all the time? The way computer vision scientists 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. approach the problem is to find a less restrictive (compared to Euclidean) structure in the hope that it requires less data to compute, and still serves many useful applications. In his pioneering work in this area, Faugeras [26] showed that, from correspondence data only, he could recover the projective structure o f the scene (thus projective reconstruction) which is different from the Euclidean one by an unknown 4x4 transformation. Later, Chen and Medioni [12], Avidan and Shashua [4] independently demonstrated that such a structure is sufficient for image warping. The result is same as if the images were projected from Euclidean structures. In [30] and [31], Faugeras et al. further showed that a projective structure can be upgraded into a Euclidean one given more information about either the scene or the cameras. 1.3 Objective and Approach The goal o f this research is to develop new algorithms for multi-view image based rendering and modeling applications. The problems involved have dual appearances in computer vision and computer graphics, which determines that their solutions can be looked at from either viewpoint, and are likely to find applications in both fields. We approach a vision problem from a graphics perspective— we perform structure recovery rather than image understanding; simultaneously, we address a graphics problem using computer vision techniques— image warping based on projective reconstruction, and modeling based on Euclidean reconstruction. By developing algorithms, we solve some o f the fundamental problems in computer vision, in particular, structure-ffom-motion. At the same time, we demonstrate how these algorithms can be applied in graphics. 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. View 1 View 2 View n Matching Correspondences Projective Reconstruction Projective (non-metric) Structure Euclidean Reconstruction Euclidean (metric) Structure Figure 1.4 The overall system View Synthesis Face Reconstruction Object Insertion Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The overall system is depicted in Figure 1.4 where we see a stratified structure: starting from a set o f images o f a (static) scene, correspondences are first established, automatically or with human assistance; projective (or non-metric) structures are then computed and subsequently, transformed into Euclidean (or metric) ones. Each stratum supports certain applications, depending on the underlying mathematical problems to be addressed. 1.4 Outline and Notations Chapter 2 introduces some preliminary concepts from projective geometry which are going to be referenced in many places throughout this thesis. They are helpful in understanding the materials in this thesis. Chapter 3 reviews related work in both the computer graphics and computer vision arenas. In Chapter 4, a complete two view IBRM system is presented. Then, in Chapter 5, the projective reconstruction algorithm developed previously is generalized to multiple views. Its application is demonstrated in Chapter 6 with a view synthesis system. Chapter 7 extends the Euclidean reconstruction algorithm to multiple views, and demonstrates its usage by an application which blends graphic objects into real images. Both Chapter 5 and Chapter 7 include experimental results from simulation and on real datasets. Chapter 8 concludes the thesis and discusses future work. Several forms o f matrix decomposition are used in this thesis. They are summarized in the Appendix for the purpose o f easy reference. Mathematical notations are used as follows (Table 1). Italic lower case characters (s) represent scalars. Italic upper case characters (M) represent matrices. Bold and italic 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. lower case characters (v) represent one-dimensional arrays which can be points, vectors or columns/rows o f matrices. Unless otherwise stated, the default assumption is that coordinates are written in column format. Under this convention, the dot product o f two vectors a and b is aT b, and the tensor product is abT . When a subscript is attached, it represents the component o f the original quantity indexed by the subscript. For instance, let t be a 3-tuple vector, then it can be written as / = . As another example, if M is a 3x3 matrix, then mn is the center element; and M can be written either in its column vectors as [M(l)Mu)Mn)], or in its row vectors as case characters (C) represent sets. < 3 , . Finally, bold and italic upper Table 1. Notation table Notation Meaning s scalar V point, vector aT b vector dot product abT vector tensor product axb vector cross product M matrix Mo /th column o f matrix M < > /th row o f matrix M s set 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 Elements of Projective Geometry This chapter introduces some elementary projective geometry that w ill be used in this thesis. The introduction is intended to be informal but informational. For further study, Faugeras' monograph [28] is a good place to start. More advanced topics are discussed in his paper [30]. The tutorial by Triggs and Mohr [63] is quite thorough in terms o f describing general concepts. The tutorial by Hartley and Zisserman [99], on the other hand, concentrates on the computational aspects o f multi-view geometry. Finally, any textbook about projective geometry should also be helpful. 2.1 Projective Space 2.1.1 Definitions Given a coordinate system, an ^-dimensional real affine space is the set o f points parameterized by the set o f all n-component real column vectors [x,, ..., xn ]reR". An n- dimensional Euclidean space, denoted as F 1 , is a subspace o f R" such that it is invariant to rigid transformation. The points o f real ^-dimensional projective space F can be represented by n+1-component real column vectors [x(, ..., x w i t h the provisos that at least one coordinate must be non-zero and that the vectors [x,, ..., x^,]r and A[x„ ..., jcv,]' represent the same point o f F for all A*0. The x,’s are called the homogeneous coordinates o f the projective point. The concept o f a point in projective space thus 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. describes an equivalence relationship: X -Y if f 3/1*0, such that X=AY where AeR, X,YeP'. This relationship, denoted by is referred to as equality up to a scale. The set o f all points in P whose coordinates satisfy a linear equation a,x, + ...,+ OnnX^i = 0, where a,eR for all /= 1 ,..., n+ 1 is called a hyper-plane. Obviously, a hyper-plane in projective space can also be represented by an n+\-component vector [a,, ..., Conventionally, hyper-planes are written as row vectors. When n= 2, the hyper-plane degenerates into a projective line: [a,, a2 , flj]. 2.1.2 Canonical affine space embedding and the plane-at-infinity An affine space R" can be embedded isomorphically in P' by the standard injection: [X|....... x„] -> [x„ ...,x„, 1]. Conversely, affine points can be regenerated from projective ones with x,M*0 by the mapping - -^ -1 » * * M I 1 - > 1 H 1 V M * , ♦ 1 . Under such an embedding, any projective point with x^,=0 corresponds to a vanishing point in the ^-dimensional affine space. The set o f all such points forms the Plane-at- Infinity, denoted by n m . It should be pointed out that the above mappings and definitions simply constitute a convention. They are only meaningful if we are told in advance that [x ,,..., x„] represents a normal affine space and xn W is a special homogenizing coordinate. In a projective space, 1 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. any one coordinate can act as the homogenizing coordinate, and all hyper-planes are equivalent— none is specially singled out as the Plane-at-Infinity. However, the choice for R" and IL, must be consistent in a way such that P1 ~R" + Fh holds. 2.1.3 Collineation A full rank (n+\)x(n+\) matrix defines a linear transformation or collineation, from P0 to itself. A collineation is also defined up to a scale: if A=pB, then V A e /*, AX-BX. Hence the symbol also applies to collineations. 2.1.4 Duality Hyper-planes and points are dual entities which satisfy the following Duality Principle: For any projective result established using points and hyper-planes, a symmetrical result holds in which the roles o f points and hyper-planes are interchanged: points become planes, and vise versa. For instance, the equation [a, ] has two different but symmetric interpretations: a point set where the points are coplanar (fixing A), or a plane set where the planes are coincident to a point (fixing X). The duality determines that a hyper-plane has dual representations: an equation in x,'s, if treated as a 12 * ■ / » + ( = AtX = 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. point set; or an n-tuple row vector as a point in the dual point space. Note that under the canonical affine embedding, the dual representation o f IF. has the form [0, 0, 1] which is called the canonical form o f FF. Unless mentioned otherwise, canonical representations are used for various entities throughout this thesis. Let H be a collineation for points. From the previous equation, we have ( \ H ... = 0 \ / Hence H T -H is a collineation for hyperplanes where H* is the adjoint matrix o f H. In other words, H and H * are dual collineations. A set o f hyper-planes incident to a common line is named a pencil-of-planes. The dual structure o f a pencil-of-planes is a set o f co-linear points. I f two o f the hyper-planes are parallel, their intersection is at /7„, thus all hyper-planes in the pencil ought to be parallel to one another, otherwise at least one pair would intersect within the affine space. 2.1.5 Properties of P2 In 2-D projective space f*2 , the set o f vanishing points form the Line-at-Infinity, denoted as L o o . The canonical form o f £« is: (0,0,1). Its dual form is: *2=0. The collineation in P2 is historically called (tomography. Since a homography is a 3x3 matrix defined up to a scale, there are 8 independent parameters. To compute it, four point correspondences suffice. Dually, four line correspondences suffice the solution o f a dual homography. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A set o f coincident lines in P2 is called a pencil-of-lines. From the Duality Principle, we know that a pencil-of-lines is a projective 1-D structure. I f two lines are parallel, all lines are parallel to one another because their common intersection is on the Line-at- Infinity. 2.1.6 Cross-ratio Let A, B, C, D be four co-linear points in a projective space, the cross-ratio is defined as cross(A, B; C, D) = W W |/4fl|/|CD| where | • | denotes the distance between two points. I f one o f the four points is a vanishing point, e.g. D, then cross(A, B\ C, D) = |/1CJ/|/1B|. Cross-ratio is invariant to projective transformations. In projective space, three points form a basis for a line: given such a basis, a fourth point is uniquely determined by a cross-ratio. The cross-ratio o f a pencil-of-planes (lines) is defined as that o f its dual structure. 2.2 Projection vs. Reconstruction 2.2.1 Projection—the forward process A perspective camera projection is represented by a 3x4 matrix P: P}->P! - K . I 0 ] R 3x3 0 3«1 1 = [^4 /? | AT\ 'X ~x' 1 =^3,4 _ 1 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where ( m,v) are the image coordinates o f the Euclidean world point [x_y z]r, A represents intrinsic camera parameters, and R, T represent the rigid transformation from the world coordinate system to the camera coordinate system. To be able to write the projection process in such a nice linear form is the primary reason for treating the image plane and the world as projective spaces. Notice that P includes a 3x3 part that is related to the camera's orientation and a 3x1 part that is related to its position. When the projection center o f a camera is on the intersection line o f a pencil-of- planes, the image o f the said pencil-of-planes is a pencil-of-lines. The two have the same cross-ratio since it is preserved under camera projection. 2.2.2 Reconstruction—the inverse process Given the correspondence information among the image coordinates o f n points in m views: (w ,y,v,y) where /=1, ..., m andj= I , i t is known that the projection matrices o f the cameras P, and the homogeneous coordinates o f the points A} can be computed satisfying = P , * r This process is called projective reconstruction. And the set o f all reconstructed points forms a projective structure. Such a structure is not unique: for any non-singular 4x4 matrix H, P,X l = (PlH)(H~iXJ) = P'X'r In other words, all reconstructed structures are equivalent up to a 4x4 collineation. Since collineation only preserves cross-ratio, projective structures are non-metric. Among all the equivalents, those which exist in 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Euclidean spaces are Euclidean structures. A ll Euclidean structures are equivalent up to a rigid transformation. Given a Euclidean structure Y, and an equivalent projective structure X, o f the same set o f points, there exists a collineation H such that for Vy e Y , Hy e X . H is called a Projective Distortion Matrix. From this point o f view, given a projective structure X , Euclidean reconstruction amounts to finding the Projective Distortion Matrix H because, by definition, f f x X must be a Euclidean structure. 2.2.3 Canonical forms of the projection matrices Let Pj, /= 1 , ..., m, be the set o f projection matrices resulted from a projection reconstruction. Write Pt into a 3x3 and a 3x1 sub-matrix [A/, | /,]. The first canonical form looks like: [I | 0], [/LI**:], ..., [/lmlflm ]- The geometrical explanation is that the (updated) world coordinate system is coincident with that o f the first camera. This form is obtained by right multiplying each (original) projection matrix by H = ~ 1 1 1 ' a/,-1 o' 'I - * l‘ 1 1 1 1 . 0 1 _ 0 1 0 1 At the same time, the projective structures must be updated by left multiplying with 7 — 1 O • 0 1 0 1 We can further rotate the coordinate system so that the projection center o f the second camera (denoted by O2) is located on the x-axis, giving the second canonical form. W riting the rotation matrix as R, then O, = RTA^a, = Rra\ . Imposing O2 on the 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. x-axis implies that its y and z components must vanish. One can choose R2 and R3 (the 2n d and 3r d columns o f R) as two orthogonal unit vectors from a plane perpendicular to a\ , and R\ the cross-product o f the previous two. In writing, the second canonical form appears as [#i|0], •••, [#m |6m ], which conceals the underlying geometrical significance, however. 2.3 Absolute Quadric and Absolute Conic Any symmetric 4x4 matrix Q uniquely defines a non-empty quadric: x rQx=Q. pr=xrQ is the tangent plane to Q passing through x. Obviously, p satisfies prO *p= Q when O' is non degenerate, which is the dual structure of Q. The absolute quadric Q is a degenerate quadric whose rank is 3. Under a point transform M: x->Mx, Q is transformed by MQMr. Under a camera projection J, Q is T * mapped to JfU . The canonical form o f in a Euclidean space (with canonical affine embedding) is / 0 0 0 Let R and T represent a rigid transformation, then - /• 1 o' o' 0 0 0 0 R T J I OlM? T 0 10 0 0 1 which means that Q is invariant to rigid motion. Let P = [ a 0 ] projection matrix, then R T 0 1 be a camera [a 0] R T J l O p T 0 1 0 0 0 I .[A 0] I 0 0 0 Ar 0 = AA 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. or, shortly, af=Pf2PT =AAr. The dual o f Q is not a set o f planes, but a conic curve on /7» called the absolute conic, denoted as f . T has the same canonical form as O : for a point X=[jt 0]r on 7 ", k o] 0 0 0 = X X = 0 which only has virtual roots. The projection o f f is described by ut. =PX=[A 0] 0 1 0 = ARx Thus A~x uc = Rx. And further, uT r A~T A~'uc = u 'ca)~x ur = x TR 1 Rx = x Tx = 0. In other words, the image o f f is a 2-D conic curve of1 which is the dual image o f Q. The implication here is that the internal camera parameter matrix is related to and can be computed from the image o f the absolute conic (or the dual image o f the absolute quadric), using Choleski decomposition (Appendix A.2). 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 Related Work To a certain degree, the effort on image-based modeling started at least twenty years ago when the structure-from-motion (SFM) problem (i.e. obtaining structural information from the images o f a moving camera) was approached by computer vision scientists. Unfortunately, up until now, successful vision systems remain uncommon. There are many issues involved, some o f them were just discussed in section 1.2. Lately, such a question has been raised: rather than trying to address all those issues, can the requirement on structure recovery be reduced? This simple question has triggered a lot o f studies on the topic o f non-metric structure-from-motion (NSFM). In the graphics side, almost at the same time, as a way to gain photorealism, image- based-rendering has been experimented which generates new images by warping and combining exiting images. It was soon realized that non-metric structure is sufficient for image warping. Then, methods which convert non-metric structures into metric ones were also reported. Algorithms developed for attacking these problems form a theoretical foundation for Image-Based Rendering and Modeling. It is this common interest that brings both vision and graphics researchers into this newly formed field. 3.1 Structure-From-Motion The literature under this topic is quite extensive. Here, only three approaches are reviewed, which suffice to demonstrate the diversity o f research in this area. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Szeliski et al. [87], an object is placed on a turntable whose rotation angle is read from marks on the side. Shape is recovered in term o f a sequence o f silhouette curves. The advantage is that the difficult correspondence problem is avoided. This approach, however, relies on knowing the rotation angle, which largely limits its applications. In feature-based approaches, distinguished features such as points (Pollefeys et al. [72]), edges (Beardsley et al. [8]) and regions (Ma et al. [57]) are identified in the first image and then tracked along the image sequence. Compared to matching pixels between two well-separated images, tracking features across temporally dense samples is relatively easy. The trade o ff is that the resulting point set is sparse. Consequently, these approaches are suitable to blocky objects rather than smooth ones. The approach taken by Pighin et al. [69] is to deform a generic model to fit the recovered data using scatter data interpolation. The system is tuned for human faces, in which case common knowledge tells where more points are needed to cover high curvature areas. Correspondence in this system is established interactively. In the third category, optical flow techniques are used to generate a dense correspondence map (e.g. Pollefeys et al. [71]). Optical flow computation is one o f the fundamental problems encountered in processing image sequences and has been extensively studied in computer vision. The goal is to compute an approximation to the 2- D motion field— projection o f the 3-D velocities o f surface points onto the imaging surface— from spatial-temporal patterns o f the image intensity. It is argued in [3] (Aloimonos and Brown) and [96] (Verri and Poggio) that estimation o f the optical flow is an ill-posed problem due to inherent differences between 2-D motion field and intensity 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. variations. This is reflected in the survey paper [7] by Barron et al. where none o f the techniques can consistently produce low error rate and high density in all testing cases. 3.2 Related Work in Computer Graphics 3.2.1 Light-field rendering Theoretically, the only way to recreate true photo-realism is to sample and then reconstruct the 6-D plenoptic function (Adelson et al. [1]): j[X, < p , 0, x, y, z). In other words, it is necessary and sufficient to sample lights o f all wavelengths (A) from all directions (q > . , 0) at all locations (.t, y, z). There w ill be a long time before the storage technology allows us to do this. However, it is possible to contrive an environment where some o f the parameters are constant. This is demonstrated by the work o f Gortler et al. [35] and Levoy et al. [53] where only a 3-D sub-function is considered (internally, a 4-D representation is used). The idea is to first build some form o f lookup table by taking image samples o f an object from many camera viewpoints, trying to record all outgoing lights, thus the name light field. Subsequently, the image associated with an arbitrary viewpoint is synthesized by interpolating the stored lookup table. The advantage o f this class o f methods is that, unlike all other methods, pixel correspondence is not necessary. The primary disadvantage, as indicated earlier, is the requirement o f a large set o f samples to produce the lookup table. For that reason, currently, the range o f field o f view is relatively small. 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.2 Image mosaicing Rather than sample all lights emitting out o f an object, methods have been proposed to capture lights coming into the observer by rotating a camera around a fixed point as close as possible to the camera center. Multiple so obtained images are stitched into one larger or higher resolution image— panoramic image— which allows views within it to be quickly generated. The mathematical foundation (see [88] by Szeliski for a formal description) for image mosaicing is that different images o f non-planar objects obtained from a fixed-location camera are related by homographies. When the camera displacement is small, the pixel correspondences, and the homographies can be solved concurrently using optical flow techniques [18]. While rectilinear panoramic images are convenient representations for a relatively small field o f view (less than 90°), they are problematic for very wide scenes. In such circumstance, either cylindrical (Chen [18], McMillan et al. [60]) or spherical representations (Szeliski et al. [89]) are more suitable. Chen’s work did not support view interpolation. A viewer has to hop from one node to the next. On the other hand, at each node, pan-tilt-zoom operations are allowed. That o f McMillan et al. extended the concept o f epipolar geometry for planar images to cylindrical images. Their method was able to synthesize virtual cylinders, thus generating panoramic images at virtual viewpoints. The drawback o f using a cylindrical panoramic image is its inability to include parts o f the scene at the top and bottom. This deficiency has been overcome in systems that output spherical images such as the one by Szeliski and Shum [89]. 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.3 Image warping This class o f techniques is characterized by the use o f a relatively small number o f images with the application o f geometric constraints to reproject image pixels appropriately at a given virtual camera viewpoint. Sometimes, it is also called image transfer. One early paper which in some sense initiated this whole subfield was by Chen and Williams [17] where only synthetic images were considered. The warping function is simply the downscaled offset vector map (a.k.a. optical flow filed) which is pre computed from the known camera pose and range data. The depth o f each pixel is cached when the frame is rendered the first time, which also helps in resolving the visibility. Holes— areas where the pixels do not have pre-images— are filled by interpolating the offset vectors o f adjacent pixels. In Seitz and Dyer [80], two real images are taken. They are pre-warped to a common virtual image plane. This step guarantees that any linear interpolation o f the pre-warped views is shape preserving, meaning the synthesized view is free o f non-linear distortion. Holes are filled using morphing techniques. The method does not extend to multiple views, however, because the pre-warping algorithm only works for two input views. In addition, the virtual camera's movement is limited to stay in between the model views—a result o f using linear interpolation. Arbitrary camera placement was considered by Havaldar et al. [43] by basing image transfer on projective invariant. However, again, only two views were dealt with. This work was broadened in Chen and Medioni [12] to allow multiple views, which in 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. addition addressed the occlusion problem. Another merit o f the latter work is that, unlike other warping methods, it is polygon based, thus able to take advantage o f hardware acceleration in implementation. 3.3 Related Work in Computer Vision 3.3.1 Epipolar geometry Non-metric structure from two views was studied by Longuet-Higgins a long time ago [54]. He discovered what is now known as the epipolar geometry. Consider two cameras and a point in the scene. The scene point forms a plane with the line connecting the two projection centers. As the point moves in the world, it sweeps out a pencil-of-planes. The epipolar geometry relates the two pencil-of-lines which, as recalled from section 2.2.1, are images o f the said pencil-of-planes. Algebraically, the relationship is described by a 3x3 matrix o f rank 2 called the fiindamental matrix. Intuitively, the rank 2 condition comes from the fact that a pencil is a projective 1-D structure. Due to the rank 2 condition, seven pairs o f point correspondences suffice to produce a non-linear solution to the fundamental matrix. Alternatively, there are linear solutions for eight or more points (e.g. Hartley [40]). The difficulty o f epipolar geometry estimation lies in that matching points and computing the fundamental matrix must proceed concurrently. A widely referenced algorithm is the one by Zhang [100] for its robustness. But still, it is not bulletproof. A thorough discussion about the related computational issues can be found in Deriche et al. [23] and Torr et al. [93]. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The fundamental matrix was the tool used to achieve pre-warping for view morphing in [80]. The same procedure was used in computer vision to align two stereo images, referred to as rectification (e.g. Hartley et al. [39] and Robert et al. [75]). Epipolar geometry can also be thought o f as a constraint that binds corresponding pixels on the two image planes, and can be used, for instance, in image matching. However, the epipolar constraint is inherently ambiguous: all points on a plane belong to the pencil-of-planes are projected to the same line in either view. On this line, it is impossible to distinguish those points that happen to have the same intensity value. Therefore, in practice, more reference views are needed to help resolve the ambiguity. 3.3.2 Trifocal tensor Recently, Shashua [83] discovered a trilinear constraint that involves both points and lines in three views. The constraint can be elegantly represented by a 3x3x3 tensor. The tensor contains the set o f coefficients o f certain trilinear relationships that vanish on any corresponding triplet (any combination o f points and lines). A robust algorithm to compute the trifocal tensor is given by Torr [92]. The trifocal tensor is more general than the epipolar geometry in the sense that (i) the three fundamental matrices for the three pairs o f views can be obtained from the coefficients o f the tensor; (ii) both point and line transfer are allowed. When used as a matching constraint, the trifocal tensor is less likely to incur the ambiguity problems. As a result, reconstruction algorithms based on it are expected to be more accurate. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For two views, a special tensor composed o f the elements o f the fundamental matrix can be constructed by considering the third view as being coincident with one o f the given views. A rigid transformation can be applied to the tensor. The new tensor plus the original views suffice to synthesize a novel view consistent with the transformation [4] (Avidan and Shashua' 97). The advantages associated with the trifocal tensor do not come without a price— the constraint is algebraically over parameterized: there are 27 coefficients while three views can at most contribute 1 8 independent parameters. Taking into consideration the overall scale factor, eight additional constraints are needed in order to form a numerically stable solution. However, up to now, only 4 o f them have been identified. Compared to estimating the fundamental matrix, the computation is more involved since there are many more coefficients to estimate. The lack o f a more common notation than the tensorial one also makes it less appealing to the graphics community. 3.3.3 Quad-linearity Investigations conducted by Faugeras et al. [29] and Triggs [94] have established the existence o f a kind o f quad-linear form with a total o f 81 coefficients across four views, with the negative result that further views would not add any new constraints. This form o f quad-linearity, however, is somewhat unsettling because o f the number o f coefficients has risen to 81, whereas a view adds at most 12 parameters. In other words, it may be too redundant a representation o f constraints over four views, and therefore may only be o f theoretical interest. 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3.4 Projective reconstruction The dilemma in epipolar geometry or trifocal tensor based reconstruction algorithms is that, on one hand, large view separation helps to stabilize the computation process. On the other hand, points appear quite differently when views are too far away from one another, which makes establishing the correspondences difficult. A natural solution is to use a dense set o f image sequences so that view separation and continuity are both achieved. W ith uncalibrated cameras, a projective reconstruction can be reached. Two types o f approaches have been reported so far. One is given by Mohr et al. [62] who adopted the idea o f bundle adjustment from photogrammetry [98]. Projective bundle adjustment is formulated as non-linear minimization, just as the traditional bundle adjustment. However, two issues that either did not exist or were much less severe before become noticeable now. First, because both cameras and points need to be estimated, the problem size increase by a polynomial factor (0 (m2n+mn2 ) where m and n are the number o f cameras and points respectively). Second, and more problematic, our experiments indicate that the formulation is ill posed which appears as the degeneracy o f the Jacobian matrix encountered in the minimization process. This problem is less severe in photogrammetry because, as its name suggests, bundle adjustment is only used as an adjustment to already good initial estimates, whilst Mohr et al. tried to apply it to very rough initial guesses. The second approach (Sturm et al. [86]) uses matrix factorization with the requirement that all points appear in all views. They first form the measurement matrix 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. whose rows are image coordinates o f all points in one view, and whose columns are images o f the same point in all views. They then estimate the fundamental matrix for each successive pair o f views. From that, they compute the projective depth, and plug the latter into the original measurement matrix. By performing a rank-4-factorization on the resulting scaled measurement matrix, they obtain the projective reconstruction. More on this method w ill be described in section 5.1.2. The factorization method suggests a more general form o f quad-linearity for m ulti view geometry which is characterized by the rank-4-ness o f the just introduced scaled measurement matrix. It is more general because, (i) it applies to an arbitrary number o f views and (ii), all previous forms o f linearity (namely, epipolar geometry, trilinear tensor and quad-linearity) can be derived from the resulting projection matrices. Requiring all points be visible in all views is quite restrictive in practice. In tracking a video sequence, it is natural to see some tracked features appearing and disappearing. Depending on good epipolar geometry estimation is also undesirable since the latter itself is still an open issue. 3.3.5 Self-calibration For most applications, however, a projective structure is not sufficient. Therefore, algorithms that convert projective structures to Euclidean ones ought to be sought. Two different paths have been suggested. One o f them (Faugeras [30], Pollefeys and Van Gool [71]) takes the stratified approach. It first transforms the projective structure into an 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. affine one by identifying the plane-at-infinity, and then into a Euclidean one by making use o f 3D knowledge about the scene such as the existence o f a trihedral vertex. The other path, referred to as self-calibration, reaches to a Euclidean structure in one step. A ll existing algorithms (Faugeras et al. [27], Hartley [41], Pollefeys et al. [72], and Triggs [95]) are based on the invariance property o f the absolute conic (or quadric) over rigid motion. The self-calibration approach by Faugeras et. al. [27] uses the Kruppa constraints which, again, requires the fundamental matrix be computed a priori. Hartley's approach [41] assumes a fixed rotating camera, and computes the image o f the absolute conic. Triggs [95] took a different approach by computing the dual image o f the absolute conic. Both algorithms do not allow varying focal length. Pollefeys et al. [72] presented a more general algorithm based on similar ideas but allows variable focal length. It should be pointed out that research on self-calibration has just started. Existing approaches are still immature, which draws split opinions about the subject. The supporting side believes this is a great idea since it totally bypasses the need for calibration. The opposing side argues that self-calibration is inherently unstable because too many parameters need to be estimated, and that traditional calibration should be used whenever possible. Our own experiments suggest that while self-calibration is not a good candidate for calibration purposes, it certainly provides sufficient accuracy for reconstruction-oriented applications. Such an application w ill be demonstrated in section 7.5. 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.4 Hybrid Methods The hybrid modeling approach is characterized by (i) making use o f vision techniques to recover the object geometry, and (ii) rendering the recovered model via standard graphics pipeline. Once a complete 3-D reconstruction has been assembled, the visibility issue (which appears in image warping methods) becomes a non-issue. Rather than address the general structure-from-motion problem as computer vision researchers do, Debevec et al. [22] cleverly restrict their problem domain to architecture, and use primitives such as boxes, prism and surfaces o f revolution as building blocks for architectures. Correspondences o f primitives among different views are given manually. By utilizing the geometric constraints inherent in the primitives, they are able to reduce the problem space significantly, changing the ill-posed problem into a well-posed one. Source images are "pasted" to the reconstructed models using texture mapping—a standard graphics technique. The method presented by Seitz et al. [81] [82] avoids the image correspondence problem by working in a discretized scene space whose voxels are traversed in a fixed visibility order. The boundary voxels (of an object) are found by checking the color consistency o f their projections in multiple views. The voxels are subsequently colored with the average color. The success o f hybrid methods relies on the use o f vision techniques to reconstruct a full 3-D model o f the scene (as compared to warping methods which stay in the 2-D image space), and at the same time avoid addressing the correspondence problem in its most general form. The disadvantage o f the Debevec's method is that it only works in its 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. designated domain. Seitz's method is slow due to the use o f volume-rendering techniques to generate the final polygonal model. Both methods require that precise camera calibration data be known beforehand. 3.5 Summary Among ail the approaches surveyed so far, only the Light-Field approach is non geometric, and therefore does not need the establishment o f correspondence. The trade o ff is the requirement o f extremely large storage space. It is not clear if the reported techniques are feasible for widely spread scenes where the 3-D sub-function becomes insufficient. For the rest o f the approaches, they all depend on, one way or the other, recovering some kind o f geometric structure o f the scene. Despite the original attempt o f bypassing explicit structure recovery (thus calibration too), successful examples have only been reported for special cases where the cameras are either not moving or moving in a planar fashion. In general cases, due to the inability to resolve occlusion in 2-D image space, one has to resort to pixel-based rendering methods, departing from the ongoing trend, which is already polygon-based. As a result, in order to derive general solutions that are widely applicable and use only a small set o f input images, one needs to develop new algorithms that outputs explicit Euclidean structures from uncalibrated images. 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 IBRM from Two Views This chapter presents our work on the simplest case o f IBRM, that is, with only two images. The binocular stereo technique is used—the two pictures are shot from close-by locations. Briefly, stereo works as follows. For a given pixel in, say, the left image, one searches for its correspondence in the right image along the corresponding epipolar line based on some local similarity measure. Then, by intersecting the two rays emitting from the two pixels, one recovers the 3-D location o f the point. A ll recovered 3-D points are connected to form a polygonal mesh representing the shape o f the object. The input images can be texture-mapped onto the mesh, generating new photo-realistic images at arbitrary viewpoints. A divide-and-conquer approach is taken, i.e. the overall problem is divided into several sub-problems: image rectification, matching, projective reconstruction and Euclidean reconstruction, which are tackled separately. The stereo problem is often treated as a matching problem only, with the assumption that the cameras are calibrated. Under uncalibrated scenarios, however, the first and the third problems have to be addressed. A face reconstruction system is then developed using off-of-the-shelf digital cameras. One important application o f such a system is telepresence, in which several geographically separated participants are brought into one virtual environment. The idea is to capture distilled information o f an individual such as the head position, orientation 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and, possibly, facial expression, then use this information to drive the face model o f an avatar at the remote site. 4.1 Review of Related Work As mentioned earlier, the stereo problem is traditionally treated as a matching problem, that is, finding all pairs o f pixels that are images o f common 3-D points. The reason is that when the cameras are fully calibrated—their intrinsic and extrinsic parameters are known— rectification and reconstruction become trivial issues. In this section, works about stereo matching are reviewed. For a comprehensive survey, the readers are referred to Dhond et al. [24]. 4.1.1 Aerial triangulation Shape recovery from aerial images was first introduced in photogrammetry [98] under the name aerial triangulation. Here the word triangulation came from the fact that any ground point and its two projections onto the two cameras form a triangle. The cameras must be calibrated. To produce accurate results, the plane must fly along a predetermined path. Pass points on the ground are marked and surveyed, and are used to connect multiple areas in a linear fashion, hence the name stripe adjustment. This is depicted in Figure 4.1. Sometimes, multiple stripes are further tied together via tie points. This operation is called bundle adjustment as shown in Figure 4.2. In general, aerial triangulation is a labor-intensive work. Special machines are built for human operators to match the aerial images manually. A ll passing points and tie 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. points must be surveyed by sending a team decide the final fly path. to the fields. Many fly paths are necessary r 1*1 r, + » +' v» +* * £ 5 *« I 1 i +" !r O t K , .. i i • * ;+ i ♦ { I * ■ f 1 ---- ---------- t o Figure 4.1 Aerial Triangulation and Stripe Adjustment Figure 4.2 Bundle Adjustment in photogrammetry ■ ut nor Piirthpr reoroduction prohibited without permission. Reproduced with permission of the copyright owner. Further reprodu 4.1.2 Binocular stereo The same problem was studied in computer vision under the name binocular stereo with the goal o f automatic matching. The central issue here is how to use image information, which is quite limited, rather than high-level knowledge to match the corresponding pixels in the two images. This is where the epipolar constraint comes in to play. It reduces a two-dimensional search into a one-dimensional search. Cox et al. [21] and Ohta et al. [65] are representatives o f this type o f methods. These methods in addition incorporated the ordering constraint, using dynamic programming. The disadvantage is that inter-scanline coherence is ignored, leading to artifacts in the output. Including global constraints in the search process naturally makes stereo matching a constrained functional optimization problem. The solution function d(u,v) defines a regionally smooth surface over one o f the image plane (e.g. the left one). The surface is normally referred to as the disparity surface. Therefore what is actually achieved by the optimization scheme is to locate, in an abstract 3-D space, the disparity surface. The approach by Wei et al. [97] is the latest example along this line o f work. There, the disparity surface is parameterized as a function with Gaussian basis to reduce the number o f parameters— a typical technique used in variational methods. The result, not surprisingly, looks overly smoothed. The work by H off et al. [44] explicitly removes the outliers using Hough transform before fitting the surface. Since the Hough transform only applies when the algebraic form o f the surface is known, they divide the surface into multiple planar patches. It is 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. exactly this division that lim its the robustness o f Hoffs method because global constraints are broken at the boundary between patches. A more appealing solution is provided by Lee and Medioni [52] where each potential match is represented by a tensor (or ellipsoid) that captures location, normal and uncertainty. During tensor voting, consistent matches enhance one another, increasing the confidence on the normal estimation, while inconsistent matches interfere with one another, increasing the uncertainty. The voting process results in a volumetric field inside which the disparity surface is the most salient surface. The surface is extracted in a subsequent step using volume-rendering techniques (Tang and Medioni [90]). The problem o f Lee's work lies in the scale issue (refer to her dissertation [51]). That is, the range o f each vote is uniformly preset. Another disadvantage is that the voting process is time-consuming. The reason comes from the fact that the convolution kernel is 3- dimensional and that it has to be aligned with the normal at each data site by applying a 4x4 transformation. Recently, Ishikawa et al. [45] and Roy et al. [77] have proposed a global optimization formulation for the general multi-baseline stereo problem. The formulation is quite powerful because it incorporates both inter- and intra-scanline constraints, and is able to model occlusions and discontinuities. The optimization problem is solved by transforming it into a maximum-flow problem. Then, the minimum-cut associated with the maximum-flow yields the disparity surface. Both works, however, failed to demonstrate examples involving more than two cameras probably because doing that 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. requires calibrated dataset. The theoretical bound and actual execution time reported by the authors suggest that their methods are computationally demanding. 4.2 Epipolar Geometry in Detail Here, we introduce the epipolar geometry which is the basic mathematical tool for our stereo system. More conceptual details can be found in Faugeras [28]. The articles by Zhang ([100] and [101]) dealt with computational issues. Zhang also made his software available for public access. P ft Figure 4.3 Epipolar geometry Graphically, epipolar geometry is depicted in Figure 4.3, where P, /’ 'are 3-D scene points; p\, p % are P's images; 0\, Oi are camera projection centers; and the line 0 \0 i is called the baseline. Notice that the two triangles AO\OjP and AO\OiP'are coming from a pencil-of-planes which is projected to the pencil-of-lines in the image planes. The latter (e.g. / 1 and h) form epipolar lines. The intersection o f each pencil-of-lines is called the epipole (®t, 0 2 )- An epipole plays several roles simultaneously. It is the intersection o f all 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the epipolar lines. It is the intersection o f the baseline with the image plane. It is also the projection o f a camera projection center on the counterpart image plane. Algebraically, epipolar geometry is described by the following equation: p[F p, =0 where F is a 3x3 rank 2 matrix called the Jundamental matrix, p\ and p2 are the 3-tuple homogeneous coordinates o f corresponding pixel points. The equation reveals the fact that pi is located on the epipolar line defined by p, F . The relationship is symmetric: p2 is on the line defined by p (F r . Since o,7 F 1 p2 =0 for any p2, Fo{ = 0. Thus o\ is the null vector o f F which reflects the fact that F is o f rank 2. Similarly, o2 is the null vector o f Ft. It is observable from Figure 4.3 that if an image plane is parallel to the baseline, then its epipole is at infinity. In that case, all epipolar lines on that image plane are parallel. 4.3 Rectification In general, the two cameras used to take the stereo images are not parallel. The task o f rectification is to achieve this effect numerically so that they become coplanar and parallel to the baseline. The result is that scanlines in each image plane are parallel to one another, and the corresponding ones from both images are horizontally coincident. Traditionally, this step requires fully calibrated cameras. Hartley et al. [39] gave an algorithm by finding a special factorization o f the fundamental matrix. Here, we present a more intuitive algorithm that is also based on the fundamental matrix. It works well for two close-by images, which is the case o f stereo. 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Figure 4.4, / l, r l and 12, r2 are two pairs o f corresponding epipolar lines, vl is the average o f the y coordinates o f the four end points (two from / l, the other two from rl), and v2 is that o f 12 and r2. The rectification transformation, one for each image, maps the trapeze (black) to the rectangle (gray). The process is demonstrated in Figure 4.5. The software provided by Zhang [100] is used for the fundamental matrix computation. Below, we briefly explain why this algorithm works. First, recall from section 2.2.1 that cross-ratio between the two pencil-of-lines in the two image planes is unchanged, also from 2.1.5 that cross-ratio is invariant to homography. Second, within each pencil, the line that is “ at infinity” is the one that passes the image origin because it possesses the canonical form [a, b, 0]. Third, after the rectification, all three bases are aligned— two o f them are the top and bottom edges o f the trapeze, the special one mentioned above is mapped to the X-axis. This fact plus the invariant property o f cross-ratio makes the alignment o f the corresponding epipolar lines (or scanlines after rectification) necessary. Figure 4.4 Illustration o f the rectification algorithm 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) original input superimposed with matched points (b) original input superimposed with epipolar lines (c) rectification results Figure 4.5 The original and the rectified stereo pair 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4 Image Matching In this section, it is assumed that the images have been aligned. We can thus encode the correspondence information by a function d(u,v) defined over the left image plane (denoted as 1 1) such that (w,v) and (u+d(u,v), v) are a pair o f corresponding pixels. Geometrically, d(u,v) defines the previously mentioned disparity surface. Assuming corresponding pixels have similar intensities (color), and letting 0 denote a sim ilarity function such that larger values mean more similar pixels, matching can be formulated as a variational problem: One simple solution to (4.1) is to sample over all possible values o f u, v, and d, followed by an exhaustive search in the resulting volume. There are two issues: one is efficiency—how to perform the search in a time-efficient way; the other is robustness— how to avoid local extrema. The method presented below addresses both issues. The core idea is to treat d(u,v) geometrically as a surface in a volume instead o f an algebraic function, and to extract the surface by propagating from seed voxels which have high probability o f being correct matches. 4.4.1 The u-v-d volume We use the normalized cross-correlation over a window as the similarity function: £(M,v) = m a X JjO(M, v,d(u,v))dudv . (4.1) 0 ( m, v, d) = Cov(W, (u , v ), Wr (u + d, v )) (4.2) Std(W, (u, v ) ) • Std(Wr(u + d, v )) 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where W i and W, are the intensity vectors o f the left and right windows o f size (o centered at (w,v) and (u+d,v) respectively, d is the disparity, “ Cov” stands for covariance and “ Stef ’ for standard deviation. The width and height o f the (left) image together with the range o f d form the u-v-d volume. The range o f 0 is [0,1]. When 0 is close to I, the two pixels are well correlated, hence have high probability o f being a match. When 0 is close to zero, that probability is low. In implementation, a threshold needs to be set. We discuss how to choose its value in the next subsection. 4.4.2 Disparity surface extraction The fact that 0 is a local maximum when (u,v,d) is a correct match means that the disparity surface is composed o f voxels with peak correlation values. Matching two images is therefore equivalent to extracting the maximal surface from the volume. Since the u-v-d volume is very noisy, simply applying the Marching Cubes algorithm [55] would easily fall into the trap o f local maxima. We thus implemented a propagation algorithm [46]. In addition, we make use o f the disparity gradient limit [11] which states that |zk/|/|zli/|<l. Use o f this constraint in the scanline direction is equivalent to the ordering constraint often used in scanline-based algorithms (e.g. [21] by Cox et al.). Using it in the direction perpendicular to the scan lines enforces smoothness across scan lines, which is only partially enforced in inter-scanline based algorithms such as the one presented by Ohta and Kanade [65]. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4.2.1 Algorithm description The output from our matching algorithm is the disparity map which corresponds to the voxels that comprise the disparity surface. This is where it differentiates itself from volume rendering, or other matching methods that model the disparity surface as a continuous function. The algorithm undergoes two steps. Step 1. Seed voxel selection A voxel (w,v,d) is a seed voxel if, i) it is unique - meaning for the pixel ( m,v), there is only one local maximum at d along the scanline v, and ii) «£(//, v, d) is greater than a threshold tl. A seed necessarily resides on the disparity surface. Otherwise, the true surface point (u.v.d1 ) for which d*d would be a second local maximum. To find seeds, the image is divided into a number o f buckets. Inside each bucket, pixels are checked randomly until either one seed is found, or all pixels have been searched without success. During the search, the voxel values are cached to save computation time for the next step. The value o f tl determines the confidence o f the seed points and is set close to 1 . In our experiments, we start from 0.995 trying to find at least 10 seeds. I f too few seeds are found, the value is decreased. In all the examples tried so far, we have found the range o f tl to be between 0.993 and 0.996. Step 2. Surface tracing Simultaneously from all seed voxels, the disparity surface is traced by following the local maximal voxels whose correlation values are greater than a second threshold tl. The 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. |Ad|/|Aw|<l constraint determines that when moving to a neighboring voxel, only those at d, d-l, d+\ need to be checked. Initially, we store the seed voxels in a FIFO queue. After tracing starts, we pop the head o f the queue every time, and check the 4-neighbors o f the corresponding pixel (border pixels need special treatment). When two surface fronts meet, the one with the greater correlation value prevails. I f any new voxels are generated, they are pushed to the end o f the queue. This process continues until the queue becomes empty. To enforce smoothness, we assign (w1 , v', d) higher priority than (u\ v’, d-1) and (u\ v', d+1). To obtain sub-pixel accuracy, a quadratic function is fitted at (u\ v', c/-l), («', v', d), and (»', v', if+ l) where ( m\ v', d) is the newly generated voxel, tl determines the probability that the current voxel is on the same surface that is being traced. It turns out that the value o f tl is not critical. In all the examples tried so far, the value 0.6 is used. Pseudo code o f the tracing algorithm is given below. 4.4.2.2 Complexity The worst case complexity o f the seed selection part is bounded by O(WHDai) where W and H are respectively the width and height o f the image, D is the range o f the disparity, and co is the size o f the correlation window. That o f the tracing part is bounded by O(WHcd). Since some voxels have already been computed during the first step, this lim it is never reached. Actual running time on examples is given later. Note that, in this case, it is expected to traverse the each image plane at least once. Thus the lower bound o f the complexity is 0 ( W TO - Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Algorithm 1. Disparity Surface Tracing Initialize Q with the seed voxels; While (not empty Q) { Set («, v, d) = pop Q; For each 4-neighbor o f (m, v) { Call it («', v'); Choose among (u\ v', d-1), (u\ v', d), (u\ v',d+ 1) the one with the max correlation value and call it (u\ v', d)\ i f (m ',v') already has a disparity d" disparity!n', v') = $ ( n', v', d) > $ (w\ v', d') ? d : d'; else if cp(n', v', d) > t2 { disparity!^', v') = d; push (u\ v', d) to the end o f Q; } } } Figure 4.6 Pseudo code o f the tracing algorithm 4.4.3 Extraction on the “Herve’s Face” stereo pair Figure 4.7 (a) shows two perpendicular slices o f the u-v-d volume computed from the Herve’s Face stereo pair. Lighter pixels indicate higher 0 (correlation) values. Clearly, the volume is quite noisy. This explains why directly solving the variational problem (6) is difficu lt—how can a reasonable initial solution be found in the first place? The same slices after non-maxima suppression are shown in (b). We see that local maxima exist in many places. Therefore simply using local similarity measurement for correspondence search could be disastrous, (c) demonstrates what the volume look like after tracing. Now, the disparity surface appears clearly. In this example, W= 512, H=384, c j=9, and the 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. disparity range is [-59, 31]. Running on a SGI 02 with default values for /I and /2, step 1 takes 2696 seconds, which results in 468 seeds; step 2 takes 24 seconds. (a) two cuts through the original u-v-d volume (b) after non-maxima suppression (c) after tracing Figure 4.7 Results on Herve’s Face stereo pair 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 4.8 Output disparity map of the Herve’s Face pair 4.4.4 Improving time efficiency Obviously, the bottleneck o f the previous extraction algorithm lies in the seed selection part. To improve time efficiency, we modify the algorithm so that it proceeds in a multiresolution fashion: only at the coarsest level is the full volume computed; at all subsequent levels, seeds are inherited from the previous level. To guarantee the existence o f seeds at the coarsest level, we replace the uniqueness condition (in step 1 o f 4.4.2.1) by a winner-take-all strategy, that is, at each ( m,v), we compute all voxels (u,v,d) where d& [-WqI2, W q!2] and choose the one that has the maximum correlation value. Under this relaxed condition, some seeds may represent incorrect matches. To deal with this, we assign the seeds randomly to five different layers. As a result, five disparity maps are generated at the end o f tracing. This lets us identify and remove wrong matches. I f no 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. agreement can be reached, we leave that point unmatched. At each level, extraction is performed for both the left and right images. Crosschecking is then conducted. Those pixels whose left and right disparities differ by more than one pixel are eliminated and recorded as unmatched. At the finest level, small holes are filled starting from the borders. Figure 4.8 shows the final disparity map resulting from the improved algorithm. The execution time is reduced to 429 seconds, about 1/6 o f the previous version. Assume the reduction rate is 4 and the size o f the correlation window is constant over all resolutions, the time complexity is reduced to 0{WHco). Another merit o f the m ulti resolution version is that there is no need to prescribe a value for D. 4.5 Reconstruction In the reconstruction stage, the correspondence information is transformed into 3-D Euclidean coordinates o f the matched points. A two-step approach (projective reconstruction followed by Euclidean reconstruction) is taken which eliminates the necessity o f camera calibration. Both steps are extended in Chapter 5 and Chapter 7 respectively to handle multiple views. 4.5.1 Projective reconstruction by matrix factorization In [70] [91], Kanade et al. described a reconstruction algorithm using matrix factorization. Denote the projections o f n points in two views as [un, v,; ]r where /= 1,2 and y = l,..., n. The following measurement matrix is defined: 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Ml M /i _ U ; "“ 2 1 " LV2I . M » “2, , _V 2« . The authors observed that, under orthographic or para-perspective projection, the aforementioned matrix is o f rank 3. Then, a rank-3-factoriztion o f the measurement matrix gives the affine reconstruction. One advantage o f their algorithm is that all points are used concurrently and uniformly. In applying the idea to perspective projection models, Chen and Medioni [14] show that the following modified measurement matrix is o f rank 4: W = “ u" “ I. v.'l ... Vl'„ 1 1 “ 21 " “ 2 /. V2 , ... V2„ 1 I where each column denotes a pair o f corresponding pixels after rectification. Thus a rank- 4-factorization produces a projective reconstruction (section 2.2.2): P, f e • • • qX (4.3) where P\ and Pi are the 3x4 matrices o f the two cameras, and Qls are the homogeneous coordinates o f the points. Appendix A .l gives a method o f computing such a factorization using Singular Value Decomposition (SVD). Next we transform the so far obtained projective reconstruction into the first canonical form (section 2.2.3) which is a prerequisite o f our Euclidean reconstruction algorithm. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Let P y = [/>,, p ,]. It is known from [28] that C, = -P~lp] is the first projection center. The stereo rig can be translated so that C\ is coincident with the world origin. Let the translation matrix be 0 1 then Thus (4.4) is the desired canonical form for stereo projective reconstruction. 4.5.2 Euclidean reconstruction Now that the world coordinate system (the origin and the axes) is coincident with that o f the first camera, from section 2.2.2, Euclidean reconstruction is equivalent to finding the Projective Distortion Matrix H such that where p compensates for the relative scaling between the two equations. Since H is defined up to a scale factor, we set one o f its element be 1: [/>, o ] / / = [a, o ]/, (4.5a) and [a , Pi ]h = m[ a 2 o ] * ; (4.5b) 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. H = //, h, ti 1 Then, (4.5a) becomes, h i °1 I t - h i " i pn * i ] = h °1 ht i which implies h\=Q and //, = . Thus H = P-'A, 0 /i7 1 Plug (4.6) into (4.5b), te , p .l which generates PjAt 0 = [ p2 1 P~'A2 + p2 h r p,\= n[A2R A2 T] 7 * . ' "A /,' pi\p\\ A + p2 h ‘ = pA2R = n fl* 2 = A/ t *3 a/ 3 and (4.6) (4.7a) (4.7b) p 2 = pA2 T . Since R is a rotation matrix, (4.7a) further expands into the following 5 constraints onf\, fi, and h : A/, • M2 = M 2 • A/3 = A/3 • A/, = 0 , (4.8a) IM M M M M I. ( 4 . 8 b ) 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Once f u f i, and h are computed, H can be obtained from (4.6). /?, T and /u are obtained from (4.7). To determine the initial value for H, let /?«/, //«1, and A 1* ^ 2= f 0 0 0 / 0 0 0 I since the two cameras have similar orientation and focal length. It follows that H] = P x ~ \ A and p2hr = {I - P2lP^)A. Thus, an approximate Euclidean reconstruction can be achieved solely depending on / We have developed an interactive tool to let the user input f and adjust its value until the approximation looks reasonable. 4.6 Results 4.6.1 Pentagon pair Figure 4.9 (a) is the input rectified pair (512*512). We use it to test our matching algorithm. Figure 4.9 (b) is the resulting disparity map and (c) is a shaded view o f the reconstruction. Notice that some details such as the concourse, the freeway, and the small "T" shape roof on the lower-right part o f the building have all been captured. Edges are smoothed, however, since the algorithm is area-based only. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) input image pair (b) disparity map (c) reconstruction Figure 4.9 Results o f the Pentagon pair Notice the T shape building inside the circle. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission (a) input (b) tracing result (d) two textured views o f the reconstruction Figure 4.10 Results on the Renault Auto Part pair 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.6.2 Renault Automobile Part Results on this example are shown in Figure 4.10 where (a) is the input pair, and (b) shows the disparity surface obtained from the tracing procedure. For comparison, (c) shows all local maxima in the u-v-d volume which is obviously quite noisy. It is clear that a brute-force method would produce disastrous results. Since we only have the rectified images, but not the rectification homography, only an approximate Euclidean structure can be obtained. However, since the vergence angle is very small, the distortion is also small, as evidenced in (d). 4.6.3 Herve’s Head Figure 4.11 illustrates results on the "Herve's Head" pair. The pair is taken by a single 16mm camera with four mirrors to guarantee that there is no motion o f the subject. It can be seen in the figure that even the highlight pixels on the nose have been correctly matched. It is also worth mentioning the shape o f both upper and lower lips has been captured as well, as can be seen in the right-most picture. Figure 4.11 Results on the Herve’s Head pair 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.6.4 A full head example A styrofoam head is used as the testing object. Six stereo images are obtained by placing it on a rotary table while keeping the cameras fixed. The rotation angles are 0°, 60°, 90°, 180°, 270° and 300°. The two extra views at 60° and 300° help to increase accuracy near the frontal face. Each stereo pair contributes a slice o f the head (point cluster). A ll slices are then fused in a common coordinate system by rotating them to the known degree. The result is shown in Figure 4.12 where (a) is one o f the images at the 60° angle, (b) is the disparity map and (c) shows two views o f the fused data set. Spurious points exist in the reconstruction. Some o f them are sparse and random, therefore relatively easy to be removed. There are others that tend to form small surface patches. These points mainly come from the boundary areas o f the stereo pairs where the foreground “ melts” into the background, a known problem in stereo matching. The noise removal issue is not addressed in our work. Instead, the software written by Tang [90] is used. As seen in Figure 4.13, not only are all the spurious points eliminated, the large empty area on the top o f the head is also closed. The theoretical foundation o f the software is called Tensor-Voting, proposed by Medioni and Lee [52][90]. Thorough descriptions about the theory and the implementation algorithms supporting Tang's software can be found in the book by Medioni, Lee and Tang [59]. 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) one o f the 60° pair (b) the disparity map o f this pair 1 (c) different views o f the fused reconstruction Figure 4.12 Reconstructed fu ll head model Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 4.13 Results from tensor-voting 4.7 A Face Reconstruction System To further explore the potential o f our algorithm, we have built a Face System using two FUJI DS-300 digital cameras and a synchronizing device (Figure 4.14). The cameras output uncompressed color images in TIFF format with 640x480 resolution. The color information is used for texture mapping only. We decide to work on faces for two primary reasons. First, faces are hard to reconstruct due to their smooth shapes and lack o f prominent features in most areas. Second, there are plenty o f applications that may make use o f such a system, for instance, teleconferencing and animation. We are collecting a growing gallery o f face sets, some o f them, together with the acquired models, are demonstrated in Figure 4.15 and Figure 4.16. In each example, the 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. first two images are the input pair; the remaining ones are the reconstructed models. In general, we are pleased with these results. Figure 4.14 Our stereo system made from two FUJI DS-300 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Stefan Hinz Figure 4.15 Examples and results from the Face System Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Prof. Gerard Medioni Qian Chen Dr. Zhengyou Zhang Figure 4.16 More examples and results 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter § Multi-view Projective Reconstruction This chapter begins the effort o f generalizing the reconstruction algorithms developed in the previous chapter to the multi-view case, that is, multi-view projective reconstruction. Later, in Chapter 7, the algorithm for Euclidean reconstruction w ill be presented. As said in the previous two chapters, research on the topic o f structure-from-motion started a long time ago. The approach taken by the photogrammetrists (see the monograph by W olf [98]) was burtdle-adjustment, which requires a priori the knowledge o f both the extrinsic parameters (location and orientation) o f the cameras and the intrinsic parameters (including focal length). Later, it was shown that the requirement on extrinsic parameters could be relaxed (e.g. Ramesh Jain [48]), leaving an unresolved global scale which in many cases is not needed anyway. Recently, Faugeras [26] proved that when neither the extrinsic nor the intrinsic camera parameters are known, in other words, when the cameras are totally uncalibrated, then, under the perspective camera model, a projective reconstruction can be reached which differs from the Euclidean one by a 4x4 transformation. In the meantime, Kanade et al. [70] [91] were experimenting with affine camera models. They developed a method using matrix factorization that computes extrinsic and intrinsic parameters simultaneously under the orthographic or paraperspective model (see 4.5.1 for a brief introduction). Both bundle-adjustment and matrix factorization methods have been extended to the perspective model, with limited success. Detailed analysis o f these algorithms w ill be 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. given in sections that follow. In conclusion, a formulation for general multi-view structure-from-motion with uncalibrated cameras is still an ongoing research issue. 5.1 Factorization-based Methods 5.1.1 Factorization-based reconstruction for affine cameras This problem was tackled by Toraasi and Kanade [91], and Poelman and Kanade [70]. Here, we adopt the derivation o f Poelman et al. with slight modifications to be consistent with our presentation for perspective cameras in later sections. A point whose 3-D location is Xj Q=\, ..., n) is observed in an affine camera whose projection matrix is P, ( / - l , ..., m) at image coordinates (wy. vy) such that K , v J r = PIX I + [.r,,y (]r where (r„ y,j is the image o f the world origin. W riting in a matrix form, we have W = ’“ l l ' _V„ . _VI" . “ m l " .V-» ! . _V n,n . [X, - X,]* yy .y. [i - i] (5.1) where W is called the measurement matrix. Without loss o f generality, it is assumed that the world center is the center o f mass o f the points, thus ^ X t In = 0. Adding all / / columns in both sides o f (5.1) and divide each sum by n, we havex, = jnand y, = ]T v7 In for all /= 1 ,..., m. Therefore, we can compute x, and>>/ immediately from the 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. image data. Once they are known, we subtract them from W, giving the registered m easurem ent m atrix 'p, ] : j t .L ( 5 . 2 ) P L " J 2 # ix 3 where w '; = un - x, and v'; = v( / - .y,. This can be summarized into the following rank theorem for affine cameras: T heorem 1. Under affine projection, the rank o f the registered measurement matrix is at most 3. Actually, except for the degenerate cases where all points are either colinear or coplanar, W’ is o f rank 3. Thus, in general cases, a R ank-3-Factorization o f IT'gives the cameras’ projection matrices and the points’ coordinates in one step. Such a factorization can be implemented using Singular Value Decomposition (see Appendix A .l). Since it is not our focus to study the degenerate cases, in the following, we w ill assume that all the points are neither colinear nor coplanar in space. 5.1.2 Extension to perspective cameras To extend the previous results to perspective cameras, one naturally defines the measurement matrix as: 64 u u _v Jt. -vin- ’ k « | ' ^ m u _ V m I . v' _ r mn _ Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and then asks the question: is W o f rank 4? It turns out that this is true only under a special situation. But we delay the discussion till section 5.2. For now, let us first look at the generic case. Suppose we have a perspective camera whose projection matrix is P, (3x4), and a point in some projective space whose homogeneous coordinates are in Xs (4x1). Using the symbol to denote equality up to a scale (see section 2 . 1. 1), we have [ujj, V jj, 1 ]T~P,Xj, or Xj[Uij, V y, 1 ]T = PjXj, for /= 1 ,..., m andy = 1 ,..., n, where A ,y is a scale factor which, under normal circumstances, depends on both the camera and the point. The equivalent matrix form is: > .1 ; I* . - * . L (s-«) P L "03/11*4 where W s is called the scaled measurement matrix. A similar conclusion thus holds for perspective cameras which is expressed as: “ ll “ n, 4 . Vl« 1 ^ ■ l/i vi„ 1 . “ «l “ n„, 1 V«l 1 ^"mn V n,n 1 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Theorem 2. There exists a set of scale factors, all non-zero, such that the corresponding scaled measurement matrix is of rank 4 under non-degenerate cases. Consequently, for points in general positions, a Rank-4-Factorization o f the scaled measurement matrix produces a projective reconstruction o f the points. The theorem also suggests that the key step in the perspective case is to recover the appropriate scale factors. Sturm et al. [8 6 ] proposed a progressive method by computing the fundamental matrix between each successive pair o f views. Starting from the first view with X\\ = X\i = ... = X\„ =1, the method finds /l,i, Xa X ,„ for /=2, ..., m progressively using the . r .1, fundamental matrix Fu.\ between the (/-l) and the / views: ( « ,,- ! X ? ,.,)• , II i-l.J °,.»-i X<M where ou. i is the epipole, and v,y, l ] r. The algorithm thus requires the epipolar geometry be known or estimated at the beginning. 5.1.3 The Iterative Factorization Algorithm (IFA) Sturm's method requires the fundamental matrix, which causes two problems in practice. First, there exist special camera configurations where the fundamental matrix becomes degenerate, e.g. all cameras are parallel and co-planar. Second, estimation o f the fundamental matrix is still experimental, normally cast as a constrained non-linear 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. minimization problem which needs initialization and may incur numerical instability. The Iterative Factorization Algorithm that we are proposing avoids all these problems. Let c t u c f i, c r$ , c r 4 , 0 5 be the five largest singular values o f fVs, and the rank-4-ness o f W s be measured by < 75. The solution to (5.4) can be obtained by: min (cr5 ). (5.5) It is quite difficult to find an analytical solution to (5.6). Instead, a heuristic algorithm is presented here which, verified experimentally, does minimizes < 7 5. The major two steps o f the algorithm are (i) to perform the rank-4-factorization on W (Singular Value Decomposition Appendix A .l), obtaining fV4; and (ii) to rescale IF with the coefficients o f W4 to make it closer to “ rank-4” . We repeat this factorizaion-rescaling process, trying to bring W closer and closer to rank 4 until (T 5 reaches a predefined small value. This process is depicted in Figure 5.1. For reference, we call it the Iterative Factorization Algorithm (IFA). The pseudo code o f the algorithm is summarized in Figure 5.2. Factorization W s Updating scale factors Figure 5.1 Illustration o f IFA 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Algorithm 2. Iterative Factorization Algorithm (IFA) Repeat { Perform SVD on W : W=U^Vr- , Obtain 2 4 from jb y setting all singular values except the 4 largest ones to zero; Let W ^U & V7- , Let the 3-tupple corresponding to the i-f*' element (as defined in (4.3)) o f W and IV A be a# and by respectively; Compute the /- 7 th scale factor for W as: s. . )/(*,[*,,) Update W with the new scale factors. } Until ( 0 5 < f) Figure 5.2 Pseudo code o f IFA It is interesting to see the difference between IFA and a similar one proposed by 7* Berthilsson et al. [9]. Let W=UZV be the singular value decomposition o f W. Denote the first four columns o f U an V respectively by U4 and V 4, then P=U[ diag(a\,0 5 ,< 3 ,0 *4 ) gives the camera matrices and X=V4 gives an approximate projective reconstruction. Notice that here X consists o f four orthonormal columns. The updating formula o f IFA is: = P X = P X X 'X V (*♦1) • 1 1 1(*+l) ■'' Am ** 1 1 r 'u,\ v,\ •](* ) m 1 1 (X rX ) (5.6) where k is the iteration index. On the other hand, according to [9], Berthilsson's updating formula is: 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. which equals to (X'X). (5.8) In some sense, (5.8) concatenates two steps o f (5.6). Therefore Berthilsson's algorithm converges faster. However, while the time complexity o f solving for the scale factors from (5.6) is O(n) because each o f them is computed individually, that o f solving (5.7) is 0 («3 ) because all factors are solved simultaneously (which amounts to solving a homogeneous 3nxn equation). This analysis makes IFA practically more applicable. 5.2 Rank Theorem Under The Common-Image-Plane Condition In the previous section, algorithms that compute the scale factors using matrix factorization are discussed. An unanswered question is if there exist cases where the computation o f the scale factors is unnecessary. Asked differently: are there cases where the measurement matrix is o f rank 4. To answer that question, some preparations are necessary. First is the following lemma. Lemma. Let W s be a scaled measurement matrix with the scale factors A ,y as defined in Theorem 2, for /=1, ..., m andj=l, ...,« . Then the corresponding measurement matrix W 69 v,i i(*+n Am V ,„ = V,\ ;(*+!) ^7 / 1 v„ , _i _ 1 1 _ 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is rank 4 if and only if there are m+n independent coefficients pt, l< /< m, and Xj, 1< j< n, where pi is related to the i'h camera and x} is related to the f h point, such that ky = Pi X j Proof. Sufficient condition. I f we pre-multiply W s by the diagonal matrix rJ_ Pi J _ Pi and post-multiply it by x -, we have ■ 1 " 1 Pi 1 I w = Pi 1 1 Pm. which indicates that W is o f rank 4. I-* 1 Pi ’ * L . K 1 --- J - c 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Necessary Condition. Let the Rank-4-Factorization o f W be W = p; P' I* ; ••• * : ] Since X t ’s and X ’ t ’s are two projective reconstruction o f the same set o f points, they are equivalent up to a 4x4 non-singular collineation H (section 2.2.2). This means that there exist n non-zero factors: x i, ..., jc„, such that [xx X[ Therefore w ■ ■ x„X'n] = H[X, *,]■ p; p ’ m _ ' p; p p," p: h h " [ x ; ... . v ; ] H :L X „ •V, -V, Consider the two projection matrices o f the /lh camera. The original definition says that V ij, 1 ]r = PXj. Hence P, and P," both project X \ , X „ to (uihv ,/),.... (u in , v ,„). As a result, they should only differ by a scale factor. Denote the factor and write P, = PiP". Knowing this is true for all 1< /< m, we have Aij[u0, V ij, l]r=piPi"Xj=piXj[ui j, V jj, l ] r for l< i<m, 1 <j<n, 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. which implies Ay = pjj, for 1 < i< m, 1< j< n. It follows from the derivation that p, is related to the z th camera, and xy is related to the f ' point. ♦ As the second preparation, we need some results on the geometrical interpretation o f the rigid transformation (represented by a rotation matrix R and a translation vector 7) that relates the camera coordinate system (Xc ) and the world coordinate system (X w ): R T x w x w or = W 0 1 w w Rr - RrT 0 1 X e \v W riting R in its row vectors R = *<o * < 3 > we have 'O' 'O' ‘ RT - R rT 0 Rr 0 ^(3) 0 1 1 _0 C _1 i 0 Since [0 0 1 0]r represents the camera's principal axis in Xc , Rq> — the third row o f R- represents the same axis in Xw . Furthermore, since ' O ' " r ’R T 0 _i 0 1 y 0 1 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T represents the world origin in the camera coordinate system. In particular, the third component o f T is the translation in the direction o f the principal axis (the Z axis o f Xc ). As the final preparation, we introduce the common-image-plane condition. In Figure 5.3, Oi indicates the origin o f a camera coordinate system, O indicates world origin, and /? ,( 3) represents a camera's principal axis. It can be observed that the common-image-plane condition is equivalent to i). A ll the camera principal axes are parallel, and ii). The common image plane is perpendicular to the common principal axis. Violating one o f them w ill violate the common-image-plane condition. Y U 3 ) Figure 5.3 The common-image-plane condition Xc -Camera Coordinate System, Xw -World Coordinate System R 2( 3), /? 3( 3> - the parallel principal axes, Ou 0 2 , 0 2 - the coplanar projection centers. 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Now, we are ready to answer the question: under what situation is W itself o f rank 4? T heorem 3. The measurement matrix is of rank 4 if and only if the cameras satisfy the common-image-plane condition. Proof. Let Ai represent the internal camera parameters, R, and T, be the external parameters. The projection process is -P . X . - [ a. 0 ] R. I 0 1 au, k, w O , 0 0 avt v O , 0 0 0 M f 0 < - > *>(1) ^ i( 2 ) X ' < 3 , ^ /(3 ) _ 1 0 1 Pni)X, + f,(3) /, (5.9) From the second preparation, we know that R,q) is actually the camera's principal axis in the world coordinate system, and rl( 3) is the translation o f the world origin in this direction. I f the scale factor A/, is chosen to be (Rj0)X t +tl0))/f , then (5.9) is an indication that the resulting scaled measurement matrix is o f rank 4. From the lemma, it can be inferred that the sufficient and necessary condition for the corresponding 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. measurement matrix to be rank 4 is that Xtj = (RjmX l +t,0))/f, can be decomposed into two terms, such that one o f them is related to the 1th camera, and the other is related to the / h point. This is true if and only if (i) R\Q )= R.2Qj=...=RmO)=V, and (ii) 'i( 3)=' 2( 3)=----fm ( 3)=/, so that Aij=( 1 //j)(Vr Xj+()-( 1 !fi) Z j. The first condition implies all principal axes are parallel (with the common direction V). The second one implies that the world origin has the same distance t to all cameras along V. It is not hard to verify that these two conditions in fact amount to the common-image-plane condition. ♦ Notice that in a general camera configuration, the dot product appearing in Xy prevents it from being decomposed into two terms related respectively to a camera and a point. Consequently, there exists no Rank-4-Factorization for the measurement matrix in the general case. On the other hand, the above special configuration is not merely o f theoretical interest. Any two, or three views can be rectified into such a configuration. We have seen the two view case in section 4.3 as part o f our stereo reconstruction algorithm, which we call rectification. To confirm the theory, an experiment was conducted on the CIL-0001 dataset provided by the Calibrated Imaging Laboratory o f Carnegie Mellon University where one camera was mounted on a robot arm. We took seven coplanar views (number 1, 2, 3,4, 5, 10, 11) from the set. The first five views are on the same baseline, that is, the cameras are colinear. The last two are on another baseline, which is above the first one. Camera calibration data is available. The average reconstruction error (in mm) vs. the number o f views is plotted in Figure 5.4, where the solid curve corresponds to the view sequence o f 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1,...5,10,11, the dashed curve corresponds to 10,11,1,...,5. It is observed that the error decreases as the number o f views increases, which indicates that some random errors cancel themselves. We also computed the camera positions. Since there is no ground truth for them, we only list in Table 2 the relative camera movement with respect to the first camera, knowing that each o f them moves either horizontally or vertically in a 2 0 mm interval. 0.28 0.2 « 0 1 8 0 .0 8 2 2.8 3 35 4 4.5 8 8.8 8 7 Num bar of vtawa Figure 5.4 Average reconstruction error o f the CIL-0001 dataset 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2. Recovered relative camera movement (mm) AX AY AZ camera 1 0 0 0 camera2 19.2671 1.04254 -0.0557293 camera3 40.209 -1.80103 -0.252869 camera4 60.0891 -1.62216 -0.583654 camera5 80.0262 -1.56739 -1.05636 camera 1 0 40.205 18.951 -0.274677 camera 11 60.3754 18.0331 -0.616324 5.3 A Geometric Explanation In [92], Triggs introduced what he called the Joint Image Theory which gives us an insight to the projective reconstruction problem in a geometrical sense. Let us look at equation (5.4) and denote the 4 dimensional space spanned by Xjs as E. The equation can be interpreted in two ways: n 4-D points in Z are projected to m algebraically 3-D spaces, or the same points are projected to an abstract 3/n-D joint image space which we denote Q. Accordingly, the m 3x4 projection matrices form one 3mx4 joint projection matrix (p . E— >Q. The range o f (p only occupies a 4 dimensional sub-space Q4 in Q. In fact, a more precise definition for (p should be ( p r . E -+Q 4 . From this point o f view, Ws is merely a sample o f Q 4 . Projective reconstruction is nothing but choosing a basis for Q 4 (which defines q > ) and projecting fV s onto this new base which generates the homogeneous coordinates for the points. This corresponds to a standard engineering problem called Principal Component Analysis (PCA) and is normally solved using Singular Value Decomposition which is exactly what has been proposed in [90]. The problem, however, 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is that the location o f Q4 in Q is unknown. What is known is the sub-space spanned by W which we denote by Qx. It is thus necessary to warp W from Qx into £ X » , which can be accomplished by finding the appropriate scale factors. The following duality between the spaces just mentioned and the matrices appear in our earlier derivation are easily observable: O4 < -> W s QX<->IT < p + + P The dimension o f Q4 is a reflection o f the rank-4-ness o f (V s. Arbitrary camera placement generates a "joint-camera” whose projection is a map o f qf: Z->QX . The dimension o f the range space o f qi is unknown. Restricting the cameras to have a common image plane results in a map o f I -> 0 4 . In this sense, the rectification step in binocular stereo in effect accomplishes the same goal as warping the jo in t image. 5.4 Iterative Methods 5.4.1 Mohr's non-linear minimization method Adapting the bundle adjustment method to the fully uncalibrated case, Mohr et al. [62] proposed to formulate projective reconstruction as a minimization problem: / - I I I I u — — - 1 PT X \ i(3) J ) V f + V'J n T (5.10) 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where P'w , P ,'(1), P /(i) are the row vectors o f Pt, with the constraints ||/M|=l and ||A)||=l. The formula can deduced from (5.4) by noticing that utj = P jmX t jPj0)X t , and v// = P,h)X, I , ■ To determine the initial values, five non-coplanar points are assigned standard coordinates [0,0,0,1], [1,0,0,1], [0,1,0,1], [0,0,1,1], [1,1,1,1]. It is claimed that the initial values for the remaining data are not critical. For example, projection matrices are all set to the identity matrix and points are set to [0.5,0.5,0.5,0.5]. Mohr's algorithm is not computationally efficient due to the need to invert the associated Jacobian matrix whose size is polynomially proportional to the number o f cameras and points (°c m2 n+mn2 where m and n are the number o f cameras and points respectively). More severely, our implementation showed that the Jacobian matrix was constantly rank deficient, which indicates that some inherent constraints were not taken into account. Consider the following matrix formed by accumulating column-wise the homogeneous coordinates o f / points: A = X , x, ••• X, .V i y 2 ••• y < Z \ r , ••• z, w, w 2 • • • w, I f all these points are collinear, only two o f the columns are independent, thus matrix A is o f rank 2. In other words, the determinant o f any order-3 or -4 minor o f A is zero. If the points are coplanar, the determinant o f any order-4 minor is zero. Incorporating these extra conditions into (5.10) is not always desirable. For example, although it is easy to check the collinearity condition, to check for coplanarity, one has to compute the 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. fundamental matrix. Furthermore, the above constraints amount to degree 3 or 4 polynomials, making the non-linearity o f the problem even worse. 5.4.2 Our approach One common drawback among the factorization-based approaches is that all points must appear in all views. Otherwise, the matrix to be factorized contains "holes"— the missing data problem. While addressing this problem, projective bundle adjustment would be an ill-posed problem if it were formulated as in (5.10). Our experiments showed that the source o f the degeneracy lies in the attempt to minimize the projections and the points at the same time. In fact, one can treat (5.10) as a separable problem [25] by alternatively holding P^s and X /s constant. Since the terms involving points are independent o f each other (the corresponding Jacobian matrix is block-diagonal), each term (thus each point) can be minimized separately. This is true for the projections as well. Hence, we have a novel way o f minimizing (5.10): starting from the initial guess, we repeatedly perform intersection for each point followed by resection for each camera. Furthermore, both sub-problems have least squares solutions, which makes the implementation easy. Eliminating the scale factors in (5.4), we have the following basic equations PLX. PLyX. u - --------= 0 and v ------- f ------ = 0 P 1 V P l Y t(3) / r *0 )A t for f = l, ..., m, andj = \ , which can also be written in linear form a s 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for all points X \ , X „ , and .V, 0 - « „ x 0 .V, - v (l.V, (5.11b) for projections P\, ..., Pm. The solution is the null vector o f the corresponding coefficient matrix, which in return can be thought o f minimizing for all Pj's. The difference between (5.10) and (5.12) is that algebraic errors are minimized in the latter. Therefore the solution from (5.11) may be biased. Hartley [42] suggested a correction for this type o f problems by repeatedly reweighting each equation in (5.11) by 1/ P,1 {i)X l until the change o f weights becomes negligible (we observed that 5 loops were sufficient in practice). This scheme is implemented in the Weighted Iterative Eigen (W IE) algorithm (Figure 5.5). It is so named because o f the way (5.11) is solved: the solution is the eigenvector o f Ar A (or BT B) corresponding to the smallest (5.12a) for all X /s, and (5.12b) 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Intersection Resection For point X t Initialize weights Solve (10a) Update weights n e x t p o i n t , K For camera P , Initialize weights -Solve (10b) .Updates weights n e x t c a m e r a Figure 5.5 Illustration o f WIE Algorithm 3. Weighted Iterative Eigen (WIE)Algorithm Initialization; Repeat { For each point Xp do { Assign W/- ... = w ,„= l; Repeat { Reweight each equation in A ; Solve (4.11b); Update weights; } Until weights are stable; } For each projection P„ do { Assign vv/= ... = w2 n = 1 ; Repeat { Reweight each equation in B; Solve (4.11a); Update weights; } Until weights are stable; } } Until the solution converges. Figure 5.6 Pseudo code o f WIE for projective reconstruction Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. eigenvalue. Besides addressing the rank deficient problem mentioned earlier, this algorithm has another advantage— the sizes o f the involved matrices are drastically reduced. For example, in (5.4), the size o f Ws is 3mn. In (5.10), the size o f the Jacobian matrix is 2mnx(l2m + 4n). But in WIE, as shown in (5.11), A is o f 2/nx4 and B is o f 2 « x 12 . 5.4.3 Simulation results Several simulations have been carried out to show that WIE actually converges. The synthetic dataset involve 9 cameras and 100 points. The points are randomly distributed over a sphere o f radius R. The cameras, whose internal parameters are set to simulate real cameras with an image size o f 400x400, are arranged in two kinds o f configurations. In the planar configuration (Figure 5.7), they are perturbed from an initial coplanar-and- parallel configuration (defined in section 5.2) in the direction o f the common principal axis. The distance between the plane and the sphere center is Z. The perturbation level is a random variable uniformly distributed between -SZand < 5 Z , and Jis an input. R = 2 0 c m Z = 2 0 0 c m . . . s. 5Z ----------- A------------------------ *----~ A -S 2 Figure 5.7 The planar configuration 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In the spherical configuration (Figure 5.8), the orientation o f the cameras are specified in Euler angles, randomly chosen between -0and 6 (also a user input). Each camera is translated from the centroid o f the points in the direction o f its principal axis by the distance Z (Z>R). The effect is to place the cameras on a spherical patch o f solid angle 9. Afterwards, the cameras are perturbed along their principal axes in some amount between -5Z and 5Z, similar to the planar case. R = 2 Q c m Figure 5.8 The spherical configuration The convergence o f WIE is illustrated in Figure 5.9 where the residual o f (5.11) is plotted against the number o f iterations. For the spherical configuration, we set 5Z=10%. To demonstrate that W can be made arbitrarily close to rank 4, the changes o f the five largest singular values o f IF in 100 iterations are plotted in Figure 5.10. Notice that the first four o f them were almost unchanged while the fifth one decreased by several orders o f magnitude— confirming the rank-4-ness o f the updated W . 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Planar Configuration 0.1 - 5=5% - - 5=10% - 5=15% 5=20% 0.09 0.08 0.07 0.06 0.03 0.02 0.01 Number of Iterations Cylindrical Configuration 1.4r — 8=5 deg - - 6=10 - 8=15 8=20 0.8 0.6 0.4 0 2 Number of Iterations Figure 5.9 Convergence o f WIE Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Planar Configuration (S=20%) i i i Five largest singular values Spherical Configuration (8=10%, 0=20 deg) £ 10-' s 1 0 ' Five largest singular values Figure 5.10 The five largest singular values in 100 iterations Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Planar Configuration 1 0 — 5=5% - - 5=10% - • 5=15% 5=20% .0 10 ■ l 10' ,-3 10' O 1 0 ' IFA • 6 10' k WIE ■ 7 10' o 10 20 30 80 40 S O 60 70 90 100 Number of iterations Cylindrical Configuration — 6=5 deg - - 6=10 - 6=15 6=20 \V i n . . . o 1 0 IFA ' WIE 100 Number of Iterations Figure 5.11 Comparison o f convergence rate Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Although in our work IFA is used only to generate initial values, there is no provision from using it independently for projective reconstruction (in the full measurement matrix case). Therefore it is interesting to compare the convergence rate between WIE and IFA. Figure 5.11 is the logarithmic plot o f the 2-D reprojection error against the number o f iterations. In this experiment, the initial values for WIE are generated from a single step o f SVD, so that the starting point is the same for both algorithms. The following are the observations: 1) Both IFA and WIE converge linearly except during the first couple o f iterations where the rate is slightly super linear; 2) WIE converges faster than IFA roughly by a factor o f 10 in the planar case, and by a factor o f 4 in the spherical case; 3) IFA converges slower in the planar configuration than in the spherical configuration; 4) The general trends o f both algorithms are quite consistent under different parameter settings. 5.5 Experimental Results on Noisy Data This section compares the performance o f different projective reconstruction methods on noisy data. In addition to IFA and WIE, we have also implemented a non-linear minimization method (M IN) using the Levenberg-Marquardt algorithm. The comparison is made in terms o f the 2-D reprojection error (in pixels). In the experiments described below, except when specifically stated, the initial values result from 10 IFA steps. 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.5.1 Experiments on synthetic data Gaussian noise o f zero mean and standard deviation a was injected into the generated image data. The 2D error vs. the noise level is shown in Figure 5.12. It is observed that all three algorithms behave similarly in both configurations. They all exhibit linear degradation against noise. WIE and M IN are almost indistinguishable while IFA is slightly worse (the difference is about one-tenth o f a pixel when a= 3). This can be explained by the fact that WIE (5.12) emulates the objective function o f M IN (5.5) which is the aggregated reprojection error, while IFA minimizes the rank-4 measurement (5.7) which is a pure algebraic quantity. We have also performed tests with initial values generated from a single step o f SVD and found that, in some cases, MIN fails to converge. Therefore, good initial values are necessary for MIN. On the other hand, WIE converges in all cases. But sometimes the error is greater compared to getting the initial values from IFA. This indicates that rough initial values might lead to local minima. Thus, for precaution, IFA (about a dozen SVD steps) should always be used. Finally, the average computation time (recorded from an SGI 02 with R5000 processor) o f IFA, WIE and MIN is respectively 1.39, 5.42 and 2655.39 seconds. This supports our earlier claim about the computational efficiency o f the proposed algorithm. 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.5 , 5=20% Planar Configuration © X B o 1.5 w U J Q CM . / / 0.5 0.5 1.5 Noise Level a Cylindrical Configuration 2.5 r if a, 6=5 deg < wie, 0=5 •4 min, 0=5 -+ - ifa, 0=20 deg • ► wie, 0=20 * min, 0=20 0.5 0.5 15 Noise Level o Figure 5.12 Reprojection error vs. noise level Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.5.2 Experiments on real data The first experiment is conducted on an image sequence with eight-frames o f a computer terminal (Figure 5.13). Comers are detected using the Harris comer detector [37] [64] from which 2 0 are interactively selected and the correspondences established manually. Among the 20 selected points, 17 points appear in all views. Table2 (4 views) and Table3 (all 8 views) compare the numerical results o f different projective reconstruction methods. The results are consistent with those obtained from synthetic data: both WIE and MIN make small but necessary improvements over IFA, but the running time o f WIE is significantly lower than MIN. In the second experiment, 73 points are tracked over 9 images o f a wooden house (Figure 5.14, courtesy o f Dr. Boubakeur Boufama). Almost half o f the points are present in any two successive views. As a result, only WIE and MIN have been experimented. 5.6 Final Remarks From the experimental results on both synthetic and real data, we make the following remarks: 1) The accuracy o f WIE is only marginally better than MIN, but in some cases MIN fails to converge; 2) Generally, WIE outperforms IFA. And when converging, M IN also outperforms IFA; 3) Increasing the variation in camera locations (8Z and 0) enhances reconstruction quality; 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4) Among the three algorithms tested, WIE provides the best trade-off in terms o f accuracy and efficiency. 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5.13 The terminal sequence Table 3. Comparison over 4 views—the terminal sequence Items SVD IFA WIE MIN 2D Error (pixel) 2 . 6 0.773 0.728 0.728 # iterations 1 80 26 6 Run time (second) 0.014s 0.34s 2.45s 17.16s Table 4. Comparison over 8 views—the terminal sequence Items SVD IFA WIE MIN 2D Error (pixel) 2.662 0.923 0.875 0.873 # iterations 1 45 30 8 Run time (second) 0.036s 0.42s 5.21s 76.69s 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5.14 The wooden house sequence Table 5. Results on the house sequence Items WIE MIN 2D Error (pixel) 0.762 0.764 # iterations 33 1 0 Run time (second) 23.58s 343.83s 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 Application of Projective Reconstruction View Synthesis This chapter presents an image-based rendering system for view synthesis. We define the problem as: given images of a static scene taken from different angles, how to reproject and integrate them into a new image as if it were obtained from a viewpoint that is different from any o f the source views. Any image-based rendering system must address the following two fundamental issues: • how to reproject (or warp) existing frames to novel viewpoints • how to resolve occlusion in the synthesized frame as a result o f changing viewpoint We formulate the warping problem as projective-reconstruction-based image transfer, which is made easy by the algorithms developed in Chapter 5. To deal with the second issue, we divide the images into triangles and reproject them in what we call the visibility compatible order such that occluded triangles are traversed first. The proposed view synthesis system comprises three steps: triangulate the input images, sort the triangles in the visibility compatible order, and reproject them one by one. Compared to approaches surveyed in Chapter 3, ours has the following characteristics: • It imposes no restrictions on the virtual camera placement. • It accepts multiple source views at sparse angles. 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • It uses triangles rather than pixels as rendering primitives. • It handles occlusion by rendering the triangles in a visibility compatible order. • It is a geometrical method, therefore uses relatively small number o f input images. 6.1 Review of Related Work 6.1.1 View morphing The view morphing approach proposed by Seitz and Dyer [80] draws its origin from image morphing first introduced in the special effects industry. In image morphing, the goal is to produce continuous transformations from a source image to a destination image. For example, a gentleman is transformed into a lady, or a human face to an animal face. What is critical here is the continuity o f transformation to generate the desired special effect. Whether or not it makes "physical" sense is not the issue. As a matter o f fact, in their paper, Seitz and Dyer show that linearly interpolating two perspective views causes a non-linear bending effect in the in-between images. Their observation, then, is that pre-warping the input views to positions parallel to the line connecting the projection centers (called baseline frequently) makes linear interpolation shape-preserving. They presented a lengthy pre-warping algorithm which achieves the same goal as what we call image rectification in section 4.3. There are two main limitations in Seitz’s method. First, the virtual camera is only allowed to move on the baseline— a direct result o f using linear interpolation. Second, the virtual image plane has to be parallel to the baseline. To overcome the latter limitation, 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. they proposed a post-warping step. However, there is no guarantee that the final image is physically valid—the image is reproducible by a conventional camera. P Figure 6.1 Illustration o f the view synthesis approach 6.1.2 View synthesis Arbitrary virtual view placement is allowed in the view synthesis approach (Havaldar et al. [43]). Recall from section 2.1.6 that four collinear points determine a unique cross ratio invariant to projective transformation. Conversely, given three points and the cross ratio, the loc&tion o f the fourth point can be predicted. The essence o f the view synthesis approach is, from the correspondence information between views, to compute a cross- 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ratio for each scene point. Then given the images o f four base points in a third view, all points can be reprojected, thus the new image can be synthesized. In Figure 6.1, A, B, C, D are four non-coplanar points, or base points. 0\ and Oi are the projection centers. P is an unknown 3-D point whose image is to be reprojected to a novel view. Here, it is assumed that the epipolar geometry has been recovered already, and that o\ and oi are the epipoles. The cross-ratio o f P, denoted as cross(0\, P\, Pi, P), can be computed from the images o f 0\, P\, Pi, P in the right image plane, that is, cross(oi, p\, pi, p”). The magic here is that one does not need to know the 3-D coordinates o f A, B, C, D, or P\, Pi in order to compute the cross-ratio. In fact, one can compute p\ and pi based on the epipolar geometry only. Then, all that is required for point transfer are the images o f the base points. A shortcoming o f Havaldar’s approach is that it relies heavily on the accurate detection o f the four base points. Otherwise, there is noticeable distortion in the final synthesis. The selection o f these points is empirical in their system. 6.1.3 View synthesis in tensor space The approach given by Avidan and Shashua [4] is based on the trifocal tensor concept (the tensor approach in the following), and works for three views. The trifocal tensor was first introduced by Shashua in [84]. It describes an algebraic relationship between a point p in the first view, some line s passing through the matching point p' in the second view and some line r passing through the matching point p" in the third view. In space, this constraint is a meeting between a ray and two planes, as shown in Figure 6.2. 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. \ \ \ \ \ __ \ \ \ Figure 6.2 Illustration o f trifocal tensor The theory behind the tensor approach is that given two views in full correspondence and the tensor, the entire third view can be synthesized: from every matching pair p , p ' we can obtain p" and copy the appropriate color value (take the average from the two model images, for example). One nice property about the trifocal tensor is that rigid transformation can be directly applied to it to generate a new tensor with which the view represented by the transformation can be synthesized. Physical validity and view placement flexibility are achieved at the same time without recovering 3-D models. Compared to the view synthesis approach o f Havaldar et al., the tensor approach is more stable because it uses three views and does not rely on any particular set o f points. 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. When there are only two views, a special tensor can be constructed from the elements o f the fundamental matrix. In other words, the synthesis process can start with two model views and their fundamental matrix, as Havaldar’s method, but the later steps follow the tensor machinery. Something that was not explicitly mentioned in the original paper is that rigid transformation only applies to tensors computed from normalized views— those whose internal matrices are the identity matrix! This implies that the cameras must be pre calibrated. 6.2 System Overview Figure 6.3 demonstrates the high level flow-chart o f the system which consists o f the following stages. Stage 1 - Comer extraction and matching Comer points are extracted and matched semi automatically. Projective reconstruction is then performed, from which the reprojection error o f each point is calculated and displayed. This helps the operator to refine those with large errors. Stage 2 - Edge linking and labeling Again, in the current implementation, this step is done interactively by simply connecting points and marking the T-junctions. Stage 3 - Constrained Delaunay Triangulation A Constrained Delaunay Triangulation (CDT, de Floriani [32]) is then conducted in each source image, which divides the image into triangles. Later, each triangle w ill be warped 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -o < g CO 't o CO < e CO E (a E o 3 < comers comers, edges, T - junctions triangles Image n Image 1 Image 2 Triangle Warping Triangle Sorting Edge Detection and Labeling Comer Extraction and Matching Constrained Delaunay Triangulation Figure 6.3 System flow-chart Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to the novel view. A CDT is a triangulation that preserves the existing edges. There are two reasons to perform the triangulation. Firstly, this avoids identifying and matching faces, another tedious and not-so-easy task. Secondly, this establishes the potential to implement warping as a texture mapping process to make use o f graphics hardware capabilities. Stage 4 - Triangle sorting As w ill be detailed in section 6.4, triangles have to be filled in a certain order so that occlusion is handled correctly in the synthesized view. Stage 5 - Triangle warping Warping a triangular facet o f a source image includes two steps: the mapping o f the triangle's vertices and the filling o f pixels within each triangle. In our system, mapping is based on projective reconstruction. Filling is established upon the concept o f homography (refer to 2.1.5) and is implemented as projective texture mapping which is supported in OpenGL - the de facto standard for graphics programming. In the following sections, we w ill present our solutions to the two fundamental issues in view synthesis, namely, warping formulation and visibility resolution. 6.3 Warping Formulation 6.3.1 Vertex mapping In Havaldar et al. [43], vertex mapping is performed through the usage o f projective depth which appeared originally as a tool for object recognition (Shashua [83]). One shortcoming o f this formulation is that it requires five points to serve as a projective 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. basis. The accuracy o f these five basis points has great impact on the quality o f the synthesis result. Another drawback is that in choosing the basis, it is necessary to make sure that none o f the comer points are close to the im plicitly defined plaue-at-infinity, which, in practice, is rather difficult to maintain. Mapping using projective reconstruction, on the other hand, avoids both problems. Given n correspondences across m views, a projective reconstruction can be obtained using one o f the algorithms from Chapter 5 . If the images o f at least 6 points in the novel view are given (each point provides 2 equations and the projection matrix has 11 independent elements), that view’s projection matrix can be estimated by resection, which is then used to reproject all remaining points. An additional advantage o f this formulation is that it uses all image data simultaneously, therefore is less biased than the projective depth formulation which uses two views at a time— a result o f being based on the epipolar geometry. 6.3.2 Triangle filling Once the vertices are mapped, all pixels inside the triangles need to be transferred. This is illustrated in Figure 6.4 where the image o f triangle ABC in the left is transferred to the right. The transitive property o f projection states that any spatial plane (e.g. the one determined by ABC) induces a homography between two image planes. To compute the homography, at least 4 pairs o f corresponding points are required. So how can a triangle be transferred? The answer is that, in this case, the fourth pair o f matching points is supplied by the epipoles (section 4.2). 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A N \\ / — i — Figure 6.4 Homography between two image planes Images o f a planar facet are related by a homography. D is the intersection o f ABC and the segment 0 |0 2 - The images o f D are the epipoles. In projective space, the virtual plane spanned by ABC always has an intersection, denoted as D, with the line connecting the two camera centers (01 and 02). The epipoles are the images o f D, which, together with the images o f the three vertices, suffice to compute the homography. Knowing that the epipole is the null vector o f the fundamental matrix, we give in next section a method o f computing fundamental matrix from projective reconstruction. 6.3.3 Fundamental matrix computation In computing the projective reconstruction, we also obtain the projection matrices o f the cameras from which the fundamental matrix relating each pair o f views is recovered. For example, denote the two matrices by P\=[Mi\ti] and Pi=[M^t-2 \ where Mu Mj are 3x3 and t\, ti are 3x1, then 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. F = [r2- A / 2 A / r ' / , ] x A /2 A / r 1 where [/]x is the skew-symmetric matrix such that [f]x a = txa for an arbitrary vector a. Obviously, computing F this way naturally takes into consideration the rank 2 constraint (section 3.3.1). Furthermore, since the projection matrices are computed from correspondences in all views, the solution is much less biased than the traditional two- view method. support H, Figure 6.5 Mapping o f T-junctions T" forms a gap with the top, while T" goes beyond it. 6.4 Addressing Partial Occlusion Partial occlusion is characterized by the appearance o f T-junctions: an edge— the intersection o f two facets— occluded by a third facet. The difficulty caused by a T- junction is that it is not the image o f a real 3-D point. To overcome this problem, we propose a 2-step solution, referring to Figure 6.5. Step 1 - Establishing correspondences for T-junctions 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Figure 6.5, T, and 7 7 are the previously mentioned "matched" T-junctions. Let us name the two edges forming a T-junction top and support as labeled in the figure. Denote the correspondent point o f 7 7 in the right image as 77. Since 7 7 also locates on the epipolar line o f 7 7 on the right image plane, 7 7 is the intersection o f the right support and the said epipolar line. Once we have the pair (7), 77), we can perform bundle adjustment on it, and subsequently, reproject the resulting point onto the new view. Denote the reprojection by T" . If the new view is between the two existing views, there is a gap between the projected support and top in the new view. The same thing can be done to 7 7 . But the projected support intersects with the top, forming an overlap. This suggests warping be done in a certain order consistent with the occlusion sequence in the new view. Step 2 - Ordering triangular facets I f Euclidean information were available, the order would be easily determined by comparing the depth value. How do we do it in projective space where the concept o f distance does not apply any more? We know that occlusion is naturally resolved in ail source images. So the question becomes whether the order information is somehow encoded in the images, and, if yes, how to extract it. In Figure 6 .6 , the to-be-synthesized view is on the left while the right one represents one o f the source views. F, and F2 are 3-D facets, / , and / 2 are their projections. The darkened line emitting from O, is an optical ray. The plain line starting from o2 is its projection in the right image plane. It is also an epipolar line. I f F occludes F > in the left view, there exists at least one point on F, which occludes a point on F > but not vice versa. Otherwise they would intersect, resulting in an edge which separates each o f them into 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. two smaller facets. Consequently, if in the right image plane a point which belongs to /, can be found closer to the epipole than a point belongs tof 2 , it can be concluded that the occlusion sequence in the left image is / , before f 2 , thus / , should be warped after f 2 . A ll triangles are sorted based on this relationship and are warped accordingly. We say the order determined this way is visibility compatible, the same terminology used by M cM illan et al. in [60]. Notice that it is only valid with respect to the given views. I f any o f them is changed, the epipole is changed, hence the order. However, a perturbation to an existing configuration is merely a perturbation to the existing order which can be quickly rearranged. As a result, this sorting algorithm is very efficient for walk-through type o f applications. In the implementation, we only compare triangles which share a vertex or an edge. I f two triangles share an edge, we compare the opposite vertices. If they share only a vertex, the remaining vertices are checked. Figure 6 . 6 V isibility compatible order 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.5 Addressing Total Occlusion Total occlusion happens when a front facet in a source view goes to the back in the destination view, completely occluded by other facets. Although the partial ordering algorithm is applicable here, we want to be able to mark those totally occluded facets so that they are ignored during warping, since they are not observable in the new view anyway. It turns out that these facets can be easily identified: when projected to the new view, the orientation o f their boundaries is reversed. We can assign an arbitrary vertex sequence (clockwise or counter-clockwise) to each facet, and check whether it is preserved after the projection. If it is not, then we know the facet is at the back when looking from the new camera position. We would like to mention that the same observation has been made by Chung and Nevada in [19], where the total occlusion is further categorized into orientation-discontinuity occlusion and limb occlusion. 6.6 Results Figure 6.7 is an example involving two source views: (a) and (b), whose comers, edges and T-junctions are shown in (c) and (d) respectively (circled in solid gray), (e) is a synthesized image. I f (a) or (b) were warped individually, there would be gaps (dashed) and overlaps (dotted) as shown in (f) and (g). Figure 6 . 8 shows four views generated by linearly interpolating three sources (top row o f Figure 5.13). Although the synthesized views are not physically valid, they do not show any visual abnormality. Figure 6.9 shows a fuller synthesis by using more source views. Figure 6.10 is an example o f extrapolation. The left and the center images are the 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. input, the right is the output. The projective reconstruction formulation works as well in this case, and no special treatment is necessary. 6.7 Some Remarks Many restrictions associated with the current image synthesis approaches have been removed in ours. For instance, Chen and Williams only experimented on synthetic images where depth information is known. The systems by Seitz and Dyer, and by Havaldar et al. only work for two views. Also, the latter is biased depending on which four points are chosen as the basis points. The tensor approach requires calibrated cameras. In addition, all existing systems are pixel-based. Therefore, they are not able to utilize the hardware rendering acceleration that only accepts polygons as primitives. Still, our view synthesis system can be improved, in at least two aspects. First o f all, it needs the capability to establish correspondences automatically. The manual process is only suitable for blocky objects which are represented by a few number o f polygonal shapes whose vertices are easily detectable. Secondly, it needs a conventional way to specify the viewing parameters such as rotation and translation. Our system can generate fly-by’s, but it cannot generate a view at a specific location and viewing direction. To be able to do this, there is no way around but to recover some Euclidean information, which is the topic o f the next chapter. 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ■ (a) view I M (b) view 2 (d) (g) (f) Figure 6.7 Handling occlusion 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 6.8 Synthesized views of a computer monitor Figure 6.9 A fuller synthesis Figure 6.10 Looking into a room - an extrapolation example Reproduced with permission of the copyright owner. Further reproduction prohibited without permission Chapter 7 Multi-view Euclidean Reconstruction In many applications, projective structures are insufficient. For example, in urban planning, one wants to know the three-dimensional positions o f major city buildings. In the special effects industry, computer graphics artists want seamless integration o f virtual objects into real film footage. To generate correct visual effects such as lighting and shadow, Euclidean information is a must. This chapter discusses how to upgrade a projective structure into a Euclidean one. This problem was previously tackled by several authors (e.g. Faugeras et al. [27], Hartley [41], Maybank et al. [58], and Pollefeys et al. [71]) using the self-calibration method, that is, to calibrate the cameras from image data rather than ground-truth data as in the traditional method. Euclidean reconstruction then follows. The idea o f self-calibration was first proposed by Maybank et al. [58] based on the Kruppa equations, each o f which describes a quadratic constraint on the dual image o f the absolute conic (see section 2.3). A working algorithm was reported by Faugeras et al. in [27]. The major problem o f their method is that it is non-linear and lacks a good method to generate the initial guesses. Hartley's approach [41] assumes a fixed rotating camera, and computes the image o f the absolute conic. Triggs [95] took a different approach by computing the image o f the absolute quadric (recall from 2.3, it is the dual image o f absolute conic) using quadratic programming (G ill et al. [34]). Neither algorithm allows varying focal length. Variable focal length is allowed in the method proposed by 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Pollefeys et al. [71] which is also based on estimating the image of the absolute quadric. More details about some o f these methods are to be discussed in section 7.1 .2. The approach we take is to estimate the Projective Distortion Matrix (PDM, section 2 .2 .2 ), by explicitly making use o f the constraints embedded in the pinhole camera model. Camera calibration is achieved by performing QR decomposition (Appendix A.3) on the recovered Euclidean projection matrix. The approach is a generalization o f what has been presented in Chapter 4 for two views. Additionally, a closed-form solution is possible if there are at least three views. By avoiding dealing with the absolute conic/quadric, our approach is conceptually simpler. On the other hand, it can be shown that PDM in fact contains all the elements o f the absolute quadric, which relates our method to the existing ones just mentioned. 7.1 PDM Estimation 7.1.1 Algorithm description The first canonical form (section 2.2.3) o f a projective reconstruction is defined as such that the fourth column o f the projection matrix o f the first view vanishes. Assuming the world coordinate system is coincident with that o f the first camera, Euclidean construction is equivalent to finding the Projective Distortion Matrix H such that o\h =[a, o ]/, (7.1a) and (7.1b) 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where jU j compensates for the relative scaling between the views. Since H is defined up to a scale factor, we can set one o f its elements to 1 : H = //, hx h T 1 Then, (7.1a) becomes, a ° ] fr *' -hw i A A H a 4 ht i which implies /it=0 and //, = P^ A,. Thus H = P-'A, 0 Plug (7.2) into (7.1b), [P. P.) PC'4 0 h r 1 = [p,p; 1 At + p y a ,t] which generates P.PC'A + P.hr =MiA,R, and P,=PAT,- Thus (7.2) (7.3a) (7.3b) (e,P{'A, * p ,h r \A[Pl-rP ,r +kP; ) ~ A,A! (7.4) which implies 5 equations for each view other than the first one. The total number o f equations is thus 5(m -l). Let us now do some number counting based on the different forms o f Ar. 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i). Zero-skew only In this case, the camera model is 'a,f, 0 “ o," 2 rl 7 <*if, + “ 0/ U 0iV 0 i “ 0, 4 = 0 f, vo , , A,AJ = W o ,v 0 , I 2+V l vo , 0 0 1 “ o , v0 , 1 _ (7.5) The number o f unknowns is 4m+3: four for each camera, and 3 for h. Therefore 8 views suffice for a solution. ii). Known aspect-ratio and unknown but fixed principal point (in addition to zero- skew) In this case, A . = ~ ai f 0 «o" a ; f 2 +K UQV0 u a = 0 / n ... V o f 2 + vo v 0 0 0 1 1 s : o < o 1 . (7.6) There are m+5 unknowns now: one for each camera, the rest for u q , v0 and h. As a result, 3 views suffice for a solution. iii). Known principal points (in addition to the previous two conditions) In this case, each image can be pre-translated so that the post-translated principal point becomes (0 ,0 ). A, thus has a diagonal form: (7.7) Now, there are only m+3 unknowns, which requires minimally 2 views for a solution. 'a j, 0 0 ‘ n M ... 1 0 o' 4 = 0 f, 0 , A r f = 0 f r 0 _ 0 0 1 0 0 1 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For binocular stereo, it is necessary to know the principal points. The condition on known aspect ratio can be replaced by equal focal length. I f a priori knowledge says both conditions hold, then both should be used to help stabilize the computation, iv). Single fixed camera This is the case where a single camera is used. While the camera is moving, its intrinsic parameters are fixed: Three is the minimum number o f views for a solution. If zero-skew can be further assumed (i.e. P=0), then, two views are enough. 7.1.2 Relation to existing methods a result, the intrinsic camera parameters are related to, and can be computed from the image o f using Choleski decomposition (Appendix A.2). Following this line o f thought, existing methods have been concentrating on recovering c o one way or the other (Hartley [41], Pollefeys et al. [72], Triggs [95]). We now relate our formulation in 7.1.1 to these existing methods. Notice that (7.3a) can be rewritten as a f P « o A,= 0 / v0 a2/ 2 + w * + p 2 uQ v0 + g f u0 , A, a!' = u0vQ + ( f f 2 + V g v0 (7.8) 0 0 1 I T I In section 2.3, it was shown that oi =A' A is the dual image o f the absolute conic T. As Thus 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Claim. In (7.5), Q ' is the absolute quadric. Proof. Recall that under a Euclidean framework, the absolute quadric has the canonical form which can be easily verified to be equal to C2 ♦ (7.10) indicates that the rank 3 condition o f Q has been naturally incorporated into our formulation, reducing the constrained minimization (o f Pollefeys [72], and B ill Triggs [95]) to a regular minimization problem. In case the camera is stationary, T,=0. It follows from (7.3b) that p, = 0. Thus (7.4) becomes which is the basis o f Hartley’s method with a rotating camera [41]. In this case, however, since h cannot be estimated, Euclidean reconstruction is impossible. A more general form o f (7.11) is = 0), where Hi{ m ) is the homography induced by 7 7 = (the plane- at-infinity, section 2.1.2) between the first and the /th view. When the camera is Q = q . With the projective distortion H, Q is transformed to HQH1 =(HC1)(HQ) (7.10) (7.11) 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. stationary, such as the case o f Hartley's, can be computed from correspondences o f affine points (in front o f camera). Otherwise, correspondences o f vanishing points (at infinity) are required. This suggests a stratified approach: affine reconstruction (determining I7X ) followed by Euclidean reconstruction (determining r). 7.1.3 Initialization We present in this section a closed-form solution when there are at least three views, which supplies the initial guesses to the previous non-linear method. The known- principal-point condition is assumed. From (7.4) and (7.7), four equations: /?,, =a;fi22, fin - = 0 , where is the /-/h element o f the left hand side o f (7.4), in 5 unknowns: a = / , 2 , b = /,/?,, c = f^h2, d = h2, e = /i,2 + h; + h] , can be deduced. With at least three views, the solution is the null vector o f the coefficient matrix. Often, it is assumed that the null space is o f rank 1 . Thus the solution is the eigen-vector corresponding to the smallest eigen-value. This, however, is not true in our case due to the nonlinear relationship among the unknowns: h2 c2 — + — + d 2 =e. (7.12) a a In fact, the dimension o f the null space is 2, meaning the true solution is a linear combination o f the two bases null vectors: e=e\+Aej. Plugging e into (7.11) results in a cubic polynomial on X for which a closed-form solution exists. It has been observed that the cubic equation always has the special solution structure: A i» A 2=A3. all real, and the wanted one is X\ which is easy to be identified. 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. When there are only two views, a closed-form solution does not exist. However, in the special case o f binocular stereo where it can be assumed that A \=Ai=A, /?=/, and /j=\, an interactive method such as the one mentioned in 4.5.2 is sufficient. 7.2 Self-calibration Here we present our self-calibration method which is aimed at recovering both intrinsic and extrinsic camera parameters given the Euclidean structure. Complete knowledge about the camera parameters form the basis for object insertion— inserting computer graphics objects into real images, which we are going cover in section 7.5. Our formulation is based on (7.lb). Note that at this point H is already computed. The task is to decompose the leftmost 3x4 matrix into the special form represented by the right side. Let us rewrite (7.1b) a s From (7.12a), we obtain A, and R, by performing QR decomposition on P,. Since 4 (3. 3) = 1. 4 (3. 3) = M , . thus 4 = 4 / 4,( 3 .3). T , is given by I ' 1 p, . which expands into P t = 4 4 ’ (7.13a) (7.12b) 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.3 Simulation Experimental results on synthetic dataset are given in this section. The spherical configuration in Figure 5.8 is used again. The reconstruction quality is measured by the average 3-D alignment error—aligning the reconstructed points to the ground-truth. The calibration error is measured by the average error in focal length estimation. Only relative errors are recorded. For reconstruction, the absolute error is divided by the radius o f sphere on which the points are located. For focal length, the absolute error is divided by the actual focal length o f each camera. Each experiment is run 100 times and the average result recorded. The parameters to be considered are the principal point (PP), noise level (cr), view separation (9), and the number o f views (A O - First, let us take a look at how the principal point varies given random point location. In Figure 7.1, the image size is 400x400. Ideally, the principal point should be at (200,200). But in fact, it varies between 170 and 230 in both directions. The inaccuracy in principal point estimation is well known [2 ] [ 1 0 ], but has little effect on the reconstruction/self-calibration quality, as w ill be demonstrated later. In the next several figures, the relative reconstruction error and relative focal length error are plotted against c r, 9 , N respectively. In each case, PP is either unknown but fixed (case ii o f 7.1), or known (case iii). 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. #Views=lO, Viow Separation=45 deg, o=2 230 220 210 £ 2 0 0 190 - 180 170 160 170 180 190 220 200 210 230 U O Figure 7.1 Locus o f principal point in 20 experiments #Views=10, View Separation =45 deg - 3 0 -Known PP - - 3D-Unknown PP - Focal-Known PP ■ Focal-Unknown PP 3 « s ■ a a B C 4 - 0.5 15 2 Noise Level (a) 2.5 3.5 Figure 7.2 Relative errors vs. noise level (d) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. #Views=10, o=2 - 3D-Known PP - - 3D-Unknown PP - Focal-Known PP •••• Focal-Unknown PP a 3 a o 40 so View Separation (deg) Figure 7.3 Relative errors vs. view separation (ff) View Separation =45 deg, o=2 4.5 — 3D-Known PP - - 3D-Unknown PP - Focal-Known PP • Focal-Unknown PP 3.5 o 2.5 1.5 0.5 60 40 Number of Views Figure 7.4 Relative errors vs. number o f views (N) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The following conclusions can be drawn. 1) The estimation o f PP is almost meaningless, as it does not affect 3-D reconstruction errors. 2) The estimation o f the focal length is also quite inaccurate. 3) When the noise level is low (< 2 pixels), it is better to assume a known PP (e.g. simply let PP be the image center) since fewer parameters are estimated and the quality difference is negligible. This is especially true when one camera is used and sub-pixel feature detection can be achieved. 4) The accuracy on focal length estimation is worse than that o f the reconstruction. This indicates that the proposed method is only applicable in areas where accurate focal length is not required, such as the one to be presented next. 7.4 Results on Real Images The two sequences—terminal (Figure 5.13) and the house (Figure 5.14), previously used in projective reconstruction—are used here. The average 3-D alignment errors are listed in Table 6 . For the terminal sequence, 8 ground truth points were measured with a Microscribe 3-D digitizer. The house sequence did not include any ground truth points except the 3D coordinates recovered from the method o f Mohr et al. [62]. Consequently, it is expected that the alignment error computed using these coordinates (Table 6 ) be larger than that reported in [62] (which was 0.15cm). Figure 7.5 shows three synthetic views o f the reconstructed terminal. The boxes indicate the locations o f the camera projection centers and the lines are for the principal 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. axes. Notice that, at this resolution, the Euclidean characteristics such as orthogonality and parallelism are clearly observable. Figure 7.6 shows the reconstruction results o f the house sequence. Table 6. Average Euclidean reconstruction error on real data Items SVD IFA WIE MIN Terminal 4 0.3 0.119 0.118 0 . 1 2 1 Terminal 8 0.288 0.116 0.0976 0.0979 House - - 0.176 0.184 Figure 7.5 Reconstruction results o f the terminal sequence I □ □ □ □ Figure 7.6 Two views o f the reconstructed house 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.5 Application to Object Insertion One important virtual reality application is the insertion o f synthetic objects into real images. This is the technique that is used to generate stunning visual effects in such blockbuster movies as Titanic, Star Wars: Episode I The Phantom Menace, etc. For this purpose, the camera pose must first be estimated. Typical approaches rely on 3-D position tracking devices, precise calibration, tiducials, or painstakingly manual work. Recently, affine invariant has been used to provide calibration-free augmented reality by Kutulakos et al. [50]. However, the affine representation makes it difficult to describe the synthetic objects in a traditional way (i.e. using positions, distances or angles, etc). Another aspect o f the same problem, which has been largely ignored, is the mutual occlusion (or other geometrical interactions) between the virtual (inserted) and the real (existing in the scene) objects. To make this happen, accurate Euclidean models o f the real objects must also be recovered. Using image features such as comers o f objects rather than fiducials as the matching token, the proposed approach addresses camera estimation and model reconstruction at the same time. Implementation wise, several advanced computer graphics techniques are also used. Additionally, algorithms have been developed to compensate for the misalignment caused by errors in principal point estimation, which has been demonstrated in Figure 7.1. The overall workflow o f our object insertion system runs in four steps. First, intrinsic camera parameters, camera pose and scene (real) object models are recovered using the previous Euclidean reconstruction and self-calibration methods. Second, computer graphics (synthetic) objects are placed into the world coordinate 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. system, and virtual cameras are constructed using the recovered intrinsic camera parameters. Third, while the stencil buffer is turned on, render real objects in black, and synthetic ones in some chosen color other than black. This draws the synthetic objects onto the screen with correct occlusion, and at the same time, prepares the stencil buffer. Fourth, turn o ff the depth buffer and redraw the original image. By choosing appropriate logical operations on the stencil buffer, only pixels that have not been drawn so far are drawn with values from the corresponding image. The result shows the synthetic objects occluded by real objects just as if the former was in the original scene (in the sense that the correct geometric relationship is maintained). 7.5.1 Implementation issues There are two implementation methods. The straightforward one is to use ray tracing. From each pixel in the image plane o f an input view, a ray is emitted. I f it hits the surface o f a synthetic object, the pixel color is replaced by that o f the object surface (after lighting calculation). Otherwise, the image pixel is unalterted. The method, however, runs slowly due to the ray tracing process. Realtime animation is impossible. The method that we have chosen makes use o f the rendering pipeline provided by OpenGL so that hardware acceleration is possible if the machine allows (which is in fact true for most graphics workstations, Windows NT or Unix based). However, the camera parameters obtained from our self-calibration method create severe misalignments between the virtual and real objects, as shown Figure 7.7. 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 7.7 Misalignment of real and virtual objects The reason is that, in OpenGL, the image center and the principal point must be the same, which unfortunately is not the case when image noise exists. In Figure 7.1, the estimated principal points appear not only different from the image center, but also randomly distributed. To solve the problem, a 3-D affine distortion matrix is introduced just before the camera-to-image projection, as follows. Assume A, R, T are the estimated camera internal and external parameters. Let At be the ideal camera that has the same focal length as A but has the image center as its principal point, then [A 0] R T 0 1 a ; x a o' o 1 R T 0 1 = [ a , o]a/ We thus construct the virtual camera using /!/, which guarantees correct alignment, and set the world-to-camera projection matrix to M, which maintains the overall projection to be equal to the original. The final result is shown in Figure 7.8. Now the virtual and real objects are perfectly registered. Essentially, a 3-D affine distortion is intentionally introduced to compensate for the 2-D error. 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 7.8 Correct registration after error compensation Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 8 Conclusion and Future Work We have addressed a new research problem called Image-Based Rendering and Modeling (IBRM) that has recently attracted attention from both the Computer Vision and Computer Graphics communities. Our goal is to show that useful graphics applications can be built on top o f mathematically sound results from Computer Vision, in particular, from structure-from-motion. We branch o ff the traditional computer vision research that aims at producing high-level interpretations from images. On the other hand, we focus on studying the geometric aspects o f IBRM, leaving the photometric part, which is equally important and challenging, almost untackled. To be able to synthesize photorealisic images, we rely on a graphics technique called texture mapping. Experimental results have confirmed that, for the purpose o f photorealistic rendering, texture mapping suffices in most cases, although the synthesized images are only approximations to the true imaging physics underneath. We set out to study the two-view case first, and propose a complete computational framework to reconstruct 3-D structures from two images. The framework has a stratified structure in the sense that it undergoes two steps: projective reconstruction, followed by Euclidean reconstruction. Based on that, we build a prototype face system and use it to create high fidelity human face models from pairs o f stereo pictures. Compared to existing authoring methods in computer graphics, ours requires less time and labor, while producing models with higher quality. 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The previous framework is then generalized to the multi-view case. The multi-view projective reconstruction algorithm, although iterative in nature, runs much faster than the existing non-linear minimization formulation. Additionally, an initialization algorithm is provided that always output good initial guesses. The Euclidean reconstruction algorithm is based on directly estimating the Projective Distortion Matrix. One advantage o f it, among others, is that a closed form solution exists. To our knowledge, this is the first time a closed form solution has been reported the general Euclidean upgrade problem. To showcase the above algorithms, two applications have been developed: one for photorealistic image synthesis, the other for blending virtual objects into real image sequences. These applications demonstrate the validity o f the proposed approach. One major issue that needs to be further explored is the automatic matching across multiple views. Currently, we rely on manual matching, which certainly is not the best choice in many applications. There are two kinds o f input in the multi-view case. For the first kind, imagine someone is taking pictures with a static camera. The person walks a couple o f steps between each shot. For the second one, a video recorder is used. Thus continuity in successive frames can be assumed. Developing an automatic matching method in the first case is challenging, for two reasons. One o f them is that images o f the same scene may appear quite different as the person moves. A significant portion o f the scene that is observable in one view may not be seen in other views. Change o f viewpoint may also cause the change o f color among corresponding pixels. Secondly, there lacks a simple multi-view matching constraint to 130 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. help selecting the right match from multiple possible choices. The epipolar constraint for two views is inherently ambiguous, as discussed in section 3.3.1. Whether or not our method given in section 4.4 is applicable beyond the binocular stereo case is unknown yet. Recently, Zisserman et al. [99] have experimented with the trifocal tensor for extended sequences, and obtained some nice results. However, we view this as a transitional technique. What is really needed is a method that analyzes many views concurrently in some kind o f spatio-temporal space. Matching in a continuous image sequence is relatively easy because o f the small movement between successive frames. This problem is frequently termed tracking as feature points identified in the first frame are tracked across the whole sequence. Since the purpose o f tracking is to estimate the camera parameters (both internal and external, and once they become known, the positions o f the tracked points are known), Azarbayejani et al. [5] formulated this problem as parameter estimation, using the Extended Kalman Filter (EKF). Under this formulation, the temporal information is absorbed into the statistical descriptions o f the parameters. In my opinion, the optical flow method (see Barron et al. [7] for a complete survey) provides a better way o f utilizing the temporal information. Potentially, the spatial- temporal space is a place where global constraints can be enforced. Unfortunately, current research only make use o f two views at a time. The central issue is, just as in the simple two-view case, how to devise a solution that is both capable and efficient, robust but not at the cost o f losing local shape features. A method that satisfies these criteria w ill undoubtingly bring a lot o f new development to the field o f IBRM. 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. References [1] E. H. Adelson and J. R. Bergen, "The plenoptic function and the elements o f early vision," Computational Models of Visual Processing, M. Landy and J. A. Movshon ed., Cambridge, MA: M IT Press, 1991. [2] L. Agapito, R. I. Hartley, E. Hayman, “ Linear Self-Calibration o f a Rotating and Zooming Camera," in Proc. IEEE Computer Vision and Pattern Recognition, pp. 15-21, 1999. [3] J. Aloimonos and C.M. Brown, "Direct processing o f curvi-linear sensor motion form a sequence o f perspective images," in Proc. 2n d Workshop on Computer Vision, pp. 72-77, 1984. [4] S Avidan and A. Shashua, "Novel View Synthesis in Tensor Space," in Proc. IEEE Computer Vision and Pattern Recognition, pp. 1034-1040, 1997. [5] A. Azarbayejani and A. P. Pentland, "Recursive Estimation o f Motion, Structure, and Focal Length," IEEE Trans. Pattern Recognition and Machine Intelligence, vol. 17, no. 6 , pp. 562-575, June 1995. [6 ] T. J. Baker, "Automatic Mesh Generation for Complex Three Dimensional Regions Using a Constrained Delaunay Triangulation," Engineering with Computers, vol. 5, pp.161-175, 1989. [7] J.L. Barron, D.J. Fleet, S.S. Beauchemin, "Systems and Experiment - Performance o f Optical Flow Techniques," Int. J. Computer Vision, vol. 12, no. 1 , pp.43-77, 1994. [8 ] P. Beardsley, P. Torr and A. Zisserman, "3D Model Acquisition from Extended Image Sequences," in Proc. European Conf. Computer Vision, pp. 683-695, 1996. [9] R. Berthilsson, A. Heyden and G. Sparr, "Recursive Structure and Motion from Image Sequences Using Shape and Depth Spaces," in Proc. IEEE Computer Vision and Pattern Recognition, pp. 444-449, 1997. [10] S. Bougnoux, "From projective to Euclidean space under any practical situation, a criticism o f self-calibration," in Proc. Int. Conf. Computer Vision, pp. 790-795, 1998. [11] P. Burt and B. Julesz, "A disparity gradient lim it for binocular fusion," Perception, vol. 9, pp. 671-682, 1980. 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [12] Q. Chen and G. Medioni, "Image Synthesis From a Sparse Set o f Views," in Proc. IEEE Visualization, pp. 269-275,1997. [13] Q. Chen and G. Medioni, "A Semi-automatic System to Infer Complex 3-D Shapes from Photographs," in Proc. Int. Conf. Multimedia Computing and Systems, vol. II, pp. 798-805, 1999. [14] Q. Chen and G. Medioni, "Efficient, Iterative Solution to M-View Projective Reconstruction Problem," in Proc. IEEE Computer Vision and Pattern Recognition, vol. II, pp. 55-61,1999. [15] Q. Chen and G. Medioni, "A Volumetric Stereo Matching Method: Application to Image-Based Modeling," in Proc. IEEE Computer Vision and Pattern Recognition, vol. I, p. 29-34, 1999. [16] S. Christy and R. Horaud, "Euclidean Shape and Motion from Multiple Perspective Views by Affine Iterations," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 11, pp. 1098-1104, 1996. [17] S. E. Chen and L. Williams, "View Interpolation for Image Synthesis," in Proc. ACMSIGGRAPH, pp. 279-288, 1993. [18] S. E. Chen, "QuickTime VR—An Image-Based Approach to Virtual Environment Navigation," in Proc. ACMSIGGRAPH, pp. 29-38, 1995. [19] R. C-K. Chung, and R. Nevada, "Recovering LSHGCs and SHGCs from Stereo," Int. J. Computer Vision, vol. 20, no. 1/2, pp. 43-58, 1996. [20] S. Cochran and G. Medioni, "3-D Surface Description from Binocular Stereo," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 14, no. 10, pp. 981- 994, 1992. [21] I. Cox, S. Hingorani, S. Rao, B. Maggs, "A Maximum-Likelihood Stereo Algorithm," Computer Vision and Image Understanding, vol. 63, no. 3, pp. 542- 567,1996. [22] P. E. Debevec, C. J. Taylor, and J. Malik, "Modeling and Rendering Architecture from Photographs," in Proc. ACMSIGGRAPH, pp. 11-20,1996. [23] R. Deriche, Z. Zhang, Q.-T. Luong, and O. Faugeras, "Robust recovery o f the epipolar geometry for an uncalibrated stereo rig," in Proc. European Conf. Computer Vision, pp. 567-576, 1994. 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [24] U. Dhond and J. Aggarwal, "Structure from stereo—a review," IEEE Trans. Systems, Man and Cybernetics, vol. 19, no. 6 , pp. 1489-1510, 1989. [25] G. Dahlquist and Ake Bjorck, Translated by N. Anderson, Numerical Methods, Englewood C liffs, NJ: Prentice-Hall, 1974. [26] O. Faugeras, "What Can be Seen in Three Dimensions with an Uncalibrated Stereo Rig," in Proc. European Conf. Computer Vision, pp. 563-578, 1992. [27] O. Faugeras, Q. T. Luong, and S.J. Maybank, "Camera Self-Calibration: Theory and Experiments," in Proc. European Conference On Computer Vision, pp. 321- 334, 1992. [28] O. Faugeras, Three-Dimensional Computer Vision—A Geometric Viewpoint, Cambridge, MA: M IT Press, 1993. [29] O. D. Faugeras and B. Mourrain, "On the geometry and algebra o f the point and line correspondences between N images," in Proc. Int. Conf. Computer Vision, pp. 951-956, 1995. [30] O. Faugeras, "Stratification o f 3-D vision: projective, affine, and metric representations," Journal o f the Optical Society o f America A, vol. 3, pp. 465-484, 1994. [31] O. Faugeras, S. Laveau, L. Robert, "3-D Reconstruction o f Urban Scenes from Sequences o f Images," in Res. Rep. INRIA 2572, Sophia-Antipolis, France, 1995. [32] L. de Floriani, "An On-Line Algorithm for Constrained Delaunay Triangulation," Computer Vision, Graphics and Image Processing: Graphical Models and Image Processing, vol. 54, no. 3, pp. 290-300, 1992. [33] P. Fua, "Reconstructing Complex Surfaces from Multiple Stereo Views," in Proc. Int. Conf. Computer Vision, pp. 1078-1085, 1995. [34] P. E. G ill, W. Muray and M. H. Wright, Practical Optimization, IIth Printing, Boston, MA: Academic Press, 1997. [35] S. J. Gortler, R. Grzeszczuk, R. Szeliski, M. F. Cohen, "Lumigraph," in Proc. ACMSIGGRAPH, pp. 43-54,1996. [36] B. Guenter, C. Grimm, D. Wood, H Malvar and F. Pighin, "Making Faces," in Proc. ACMSIGGRAPH, pp. 55-60,1998. 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [37] C. G. Harris, C. and M. J. Stephens, "A Combined Comer and Edge Detector," in Proc. Alvey Vision Conference, pp. 147-152, Cambridge, UK 1988. [38] R. Hartley, R. Gupta and T. Chang, "Stereo from uncalibrated cameras," in Proc. IEEE Computer Vision and Pattern Recognition, pp. 761-764, 1992. [39] R. Hartley and R. Gupta, "Computing Matched-epipolar Projections," in Proc. IEEE Computer Vision and Pattern Recognition, pp. 549-555, 1993. [40] R. I. Hartley, "In Defense o f the Eight-point Algorithm," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 6 , pp. 580-593, 1997. [41] R. Hartley, "Self-Calibration o f Stationary Cameras," Int. J. Computer Vision, vol. 22, no. 1 , pp. 5-23,1997. [42] R. Hartley and P. Sturm, "Triangulation," Computer Vision and Image Understanding, vol. 6 8 , no. 2, pp. 146-157,1997. [43] P. Havaldar, M. Lee, and G. Medioni, "Synthesizing Novel Views from Unregistered 2-D Images," Computer Graphics Forum, vol. 16, no. 1, pp. 65-73, 1997. [44] W. H off and N. Ahuja, "Surfaces from stereo: integrating feature matching, disparity estimation, and contour detection," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11, no. 2, pp. 121-136, 1989. [45] H. Ishikawa, D. Geiger, "Occlusions, discontinuities, and epipolar lines in stereo," in Proc. European Conf. on Computer Vision, pp. 89-96, 1998 [46] T. Itoh and K. Koyamada, "Automatic Isosurface Propagation Using an Extrema Graph and Sorted Boundary Cell Lists," IEEE Trans. Visualization and Computer Graphics, vol. 1 , no. 4, pp. 319-327,1995. [47] D. Jacobs, "Linear Fitting with Missing Data: Applications to Structure-from- Motion and to Characterizing Intensity Images," in Proc. Computer Vision and Pattern Recognition, pp. 206-212, 1997. [48] Ramesh Jain, Machine Vision, New York, NY: McGraw-Hill, 1995. [49] E. Kruppa, "Zur Ermittlung eines Objektes aus zwei Perspektiven mit innerer Orientierung," Sitzungsber. Akad. Wiss Wien math. Natunviss. Kl. Abt., 2A 122: pp. 1939-1948, 1913. 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [50] K. N. Kutulakos and S. M. Seitz, "Calibration free Augmented Reality," IEEE Trans. Visualization and Computer Graphics, vol. 4, no. 1, pp. 1-20, 1998. [51] M. S. Lee, Tensor Voting for Salient Feature Inference in Computer Vision, Ph.D. dissertation, Dept. Computer Science, University o f Southern California, Los Angeles, CA, 1998. [52] M. S. Lee and G. Medioni, "Inferring Segmented Surface Description from Stereo Data," in Proc. IEEE Computer Vision and Pattern Recognition, pp. 346-352, 1998. [53] M. Levoy and P. Hanrahan, "Light Field Rendering," in Proc. ACM SIGGRAPH, pp. 31-42,1996. [54] H. C. Longuet-Higgins, "A computer algorithm for reconstructing a scene from two projections," Nature, vol. 293, pp. 133-135, 1981. [55] W. Lorensen and H. Cline, "Marching cubes: A high resolution 3-D surface construction algorithm," in Proc. ACM SIGGRAPH, pp. 163-169, 1987. [56] Q. Luong and T. Vieville, "Canonic Representations for the Geometries o f Multiple Projective Views," in Proc. European Conference on Computer Vision, pp. 589-596, 1994. [57] J. Ma and N. Ahuja, "Dense Shape and Motion from Region Correspondences by Factorization," in Proc. IEEE Computer Vision and Pattern Recognition, pp. 219- 224, 1998. [58] S.J. Maybank, O.D. Faugeras, "A Theory o f Self-Calibration o f a Moving Camera," Int. J. Computer Vision, vol. 8 , no. 2. pp. 123-151, 1992. [59] G. Medioni, M.-S. Lee, and C.-K. Tang, A Computational Framework for Segmentation and Grouping, New York, NY: Elsevier Science, 2000. [60] L. McMillan and G. Bishop, "Plenoptic Modeling: An Image-Based Rendering System," in Proc. ACMSIGGRAPH, pp. 39-46,1995. [61] S. Moezzi, A. Katkere, D. Y. Kuramura, and R. Jain, "Reality Modeling and Visualization from Multiple Video Sequences," IEEE Computer Graphics & Applications, vol. 16, no. 6 , pp. 58-63, 1996. [62] R. Mohr, F. Vcillon and L. Quan, "Relative 3D Reconstruction Using M ultiple Uncalibrated Images," in Proc. IEEE Computer Vision and Pattern Recognition, pp. 543-548,1993. 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [63] R. Mohr, B. Triggs, "Projective Geometry for Image Analysis," http://www.inrialpes.fr/movi, 1996. [64] A. Noble, "Finding Comers," Image Vision Computing, vol. 6 , pp. 121-128, 1988. [65] Y. Ohta and T. Kanade, "Stereo by Intra- and Inter-scanline Search Using Dynamic Programming," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 7, no. 2, pp. 139-154,1985. [6 6 ] J. Oliensis, "Muliframe Structure from Motion in Perspective," in Proc. Workshop on the Representations o f Visual Scenes, pp. 77-84, 1995. [67] J. Oliensis, "An Experimental Study o f Projective Structure from Motion," in NEC I Technical Report, Princeton, NJ, October 1995 (revised 1997). [6 8 ] J. Oliensis, "A Critique o f Structure from Motion Algorithms." in NEC1 Technical Report, Princeton, NJ, A pril 1997 (updated April 1999). [69] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski and D. H. Salesin, "Synthesizing Realistic Facial Expressions from Photographs," in Proc. ACM SIGGRAPH, pp. 75-84, 1998. [70] CJ.Poelman and T. Kanade, "A Paraperspective Factorization Method for Shape and Motion Recovery," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp. 206-218, 1997. [71] M. Pollefeys and L. V. Gool, "A Stratified Approach to Metric Self-Calibration," in Proc. IEEE Computer Vision and Pattern Recognition, pp. 407-412, 1997. [72] M. Pollefeys, R. Koch and L. V. Gool, "Self-Calibration and Metric Reconstruction in Spite o f Varying and Unknown Internal Camera Parameters," in Proc. Int. Conf. Computer Vision, pp. 90-95, 1998. [73] W. H. Press and S.A. Teukolsky, Numerical Recipes in C, 2n d Ed., Cambridge, UK: Cambridge University Press, Reprinted in 1996. [74] P. Rander, P.J. Narayanan and T. Kanade, "Virtualized R.eality: Constructing Time-Varying Virtual Worlds From Real World Events," in Proc. IEEE Visualization, pp. 277-283, 1997. [75] L. Robert, M. Buffa, and M. Hebert, "Weakly-Calibrated Stereo Perception for Rover Navigation," in Proc. Int. Conf. Computer Vision, pp. 1317-1324,1995. 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [76] C. Rothwell, G. Csurka, and O. Faugeras, "A Comparison o f Projective Reconstruction Methods for Pairs o f Views," in Proc. Int. Conf. Computer Vision, pp.932-937, 1995. [77] S. Roy and I. J. Cox, "A Maximum-Flow Formulation o f the N-camera Stereo Correspondence Problem," in Proc. Int. Conf. Computer Vision, pp. 492-499, 1998. [78] R. Sara and R. Bajcsy, "Fish-Scales: Representing Fuzzy Manifolds," in Proc. Int. Conf. Computer Vision, pp. 811-817, 1998. [79] R. Szeliski and P. Golland, "Stereo matching With Transparency and Matting," in Proc. Int. Conf. Computer Vision, pp. 517-524,1998. [80] S. M. Seitz and C. R. Dyer, "View Morphing," in Proc. ACMSIGGRAPH, pp. 21- 30, 1996. [81] S. M. Seitz and C. R. Dyer, "Photorealistic Scene Reconstruction by Voxel Coloring," in Proc. IEEE Computer Vision and Pattern Recognition, pp. 1067- 1073, 1997. [82] S. M. Seitz and K. N. Kutulakos, "Plenoptic Image Editing," in Proc. Int. Conf. Computer Vision, pp. 17-24, 1998. [83] A. Shashua, "Projective Depth: A Geometric Invariant for 3D Reconstruction from Two Perspective/Orthographic Views and for Visual Recognition," in Proc. Int. Conf. Computer Vision, pp. 583-590, 1993. [84] A. Shashua, "Trilinearity o f Three Perspective Views and Its Associated Tensor," in Proc. Int. Conf. Computer Vision, pp. 920-925, 1995. [85] A. Stewart, R. Flatland and K. Bubna, "Geometric constraints and stereo disparity computation," Int. J. Computer Vision, vol. 20, no. 3, pp. 143-168, 1996. [8 6 ] P. Sturm and B. Triggs, "A Factorization Based Algorithm for Multi-Image Projective Structure and Motion," in Proc. European Conference on Computer Vision, pp. 709-720,1996. [87] R. Szeliski and R. Weiss, "Robust Shape Recovery from Occluding Contours Using a Linear Smoother," Real-time Computer Vision, C. M. Brown and D. Terzopoulos ed., Cambridge, UK: Cambridge University Press, 1994. [8 8 ] R. Szeliski, "Video Mosaics for Virtual Environments," IEEE Computer Graphics and Applications, vol. 16, no. 2, pp. 22-30, March 1996. 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [89] R. Szeliski and H.-Y. Shum, "Creating full view panoramic image mosaics and environment maps," in Proc. ACMSIGGRAPH, pp. 251-258,1997. [90] C.-K. Tang and G. Medioni, "Inference o f Integrated Surface, Curve, and Junction Descriptions from Sparse 3-D Data," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1206-1223, 1998. [91] C. Tomasi and T. Kanade, "Shape and motion from image streams under orthography: A factorization method," Int. J. Computer Vision, vol. 9, no. 2, pp. 137-154,1992. [92] P.H.S. Torr and A. Zisserman, "Robust Parametrization and Computation o f the Trifocal Tensor," Imaging Vision Computing, vol. 15, no. 8 , pp. 591-605, 1997. [93] P.H.S. Torr and D.W. Murray, "The Development and Comparison o f Robust Methods for Estimating the Fundamental Matrix," Int. J. Computer Vision, vol. 24, no. 3, pp. 271-300,1997. [94] B. Triggs, "Matching Constraints and the Joint Image," in Proc. Int. Conf. Computer Vision, pp. 338-343,1995. [95] B. Triggs, "Autocalibration and the Absolute Quadric," in Proc. IEEE Computer Vision and Pattern Recognition, pp. 609-614,1997. [96] A. Verri and T. Poggio, "Against quantitative optical flow," in Proc. Int. Conf. Computer Vision, pp. 171-180, 1987. [97] G.-Q. Wei, W. Brauer and G. Hirzinger, "Intensity- and Gradient-Based Stereo Matching Using Hierarchical Gaussian Basis Functions," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1143-1160, 1998. [98] Paul R. Wolf, Elements o f Photogrammetry, Chapter 14, New York, NY: McGraw-Hill, 1983. [99] A. Zisserman, R. Hartley, "M ultiple View Geometry," in Course Notes IEEE Computer Vision and Pattern Recognition, 1999. [100] Z. Zhang, "A New Multistage Approach to Motion and Structure Estimation: From Essential Parameters to Euclidean Motion Via Fundamental Matrix," in Research Report INRIA 2910, Sophia-Antipolis, France, 1996. [101] Z. Zhang, "Determine the Epipolar Geometry and Its Uncertainty: A Review," in Research Report INRIA 2927, Sophia-Antipolis, France, 1996. 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [102] Z. Zhang, O. Faugeras and R. Deriche, "An Effective Technique for Calibrating a Binocular Stereo Through Projective Reconstruction Using Both a Calibration Object and the Environment," Videre - Journal of Computer Vision Research, vol. 1 , no. 1 , pp. 58-68, 1997. On-line version at http://mitpress.mit.edu/e- joumals/Videre. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix A Matrix Decomposition The following three forms o f matrix decomposition are excerpted from [25]. A .l Singular Value Decomposition Every matrix AeRm xn (m>n) o f rank r can be written A = UIVr, 2 = 2r 0 0 0 e R n where UeRm xn and VeRn xn are orthonormal, 2 r = J/agfcrpO-, , . . . , ^ ) , and < 7 , > < y , >... > crr > 0. The oj’s are called the singular values o f A. If we write u = [m ,,« 2 Ur], V = [v,,v,,...,vr], we obtain a rank-r-factorization of A as follows: A = U rVj = [ ^ U i,y fc u 2,...,yJ^U r\ ^ V i , y f c v 2,...,Jo^V r]T . A.2 Cholesky Decomposition Let A be a symmetric positive definite matrix, then A can be uniquely decomposed into the product of a lower triangular matrix L and its transpose: A = L L r . L is called the Cholesky factor of^. A.3 QR Decomposition Let AeRm x n (m>n) with rank(/l)=M. Then there is an orthonormal matrix QeRm xm and an upper triangular matrix R with positive diagonal elements such that Q 'A = 7* 7* 7* It is easy to verify that A A=R R which is the Cholesky decomposition o f A A. 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Reduced-parameter modeling for cost estimation models
PDF
Web-based remote rendering with image -based rendering acceleration and compression
PDF
Semantics -based customization of information retrieval on the World Wide Web
PDF
Multi-level three-dimensional building modeling by integration of aerial and ground view images
PDF
Modeling, rendering and animating human hair
PDF
An adaptive soft classification model: Content-based similarity queries and beyond
PDF
Multi-robot formations: Rule-based synthesis and stability analysis
PDF
Multi-view three-dimensional object description with uncertain reasoning and machine learning
PDF
An implicit-based haptic rendering technique
PDF
On multi-robot task allocation
PDF
Model-based segmentation and tracking of multiple humans in complex situations
PDF
On theory and applications of reuse of multiple extensible markup languages (XMLs)
PDF
Applying aggregate-level traffic control algorithms to improve network robustness
PDF
Models and algorithms for distributed computation in wireless sensor systems
PDF
Automatically and accurately conflating road vector data, street maps and orthoimagery
PDF
Modelling the scaling behavior of large distributed systems: Three case studies
PDF
Development and evaluation of value -based review (VBR) methods
PDF
Ontology -based information selection
PDF
Inference of two-dimensional layers from uncalibrated images
PDF
Modeling the mirror: Grasp learning and action recognition
Asset Metadata
Creator
Chen, Qian (author)
Core Title
Multi-view image -based rendering and modeling
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-59886
Unique identifier
UC11326923
Identifier
3017991.pdf (filename),usctheses-c16-59886 (legacy record id)
Legacy Identifier
3017991.pdf
Dmrecord
59886
Document Type
Dissertation
Rights
Chen, Qian
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA