Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Interactive rapid part-based 3d modeling from a single image and its applications
(USC Thesis Other)
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INTERACTIVE RAPID PART-BASED 3D MODELING
FROM A SINGLE IMAGE AND ITS APPLICATIONS
by
Ismail Oner Sebe
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2008
Copyright 2008 Ismail Oner Sebe
Acknowledgements
I would like to thank to more than a few people for helping throughout my studies
at USC. On the top of the list, I would like to thank to my advisors Prof. Ulrich
Neumann and Suya You for their guidance, support and wisdom. Without the
light they shine on my path, I would be lost. Also I would like to mention my
collaborators and lab mates for their insightful ideas: Ilya Eckstein, Jinhui Hu,
Jonathan Mooser, Zhigang Deng, Zhengyao Mo and Taehyun Rhee.
Thanks to my defense committee members Prof. C.-C Jay Kuo and Prof. Ra-
makant Nevatia and my qualification members Prof. Karen Liu and Prof. Krishna
Nayak for their comments on my research. Also I would like to acknowledge the
whole staff of IMSC and EE department. I am very grateful to both IMSC and
CISOFTfortheirfinancialsupportthroughoutmystudies. Iwouldliketomention
Ms. Diane Demetras for the academic guidance she gave me.
Of course this thesis would not have been possible without the support of my
close friends: Lorena Bravo, Belma Dogdas and Mary Scanlon. Above all I would
liketothanktomyfamily,withouttheirbeliefandsupportIwouldnotbepursuing
my life goals 10000 miles away from home.
ii
Table of Contents
Acknowledgements ii
List Of Tables v
List Of Figures vi
Abstract x
Chapter 1: Introduction 1
Chapter 2: Rapid Part-Based 3D Modeling 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Part-Based Example Model . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Camera Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 2D to 3D Space Mapping . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 3D Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7.1 Applying Modifications . . . . . . . . . . . . . . . . . . . . . 36
2.7.2 Modification Modes . . . . . . . . . . . . . . . . . . . . . . . 41
2.8 Texture Mapping and Visualization . . . . . . . . . . . . . . . . . . 43
2.8.1 Texture Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 43
2.8.2 Smart-Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.9 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.10 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 3: Application I: Model-Driven Video Based Rendering 58
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Model-Aided Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3 Model-Aided Pose Estimation . . . . . . . . . . . . . . . . . . . . . 68
3.4 Environment Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
iii
Chapter 4: Application II
Semi-Automatic Vehicle Modeling 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Part-Based Vehicle Detection . . . . . . . . . . . . . . . . . . . . . 82
4.3.1 Vehicle Parts . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.2 Part Detectors . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.3 Learning Algorithm: Reduced Histograms . . . . . . . . . . 90
4.4 3D Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.1 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.2 Camera Parameters . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.3 3D Modeling from 2D Part Detections . . . . . . . . . . . . 98
4.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chapter 5: Conclusion and Future Work 109
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
References 115
iv
List Of Tables
2.1 Number of parts, triangles and points for different object classes . . 48
2.2 Number and type of user interaction (number of clicks not counting
thecameracalibration)andtotalexecutiontime(inminutes)forthe
examples shown in figure 2.16, 2.17, 2.18, 2.19 and 2.20 . . . . . . . 48
2.3 Number of parts, triangles and points for object classes . . . . . . . 56
4.1 Parts and corresponding features. n, k
†
and k are parameters from
Table 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Pseudo-code for edge-based part detectors . . . . . . . . . . . . . . 88
5.1 Comparison our modeler to other modeling approaches . . . . . . . 110
v
List Of Figures
1.1 Rapid Part-Based 3D Modeler input images and sample output ren-
derings. All objects are modeled and textured from a single uncali-
brated image in couple of minutes. . . . . . . . . . . . . . . . . . . 4
1.2 A collage of sample renderings from different 3D models created by
our system. All 3D models are created from a single uncalibrated
image in couple of minutes. . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Example models of high-end 3D modeling software packages: Maya
and 3D Studio Max . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Top row are buildings modeled using Sketch-Up program. Bottom
row are example models of Yang et al. . . . . . . . . . . . . . . . . 14
2.3 3Dmodelscreatedfrommulti-viewtechniquesareoftencreatecrude
3D models with many artifacts . . . . . . . . . . . . . . . . . . . . 15
2.4 3D models can be created by fusing parts from other 3D models . . 16
2.5 A part-based human model is created from silhouettes estimated
from multiple cameras . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Comparison of time vs. quality of different modelers . . . . . . . . . 18
2.7 Example models of the objects classes. Each part is shown in differ-
ent color. (a) Vehicles (b) House with roof (c) Love seat . . . . . . 20
2.8 Man-madeobjectsoftenhavefeaturesthatcreateparallellines. Par-
allel lines of three main axes are drawn on the images . . . . . . . . 25
vi
2.9 Projectionofthegenericmodelaftercalibrationisshownin(a). The
3D model after the modeling matches the image since the image is
used as a guide (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.10 Projection causes ambiguity. Given 2D vector, there are infinitely
many possible 3D vectors. . . . . . . . . . . . . . . . . . . . . . . . 32
2.11 Estimation of the 3D vector V in normal-mode . . . . . . . . . . . 35
2.12 Modification by whole part vs. part boundary. Red = (User con-
trolled), Green = (Algorithm controlled), Gray = (Do not move).
(a) Before whole part modification (b) After whole part modifica-
tion (c) Before part boundary modification (d) After part boundary
modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.13 Sample texture synthesis inputs . . . . . . . . . . . . . . . . . . . . 43
2.14 Sample texture copy inputs . . . . . . . . . . . . . . . . . . . . . . 44
2.15 (a) Input image, Textured 3D model with (b) projective texturing
(c) projective texturing with symmetry (d) Smart Copy . . . . . . . 46
2.16 A Honda Accord is modeled from a single image. The input image
(a),final3Dmodeloverlay(b),texturednovelviewrenderings(c,d,e)
and 3D mesh of the resulting 3D models (f). . . . . . . . . . . . . . 49
2.17 An SUV is modeled from a single image. The input image (a), final
3D model overlay (b), textured novel view renderings (c,d,e) and 3D
mesh of the resulting 3D models (f). . . . . . . . . . . . . . . . . . 52
2.18 Coucheswithtwoandthreecushionsaremodeled. (a)and(e)arein-
putimages, (b)isthefinal2Ddisplayofthemodeler,(c)(d)(f)(g)(h)
are the 3d renderings of the final result. . . . . . . . . . . . . . . . . 53
2.19 Inputimagesandresulting3DrenderingsofabuildingonUSCCampus 54
2.20 Input images and resulting 3D renderings of a building on a street . 55
2.21 Distance of two models of the same vehicle from two different views,
blue is low and red is high error. Errors are concentrated on the
parts that are invisible to the other view. . . . . . . . . . . . . . . . 57
vii
3.1 Screen shots of ”bullet-time effect” from the major motion picture
”The Matrix”. The first popular use of video-based rendering in the
movies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Cameraset-upof3DCagesystem(a)fromCMUandStanfordMulti-
Camera Array (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Novel view rendering of outdoor scenes are attempted by research
groups in universities: (a) USC (b) Berkeley . . . . . . . . . . . . . 61
3.4 Flow digram of the model-driven video-based rendering . . . . . . . 63
3.5 Our model-based 3D patches have less ”background leakage” than
theregular2Dpatches. (a)Regular2Dpatches(b)Projectionof3D
patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6 2D tracks of the model-driven tracker (a) before and (b) after filtering 70
3.7 (a) Background image created by subtraction of foreground and av-
eraging. (b) 3D wireframe model of the environment . . . . . . . . 71
3.8 (a) Rendering of the vehicle without the shadows (depth ambiguity)
(b) Rendering of the vehicle with shadows (clear depth perception) 72
3.9 3D model is projected on the input video (stationary camera). (a)
Frame 1 (b) Frame 25 (c) Frame 110 (d) Frame 290 . . . . . . . . . 73
3.10 Video-Based Rendering from arbitrary views for the same frames as
in Figure 3.9. (a) Frame 1 (b) Frame 25 (c) Frame 110 (d) Frame 290 75
4.1 Hierarchy of vehicle parts . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Hierarchy of vehicle parts . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Creatingmultiplehypothesesfromtheresponsecurve. Theweighted
response curve is shown in blue, marginal PDF is shown in green.
Peaks selected by non-maxima suppression are shown in red. . . . . 90
4.4 Comparison of our learning algorithm to others in terms of cross-
correlation and joint optimization. . . . . . . . . . . . . . . . . . . . 93
viii
4.5 Cross-correlationbetweenparts: frontbumper,rearbumper,middle,
roof, front and rear windshield locations and angles. White is high
correlation and dark is low correlation. . . . . . . . . . . . . . . . . 95
4.6 Two sample input images . . . . . . . . . . . . . . . . . . . . . . . 100
4.7 (a) Generic model projected, before modeling. (b) Level I hypothe-
ses, besthypothesiscombinationareinred. (c)LevelIupdatedbest
selection after single user interaction. (d) Level II hypotheses. (e)
Image with projected final 3D model . . . . . . . . . . . . . . . . . 102
4.8 (a) 3D model with estimated vehicle color (b) 3D model with tex-
ture from the original camera position (c) 3D model from arbitrary
viewing angle (not all texture is available) . . . . . . . . . . . . . . 104
4.9 (a) 3D model with estimated vehicle color (b) 3D model with tex-
ture from the original camera position (c) 3D model from arbitrary
viewing angle (not all texture is available) . . . . . . . . . . . . . . 106
4.10 (a) Level I Hypothesis Match Percentage, (b) Level II Hypothesis
Match Percentage, (c) Level II Pixel Error, (d) Level I Percentage
Error, (e) Level II Pixel Error, (f) Level II Angular Error . . . . . . 108
ix
Abstract
Commerciallyavailable3Dmodelingsoftwareareoftendesignedforprofessional
artistsandengineers. Inthisthesis,wepresentanovelimage-basedmodelingframe-
work to rapidly create 3D models from a single un-calibrated photograph taken by
an ordinary camera. Our target user is novice computer users, thus a strictly 2D
user interface (rather than a 3D interface similar to Maya) is chosen. However,
given a 2D vector in the image, there are infinitely many 3D vectors with the
same 2D projection. Our modeler combines 2D user mouse drags with the exam-
ple model to create unique part-based 3D modifications. In our framework, 3D
models are created by modifying a part-based example model of the object class.
Furthermore, our part-based modification algorithm automatically distributes the
user inputs to the whole model. Our smart-copy texture synthesis algorithm auto-
matically creates a complete texture map of the 3D model. This texture synthesis
creates seamless textures by combining the image with the 3D model properties
such as visibility, symmetry and distance to camera. Many common objects such
as buildings, furniture and vehicles are modeled by our technique in mere minutes.
x
In this thesis, two extensionstoourmodelerare presented: model-driven video-
based rendering (ModVBR) and semi-automatic vehicle modeling. ModVBR is
able to create novel photo-realistic renderings of a dynamic scene from a single
video stream. Availability of a 3D model simplifies tracking, pose estimation and
additionofpost-effectssuchasshadows. Ourinteractivemodeleroftenrequirestens
of inputs from the user which can be entered in couple minutes. Semi-automatic
vehicle modeling, on the other hand, requires only a couple of user inputs and is
performed in tens of seconds. A learning-based part detector is fused with our
rapid part-based modeler to semi-automatically create 3D models of vehicles from
a single side-view image.
xi
Chapter 1
Introduction
Commercially available 3D modeling software are often designed to be used by
engineers or professional artists. In this arena, 3D modelers, such as 3D Studio
Max and Alias Maya, have been successful in creating various detailed 3D models
with thousands of polygons. Complex, coherent and detailed 3D models are often
referred as high-quality 3D models. This type of software programs dominates the
entertainment and gaming industries. On the other hand, few novice computer
usersusethesetypesofsoftware, sincehigh-qualitymodelsrequirehoursofmanual
work and an expertise in modeling. The main goal of our modeling research is
creating a rapid modeling scheme that is able to create quality 3D models with a
simple user interface. Our rapid part-based 3D modeler is used as a stand-alone
program to model every-day objects, such as vehicles, furniture and buildings.
With this thesis, non-engineer/artist users will be able to rapidly create 3D
models. Insteadofteachingusershowtouseamodelingsoftwareprogram,ourgoal
is to reverse-engineer their attempts to do modeling. It is a fact that people see 3D
1
objects even when they are looking at images [31]. It is assumed/suggested that
our recognition of objects triggers a recall of their corresponding 3D models, such
as depth relations. In our framework, users enter their inputs via mouse clicks and
dragsusingtheimageasaguide. However,thereareinfinitelymany3Dmovements
that satisfy this 2D line segment [14]. This ambiguity can be resolved either by
using multiple images [36,39,44] or using a 3D input device [28,50]. We present
a procedure to uniquely convert these 2D inputs to proper 3D movements from a
single image. This is important since novice computer users feel more comfortable
with 2D rather than 3D user interfaces.
Rapid part-based 3D modeler requires minimal number user interaction (i.e.
10 - 30) which attracts novice users since they are generally less likely to spend
hours or days on a single 3D model. In other words,”Time vs. Quality” paradigm
does not apply to our scheme; typically modeling time with our system is two to
three minutes. One of the reasons behind this is that our modeler is a model-based
system, thus there exists a prior underlying structure to the object being modeled.
Model-based systems can be parametric models where a set of rules are forced onto
the inputs, however generalization of this technique is non-trivial. Good examples
to model-based approaches are the revolution of surfaces [52,60] and architectural
modeling [9,29]. In architectural modeling, parallelism and orthogonality are en-
forced over the user inputs to remove the arbitrariness. In our framework, the
prior knowledge of an object class is gathered by taking an instance 3D model of
2
this class as an example. All the information about the object class is passed to
the program by the use of this generic model. In particular, the user modifies the
generic model by the 2D mouse drags until the model matches the input image.
Modification of an existing template is faster but more approximate than creating
models from scratch.
Althoughmodifyingageneric3Dmodelinordertogetthefinalmodelisamajor
speed up, without an effective way of entering and distributing inputs to the whole
modeloverallsystemstillsuffersfromtheamountofuserinteractionrequired. First,
we choose part-based modifications rather than vertex-based modifications. The
rationale behind this is that realistic renderings often require detailed 3D models,
i.e. large number of vertices. For example, our generic Honda Civic model has
14,000 vertices. Although not all vertices need to be modified at every input, only
a5%changestillrequireschangein700vertices. Thusavertex-basedapproachwill
clearly hinder the system’s performance. Secondly, the user inputs are distributed
fromthesourceparttotheotherpartsviaconnectivityandsymmetryofthemodel.
This distribution is automatically performed for every input and it not only speeds
uptheprocessbutalsopreventsartifactsduetoinconsistentuserinputs. Although
anexactmodelisoftenunachievable,part-basedmodificationsenablerapidcreation
of3Dmodelsthatarevisuallyclosetotherealmodelinmereminutes. Furthermore,
our rapid part-based 3D modeler is able to texture this new 3D model using the
image and the symmetry of the object, where the gaps in the texture map are
3
Figure 1.1: Rapid Part-Based 3D Modeler input images and sample output ren-
derings. All objects are modeled and textured from a single uncalibrated image in
couple of minutes.
4
filled by the estimated mean body color of the object. Figure 1.1 shows various 3D
models created by our system from single images in couple of minutes.
Main contributions of this thesis to 3D modeling can be summarized under
three topics. First, a generic modeling framework to model a range of objects
from examples is shown. Examples are modified to create new instances rather
than creating them from scratch. This not only speeds up the overall modeling
procedure but also keeps the consistency among the class intact. Second, a novel
2D to 3D space mapping scheme is presented. This mapping exploits the users’
understanding of 3D objects from 2D images. Third, a unique part-based (as
opposed to vertex-based) modeling for arbitrary objects is presented. Part-based
modifications are far faster and more suitable to modeling many man-made objects
than the vertex-based modifications.
Our 3D modeler requires few 2D user interactions to create 3D models. This
has crucial implications beyond 3D modeling. Several extensions to our research
is made by integrating our modeler with other known techniques, such as model-
aided tracking and object detection. This thesis focuses on two of these extensions:
model-driven video-based rendering and semi-automatic vehicle modeling.
Video-Based Rendering (VBR) [3,34,40] extends image-based rendering tech-
niques [9,24,49,51]. ThemainideabehindVBRistheuseofmultipleframes(from
a single or multiple camera) in order to model both static and dynamic objects in
the scene. In general either the camera or the scene is assumed to be stationary,
5
since these two have mathematically equivalent representation. Our main focus is
on the stationary camera, dynamic scene case. VBR systems require tedious man-
ual work and/or computation times ranging from days to weeks. In our work, we
present a pipeline that is able to create close to photo-realistic renderings in tens
of minutes. In the core of our VBR system lays our part-based modeler. Although
one may suggest using the part-based modeler at every frame rather than VBR,
this is neither efficient nor necessary. Furthermore, tracking of the object ensures
the consistency of the vehicle’s 3D model over the frames.
VBR is generally composed of two major steps: tracking and pose estimation.
Our model-driven VBR (ModVBR) takes advantage of the 3D model to perform
both. We present a technique that automatically creates 2D patches estimating
by aligning user inputs with the 3D model. The model-aided tracking not only
enables the system to mitigate the ”background leakage” problem but also has less
driftingduetopostupdateafterposeestimation. Inparticular,ourtrackerupdates
the location of 2D points after pose estimation via projecting the corresponding
3D points back on to the image, which reduces drift and corrects overall error
propagation.
ModVBR process starts by modeling the dynamic object in the scene, i.e. vehi-
cle. Atthisstageamappingbetween2Dimageand3Dmodelisknown. Numerous
points are tracked over the video using the image and the 3D model. Later, these
points are used to update the camera parameters. The static environment is also
6
modeled and rendered by our system to increase realism of our renderings. This
model-driven VBR technique is presented as a possible application of our modeler.
To our best knowledge, ModVBR is the only dynamic scene VBR technique from
a single un-calibrated camera.
Second extension to our modeler is made in the area of automatic vehicle mod-
eling. Ourapproach,althoughpresentedforvehicles,canbeappliedtootherobject
classes. Similar to our 3D modeler, our object detection system is also part-based.
Automatic part detection not only decreases the modeling related user interaction
(from an average of fifteen inputs to an average of three) but also allows automatic
cameracalibration(fromsixteenimageclickstozero). Ourobjectdetectionsystem
takes advantage of the fact that parts of an object are often related to each other.
For example, sports cars have elongated fronts and shallow frame.
Our vehicle modeler first creates multiple hypotheses for every part using the
image, predefined part filters and marginal probability distribution of part proper-
ties. Later, a best combination of the part hypotheses is estimated using our novel
reduced histograms learning algorithm. Reduced histogram algorithm is able to
learn a JPDF from a small set of training samples (e.g. hundreds) regardless of the
dimensionality. The best hypotheses are chosen by maximizing the log-likelihood
of the JPDF approximate by the reduced histograms. However, the selection is not
error-free and the user is allowed to correct possible false selections. Once all the
part hypotheses are selected, part location estimates are passed to our part-based
7
3D modeler. Since our 3D modeler only requires 2D mouse drags, by combining
the current location of the part with this estimate, an input per part is achieved.
We model many vehicles ranging from SUVs to hatchbacks from a single side-view
picture in around ten seconds with an average of three user inputs.
Our main contributions to object detection are the introduction of a new learn-
ing algorithm and interactive learning-aided multiple hypotheses optimization. Re-
duced histograms learning algorithm is a general learning algorithm for learning
arbitraryJPDFasacombinationofcross-correlationweighted2Dhistograms. Fur-
thermore,wepresentedasuitablemultiplehypothesesmaximizationtechniquethat
leverages the strengths of reduced histograms. Although our algorithm is tested
on vehicles only, we expect that our assumptions would hold for any part-based
representation of objects.
In this thesis, our part-based 3D modeler and its extensions are presented.
Simple 2D only user interface not only enables novice computer users to create 3D
models but also allows many other extensions similar to our ModVBR and vehicle
modeler.
8
Figure 1.2: A collage of sample renderings from different 3D models created by our
system. All 3D models are created from a single uncalibrated image in couple of
minutes.
9
Chapter 2
Rapid Part-Based 3D Modeling
3D modeling of arbitrary objects often requires either extensive user interaction
or expensive specialized hardware. In this chapter, we present a novel part-based
modeling framework to rapidly create 3D models of objects from a single image
taken by an ordinary camera. Object models are created by altering a part-based
example model (of the same class). Commonalities of the object class, e.g. similar
shape and hierarchy etc. , speeds up the modeling procedure. Part-based 3D
modifications are estimated by combining 2D user inputs (mouse drags over the
image)withtheexamplemodelproperties. Manyobjectsclasses,suchasbuildings,
couches and vehicles, are modeled by our technique from a single un-calibrated
image in couple of minutes.
10
2.1 Introduction
Creating rapid but accurate 3D models of common objects is a valuable tool for
many applications such as virtual/augmented reality, digital entertainment and
computergames. Having3Dmodelsofpersonalitemssuchasyourhouse,furniture
orcarmightseemunnecessary; howeversameargumentcanbemadeforpicturesof
suchitems. Thepopularityofphotographyisduetotheaccuraterepresentationand
ease-of-capture. Thus,ifcreating3Dmodelsofobjectsweresimpleandinexpensive,
their usage might dramatically increase.
So far there has not been a general-purpose modeler that can create accurate
models in minutes. On the other hand, a variety of systems exists for modeling
specific classes such as human body [3], face [2] and buildings [29]. These systems
typically take advantage of the prior knowledge of the object class. However to our
bestknowledge,aframeworktogeneralizethesealgorithmstoarbitraryobjectshas
not been attempted.
In this chapter, we present a general-purpose part-based modeler that is both
fast and accurate. Our modeler is designed to modify a part-based 3D example
model of agiven object classtocreatedifferent3Dinstances. Userinterface, unlike
other general-purpose modelers, is 2D, i.e. user gives all the inputs by mouse drags
over the image. Our procedure distributes the user inputs to the whole model in
order to speed-up the process and prevents inconsistencies between parts due to
differentinputs. Weareabletocreate3Dmodelsfromasingleun-calibratedimage
11
in couple of minutes. We applied our algorithm on objects that we encounter on a
daily basis such as buildings, furniture and vehicles.
2.2 Related Work
Maya and 3D Studio Max are general purpose high-end 3D modeling software
packages. These programs can be used to model objects ranging from prehistoric
dinosaurs to coffee tables, see figure 2.1 for some example models. Their generality
comesfromthefactthattheyworkonthesmallestpossibleelementofa3Dmodel:
vertex. However,generalitycomeswithaprice; theyoftenrequireintensivemanual
labor. Forexample,atypicalvehiclemodeliscomposedofcouplethousandvertices.
Modeling objects is often a multi step procedure that requires creation of the 3D
model, assigning surface properties and texturing. Furthermore, they are designed
for creating models, thus modification to already created 3D models is as time
consuming as creating them.
On the other end of the spectrum, sketch-based systems attempt to model
objects with simple approximate user interactions. The TEDDY software program
is considered to be the principal starting point of many sketch-based methods [57].
TEDDY is a simple to use program to create approximate 3D models as a union of
ellipsoids. SKETCHrapidlycreatesnon-photorealisticrenderingsofthe3Dmodels
created by the user from few simple inputs [65]. This method is one of the first
of its kind and inspired the research on converting simple 2D user interactions to
12
Figure2.1: Examplemodelsofhigh-end3Dmodelingsoftwarepackages: Mayaand
3D Studio Max
3D actions. Sketching methods are design to approximate rather than model and
created models lack realism. Yang et al. fit the user inputs to pre-made part
templates and combine them to create 3D models [64], see figure 2.2. They can
rapidly create 3D models of a variety of objects such as cups, airplanes and fish.
However,theirapproachrequirescarefullycreated2Dobjecttemplates,wherethese
2D templates need to be matched to a 3D model template. In other words, adding
an extra object class to their modeler requires large amount of work. Furthermore,
objects can not be in arbitrary position, i.e. object pose is pre-determined. Google
Inc. has created the Sketch-Up software program to model architectural structures
using sketching framework [55], see figure 2.2. Sketch-up only allows straight lines
and certain angle combinations as inputs, thus user inputs are interpreted to fit
certain structure. Their system, similar to sketch-based systems, is easy to use and
able to create low complexity models in minutes.
13
Figure 2.2: Top row are buildings modeled using Sketch-Up program. Bottom row
are example models of Yang et al.
Prior knowledge of the objects can dramatically increase the quality and the
speed of the modeler. For instance, architectural modeling approaches use the
structure of a building to constrain and convert user inputs to appropriate 3D
entries [9,29,54]. Although the models are rapidly created in minutes, they tend
to lack details. However, these techniques do not require an additional texturing
step due to their 2D interface. Hoeim et al. presented an automatic architectural
modeler based on pop-up book effect [17]. The scene is segmented into three parts:
ground, middle and sky. The middle section is ”popped-up” over the ground and
the sky is omitted in the 3D visualization. Another example of this family is the
revolution of surfaces. Revolution of surfaces are commonly used to model objects
suchasspheres,cylinderswhichareusefulprimitivesforvasesandboxesetc.[52,60].
14
Figure 2.3: 3D models created from multi-view techniques are often create crude
3D models with many artifacts
Considerableworkhasbeendoneintheareaofmultiple-viewimage-basedmod-
eling. Image-based modelers use the correspondences between the images to esti-
mate 3D structures [26,44]. Multi-view techniques do not require a predetermined
shape of the objects. Their main limitation is the Lambertain surface requirement
for the automatic detection, i.e. the color of the surface is independent of the
viewing angle. Some objects, however, are non-Lambertian which complicates the
matching procedure, e.g. shiny surfaces of vehicles. Furthermore, multiple images
of an object might not always be available. Another multi-view system often used
is the silhouette-based reconstruction [36]. However these techniques create crude
3D models and have many artifacts: bloated surfaces with many sharp edges, see
figure 2.3.
Avarietyofprojectshasfocusedoncreating3Dmodelsbycombiningpreviously
created 3D models. These techniques are generally faster but more approximate
comparedtoothermodelingtechniques. Thefocusisonsplittingmodelsintoparts
15
Figure 2.4: 3D models can be created by fusing parts from other 3D models
and fusing them together. Funkhouser et al. split 3D models by intelligent scissors
method and combine them by automatic alignment estimation and interpolation
[16], seefigure2.4. Theirmethodislimitedtothesizeofthedatabaseandretrieval
of desired shapes seems to be the bottleneck of their algorithm. Although they
present a powerful technique to analyze objects ahead of time and learn many
parts, not all possible parts are retrieved. However, one can argue that enough is
retrieved. We think that combining our part-based modification framework with
their part-based fusing framework is a step toward a complete modeler. Modeling
by examples is not limited to 3D modeling [6]. Cohen et al. presented the idea of
example interpolation and its applications to face modeling and body animation.
16
Figure 2.5: A part-based human model is created from silhouettes estimated from
multiple cameras
Modifyinganexisting3Dmodelissimilartoestimatingvariablesofaparametric
representation. Carranza et al. assigned certain DOF (such as location and size)
to a part-based human model [3] and the parameters of this model are estimated
using the image silhouettes from multiple views, see figure 2.5. Their technique,
although automatic at run time, requires intensive pre-processing of the example
human model, thus not really suitable for general-purpose modelers. Sebe et al.
presented a part-based vehicle modeler [45,46] that creates different instances of
vehicles by translating the parts of a generic car model. Their technique, although
limitedtovehicles, istheclosesttoourapproachthananyothermethod. However,
their part decomposition requirement is too strict and parts can be only translated
which is too limited for general object modelers.
LiuandHuangpresentedanimage-basedinteractivemodelertocreate3Dmod-
els of relative smooth surfaces from a single image [30]. Their technique is a com-
bination of automatic image segmentation (which requires plain background) and
17
0 20 40 60 80 100
0
20
40
60
80
100
Amount of User interaction
Quality
Time vs Quality plot of different modeling approaches
Maya
SKETCH
Yang
Hoeim
Sketch−Up
Carranza
Sebe
Architectural
Huang
Matusikl
Funkhouser
Best Modeler
Worst Modeler
Figure 2.6: Comparison of time vs. quality of different modelers
userinteraction. Theirapproachrequiresintensiveuserinteractionandsuffersfrom
numerous assumptions made about the images and the 3D models of the objects.
Figure 2.6 shows our ”subjective” comparison of different modelers and their
expected quality and time requirements. The left-top corner represents the holy
grail of modelers, i.e. complicated, detailed and coherent 3D models are created
automatically. On the other extreme, right-bottom corner represents the worst
possiblemodeler, i.e. crudeandincoherent3Dmodelsarecreatedindaysorweeks.
Our modeler as can be seen is closer to this goal.
2.3 System Overview
Our modeler uses a single uncalibrated image and the user starts by selecting the
object class, e.g. vehicles, couches etc. . Currently our systems supports the
18
following object classes: ottoman seats, couches (two and three seats), buildings
(with and without roof) and vehicles. We have chosen a variety of classes with
different complexities to present the generality of our modeler. Each object class is
represented by a part-based example model.
Firststepisthelabelingofthe3majoraxesbyfindingtwoparallellinesforeach
direction. Second, anchor points (specific to the object class) are labeled by the
user. These two steps is used to align the example model with the instance in the
image. Fromthese16imagepoints(wheretherequirementsontheselectionofthese
points is detailed in section 2.5), a projection and scaling matrix are estimated. At
this stage, the example model is projected onto the image as a guide for modeling.
The user selects and modifies parts by mouse drags (over the image). Although the
alteration inputs are in 2D, 3D modifications are estimated and applied to the 3D
example model. User is allowed to switch between two alternative visualizations:
2Dimagewith3Dmodeloverlayand3Dmodeltexturedwiththeimage. Bothhave
its merits when trying to see the artifacts and the next appropriate modification.
This procedure is repeated until the user is satisfied with the quality. Final result
is a part-based textured 3D model, which can be saved with or without the texture
as a VRML file.
In the next section, the choice and requirements of example models are ex-
plained. Section 2.5 has the details of the camera parameter estimation. Visual-
ization and texturing of our model is explained in section 2.8. Mapping of 2D user
19
(a) (b) (c)
Figure 2.7: Example models of the objects classes. Each part is shown in different
color. (a) Vehicles (b) House with roof (c) Love seat
inputs to 3D is a crucial part of our framework, see section 2.6. Section 2.7 gives
the details of how user inputs are converted to 3D modifications and applied to
the example model. The 3D models created by our system and a discussion of the
performance and limitation of our modeler can be found in section 2.9.
2.4 Part-Based Example Model
Ourapproachofmodelalteration(ratherthancreation)takesadvantageofthefact
that user knows (or can identify) theclass of objectintheimages. Priorknowledge
of an object can be used parametric or model-based. Parametric models have been
successfully used for architectural structure modeling. Buildings are composed of
dominant parallel and orthogonal lines. By selecting either one of these modes
for every line, many building models can be rapidly created [29,56]. Revolution
of surfaces can model objects such as vases, cylinders and spheres [52,60]. As
can be seen from the type of objects in these examples, parametric models are
20
used to model simple objects, since parameterization of complex objects are often
impractical (if not impossible).
In order to model arbitrary objects, we choose a model-based approach. The
underlying assumption is that the objects from the same class share many com-
monalities, such as hierarchy and parts. For example, humans have two legs, two
arms, a torso and a head, where the relative size and shape of these parts vary
from person to person. This observation was leveraged by Carranza et al. to create
human models [3], see figure 2.5. Modifying an example model saves time since
parts are already created and only need to be placed. 3D model of an object class
should be chosen in a way that minimizes the number of average manipulation,
3D model corresponding to the mean of the object class is the best choice. For
example, if there are three human models available: skinny, overweight and regular
weight, choosing regular one minimizes the average alteration made to the model.
Representationofthe3Dmodelmainlydependsonhowthemodificationstothe
example model will be performed. Although mesh-based modification techniques
such as free-form deformations [47] can be used, we choose to use a part-based
alteration technique due to two main reasons. First, most 3D models today are
composed of parts (in our Internet search for 3D models, close to 80% all models
were part-based). Second, part-based techniques are often faster than general-
purpose modelers and allows a control on accuracy via varying part number [16,20,
45,46].
21
Part-level alterations, in comparison to vertex-level alterations, lead to faster
but more approximate modelers. The level of detail and the amount of user inter-
action is a trade-off, where changing the number of parts in a model can move the
modeler closer to either one of these extremes. For example, human arm can be
represented with a 3 part model: hand, forearm and biceps, however this represen-
tation is unable to do any changes to the fingers of the hand. The best approach
in creating a fast (couple minutes per model) and high level-of-detail capable (tens
of thousands of coherent polygons) modeler is by having a hierarchical represen-
tation for parts. The example model can be created in a way that each part can
be decomposed into smaller parts upon user’s request. One such method, although
compatiblewithourcurrentframework,isnotapartofourcurrentimplementation.
Introducinganobjectclasstoourframeworkrequiressomepreprocessing(total
amount depends on the example model at hand). If a suitable part-based 3D
exampleisnotavailablebutanexample(non-partbased)isavailable, thenamesh-
partitioning approach is taken.
Meshpartitioning,ormeshcharting,istheprocessofalabelingofthemeshfaces
as different parts. In our modeler, each part (partition) must be unique, bounded,
closed, connected, and compact. Although each partition can be of any genus, a
higher genus may cause unexpected behaviors during modeling. All parts created
in our system are of genus 0, except the tires, where they are of genus 1. Genus 0
and 1 partitions are homeomorphic to discs and toroids, respectively. Furthermore,
22
parts can have holes, as soon as there exists a path connecting the holes to the
outside boundary (implied from connectedness). Although mesh partitioning does
not have to go through the vertices of the model, we choose to not to cut through
facesforsimplicitypurposes. Ourpartitionsdonotsharefacesbutdosharevertices
(and edges), where these form the boundaries of the partitions. The boundaries
of the partitions are crucial since they transfer information (in our case 3D mesh
movement) between parts. Shared boundaries satisfy C
0
continuity requirement
of the partitions. Boundaries are also regularized, i.e. there is no isolated path
of edges on the boundary. Further details about mesh partitioning can be found
in [42].
Meshpartitioningprocedurehastwosteps: boundarycreationandfacelabeling.
User selects certain mesh vertices as boundary and these vertices are linked to
create a compact and regularized path using Dijkstra’s algorithm [7]. Once the
boundary is known, user labels the inside to resolve inside/outside ambiguity. A
flood-fillalgorithmisusedtoextractallthefacesinthepartition[7]. Thisprocedure
is repeated until the model is totally partitioned. Although this step is time-
consuming and requires tedious work, it needs to be done only once per class of
objects. Sincemeshpartitioningisatime-consumingprocessandshouldbeavoided
ifpossible. Ifapart-based3Dexamplemodelisalreadyavailable,meshpartitioning
step is skipped.
23
A graph representation of parts is extracted for modeler to perform part-based
actions. Sebe et al. automatically created a similar representation by analyzing
shared vertices between parts [45]. However, after analyzing many part-based 3D
models, the assumption of shared vertices is found to be a rare case. We suspect
that this is due to the fact that artists model parts separately and one at a time.
Thusarelaxedconnectivitycriterion,pseudoconnectivity,isusedinourframework.
In particular, parts that are closer than an distance are assumed to be neighbors.
Although can be estimated by analyzing the example model, we allow the user to
enter this value. In our experiments, we observe that the value does not have to
be precise and can be generally guessed at a single try. Once the value is given
during pre-processing, a connectivity graph of parts is automatically estimated at
run-time.
Examples created by mesh partitioning or directly downloaded from Internet
are hard to distinguish from each other. Figure 2.7 (a,b) show mesh partitioning
models and (c) shows a downloaded 3D model.
2.5 Camera Parameters
Man-made objects often have features that create parallel lines, e.g. buildings,
vehicles, chairs; see figure 2.8. Thus, we use vanishing points approach to estimate
the coupling between the image and the example 3D model. Specifically, a 3x4
24
Figure2.8: Man-madeobjectsoftenhavefeaturesthatcreateparallellines. Parallel
lines of three main axes are drawn on the images
projection matrix P 2.1 and a 3x3 scaling matrix are estimated from 16 image
clicks.
P =K [RT] (2.1)
Vanishing points approach requires the user to label 2 parallel lines per X-
Y-Z directions. From these 12 2D points, a projection matrix is estimated [5].
Projection matrix is composed of there major matrices: internal camera matrix,
rotation matrix and translation matrix 2.1. Vanishing points are assumed to be at
infinity in 3D and their projection can be written as in 2.2 (this is a special case
for three orthogonal vanishing points).
25
λ
1
p
1
x
λ
2
p
2
x
λ
3
p
3
x
λ
1
p
1
y
λ
2
p
2
y
λ
3
p
3
y
λ
1
λ
2
λ
3
=P
1 0 0
0 1 0
0 0 1
0 0 0
(2.2)
,where p
i
denotes the vanishing points on the image plane and λ
i
denotes the
projective scaling. Each vanishing point can be found by intersecting the user
labeled parallel lines, see figure 2.8 (lines are parallel in 3D but not in 2D, thus
they have a finite intersection point). Formula 2.3 can be achieved by substituting
the P from equation 2.1 to 2.2.
p
1
x
p
2
x
p
3
x
p
1
y
p
2
y
p
3
y
1 1 1
λ
1
0 0
0 λ
2
0
0 0 λ
3
=KR (2.3)
Iftheright-hand-sideoftheequality2.3ismultipliedbyitstranspose,weachieve
2.4. This form is quite useful to estimate the internal camera parameters. This
equality can be converted in a form to solve for the 6 parameters (threeλ’s has two
DOF). However an explicit solution to estimate the camera parameters can be also
written, e.g. equation 2.7 and 2.11.
26
(KR)(KR)
T
=K(RR
T
)K
T
=K(I)K
T
=KK
T
(2.4)
KK
T
=
p
1
x
p
2
x
p
3
x
p
1
y
p
2
y
p
3
y
1 1 1
λ
2
1
0 0
0 λ
2
2
0
0 0 λ
2
3
p
1
x
p
2
x
p
3
x
p
1
y
p
2
y
p
3
y
1 1 1
(2.5)
Theinternalcameramatrix(K)underzeroskewandfixedaspectratioassump-
tions can be written as 2.6.
K =
f 0 c
x
0 f c
y
0 0 1
(2.6)
, where the camera focal length is denoted as f and camera center is shown as
[c
x
,c
y
]. By substituting K from equation 2.6 into equation 2.4, the camera center
and the focal length can be estimated as follows:
27
A[c
x
c
y
]
T
=b (2.7)
[c
x
c
y
]
T
=(A
T
A)
−1
A
T
b (2.8)
A=
(p
3
x
−p
2
x
) (p
3
y
−p
2
y
)
(p
3
x
−p
1
x
) (p
3
y
−p
1
y
)
(p
2
x
−p
1
x
) (p
2
y
−p
1
y
)
(2.9)
b=
p
1
x
∗(p
3
x
−p
2
x
)+p
1
y
∗(p
3
y
−p
2
y
)
p
2
x
∗(p
3
x
−p
1
x
)+p
2
y
∗(p
3
y
−p
1
y
)
p
3
x
∗(p
2
x
−p
1
x
)+p
3
y
∗(p
2
y
−p
1
y
)
(2.10)
f =
q
(p
1
x
−c
x
)∗(p
2
x
−c
x
)+(p
1
y
−c
y
)∗(p
2
y
−c
y
)
(2.11)
Equation2.3withtheknownKfrom2.6canbere-written. Theonlyunknowns
in equation 2.12 are the projective scaling parameters, λ
i
.
R =
λ
1
(p
1
x
−c
x
)/f λ
2
(p
2
x
−c
x
)/f λ
3
(p
3
x
−c
x
)/f
λ
1
(p
1
y
−c
y
)/f λ
2
(p
2
y
−c
y
)/f λ
3
(p
3
y
−c
y
)/f
λ
1
λ
2
λ
3
(2.12)
Using the orthogonality property of the rotation matrix we can estimate λ
i
in
equation 2.13.
28
λ
1
=
s
(c
y
−p
3
y
)(p
2
x
−p
3
x
)−(c
x
−p
3
x
)(p
2
y
−p
3
y
)
(p
2
x
−p
3
x
)(p
1
y
−p
3
y
)−(p
2
y
−p
3
y
)(p
1
x
−p
3
x
)
(2.13)
λ
2
=
s
(c
y
−p
3
y
)(p
1
x
−p
3
x
)−(c
x
−p
3
x
)(p
1
y
−p
3
y
)
(p
1
y
−p
3
y
)(p
2
x
−p
3
x
)−(p
2
y
−p
3
y
)(p
1
x
−p
3
x
)
(2.14)
λ
3
=
s
(c
y
−p
1
y
)(p
2
x
−p
1
x
)−(c
x
−p
1
x
)(p
2
y
−p
1
y
)
(p
2
x
−p
3
x
)(p
1
y
−p
3
y
)−(p
2
y
−p
3
y
)(p
1
x
−p
3
x
)
(2.15)
In single view with no prior knowledge of the scene, translation vector is under-
defined. The arbitrary scaling value can be chosen any value and in our imple-
mentation is chosen as λ
4
= 1. Using the user given world center [u
x
,u
y
], we can
estimate the translation vector T as follows:
[u
x
u
y
1]
T
∝P [0001]
T
=K [RT]∗[0001]
T
=K T (2.16)
T ∝K
−1
[u
x
u
y
1]
T
(2.17)
At this point, we have estimated a projection matrix that maps the 3D world
to 2D image. The scaling of our generic model is still unknown. For example, our
3D model can range from -100 to 100 or -1 to 1, since there are no units the scaling
information is required. We need 4 known points on the example 3D model and 4
userlabeled2Dimagepointstoestimatethismatrix. Thefourpointscorrespondto
X-Y-Z scale and the centerofthemodel. Forexample, tiresareusedforestimating
29
Figure 2.9: Projection of the generic model after calibration is shown in (a). The
3D model after the modeling matches the image since the image is used as a guide
(b).
the scale matrix for vehicles. The selection of the 3D points needs to be done
(once) during the introduction of the additional object class to the modeler, i.e.
during pre-processing. During modeling, user clicks a total of 16 image points for
the camera calibration (12 for projection matrix and 4 for scaling matrix).
A sample result of calibration is shown in figure 2.9 (a). As can be seen, generic
model does not match the input image and the user easily can identify which parts
need correction. After the modeling the model matches the 3D model and this is
shown in figure 4.4.2 (b).
2.6 2D to 3D Space Mapping
An important part of any modeler design is the mapping of user inputs into final
modeling inputs. Our goal is to create an intuitive and easy-to-use interface for the
user. Several systems have inspired our approach. Google’s Sketch-Up program
30
allowstheusertocreateandmodifyoftensimple3Dmodels[55]. Specifically, their
automaticdirectionselectionandextrusioncapabilitiesareofinteresttous. Sketch-
Up, in default mode, allows movement only in a set of directions and highlights the
probable(suggested)directionduringtheusersmousedrag. Theyalsoallowuserto
draw and extrude parts on other parts. For example, once the wall of a building is
created, user is allowed to draw a window and pull that window in/out. Extrusion
direction is only in the direction of the surface normal. Another inspiration to our
modelercomesfromthearchitecturalstructuremodeling. Thesesystemsconstraint
the user input by the implicit model of a building (parallel or orthogonal lines).
The program tries to satisfy the user input without violating any of the constraints
(user inputs are not totally satisfied). This is often preferred since the coherency
of the model is kept intact. In our informal user studies, we find out that the
novice users feel more comfortable with interactions made in 2D rather than 3D.
Furthermore, humans still recognize and see 3D structures while looking at a 2D
image. Thus, the 2D vectors in an image are often perceived as 3D vectors by the
user(samecannotbesaidforcomputers). Theadvantageofourframeworkisthat
the 3D surface information is coupled with the 2D image with the use of example
3D model overlay (see figure 2.16 b,2.17 b and 2.18b).
In our framework, user inputs are mouse drags over the image. User first clicks
on the image, to select a part and then drags it to move, see figure 2.10. This 2D
image vector v
2D
needs to be converted into a 3D movement V
3D
in order to make
31
Figure 2.10: Projection causes ambiguity. Given 2D vector, there are infinitely
many possible 3D vectors.
modifications to the example 3D model. The starting point of the 2D vector p is
back-projected onto the example 3D model, this gives an unique 3D point P for
the starting point of the movement. However, end point of this 3D vector P‘ can
not be directly estimated, i.e. there are infinitely many 3D vectors that satisfy this
2D vector even with a known starting point.
UsingatechniquesimilartoSketch-Up,extralogicalconstraintsareadded. The
main assumption made in oursystemistheapproximationof the projective scaling
via linear scaling. In other words, we assume that scaling a vector in 3D causes
the same scaling in 2D, see inequalities 2.18 and 2.19. This is a false assumption in
general, however under certain conditions this gives a good approximation. In our
case since the parts are assumed to be connected and bounded (the parts often do
not span a large area), the projective division causes only minor differences on the
different points in the part, i.e. the projective effects are approximately same for
the points belonging to the same part.
32
V
i
↔v
i
(2.18)
c.V
i
↔c.v
i
(2.19)
Combining the estimation of the starting point P and 2.19, a 3D vector can be
estimated in 2D. There are two different constraint modes: smart-direction mode
and normal mode. Smart-direction chooses the best axis that aligns with the user
vector (projection of 3D major axes v
i
are compared to the user vector v
2D
). Since
a single direction is chosen, the user input is not totally satisfied, see projective
scaling approximation formula 2.21.
max
i∈x,y,z
|v
2D
.v
i
| (2.20)
V =V
i
v
2d
.v
i
kv
2D
kkv
i
k
(2.21)
In our experiments users rarely attempt to move in arbitrary off axis directions
(visualizing off axis movements from 2D is hard). However, there is one off axis
inputoftengiventothemodelerandthatisthenormalofthesurface. Forexample,
windshields are slanted and they are allowed to move in the normal direction. This
option is handled by our second mode: normal mode. The final 3D vector (V)
is composed of V
N
and V
R
. First the amount of movement in the direction of N
33
(the part’s normal) is estimated using formula 2.23, where n is the projection of N.
Contrary to the smart-direction mode, the residual of the movement V
R
is satisfied
as a combination of the three major axes V
i
, see formula 2.24. In particular least-
squares estimation is performed to get the smallest 3D movement, see formula
2.25. Thus, in this mode, user input is satisfied 100%. All 2D user inputs are
automatically converted into 3D movements of a selected part by either one of
these modes.
V
3D
=V
N
+V
R
(2.22)
V
N
=N
v
2D
•n
knkkv
2D
k
(2.23)
V
R
=
X
i∈x,y,z
w
i
.V
i
(2.24)
v
R
=min
kwk
v
2D
−
X
i∈x,y,z
w
i
.v
i
(2.25)
Man-madeobjectsareoftencomposedofrelated/symmetricparts,e.g. cushions
of a couch. The user often desires to perform the same action to multiple parts.
Applying same action to multiple parts at the same time not only speeds up the
process but also keeps the model coherent and aligned. Thus the user is allowed
to select multiple parts by holding down the ctrl key and clicking on the parts in
the image. With every additional part addition (or deletion) enclosing box of the
parts is estimated. This enclosing box later used for direction selection process in
34
Figure 2.11: Estimation of the 3D vector V in normal-mode
the smart-direction mode. Multiple part selection is simply handled by creating a
virtual part that is the union of all selected parts.
2.7 3D Modeling
In our framework, the user inputs are applied not only to the selected parts but
also automatically distributed to the whole model. The main shortcoming of many
alternative modelers is that the user needs to enter many inputs to do a simple
global change. For example in 3D software package Maya, to move the bumper of
a car using alternative modelers, the user needs to highlight all the vertices of the
part first then enter the appropriate movement and interpolate all the in between
points manually. In our framework, this can be performed by a single mouse drag
since part-level modifications, rather than vertex-level modifications, are used to
ease and speed-up the modeling procedure.
35
2.7.1 Applying Modifications
Modifyingpartsindependentlyisnotonlyslowbutcausesinconsistencies(artifacts)
in the model. Thus, we choose to distribute the input given to a part to the whole
modelbyanalyzingtheconnectivityofparts. Apseudo-connectivitygraphofparts
isestimatedatthestart-up(seethedefinitionofpseudo-connectivityinsection2.4).
In this graph representation, each part is a node in the graph and the edges denote
the connectivity of parts. Given the source parts (multiple parts can be selected),
graph nodes are labeled as source, neighbors or sinks (non-neighbors). The goal
of our modeler is to estimate a movement value for every vertex in the model
using the graph as a guide. As the first step, all source parts (user selected parts)
are applied the action (translate, resize etc. ) with the estimated 3D movement.
Second, all sink parts (non-neighbors to the source parts) are kept still, i.e. all
vertex movements in these parts are set to zero. Lastly, for all the neighbor parts,
vertex movements are estimated using scattered data interpolation (SDI).
The user can choose between mainly two different SDI techniques: thin-plate
spline and linear triangular SDI. SDI in our context: The set of 3D points X
1
...X
n
with known movement values V
1
...V
n
(where V
i
is a 3D vector) are going to be used
toestimatethemovementvaluesW
1
...W
m
ofpointsY
1
...Y
m
. Thereisnoabsoluteor
relativerestrictiononvaluesofnandm(theycan”independently”beanynon-zero
number).
36
Our first method, linear triangular interpolation, is a 2D method, thus we first
project all the points to a 2D plane. Although this is an approximation, 2D in-
terpolation is computationally more efficient. The user should choose this option
if parts are close to planar (e.g. our generic vehicle model). Given a part, 2D
projection plane is estimated by the average surface normal. This is an average of
each triangle normal weighted by their corresponding areas. Once the 2D plane is
estimated and points are projected onto this surface, our problem becomes similar
to the problem of color interpolation on a plane (i.e. estimating the color of each
point on a plane from few known color points). First, a Delaunay triangulation
of the points with known movement values is performed. Later, every point with-
out a movement value is assigned to a triangle by in-triangle test (the assignment
is unique). The unknown movement value is estimated using barycentric interpo-
lation, see equation 2.26. If the triangle corners (x
i
,y
i
) have values ¯ v
i
, then the
unknown value ¯ w for point (p
x
,p
y
) can be estimated as follows:
¯ w =a
1
¯ v
1
+a
2
¯ v
2
+a
3
¯ v
3
(2.26)
a
1
a
2
a3
=
x
1
x
2
x
3
y
1
y
2
y
3
1 1 1
−1
p
x
p
y
1
(2.27)
37
However, not all unknown points are enclosed in the convex hull of the known
points i.e. someunknownpointsmightlieinnoneofthetriangles. Inthiscase, the
value of the movement value is estimated by projecting the point onto the convex
hull and then linear triangular interpolation is performed. Estimation of 2D plane
isO(N),projectionO(N),triangulationO(Nlog(N)),in-triangletestO(N),barycen-
tric interpolation O(1), overall algorithm is O(Nlog(N)). Also linear triangulation
has the nice locality property in contrast to thin-plate spline.
Thin-plate spline is global and generic SDI that uses radial-basis functions
(RBF). Thin-plate spline (TPS) minimizes the integrated squared second derivate
of a function, which is an approximation of the curvature. The distance measure in
TPSisshownin2.28(itisaradialdistancemeasure). XistheNknownpointswith
valuesv, estimation parameters weightsa and constant scaling c can be solved for
2.29. ThematrixΦiscomposedofdistancemeasureofknownpointstoeachother,
the diagonal is filled with weighting parameter λ (without λ, diagonal would have
been all zeros). Parameter λ controls how far the influence would spread, however
in our experiments this value showed little difference in the final estimation and is
fixed to 0.1.
38
φ(r)=r
2
log(r) (2.28)
Φ [1X]
[1X]
T
0
a
c
=
v
0
(2.29)
Φ=
λ φ(kX
1
−X
2
k) ··· φ(kX
1
−X
n
k)
φ(kX
2
−X
1
k) λ ··· φ(kX
2
−X
n
k)
.
.
.
.
.
.
.
.
.
.
.
.
φ(kX
n
−X
1
k) ··· ··· λ
(2.30)
By solving 2.29, the effect of each point is estimated. As can be seen from
the formalization of the TPS, the influence of points extends to infinity (though
smaller by distance) unlike the linear triangulation method. Solving for the TPS
parameters requires a least squares solution of (n+4)x(n+4) matrix which can be
done in O(N
3
) (for arbitrary matrices). Thus this method is computationally more
complexthanthepreviousmethod. Ontheotherhand, outsidepointestimatesare
often more accurate. Once the TPS parameters are estimated, given an unknown
point Y, the value at this point W can be estimated using 2.31.
W =
N
X
k=1
a
k
φ(kY −X
k
k)+c
1
+c
x
∗Y
x
+c
y
∗Y
y
+c
z
∗Y
z
(2.31)
39
Although SDI techniques might be different, these techniques have numerous
common behaviors. For example, interpolation values are more accurate when the
points with known values are plenty (percentage of known values are equal or more
thanunknownpoints)andwell-distributed(knownpointsencapsulatetheunknown
points). In general there are multiple parts that are labeled as neighbor, an order
among them needs to be chosen in a way to satisfy well and abundant distribution
of points. However, this should be transparent to the user since estimating an op-
timized order is quite complex. Instead, our algorithm enumerates the changes by
the following method. For every neighbor part, the percentage of known points to
the total number of points of that part is calculated. Part with the highest per-
centage is interpolated first and this procedure is repeated till no part is left. This
can be viewed as performing the best possible interpolation first, thus maximizing
the overall performance and accuracy of the interpolation.
Above procedure is performed if the connectivity option is selected. User is
allowed to disable the connectivity of parts; this is often desired if a change to
a part is wished to be not distributed. The downside of this disabling is that
the parts can pierce (penetrate) each other since there is no interaction between
them. In order to prevent this, during the application of the movements to the
vertices, collisiondetectiontestisperformed. Collisiondetectionlimitstheamount
a part can move in a certain direction without penetrating another part. From the
user’s point, this option often generates a snap-on effect. Enforcing this limitation
40
becomes a crucial piece of modifications under no connectivity mode, since part
penetrations are often hard to see (realize) from the projection of 3D model on the
image and rarely desired.
2.7.2 Modification Modes
Estimation of the 3D movement from the 2D user input is explained in section 2.6
and distribution of this input to the rest of the parts is presented in the previous
subsection. However, the user movements can be interpreted differently depend-
ing on the modification modes. In our framework, three alternative modifications
are possible: part translation, boundary translation and resizing. Although other
modifications, such as twist, skew etc. , can be added, in order to keep the in-
terface simple we limit the system to three modes only. Modification modes not
only changes the 3D movement applied to the selected part but also affects the
distribution of this movement to other parts.
In part translation mode, selected part/s are moved by the estimated 3D move-
ment, figure 2.12 (a,b). In our experiments, three quarters of all the movements
were made in this mode. Boundary translation first estimates the neighboring
points between the selected parts, figure 2.12 (c), then apply the estimated move-
ment, figure 2.12 (d). This option is useful in changing the slope of certain parts.
Lastly, resizing is required often to make relative size changes of certain parts. In
41
=⇒
(a) (b)
=⇒
(c) (d)
Figure 2.12: Modification by whole part vs. part boundary. Red = (User con-
trolled), Green = (Algorithm controlled), Gray = (Do not move). (a) Before whole
partmodification(b)Afterwholepartmodification(c)Beforepartboundarymod-
ification (d) After part boundary modification
all three modes, once the movement is applied to the desired parts or points, the
proper distribution is achieved using the technique in the previous subsection.
As can be seen in figure 2.12, a single user input can drastically change the 3D
model but still keep the 3D model smooth and artifact-free. The user controlled
part/points are shown in red, neighboring points that are going to be estimated
automatically by our algorithm is shown in green and the parts that are not close
(not a neighbor) are shown in gray.
42
Figure 2.13: Sample texture synthesis inputs
2.8 Texture Mapping and Visualization
2.8.1 Texture Synthesis
In our framework, only a single image is used thus a full texture capture is impos-
sible. Part textures that are not visible from the camera need to be ”synthesized”.
Texture synthesis can be performed by fitting a parametric representation of the
textureandlatersynthesizeanewtexturebymanipulatingthisrepresentation[41].
Non-parametricsampling[11,63]andtree-structuredvectorquantization[58]meth-
ods are also used to perform texture synthesis. Although these techniques are able
tocreateseamlesstextures,theinputtexturesareoftenrepetitiveand/orsynthetic,
see figure 2.13. Therefore these techniques have little promise in quality texture
synthesis for structured but non-repetitive real world objects, such as vehicles,
couches, humans etc. .
Synthesis techniques often require repetition of simple artificial patterns. How-
ever, real life images often have large scale repetitive patches. One alternative is
directly copy and pasting these repetitive matches [12,18,27]. Kwatra proposed an
43
Figure 2.14: Sample texture copy inputs
example-based method to fill in the holes in the image [27]. Hu combined a system
to first undo the effects of the projective scaling then copy the patches from this
non-perspective image [18]. These techniques generally can recover complicated
scenes that have many human-made objects, see 2.14. On the other hand, these
techniques are typically interactive.
2.8.2 Smart-Copy
Our texture estimation approach is similar to texture copy techniques and com-
bines the visibility information, symmetry of the object and user defined part spec-
ifications. Texture decision for the 3D model is made at a triangle level, thus an
automaticalgorithmispreferred. Textureofatrianglecanbeestimatedfromthree
different sources: directly from the image projection, from symmetry or from the
estimated color of the object.
Our procedure starts by projecting the example model onto the image as a
wireframe using the calibration results from section 2.5. In order to reduce clutter,
44
hidden-line removal is performed. During the pre-processing of the example model,
user is allowed to specify further parameters such as symmetry and color source
parts. Color source parts are the parts where a mean color of the object can be
estimated. Forexample,doorsandfendersforvehiclesandseatcushionsforcouches
are used as color source parts. If such parts are specified by the user, mean color of
the parts is used to estimate the main body color of the object. If no such part is
specified, the mean color of all visible parts are used for the body color. The body
color is used to fill the gaps in the texture map. The color estimation is performed
in three steps. First, the parts that are more than 80% visible are selected as
candidate parts (80% is chosen empirically). If the user specified color source parts
an additional elimination is performed. Secondly, average color of all the parts
that passed the elimination from step 1 is estimated. The averaging is performed
on the image domain rather than 3D for simplicity (although estimation in 3D is
more accurate, this seems to be often unnecessary). Same color value is perceived
different in images due to surface properties, viewing angle and self shadows (due
to objects being non-Lambertian). For each part, the observed color O value is
assumed to follow the simplified reflectance model in formula 2.32.
O =sC cos(θ) (2.32)
,wheres is the shadow scaling,C is the actual color and θ is the angle between
the camera and the surface normal. In this model, shadows are assumed to have
45
(a) (b)
(c) (d)
Figure 2.15: (a) Input image, Textured 3D model with (b) projective texturing (c)
projective texturing with symmetry (d) Smart Copy
a linear scaling effect and surface has no specular component. For every part, the
average observed color is divided by the cos(θ) to remove the effect of the viewing
angle. To eliminate the shadow factor, a line fitting is performed over the part
color estimates. Once a line is fitted, the middle of the line is chosen as the main
color. In our experiments the body color is often estimated correctly without any
user interaction.
46
Secondsourceoftextureisthetexturecopyusingsymmetry. Man-madeobjects
tend to be at least symmetric in one axis, e.g. left and right side of a vehicle is
symmetric. User can specify such symmetry by selecting one of the axis (during
pre-preprocessing). Selecting a symmetry (when one exists) allow the parts that
are not visible to have textures. These textures, although not the real textures of
these parts (real textures are not visible), create a coherent visualization of the 3D
model. Projective texturing is often performed to texture 3D models when only
visibilityinformationisavailable[8,19]. Usingprojectivetexturingoffigure2.15(a)
3D model can be textured, see figure 2.15 (b). The right side of the vehicle was not
visibleinthephotothustexturedwiththeestimatedbodycolor. Astraightforward
extension to projective texturing is the addition of the symmetry information. If a
point in the 3D model is not visible but its symmetric counterpart is visible, then
the texture information is copied. Figure 2.15 (c) is the result of this projective
texturing with symmetry method. However, this method often causes artifacts at
the boundary of the texture decisions, see right back corner of the vehicle in figure
2.15 (c). Our smart copy texturing combines visibility, symmetry and distance to
the camera to estimate a coherent texture map.
Smartcopytexturingisamulti-passalgorithm. Inthefirstpass,allvisibleparts
are textured using projective texturing. In the second pass, all the unassigned
(invisible) points are textured using projective texturing with symmetry. In the
third pass, all textured points from the first step are compared to their symmetric
47
# parts # triangles # points
Vehicle 36 14380 7608
House 6 1072 734
House with roof 9 1496 929
Couch with 1 seat 5 3520 1770
Couch with 2 seat 6 4224 2124
Couch with 3 seat 7 4928 2478
Table 2.1: Number of parts, triangles and points for different object classes
Number of inputs Time
Move Resize Boundary
SUV 11 1 2 3
Accord 6 0 2 2
House1 3 0 0 <1
House2 3 0 0 <1
Couch1 1 4 0 2
Couch2 1 5 0 2
Table 2.2: Number and type of user interaction (number of clicks not counting the
camera calibration) and total execution time (in minutes) for the examples shown
in figure 2.16, 2.17, 2.18, 2.19 and 2.20
counterpart in terms of their distance to the camera. This step simply assumes
that if a point is closer to the camera then the texture is more reliable. Lastly, all
the points with a texture are assigned the estimated body color of the object. As
can be seen from figure 2.15(d), smart copy texturing creates seamless transition
of the texture and creates a coherent visualization of the 3D model.
2.9 Results and Discussion
We focus on the modeling results of three major object classes (although other
classes are possible, we limit our attention to these for analysis purposes). These
48
(a) (b)
(c) (d)
(e) (f)
Figure2.16: AHondaAccordismodeledfromasingleimage. Theinputimage(a),
final 3D model overlay (b), textured novel view renderings (c,d,e) and 3D mesh of
the resulting 3D models (f).
49
classes are buildings, couches and vehicles. Buildings are chosen due to their em-
blematizedpart-structuresandhighutilityinmanypracticalapplications. Couches
have medium complexity and many examples of them can be found in every house.
Vehicles, in comparison to other object types, are complex and diverse in terms of
its models. The number of parts, points and triangles per object class is shown
in Table 2.1. One might realize that the number of triangles is far too many for
houses and couches. We applied iterative triangle subdivision technique to increase
the texture rendering quality. Increasing the number of triangles of a part does not
increase the complexity of the modeler since we perform part-level modifications.
Due to the alteration rather than creation framework, all created models have the
same number of parts, points and triangles as the example model of that class.
The number and type of user interaction and the total execution time for the
examples we present in this chapter are shown in table 2.2. In general, total exe-
cution time of our modeler depends on the similarity of the example model to the
instance in the picture and the number of parts of the model (moving fewer parts
requireslesstimebutcreateslessdetailedmodels). Allexperimentswereperformed
on a desktop computer with 3GHz CPU and 1GB RAM. Each user interaction is
performed under a second (on average), where 3D model modifications, texture es-
timation and rendering takes around 500ms, 250ms and 50 ms respectively. Thus,
our system runs at interactive rates. The choice of the modification modes, see
50
section 2.7.2, often depends on the object class. For example, vehicles and build-
ings often modeled by moving parts, where for the class of couches resizing actions
were generally taken, see table 2.2. Each object class has its own characteristic of
modifications, where this depends on how the manufacturer creates the different
instances.
TheexamplemodelforthevehicleswasaHondaCivicmodel(allexamplemod-
els were acquired from free internet sites, where many are available). An old model
Honda Accord and a new model Chevy SUV are modeled using our modeler, see
figure 2.16 and 2.17 respectively. The resulting 3D models are quite different from
each other and the example model. Mostly artifact free realistic novel views can
be rendered, see figure 2.16 and 2.17 (c-e). As can be seen from the vehicle exam-
ples, images do not have to be taken from a predetermined (or limited) direction.
The arbitrary location of the 3D model is handled by the camera calibration step,
section 2.5. Our modeler allows visualizations without a texture similar to Maya
and 3D Studio Max, figure 2.16 and 2.17 (f). In addition, a complete texture is
automatically extracted from the image, thus creating novel realistic-looking 3D
renderings of the objects is possible.
The main limitation of our framework comes from the assumption of hierarchy
similarity, i.e. object class hierarchy is fixed. For example couches with different
number of cushions are represented with different example models, see table 2.1.
One way to mitigate this problem is having only the couch with three seats and
51
(a) (b)
(c) (d)
(e) (f)
Figure 2.17: An SUV is modeled from a single image. The input image (a), final
3D model overlay (b), textured novel view renderings (c,d,e) and 3D mesh of the
resulting 3D models (f).
52
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Figure 2.18: Couches with two and three cushions are modeled. (a) and (e) are
input images, (b) is the final 2D display of the modeler, (c)(d)(f)(g)(h) are the 3d
renderings of the final result.
53
Figure2.19: Inputimagesandresulting3DrenderingsofabuildingonUSCCampus
remove the other seats via ”select and delete” option and resize the remaining
seat(s). This might be a good option for couches however other object classes
might not have such a deducible hierarchy. As mentioned in 2.2, our system can
benefit from a framework that selects and fuses parts from different objects, such
as [16], and modify them later on with our framework.
Couches often differ in the relative sizes of the parts, i.e. parts are in a similar
place with a different size. Thus, majority of the user inputs were in resize mode,
see table 2.2. Example couch models and their inputs can be found in figure 2.18.
ExampleofthecouchwithtwocushionsistakenfromanInternetfurniturecatalog.
Buildings, on the other hand, are often created by moving the faces of the building
54
Figure 2.20: Input images and resulting 3D renderings of a building on a street
(each face is a part), see figure 2.19 and 2.20. The surfaces of the building is flat,
however texture from the image creates the illusion of more detailed 3D models (as
opposed to color only representation).
One important question arises from the repeatability (consistency) of the pro-
cess. In order to test this, we took pictures of the same vehicle from two different
views and perform modeling independently. We compare the resulting 3D models
with absolute and bounding box percentage error measures (absolute error divided
by the diagonal distance), see table 2.3. The RMS error is less than 1% and the
max error is 3.8% (at the rear bumper). The error is small enough to be ignored
55
Absolute BBOX %
Min 0 0
Max 0.0816677 3.78909
Mean 0.0106783 0.495435
RMS 0.0206283 0.957081
Table 2.3: Number of parts, triangles and points for object classes
but numerical comparison does not always guarantee visual quality, e.g. artifacts
often cause small but quite visible errors. We compute the distance of each vertex
to the other model and the error is used as the color of the vertex, see fig 2.21.
As expected, the differences come from the parts that were not visible from the
other view. For example, front of the vehicle is different in the models since one
picture sees the front and the other does not. This suggests that if multiple views
are available, one should model the parts that are visible from that view mainly.
However, parts seen by both images are estimated quite similar, see figure 2.21.
2.10 Conclusion and Future Work
In this chapter, we have presented a novel modeling framework to create 3D mod-
els by modifying an example model. Alteration of 3D models rather than creating
them from scratch is shown to be a faster way to acquire 3D models. Our frame-
work requires little pre-processing of the example models. Furthermore, arbitrary
objects can be modeled from single un-calibrated images. Our user interface is
completely in 2D for user friendliness; however these 2D user inputs are automat-
ically converted into appropriate 3D modifications by our modeler. There are two
56
(a) (b)
Figure 2.21: Distance of two models of the same vehicle from two different views,
blue is low and red is high error. Errors are concentrated on the parts that are
invisible to the other view.
underlying assumptions made: man-made objects are composed of parts and same
object class often has the same hierarchy of parts. Many common objects such as
buildings,furnitureandvehiclesaremodeledbyourtechniquefromasinglepicture.
Although 3D modeling is possible from a single view, extracting a complete
texturefromasingleviewisoftenimpossible. Oneoptionisanalyzingandcreating
the texture via copy or synthesis, however texture synthesis restricts the type of
objects and often user wants a real duplicate rather than a realistic looking one.
Another method is allowing the user to add extra images and seamlessly combine
these textures to create a complete texture map. This second option is being
investigated as a part of our future work.
57
Chapter 3
Application I: Model-Driven Video Based
Rendering
Video-basedrendering(VBR)isalogicalextensionofimage-basedrendering(IBR).
Especially in the area of digital cinematography, VBR is used to create photoreal-
istic renderings of the real world from arbitrary viewing angles. Although reality
of the computer generated renderings are getting better everyday, VBR techniques
arestillacrucialalternativeforcreatingphotorealisticrenderings. VBRallowsfast
creation of content (tens of minutes) and novel view rendering is close to real-time.
However, the camera set-up limits the novel shots that can be created. On the
other hand, computer generated worlds have unlimited flexibility in the choice of
novel views. These techniques, however, suffer from long creation stage and final
renderings often take relatively long time to create. VBR is not a new topic and
58
Figure 3.1: Screen shots of ”bullet-time effect” from the major motion picture
”The Matrix”. The first popular use of video-based rendering in the movies.
there are numerous VBR systems of interest over the last decade. Since the in-
troduction of ”bullet time effect” in the major motion-picture ”The Matrix”, see
figure 3.1, IBR has been an important research area.
Another successful use of VBR was made in the Superbowl XXXV. The Eye
Visionsystemcreatedclosetoreal-timerenderingsofpre-gameshowandtheactual
football game. Both Eye vision system and the system used in ”The Matrix” cap-
turethescenewithmultiplecloselyplacedcameras. Thesesystemsoftenreferredas
the narrow-baseline techniques and use interpolation techniques in order to create
intermediate camera positions. Interpolative methods do not allow the camera to
be placed outside the initial locations of the cameras. Narrow-baseline techniques
often create quality renderings at real-time speeds but the virtual camera location
islimited. Inordertocreateunrestrictedvirtualcameramovement,alargenumber
of cameras can be placed in a sparse set-up. A closed room version of the ”Eye
Vision” system, called 3D Cage, is created by Carnegie Mellon University [23]. In
3D Cage, a total of 48 cameras are distributed on a grid on all walls of a room, see
figure 3.2 (a). Novel views of arbitrary objects can be created using this system.
59
(a) (b)
Figure 3.2: Camera set-up of 3D Cage system (a) from CMU and Stanford Multi-
Camera Array (b)
A similar approach is taken by the Stanford Multi-Camera Array Group [59]. A
frontal baseline of 128 cameras is used to manipulate video in different modes in-
cluding novel view creation, video-rate conversion, super resolution , see figure 3.2
(b). Both Stanford multi-camera array and 3D cage requires specialized hardware
that costs around hundred thousand dollar.
The VBR techniques can be grouped under the camera set-ups used. Wide-
baseline VBR systems often use fewer numbers of cameras and the cameras are
placed on a sparse fashion [35,36]. These approaches are more affordable but also
more approximate. They often can not create the same quality renderings as the
narrow-baselinesystems. Thesetechniquescreateanexplicit3Dmodelofthescene
and texture it using the acquired images. The main artifacts are due to low quality
3D models and faulty texture fusion.
OutdoorversionsofVBRhavebeenpreviouslyattempted. AVEsystemcreated
at USC [37] and the system developed at UC Berkeley [8] create an approximate
60
(a) (b)
Figure 3.3: Novel view rendering of outdoor scenes are attempted by research
groups in universities: (a) USC (b) Berkeley
3Dmodelofthesceneandtexturethismodelusingprojectivetexturemapping,see
figure 3.3 (a) and (b) respectively. The main drawbacks of these systems are that
notallrenderingsarephotorealisticanddynamicobjectsinthescenearenotalways
handled properly. Dynamic objects need to be modeled and tracked throughout
the video in order to create a successful fly-through.
Model-basedVBRtechniquesoftenhaverealisticrenderingsandarestillcompu-
tationally inexpensive to perform. Prior knowledge of the object class is leveraged,
e.g. humans. Vision-based fitting of a human model is investigated by Carranza
et al. [3] and Cheung et al. [4]. Cheung et al. used a skeletal representation of
the human body and estimate the pose of the skeleton by matching various points
between the images. They presented their system as a marker-free motion capture
system rather than a model-based VBR system.
61
Carranza et al. fit a part-based human model from multiple calibrated cam-
eras [3]. The movement and surface properties of humans are automatically ex-
tracted from the video in order to create a free-viewpoint video. First, a silhouette
per camera is estimated and a part-based human model is fit. Second, the model
is tracked over the video sequence. Lastly, the model is rendered using the mul-
tiple camera streams. Their technique is able to create good quality models and
renderings of dynamic human body. Their approach of part-based modeling and
rendering is one of the inspirations of our research. However, their technique re-
quires multiple calibrated cameras and long computation times ranging from hours
to days.
OurapproachtoVBRistointeractivelymodeltheobjectclassusingthemodel-
based modeler similar to architectural modelers, see Chapter 2 for details. Our
model-driven video-based rendering (ModVBR) system uses this part-based mod-
elertorapidlycreaterenderings[46]. Themodelcreatedfromasingleimage/frame
is tracked in the video and the pose of the dynamic object is updated. A free-
viewpointimageisrenderedusingthemodel,poseinformationandtheinputvideo.
Our model-aided tracker and pose estimation technique are also more robust than
regular non-model-aided trackers. We demonstrate the modeling and rendering ca-
pabilitiesofourtechniqueonvehicles. ModVBRmayalsobeusefulforapplications
such as surveillance, games and special effects.
62
Figure 3.4: Flow digram of the model-driven video-based rendering
3.1 System Overview
Model-driven VBR (ModVBR) uses our part-based modeler (Chapter 2) for rapid
creation of the vehicle model. In particular, video is paused at a particular frame
and the vehicle is interactively modeled. User clicks various points on the vehicle
for the tracker to track and certain parts of the background for the environment
modeler. These steps needs to be performed only once for every video sequence.
After this step, program runs automatically.
First an environment model is created using the current location of the vehicle
and the user clicks. The environment model is similar to a pop-up book effect [17].
At every frame, the vehicle is tracked using a model-driven tracker and the new
location of the tracking points are passed to our pose estimation technique. At
the frame that the modeling is performed, the vehicle’s pose with respect to the
camera is known. Pose estimator updates this pose as well as the tracking points
themselves. Finally at the rendering step, the vehicle is textured with the current
63
frame and the environment is modeled with the estimated background image. Fur-
ther effects such as lighting changes and shadows are added as requested by the
user. A flow diagram of the whole procedure is shown in figure 3.4. The steps
that require interaction are shown in light blue. The automatic steps, performed
at every frame, are shown in yellow. The user controls the dynamic rendering via
changing the virtual camera position and is shown in green in the flow diagram.
The main focus of our research is on the modeling, tracking and rendering of
dynamicobjectsandnotofthearchitecturalstructures. Modelingandrenderingof
architectural structures from image/video has been previously investigated by [29]
in detail. The rest of this chapter outlines the steps of this procedure. Section
3.2 presents the use of the 3D model to aid tracking. Pose estimation/update is
explained in section 3.3. Lastly, environment modelingandextrarenderingoptions
are presented in section 3.4.
3.2 Model-Aided Tracking
In order to use model-aided pose estimation procedure, several 2D image point/3D
mesh vertex pairs are needed. At the frame where the modeling is performed, the
correspondence between the 2D image and the 3D model is known. In particular,
user clicked 2D tracking points are back-projected onto the 3D model to create a
set of 2D/3D pairs. The purpose of the model-aided tracking is to propagate this
pairing information to the rest of the frames in the video.
64
Model-aided tracking can be formally stated as follows: Given a set of 2D/3D
pairsatframet,findasimilarmatchatframet+1. Sincethe3Dmodeloftheobject
is assumed to be unchanged between frames, by matching the 2D points between
framestandt+1, 2D/3Dpairsforframet+1arecreated. Thisprocedureissimilar
to finding motion vectors of some given points in 2D. Motion vector estimation
generally uses 2D square windows for matching. The model-aided vehicle tracker,
on the other hand, uses the relation between the 2D image and the 3D model to
increase the accuracy of the tracking. Instead of searching for a square-shaped
window in 2D, a surface patch on the model in 3D is created and the search in 2D
is performed for the projection of this patch. Figure 3.5 compares the projection
of these 3D patches and regular 2D patches. Patches in 3D are superior to the
patchesin2Dsincetheywillhavelesscrosstalkbetweenforegroundandbackground
(comparing patches on tires in Figure 3.5). This is also referred as ”background
leakage” and is one of the main sources of drifting in tracking. In our experiments,
3D patches outperformed 2D patches in terms of tracking accuracy and tendency
to un-drift.
The user controls the selection and creation of these patches through 2D image
clicks. The orientation of the square patches is still undetermined; we choose to
orient the patches parallel to three major axes. This can be seen in Figure 3.5 (b),
where each square box is aligned with the surface and is parallel to the major axes.
65
(a) (b)
Figure 3.5: Our model-based 3D patches have less ”background leakage” than the
regular 2D patches. (a) Regular 2D patches (b) Projection of 3D patches
Tracking these patches is computationally comparable to regular motion vector
estimation. Next frame is exhaustively searched with the following error function:
argmin
4x,4y
F
∗
t
(x
0
,y
0
)−F
∗
t+1
(x
0
+4
x
,y
0
+4
y
)
(3.1)
F
∗
t
(x
0
,y
0
)=
F
t
(x
0
,y
0
)−
¯
F
t
(x
0
,y
0
)
F
t
(x
0
,y
0
)−
¯
F
t
(x
0
,y
0
)
(3.2)
where F
i
is the i
th
frame, (x
0
, y
0
) is the current location of the patch, and
(4
x
,4
y
) is the displacement between frames. F* represents the normalized image
(zero mean, unit norm image patch). Equation 3.1 is also called normalized cross
correlation and effectively handles the illumination changes made by the camera.
The perspective effect between consecutive frames is assumed to be negligible.
We implemented additional drift prevention: time distance filtering. In general,
drift increases proportionally to the time distance (frame difference) of the current
66
frame to the initialization frame. In order to mitigate this, we post process the
outputs of our tracker. At every frame, tracking is performed between the current
frame and the previous frame first, later we search for another previous frame
(closer to the initialization frame) that has tracking points closer to the current
frame. If such a frame is found, model-aided tracking is performed between the
current frame and this new frame. This update in the tracking points diminishes
the drift since the points from this new frame has less drift. In our experiments,
addition of re-tracking eliminated many errors.
Currently, selection of the 2D points to track is left to the user and the points
are not changed among the frames. Neither of these is required for the model-
aided tracker. First, tracking points can be selected automatically by a saliency
criteria, suchascorners. Onesuchpopulartechnique, ”GoodFeaturestoTrack”, is
presented by Shi and Tomasi [48]. Furthermore, the tracked points do not need to
be same for all frames. This is important since vehicles may turn drastically over
the period of the video and some points may become invisible to the camera. If 2D
candidate tracking points are automatically and robustly found, then the system
can handle any kind of vehicle movement. One possible addition to our system is
to randomly select points from the ”Good Features to Track” by RANSAC and
iterateuntilaproperestimationisachieved. Althoughadditionofthesetechniques
are straight-forward, they are not implemented due to time constraints.
67
3.3 Model-Aided Pose Estimation
Model-aided pose estimation is not a new topic. Koller et al. showed how line
features in a car could be used to track and recover the vehicle’s pose informa-
tion [25]. This technique assumes the 3D model of the car is created in terms of
predetermined lines. This prevents the system from trackingarbitrary object types
or different types of vehicles. We find the point features to be common, thus suit-
able for a general pose estimator. However, best approach would be using a hybrid
system that uses both lines and points.
Pollefeys et al. uses corner features for tracking the movement of the camera
betweenframes[40]. Althoughtheirgoalistoestimatethemovementofthecamera,
thisismathematicallyequivalenttoestimatingthemovementofarigidobjectfrom
afixedcamera. Thiscanbeexplainedwiththerelativityofmotion: movingcamera
with stationary scene creates the same effect as the stationary camera with moving
scene. Thus, we use a similar formula to theirs.
Camera calibration from section 2.5 requires the estimation of all 9 camera pa-
rameters. However under the assumption of a stationary camera without zoom
between frames, and tracking of vehicles, only three parameters need to be esti-
mated. Two of these parameters are for translation: T
X
, T
Z
(since translation on
a plane is a 2D vector). The third parameter is the rotation of the object, R
Y
, on
theplane. Ifenough2D/3Dmatchesareavailable, poseisestimatedbyminimizing
68
the Euclidean distance between the projection of the 3D point and the tracked 2D
point:
argmax
Ry,Tx,Ty
(k[u,v]−P(KR[IT]X)k) (3.3)
where X is the 3D point, T is the translation vector, R is the rotation of the
vehicle, K is the internal camera parameters and (u,v) is the tracked 2D point.
Function P() is the projective division. The 2D/3D coupling is only available for
systems, like ours, that explicitly constructs a 3D model. Formula 3.3 is nonlinear
in terms of T
X
, T
Z
, and R
Y
. Although a nonlinear estimation can be performed,
estimation by bootstrapping is computationally less complex and is less likely to
get stuck in local-minima since the movement of the vehicle between frames is
small. In particular, we first fixthe rotation and estimate the translation, and then
we fix the translation and estimate the rotation. This procedure is repeated until
convergence, in our experiments average number of iterations was four. In the case
of no noise, two points are enough to solve for these three parameters, but with
noise, more points are required. Stationary camera and planar move of the car are
not necessary assumptions but they improve the accuracy and robustness of the
estimation procedure.
A novel addition to our pose estimation is the anti-drifting post processing.
Tracking systems often suffer from drifting (error accumulation) over time. Al-
though it is a known problem, not too many solutions exist. The main reason is
69
150 200 250 300
200
210
220
230
240
250
260
270
280
2D Tracks before filtering
x location
y location
150 200 250 300
200
210
220
230
240
250
260
270
280
2D Tracks after filtering
x location
y location
(a) (b)
Figure 3.6: 2D tracks of the model-driven tracker (a) before and (b) after filtering
that there are no 3D global criteria to correct these small errors, since the min-
imization is performed over the 2D error. However, formula 3.3 creates a global
constraint over the tracking points. As post processing, we re-project 3D points
(X) on the image with the new camera parameters (T
X
, T
Z
, R
Y
). This eliminates
small random errors each 2D point has and creates a more robust overall tracking
algorithm (assumption is that all drifting errors are independent of each other). In
our tests, this post process eliminated a big part of the error accumulation.
There are additional set of filters used in our pose estimation. In particular,
input2Dtrackingpointsandoutputcameraparametersarefilteredwithalow-pass
filter. Since we re-project the 3D points after pose estimation, we can compare the
input points and a combined result of all the filtering and back-projection. This
comparison is shown in figure 3.6. Figure 3.6 (a) shows the results of the model-
aided tracker and (b) shows the points after the pose estimation procedure.
70
(a) (b)
Figure 3.7: (a) Background image created by subtraction of foreground and aver-
aging. (b) 3D wireframe model of the environment
3.4 Environment Modeling
Visualization of only dynamic objects without a background model would result in
objects flying in the air. In order to increase the realism of the VBR, we create
an environment 3D model, relit models and recast shadows. The environment is
modeled relative to the vehicle at the frame where the vehicle model is estimated.
A background image is estimated using an averaging approach where the parts of
the image that the vehicle occupied are discarded. This background image is used
to texture the environment and contrary to the main object, this texture is kept
stationary, see Figure 3.7 (a). Environment is approximated as a ground plane
and a set of walls. First, ground plane is automatically inferred from the lowest
point of the vehicle, i.e. tires. Second, we incorporate the user labeled points
with the ground plane estimation. The user is expected to label bottom part of
71
(a) (b)
Figure 3.8: (a) Rendering of the vehicle without the shadows (depth ambiguity)
(b) Rendering of the vehicle with shadows (clear depth perception)
the walls in the scene, where walls are created with the assumption of standing
upwards. Thiscreatesapop-upcardeffect, similartotheworkofHoeimetal.[17].
An environment model with few triangles often renders low quality images, thus
the scene model is iteratively subdivided (without smoothing) to increase texture
quality, see Figure 3.7 (b).
At this point, a free viewpoint video of the ground, walls and the vehicle can be
created, where virtual camera can be moved arbitrarily. Shadows are recast on the
scene (optionally) by the user manually entering the suns location. Experiments
show that lack of shadows causes depth confusion and hinders the quality of the
renderings. A comparison of the screen shots of our renderings with and without
shadows is shown in Figure 3.8.
72
(a) (b)
(c) (d)
Figure 3.9: 3D model is projected on the input video (stationary camera). (a)
Frame 1 (b) Frame 25 (c) Frame 110 (d) Frame 290
3.5 Results and Discussion
The example in Figure 3.9 and Figure 3.10 is created from a video sequence (400
frames/13seconds)capturedwithastationaryDVcamera(low-endvideocamera).
The video is composed of a car driving forward, stopping, driving back and then
turning left. Vehicle is modeled at Frame 1 with our part-based 3D modeler and a
total of seven points are tracked using the technique explained in section 3.2.
Figure 3.9 shows the input video with the 3D model projected onto the image.
As can be seen from Figure 3.9, the vehicle model is quite accurate and our stable
tracking keeps the 3D model on the vehicle at every frame. Figure 3.10 show the
73
screenshots of the video-based rendering for the same frames in Figure 3.9. For
this scene, five image clicks were used to model a piecewise-planar scene model.
In addition to environment modeling, shadow of the vehicle is recast on the floor.
As can be seen, novel renderings still maintain realism under different viewing
conditions. VBR gives the user a 4D control over the renderings: user can move
thevirtualcameraanywhereinthethreedimensionalworldandalsoplaythevideo
forward and backwards in time.
Althoughcreatingvideo-basedrenderingisautomatic, itisnoterror-free. Erro-
neous tracks might occur under bad illumination or when the assumption of fixed
3Dmodelisviolated. Forexample,vehiclesusefrontwheelstosteerandunderrad-
ical appearance change, tracking may sometimes diverge. In order to have robust
tracking, outlier detection needs to be added to the model-aided tracker. Other-
wise, faulty tracking causes inaccuracies in pose estimation, which creates larger
errors due to accumulation in the subsequent frames.
VBR can be used to increase the understanding of a scene. Novel renderings of
surveillance videos are shown to be an important visualization tool [37]. However,
VBR’smainuseisinthespecialeffectsandentertainmentareas. Forexample,once
the 3D model of the car is created and tracked over the video, a virtual human (or
any virtual object) can be interacted with the car.
74
(a) (b)
(c) (d)
Figure 3.10: Video-Based Rendering from arbitrary views for the same frames as
in Figure 3.9. (a) Frame 1 (b) Frame 25 (c) Frame 110 (d) Frame 290
75
3.6 Conclusion
Inthischapter,wepresentamodel-drivenvideobasedrendering(ModVBR)scheme
to rapidly render novel views of dynamic scenes. Our system models, tracks, and
renders an instance of a specific object class, vehicles, from a single un-calibrated
video stream. First, the 3D model of the vehicle is interactively created by modify-
ing a part-based generic model (as discussed in Chapter 2 ). Model-aided tracking
andposeestimationtechniquesaredevelopedtoaccountforthedynamicbehaviors
in the scene. Model-aided tracker uniquely creates 3D patches from 2D tracking
points and searches for these 3D patches rather than conventional 2D patches.
Furthermore, tracking points are updated after pose estimation by re-projecting
these 3D patches on to the image to mitigate drifting problem. Lastly, a piece-wise
planar scene model is created to increase realism and shadows are recast. With
this pipeline, a 4D video can be created in mere minutes using only a single un-
calibrated video camera. This framework can be invaluable to many virtual reality
and special-effect systems.
ModVBR is created as an application of our part-based 3D modeler. Without
the addition of our rapid modeler, it is difficult to create novel photo-realistic ren-
derings from a single camera. To our best knowledge, our system is the only rapid
video-based rendering of dynamic scenes.
76
Chapter 4
Application II
Semi-Automatic Vehicle Modeling
3D modeling of complex objects, such as vehicles, is rarely attempted by image-
based modeling techniques. Our rapid part-based vehicle modeling framework, see
Chapter 2, is able to create quality 3D models from single un-calibrated images in
couple of minutes. Although this approach is faster than the alternative modeling
systems, it has an interactive framework that prevents automation. Our approach
to solve this problem is composed of two main stages: detection and modeling.
First,multiplehypothesesaregeneratedforeverypart. Second,abestcombination
of these hypotheses is estimated using our reduced histograms learning algorithm
and occasional user interaction. Lastly, these 2D part detections are converted
into 3D modifications with a technique similar to section 2.6 in chapter 2. Our
framework is able to create quality 3D models of sedans, SUVs, minivans and
manymorebyontheaverageofthreeimageclicks. Inthischapter, semi-automatic
77
vehicle modeling is presented. Although our current implementation is not fully
automated, with accurate detectors and effective learning algorithms, the whole
procedure might run without the need for a human interaction.
4.1 Introduction
Digital re-creation has been successful in modeling the human body, face and ar-
chitectural structures. However, there have been only a few attempts to model
vehicles. Due to their crucial role in our lives, we focus this chapter on rapidly
creating 3D models of vehicles even faster and easier (in terms of total time spent
and total number of user interaction) than the technique represented in chapter 2.
Numerous alternative 3D modeling techniques are reviewed in section 2.2. In
this section, we will focus on the detection systems suitable for modeling complex
objects, i.e. vehicles. Methods such as stereo and laser-scanner based reconstruc-
tion often require expensive specialized equipment. Therefore, we limit our atten-
tiontotechniquesthatrequirenothingmorethanadigitalcamera,i.e. image-based
modelers.
One way to speed up modeling is the automatic detection of objects in the
image. Computer vision research had some success in detecting humans [61], faces
[21,43] and vehicles [33,43] in images. Viola and Jones first presented a face
detection system that is able to find frontal facing faces [53]. Later, they extended
their system to multiple view and arbitrary rotations [21]. Different poses are
78
handled by the use of a decision tree. Each pose is detected by a separately trained
face detector. However with the use of decision tree, applying all detectors to all
candidates is not required. This although lowers computational complexity, some
detection accuracy loss is perceived. Schneiderman and Kanade presented a multi-
view vehicle detection system similar to Viola and Jones’s face detector [43]. They
train a set of appearance-based binary classifiers that can handle various street
scene visual appearance variations such as shadows and lighting changes. Multi-
view detection systems often train multiple single-view detection systems and run
them in parallel [13,33,43]. Fergus et al. presented an unsupervised classification
algorithm that is capable of detecting a variety of object classes: motorcycles, cars,
faces,airplanesetc.[13]. Firstasetoffeaturesareextractedfromeachpositiveand
negative samples. The classification of objects is performed by learning a binary
classifierperobjectclass. However,thelocalizationofobjectsinthepictureisoften
not accurate enough for modeling purposes. Furthermore, all these techniques are
holisticapproaches,i.e. theydetectobjectsasawhole. Theseapproaches,although
valuableindetectingobjects,arenotsuitableformodelingobjects. Thuswechange
our attention on part-based systems.
Part-based techniques often used to detect objects with variable parts (due to
design or movement) and possibly under occlusion. Wu and Nevatia detect people
by separately finding human parts and latercombining them with MAP estimation
[61]. Although their system is applied to only humans, the framework is capable
79
of detecting other object classes with straight forward changes. Their system is
one of the inspirations of our system. Lueng similarly uses a part-based vehicle
model to detect vehicles in street scene images [33]. Lueng’s approach is specific to
vehiclesandisabletodetectpartswithcloseto80%accuracy; howeverlocalization
accuracy of the parts is not reported in his publication. Wu and Nevatia, later
presented another work to detect and segment both humans and vehicles from a
single image [62]. Their method learns a combination of classifiers to both detect
and then later segment the image. Their detection result is close to perfect (95.7%
for vehicles) and their segmentation is quite accurate (95% pixel overlap). One
drawback of their algorithm is that they require hand segmented positive samples
for training and for vehicles around 500 images were hand segmented which is a
considerable job. On the other hand, their system is automatic and detection rate
is 95%. Nonetheless, previous research has shown that part-based approaches are
suitable for both modeling and detection. Furthermore, part-based techniques are
superior in localization when compared to holistic object detection systems.
In this chapter, we present a learning-aided vehicle modeler that is able to
rapidly create textured 3D models from a single image in around ten seconds. A
part-based representation of the vehicle is used during both modeling and detec-
tionsteps. Indetectionstep, eachvehiclepartisestimatedbymultiplehypotheses.
Vehicle parts are, by design, correlated and this dependence is leveraged in our
modeler to speed up the process. We introduce a novel learning algorithm, reduced
80
histograms, that is able to learn a JPDF as a weighted combination of smaller
JPDFs. Reduced histograms guides the selection of the best part hypotheses com-
bination, where these hypotheses are created by part-specific detectors. 2D part
detection results are converted to 3D modifications and distributed over a part-
based generic model with a technique similar to chapter 2. A variety of different
vehicles, SUVs, minivans, sedans and hatchbacks, can be modeled from pictures
taken in an ordinary street scene with an average of three user clicks.
4.2 System Overview
Our vehicle modeler creates 3D models from a single image, however the image is
assumed to be taken from a pre-determined direction. Currently only side-view
images are supported, however a multi-view version is currently being investigated
with a pipeline similar to multi-view detection algorithms [13,21,33,43].
Our system starts with the detection of the tires. This is followed by automatic
camera calibration. Later, first level parts (front/rear bumper, middle and roof)
are detected. To increase overall detection rate, multiple hypotheses per part are
created. Later, a best combination of these hypotheses is estimated using our
reduced histograms algorithm. Reduced histograms enable learning of a JPDF
with relatively small number of training samples, handle fixed/missing dimensions
and can be maximized quite efficiently. If the best combination of the hypotheses
is not totally accurate, the user is allowed to choose another hypothesis simply
81
by clicking on it. This triggers maximization with this selection as an additional
constraint. Once all parts are correctly estimated, a generic part-based vehicle
model is modified with the modeling technique from Chapter 2. After modeling,
the texture of the vehicle is estimated using the image, camera parameters and the
symmetry of the vehicle. However, not all parts are visible, thus these gaps are
filled with the automatically extracted color of the vehicle.
Section 4.3 has the details of the part selection, part detection and learning of
parts. The user interface, camera calibration and modeling using the detection is
explained in section 4.4. An analysis of our learning algorithm, detection accuracy
and modeling results can be found in section 4.5.
4.3 Part-Based Vehicle Detection
In our early tests, three alternative vehicle detection techniques were investigated:
colorsegmentation,silhouetteestimationandpart-baseddetection. Thebodycolor
of a vehicle in street scenes varies dramatically due to the reflectance property of
thebodymaterial, thuscolorsegmentationoftenfailed(seefigure4.6a). Silhouette
estimate of the vehicles are also often broken in many places due to cluttered
backgrounds, reflective body material and transparent windows (see figure 4.6).
Part-based detectors unlike silhouette estimation or segmentation do not require
continuity over the whole vehicle. Furthermore, there have been numerous part-
based object detection researches for similar purposes.
82
Parts Features Types n k
†
k
Front Edge Vertical 1 3 3
Rear Edge Vertical 1 3 3
Middle Edge Horizontal 1 3 3
Roof Edge Horizontal 1 3 3
F. Windshield Edge Variable 10 2 3
R. Windshield Edge Variable 10 2 3
Tires Adaboost Haar - - -
Table4.1: Partsandcorrespondingfeatures. n,k
†
andkareparametersfromTable
4.2
Object (part) detection is often performed in three steps: (i) find salient (inter-
est)points,(ii)extractafeaturevector,(iii)learnaclassifier. Entropydetector[22]
or Forstner detector [1] can be used for step (i). Feature vectors generally include
both shape features such as edges, corners and saliency features such as luminance
crest/valley and entropy. Methods such as SIFT [32] and Haar wavelets [38] com-
binethefirsttwosteps. Lastly,fromasetofpositiveandnegativesamples,abinary
classifier is learned using techniques such as SVM [38], MAP [13], SNoW [1] and
Adaboost [15]. These techniques often have good classification results, however lo-
cation of the objects are not exact or accurate enough for modeling purposes. The
main reason is the implicit object representation is not used, e.g. saliency feature
vectors. Using pre-determined explicit features for parts enhance the location esti-
mates with the drawback of lower detection rate. The selection of the features and
detection of parts is discussed in section 4.3.1 and 4.3.2 respectively.
83
4.3.1 Vehicle Parts
A vehicle can be decomposed into parts in many different ways. Sebe et al. [45,46]
used 32 parts to model their vehicles, where Huber et al. [20] divide vehicles into
three parts for classification purposes. Although it is good to have all the parts
detected, this is often unrealistically ambitious. However, having only three parts
(Huber et al. ), is not enough for modeling many vehicles. In this section, we will
analyze the choice of how many parts to detect and which ones.
For vehicles, we made several observations: (i) certain parts are more crucial
than others, (ii) certain parts are easier to find/detect in an image, (iii) some parts
canbeinferredfromothers. Inthelightoftheseobservations,wechoosetodetecta
totalof8parts: front/reartires,front/rearbumpers,front/rearwindshields,middle
and roof. Furthermore, a hierarchy of parts is used in both detection and learning
steps. This hierarchy of parts is not in terms of their size/scale but of their priority
of detection.
Hierarchical framework gives additional control to the modeler and creates
tighter constraints on the search spaces of the higher levels in the hierarchy. How-
ever, this approach has an inherent error propagation problem, which is mitigated
by the occasional user interaction. In our modeler, user verifies the hypothesis se-
lection and is allowed to modify it if necessary. The algorithm repeats the selection
processwiththeuserinputasaconstraint,thuserrordoesnotpropagatetofurther
levels.
84
Figure 4.1: Hierarchy of vehicle parts
In the image, vehicle can be anywhere and of any size. However, the location of
tiresgivesagoodindicationofthelocationandthesizeofthevehicle. Furthermore,
the shape of the tires is same among different vehicles and they are quite easy to
find in the images (in comparison to other parts). Thus, tires are treated as the
anchors of the car model. For these reasons, our first step is the detection of tires.
Afteranalyzingnumerousvehicles,weobservethatthemajorchangesinvehicles
come from 4 parts: front bumper, rear bumper, roof and the middle of the vehicle
(boundary of the door to the window). The location of these four parts gives us
a rough model of the vehicle. Windshields are of secondary importance and often
make the distinction between SUVs and cars. Each vehicle is modeled with three
levels and a total of eight parts, see figure 4.1. FrontBumper and RearBumper are
85
themiddleofeachbumper; middleisthelineconnectingdoorstothesidewindows;
FrontTire and RearTire are the center of each tire; Roof is the middle of the roof;
FrontWindshield and RearWindwshield are extracted from any two points on the
windshields. All points are marked on an side-view vehicle image.
P
1
= (FrontBumper
x
−FrontTire
x
)/L
P
2
= (RearBumper
x
−RearTire
x
)/L
P
3
= (Middle
y
−(FrontTire
y
+RearTire
y
)/2)/L
P
4
= (Roof
y
−(FearTire
y
+RearTire
y
)/2)/L
P
5
= FrontWindshield
θ
P
6
= RearWindshield
θ
P
7
= (FrontWindshield
x
−FrontTire
x
)/L
P
8
= (RearWindshield
x
−RearTire
x
)/L
L= kFrontTire−RearTirek
(4.1)
As mentioned before, the size and location of the vehicle comes from tires and
theseparametersarearbitrary,thusnolearningisrequiredfortires(uniformdistri-
bution). Second level parts(front/rearbumper, middleandtop)areparameterized
by their scaled distance to the tires (see P
1
, P
2
, P
3
, P
4
in formula 4.1). Each wind-
shield is modeled by two parameters: the location and the slope (see P
5
, P
6
, P
7
, P
8
in formula 4.1). L is the distance between the tires and enables scale-invariance.
These 8 parameters are learned to aid our modeling procedure, the details of the
86
0.2 0.3
0
5
10
15
20
25
Front
0.2 0.25 0.3 0.35
0
5
10
15
20
25
30
35
Rear
0.22 0.24 0.26 0.28 0.3 0.32
0
5
10
15
20
25
30
Middle
0.35 0.4 0.45 0.5
0
5
10
15
20
25
30
Top
0.4 0.6 0.8
0
5
10
15
20
25
30
Front Windshield Angle
0.4 0.6 0.8 1 1.2
0
5
10
15
20
25
30
Rear Windshield Angle
0 0.1 0.2 0.3
0
10
20
30
40
Rear Windshield Location
0 0.1 0.2 0.3
0
5
10
15
Rear Windshield Location
Figure 4.2: Hierarchy of vehicle parts
learning algorithm is discussed in section 4.3.3. Figure 4.2 shows the marginal
distribution of the variables defined by equations 4.1.
4.3.2 Part Detectors
Each part defined in the previous section is detected by a detector that is tuned
to the specific characteristics of that part. During the offline part of our modeler,
a set of side-view vehicle images are labeled to extract the parameters in formula
4.1. This information is used to estimate the search boundaries of the parts. Main
purpose of the part detectors is to estimate one or more of the parameters in
formula 4.1. In order to increase the location accuracy of the detections, each part
is assigned a fixed feature (e.g. roof is a horizontal line). Although fixing the
87
0 Undo tire rotation
For every part
1 Estimate search bound from marginal PDFs
2 Find∠∇I andk∇Ik
For every angle θ
i=1:n
3 Filter∠∇I andk∇Ik with θ
i
For every line location L
j=1:m
4 Find response of line L
j
for angle θ
i
5 Find k
†
local maxima
6 Find best k out of n∗k
†
local maxima
Table 4.2: Pseudo-code for edge-based part detectors
feature increases the location accuracy, it decreases the detection rate. To achieve
higher detection rates, multiple hypotheses for parts are estimated. For example,
even with a 60% accurate part detector, the system is 93.6% likely to find the part
with only 3 hypotheses. The parts and their corresponding features are shown in
Table 4.1.
Majorityofthepartsareestimatedusingedge-baseddetectors(allexcepttires).
Tires are detected by a scale-invariant appearance-based detector, in particular
Adaboost Haar classifier [15]. If more than two tires are detected, a best pair
is estimated by estimating the patch’s grayness C(i,j), comparing size S(i,j) and
relative angle of the tires A(i,j) (see formula 4.2).
argmax
i,j
S(i,j)C(i,j)A(i,j)
S(i,j)=
1−
2∗|w
i
−w
j
|
|w
i
+w
j
|
1−
2∗|h
i
−h
j
|
|h
i
+h
j
|
C(i,j)=
min(r
i
,g
i
,b
i
)
max(r
i
,g
i
,b
i
)
min(r
j
,g
j
,b
j
)
max(r
j
,g
j
,b
j
)
A(i,j)=
1−|tan
−1
(4y/4x)|/
π
2
(4.2)
88
The rest of the parts are detected using edge-based detectors. A pseudo-code
for these detectors is shown in Table 4.2. Part detector uses both the angle and
the magnitude of the edges to create hypotheses. The part detector first rotates
the image with the angle between the tires, which gives computationally efficient
searches. For every part, a pre-learned marginal PDF is available and the search
limits for parts are extracted from these marginal PDFs. The gradient angle and
magnitude images of the patch (search space) are extracted using formula 4.3.
Gradient angle image is thresholded with the desired angle θ
i
(see Table 4.1) and
convertedtobinary. Thisbinaryimageisusedasamaskonthegradientmagnitude
image, i.e. the gradient image is filtered with angle θ
i
. The filtered image is
converted to a response curve, see figure 4.3, by summing the pixel values on line
L
j
(which has a slope of θ
i
). The specific number values of n, k
†
and k from Table
4.2 are listed in Table 4.1.
∇I
x
=
∂
0
@
I∗
e
−
x
2
2σ
2
√
2σ
1
A
∂x
∇I
y
=
∂
0
@
I∗
e
−
y
2
2σ
2
√
2σ
1
A
∂y
k∇Ik=
q
∇I
x
2
+∇I
y
2
∠∇I =tan
−1
∇Iy
∇Ix
(4.3)
The result of the part detectors is the edge-oriented response curve. The edge-
oriented response curve (R() in formula 4.7) is weighted with the marginal PDF
of the part (P() in formula 4.7) for biased hypotheses generation. Non-maxima
suppression (NMS) technique is used to extract largest k local maxima of this
weighted response curve. NMS not only prevents multiple responses to the same
89
80 100 120 140 160 180 200 220
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Hypotheses creation from filter response
Weigthed Response
Location
Figure 4.3: Creating multiple hypotheses from the response curve. The weighted
response curve is shown in blue, marginal PDF is shown in green. Peaks selected
by non-maxima suppression are shown in red.
part (double peaks in Figure 4.3) but also is able to find local maxima accurately.
Figure 4.3 shows a sample result of this procedure, the selected hypotheses are
shown as red dots on the top of the peaks.
4.3.3 Learning Algorithm: Reduced Histograms
Estimating each part independently, although possible, is inefficient and error-
prone. Vehicle parts are, by design, related to each other, i.e. given one part,
certain properties of another part can be expected. For instance, sports cars have
elongated fronts and shallower frames, or SUVs have their rear windshield steep
and attached directly to the rear bumper (no flat trunk). Although finding simi-
lar rules for every type of vehicle is plausible, this will likely create an incomplete
90
and conflicting decision process. We think that, JPDFs are proper way to handle
complex interactions between the parts.
In particular, a JPDF of the eight parameters from formula 4.1 will be used to
estimate the best combination of part hypotheses. As mentioned in the previous
section,eachpartisestimatedwithmultiplehypotheses,i.e. multiplepossibleloca-
tionsofeachpart. JPDFguidestheselectionamongthesehypothesesandincreases
the overall accuracy of the estimates by leveraging the dependencies between the
parts.
There have been numerous successful methods for learning and maximizing
JPDFs. One commonly used method is fitting a pre-defined probability distribu-
tion, such as Jointly Gaussian and independently Gaussian, however it is presump-
tuous to think that vehicle parts are dependent this way (in our experiments, no
such fixed pattern is observed, see figure 4.2). Another straight-forward method
is the use of an n-dimensional histogram. This method, however, suffers from the
”Curse of Dimensionality”. In our case, 5 bins for each one of the 8 dimensions
require around 4 x 10
5
total bins. Filling such a large number of bins with super-
vised learning is impractical. Thus, we focus on learning methods that require few
learning samples.
Support Vector Machines (SVM) [38] can successfully work with a small set
of training samples and effectively reduce the number of samples required at the
run time. On the other hand, they are not suitable for fixed/missing dimension
91
cases(userinteractioninourframeworkcreatessuchcases). Maximumaposteriori
probability (MAP) due to its probabilistic framework is not only able to learn
JPDFs but also can handle missing data cases. However, they either require a
predefined PDF (which is unsuitable for vehicles, as mentioned before) or EM
coupled with particle filtering.
ArelativelynewtechniqueforapproximatingarbitraryJPDFwithlowerdimen-
sionalhistogramsisstudiedbyDeshpandeetal.[10]. Dependency-basedhistograms
treat each parameter as a node of a graph, where the edges denote the correlation
between the parameters. This graph is decomposed into small cliques by trying to
have the best overall correlation with few connections. Each clique can be learned
and maximized separately, which enables their system to work with few training
samples (hundreds of samples).
Our learning algorithm, reduced histograms, is a hybrid between dependency-
based histograms and MAP framework. As in dependency-based histograms, lower
dimensional histograms are used to approximate JPDF. However, rather than
clustering parameters by their cross-correlation, lower dimensional histograms are
weighted by the cross-correlation between the parts. Cross-correlation, although
a weaker condition than independence, can still effectively quantify the level of
dependency between parameters. In dependency-based histograms, all the chosen
dependencies have equal weight. In reduced histograms, interactions between parts
that are correlated have more weight than the ones between independent parts.
92
Full Correlation
f
r
m
t
fwx
rwx
fwt
rwt
Reduced Histograms
f
r
m
t
fwx
rwx
fwt
rwt
Dependency−Based
f
r
m
t
fwx
rwx
fwt
rwt
f
r
m
t
fwx
rwx
fwt
rwt
Independence
Figure 4.4: Comparison of our learning algorithm to others in terms of cross-
correlation and joint optimization.
This not only enables stronger interactions to force appropriate choices but also
allow weaker interactions to have some affect on the decision making process (in
dependency-based histograms, weaker interactions have no affect since they are
eliminated).
A comparison of our technique to its alternatives can be made from the cross-
correlation point of view. If each variable is taken as a node and their relation is
shown with a connection (edge), then a graph representation can be drawn. In the
case of full correlation, all variables are correlated thus we have a fully connected
graph, see figure 4.4. Our approach is also fully connected, however the level
93
of connectedness is a function of correlation. Hence, some connections are more
important than others (darker lines indicate higher correlation). A dependency-
based histogram approach can be taken and an expected graph from this approach
is also shown in figure 4.4. As mentioned earlier, the dependency-based histograms
eliminate lower correlation cases. On the other end of the spectrum lies the full
independence. In this form, all the variables are disjoint.
As can be seen from figure 4.5, certain parts are correlated (e.g. middle and
roof) and no two parts are totally uncorrelated (correlation implies dependence).
In our framework, all reduced histograms are two-dimensional (although this is not
necessary), thus a total of
n
2
many histograms and an nxn cross-correlation
matrix are pre-calculated at the learning stage. Unlike n-dimensional histograms,
2D histograms can be trained with few samples (e.g. couple hundred samples).
Since the histograms are separate, same vehicle examples can be used for training
different histograms (i.e. samples are re-usable).
Using MAP framework over reduced histograms, we can also handle miss-
ing/fixed dimension cases (user interactions). General MAP framework for n-
variable is shown in formula 4.4. Each hypothesis H
i
is first estimated indepen-
dently and later the JPDF P
J
(H
i
) (the probability of the i
th
hypothesis of the part
j) is used to evaluate the probability of the combination of hypotheses. Thus the
formula 4.4 is rewritten as in 4.5.
94
Figure 4.5: Cross-correlation between parts: front bumper, rear bumper, middle,
roof, front and rear windshield locations and angles. White is high correlation and
dark is low correlation.
=argmax
X
1
..Xn
E(f(X
1
...X
n
|θ
1
...θ
n
)) (4.4)
≡argmax
X
1
..Xn
Q
n
i=i
P
M
(H
i
)P
J
(H
i
|Θ) (4.5)
Since each hypothesis is estimated from a weighted response curve, it has a
response probability R(X
i
) calculated directly from the image and marginal prob-
ability P(X
i
) estimated from the learning data set. Image-based probability P
M
()
is estimated by the combination of these two terms, see equation 4.6. The joint
probability distribution is approximated by cross-correlation weighted joint distri-
butions.
95
P
M
(H
i
) ≈ R(X
i
)P(X
i
)
P
J
(H
i
|Θ) ≈
Q
n
j6=i
P(X
i
,X
j
|θ
i
,θ
j
)
C(i,j)
(4.6)
X
i
isthesetofpartihypothesesandhasdiscretevalues(inourexperiments3-6
hypotheses per part). θ
i
is model for part i. C
ij
is the normalized cross-correlation
between part i and j. Combining formula 4.6 with 4.5 gives us the formula 4.7.
=argmax
X
1
..Xn
Q
i
R(X
i
)P(X
i
)
Q
j6=i
P(X
i
,X
j
|θ
i
,θ
j
)
C
ij
(4.7)
A close look at formula 4.7 suggests that tresholding C
ij
to binary implements
the dependency-based histograms (thus their method can be treated as a special
case of our framework). Since each X
i
has k many possible values (k = number of
hypotheses), the maximization tries to find the best hypothesis for each part such
thatthecombinationhasthehighestjointprobability. Inourexperiments,thetotal
numberofcombinationswerelow(3
6
=730),thusanoptimizationorapproximation
for the search space, such as Monte Carlo sampling or particle filtering, was not
required. User interaction is handled by setting the value of the user selected part
fixed (rather than having discrete values), i.e. X
i
is fixed.
96
4.4 3D Modeling
4.4.1 User Interface
The main goal of our research is to create quality 3D models of vehicles from a
singleimagewithonlyafewsimpleuserinteractions. Ourprevioususerinteraction
technique involves precise mouse clicks and drags. Our current framework is even
simpler and involves only approximate mouse clicks. Verification is one of the
simplest forms of user interaction. In our modeler, if a part hypothesis is not
selected correctly, user simply click on the image to choose another one of the
hypotheses of this part (other hypotheses of a part is made visible when the user
moves the mouse onto the part). These image clicks can be approximate since they
are used to choose the closest hypothesis. Once such an interaction is performed,
best hypotheses combination estimation (formula 4.7) is repeated with the user
input as one of the constraints. Thus, the only user interaction ever needed by our
modeler is few approximate image clicks.
4.4.2 Camera Parameters
The expected input to our modeler is a side-view vehicle image, although this
eliminates the need for estimating a number of camera parameters, one rotation
angle,cameracenterand2scalevaluesstillneedtobeestimated. Giventhelocation
ofbothtiresintheimage,theseparameterscanbeeasilycalculated. Cameracenter,
97
insingle-viewcase, isarbitraryandthebacktireischosentobecenteroftheworld
in our implementation. In side-view images, rotation matrix is function of only the
angle between the tires. X and Y scales are estimated from the tire separation and
size of each tire, respectively. However, not all tires are of same size, popular tire
sizesare14”, 15”and19”. ThiscausesaproblemintheYscalingandinparticular
the values of P
3
and P
4
in formula 4.1 (middle and roof variables). On the other
hand, one can argue that the error is relatively small and it is not always possible
to have correct length estimates from a single view.
4.4.3 3D Modeling from 2D Part Detections
Generating 3D models with only 8 known points is a challenging task. Straight-
forward connection of these points would create a crude 3D model. Instead, an
approachsimilartoSebe et al.[45,46]isused: modelingbymodificationtechnique.
Their method allows the user to manipulate an already existing part-based 3D
model. Modifying an existing 3D model is a more effective way to model a known
object class than creating the model from scratch. The 2D part detections are
converted to 3D modifications by combining camera parameters, normal of the
surface in 3D, projected location of the unaltered model on the image and the
newly entered user input. This 3D movement is distributed to the whole model by
the connectivity of parts. For example, if the roof is moved upwards, windshields
and windows are stretched (since the doors and rest of the model are not neighbors
98
oftheroof,theyarekeptstill). Theseinteractionsareautomaticallyextractedfrom
the example 3D model. The details of this procedure can be found in chapter 2.
This allows the modeler to create quality 3D models from tens of user inputs.
We combine the modification framework with our learning-aided part detection
and hypothesis selection to overcome the need for camera calibration and accurate
user inputs. In our framework, camera calibration is performed automatically from
the detection of tires, see section 4.4.2. Furthermore, user inputs do not need to
be accurate since they are used for selecting the hypotheses (closest hypothesis to
the user click is chosen), see section 4.4.1. In the previous framework, user takes
around 10 seconds for each input, where inputs with part detectors can be given
in couple of seconds. Moreover, best hypotheses selection maximization (section
4.3.3) decreases the number of false assignments (on average 2-3 user correction is
needed, in comparison to 10-15 inputs). Once all of the part-based detections and
user interactions are finalized, results are passed to the modeler to make changes
to the 3D model. Analysis and results of this procedure is discussed in the next
section.
4.5 Results and Discussion
Training a JPDF often requires large datasets. However our learning algorithm,
with its use of reduced-histograms, can be trained with comparably smaller size
datasets. Our dataset is composed of 125 hand-labeled side-view vehicle images;
99
(a) (b)
Figure 4.6: Two sample input images
eight parts are labeled in every image (a total of 1000 labels). This labeling proce-
dure took a total of three hours. Marginal PDFs are estimated by 1D histograms
with 10 bins. 2D histograms have 25 bins (5 bins in each dimension). The cross-
correlationmatrixisalsoestimatedfromthese125images,seefigure4.5. Adaboost
Haar classifier, which is used as the tire detector, is trained with 250 positive (2
tire samples from 125 images) and 6000 negative samples. Negative samples are
automatically extracted from the parts of the training images that the tires do not
occupy.
Although there are several techniques to analyze learning algorithms, there is
no standard technique for analysis of learning algorithms that have an interactive
component. We analyze the performance of our learning algorithm by comparing
it to other learning algorithms over a test set of 125 hand-labeled vehicle images
(totally different from the training set). This comparison is performed under three
criteria: regular error, percentage error and hypothesis match percentage. Regular
100
(pixel/degree) error is from images of size 1024x768 (see figure 4.6) and percentage
error is the regular error divided by the total search space of the variables. Search
spaces are automatically estimated from the limits of the marginal PDFs. For
example, a 5
0
front windshield degree error would correspond to 17% if the search
range is 30
0
. During the test of a particular level, the information of the lower
(previous) levels is assumed to be known. A crucial part of our framework is the
user interaction, thus the analysis are a function of number of user interactions
(which varies between 0 [no interaction] and number of parts). Level I and II test
results are shown in figure 4.10.
Ascanbeseenfromfigure4.10,ourlearningalgorithmoutperformsotherlearn-
ing algorithms in all cases. ”Min” is the result of hand-labeled system, i.e. best
possiblesolutionwiththisframework. ”Mean”techniqueoptimizeseachparameter
separately, i.e. diagonalcross-correlation,andtotalindependence. ”NN”technique
is a standard nearest neighbor hypothesis testing algorithm. ”RH” is our reduced
histograms JPDF learning algorithm.
For level I, our learning algorithm with no user interaction has a 55% correct
hypothesis match (Figure 4.10a) with an average error of 12 pixels (Figure 4.10c).
With only a single user interaction, hypothesis match jumps to 83% with average
error of 6 pixels, where minimum achievable error with this framework is 4 pixels.
The total error becomes negligible with the second user input for level I.
101
(a)
(b)
(c)
(d)
(e)
Figure 4.7: (a) Generic model projected, before modeling. (b) Level I hypotheses,
best hypothesis combination are in red. (c) Level I updated best selection after
single user interaction. (d) Level II hypotheses. (e) Image with projected final 3D
model
102
ForlevelII,asimilarconclusioncanbemade; theerrordisappearsafterthefirst
user interaction. This shows that the behavior of different levels, although having
different number and type of parts, is the same. An average of two user inputs for
level I and one for level II achieves an average hypothesis match of 89%, average
pixelerrorof7pixelsandaveragedegreeerrorof1.5
0
. Theseprecisepartestimates
are quite promising for creating quality 3D models.
The steps of an example modeling procedure are shown in figure 4.7. The
program starts with detection of tires and automatic camera calibration. After
this step 3D model of the generic car model can be projected onto the image,
see figure 4.7 (a), as can be seen the 3D model and the image are not matching
before the modeling. Later, multiple hypotheses for the level I parts are estimated
and displayed (for convenience, all part hypotheses are shown, in actual framework
only one part hypotheses are shown at a time to ease interaction), see Figure 4.7
(b). Best combination of these hypotheses is highlighted in red as a guide for
user interaction. For this particular vehicle, first change is the correction of the
middle part. As mentioned in section 4.3.3, change to one part can cause other
parts to change their hypotheses. In this case, change to the middle cause the roof
hypothesis to change (for the better), see figure 4.7 (c). A second change is made
to the rear end, see figure 4.7 (d). Reflections and transparent windows are the
main reason for the faulty combination in the first initial estimate. This procedure
is repeated for level II with different parts. No changes were needed for level II,
103
(a)
(b)
(c)
Figure 4.8: (a) 3D model with estimated vehicle color (b) 3D model with texture
from the original camera position (c) 3D model from arbitrary viewing angle (not
all texture is available)
104
see figure 4.7c. Once the hypotheses selection step is completed, the generic 3D
modelismodifiedtomatchthehypotheses. Figure4.7(e)showshowthe3Dmodel
matches the image.
The 3D model of the minivan with the estimated color is shown in figure 4.8a.
Figure 4.8b shows the 3D model with the texture from the image (notice the back-
ground is clipped out). An arbitrary view of the 3D model with texture is shown
in figure 4.8c. For a complete texture map of the vehicle, a multi-view detection
system is being investigated as a part of our future work. The sample input image
from figure 4.6b is modeled and the results are shown in figure 4.9. This model
required only a single user interaction and was created in less than ten seconds.
The 3D model and the color of this vehicle were accurately estimated. Figure 4.9c
shows the vehicle from the opposite side of the original camera position, where the
texture is estimated using symmetry of the vehicle. Use of symmetry decreases the
total number of pictures required for full texture map.
4.6 Conclusion
In this chapter, a part-based vehicle modeler with simple and easy-to-use interface
ispresented. Ourmethodisabletocreatequality3Dmodelsfromasinglesideview
street scene image of a vehicle in around ten seconds. Use of part-specific detectors
andcombinationofourreducedhistogramslearningalgorithmwithoccasionaluser
interaction generates precise part estimates. Reduced histograms is a novel JPDF
105
(a)
(b)
(c)
Figure 4.9: (a) 3D model with estimated vehicle color (b) 3D model with texture
from the original camera position (c) 3D model from arbitrary viewing angle (not
all texture is available)
106
learning algorithm that can work with few training samples, handle missing/fixed
dimensions and are computationally efficient. Part detections are converted into
3D modifications over a part-based generic vehicle model. A variety of vehicles
(SUVs, minivans, sedans) are modeled with few mouse clicks. As future work, a
multi-view version of our modeler is being investigated.
107
0 1 2
20
30
40
50
60
70
80
90
100
Hypothesis Match Percentage
User Interaction
Level 1 Hypothesis Match Percentage
RH
NN
Min
Mean
0 1 2
20
30
40
50
60
70
80
90
100
Hypothesis Match Percentage
User Interaction
Level 2 Hypothesis Match Percentage
RH
NN
Min
Mean
(a) (b)
0 1 2 3 4
4
6
8
10
12
14
16
18
20
Pixel Error
User Interaction
Level 1 Pixel Error
RH
NN
Min
Mean
0 1 2 3 4
5
10
15
20
25
30
Percentage Error
User Interaction
Level 1 Percentage Error
RH
NN
Min
Mean
(c) (d)
0 1 2
8
10
12
14
16
18
20
22
24
26
Pixel Error
User Interaction
Level 2 Pixel Error
RH
NN
Min
Mean
0 1 2
1
1.5
2
2.5
3
3.5
4
4.5
5
Angle Error
User Interaction
Level 2 Angular Error
RH
NN
Min
Mean
(e) (f)
Figure 4.10: (a) Level I Hypothesis Match Percentage, (b) Level II Hypothesis
Match Percentage, (c) Level II Pixel Error, (d) Level I Percentage Error, (e) Level
II Pixel Error, (f) Level II Angular Error
108
Chapter 5
Conclusion and Future Work
Computer generated 3D models are extensively used in computer games, motion-
picturefilms,productdesignandmanymoreindustryareas. Thesemodelsareoften
created by engineers or professional artists. Although 3D models are commonly
used in industry, its use by novice computer users is non-existent.
In this thesis, we present an interactive 3D modeler that is easy-to-use and cre-
ates quality 3D models in mere minutes. Most of the design choices of our modeler
are made by taking the following observations into account. Novice computer users
prefer 2D user interfaces rather than 3D user interfaces (e.g. 3D Studio Max and
Maya), thus our interface is strictly in 2D, i.e. image-based. Photographs are easy,
inexpensive and non-intrusive way of capturing and representing common objects.
Alternative methods, such as laser scanning, multi-camera capture and manual
measurements, do not satisfy one or more of these requirements. Our modeling
approach requires only a single image of the object. One of the main bottlenecks
of many 3D modeling approaches is the amount of total time spent to create 3D
109
Our 3D Studio Laser Sketch Arch. Multi
Modeler Max & Maya Based Based Modelers View
3D Model Medium Low to
Quality to High High Various Low Low Medium
Texture
Capable Yes Yes No No Yes Yes
User
Interface 2D 3D 3D 2D 2D N/A
Input Vertex Vertex
Level Part Vertex Vertex Stroke and Line and Line
Manual
Input Low High N/A Low Medium N/A
Ease of
Capture Easy N/A Hard N/A Easy Medium
Object
Type Several All All All Buildings All
Table 5.1: Comparison our modeler to other modeling approaches
models. We propose a technique that requires only a few minutes for arbitrary
objects. Inadditiontothesechoices, ourapproachcreates3Dmodelsbymodifying
a part-based example model of the object class. Thus, our framework requires that
the instances of the same object class share a similar hierarchy and resemblance of
parts, e.g. vehicles have four wheels. A comparison of our modeler to alternative
modeling approaches is shown in table 5.1. As can be seen, alternative approaches
are not suitable for our ultimate goal in one or more areas.
Althoughnoviceusercomputeruserspreferimage-spaceinputs,back-projection
of 2D (image-space) vectors are under constrained in 3D, i.e. there are infinitely
many 3D vectors that projects to the same 2D vector. Resolving this ambiguity
from a single image is a challenging task. We present a novel technique to combine
110
2Duserinputs, cameraparametersandthe3Dexamplemodeltoestimatea unique
3D vector. Our unique conversion takes advantage of the human’s ability to see 3D
objects/vectors in 2D images.
A general part-based modification technique is also presented in this thesis.
Our technique is applicable to any 3D model that is composed of parts. Inputs
given through one part is automatically distributed to the whole model using the
connectivity of the parts and symmetry of the object. This distribution not only
speeds up the modeling procedure but also prevents artifacts due to inconsistent
inputs.
Present implementation of our part-based modeler is able to model vehicles
(SUVs, minivans, sedans, hatchbacks etc. ), furniture (couches, sofas, love seats)
and buildings (with and without a roof). Although the total modeling time is
a function of the amount of variance between the instance in the image and the
example model, objects are modeled often in two to three minutes. The main
limitation is that our modeler can only modify parts. This has two drawbacks, first
the output model is detailed up to the size of the parts, i.e. smaller details can not
be modeled. However, increasing the number of parts in the example models will
increase the level-of-detail. Secondly, not every class of objects shares a common
hierarchy, thus such a generic model may not exist, e.g. vegetation and mountains.
Although our modeler has various limitations and is not applicable to every type
of object, it is fast, requires little user interaction and outputs quality 3D models
111
from a single un-calibrated image. Our main contributions to modeling are the
introduction of the part-level modeling paradigm, a novel technique of converting
2D inputs to unique 3D movements and use of model-based modification rather
than creating from scratch.
We have extended research in two major areas: video-based rendering and au-
tomatic modeling. We use these extensions to show the versatility of our modeler.
Our first extension is called model-driven video-based rendering (ModVBR). Mod-
VBR creates photo-realistic4Dvideosfromasinglestreambycombiningourrapid
3D modeler with a model-aided tracker and pose estimation technique. VBR tech-
niquesgenerallyrequirelongandtediousmanualwork. OurapproachtoVBRisto
interactively model the object class using the model-based modeler at one frame.
After modeling, a set of 2D image/3D model pairs are interactively extracted and
tracked over the video. Both the tracking and pose estimation techniques take
advantage of the existence of a 3D model. This creates a robust and accurate
processing of the video. This robustness comes from two sources. First, drifting
is prevented by re-projection of our 3D patches back on to the image after pose
estimation. Second, use of 3D patches rather than regular 2D patches for tracking
removes the ”background leakage” problem. Final rendering includes a simple 3D
environmentmodel,dynamicobject’s3Dmodelandtheextrarenderingeffectssuch
as shadows. ModVBR is used to increase the understanding of a scene by enabling
free viewpoint visualization. ModVBR can also be used as a stand-alone program
112
for various kinds of special effects. For example, once the 3D model of the object is
created and tracked over the video, a virtual human (or any virtual object) can be
interacted with this object. To our best knowledge, ModVBR is the only system
that uses a single un-calibrated camera and able to render novel views of dynamic
scenes in tens of minutes.
Our second extension is the integration of our part-based 3D modeler to our
learning-aided part detector. This part of our research focuses on automatic mod-
eling of vehicles from a single image. First, multiple part hypotheses are created
by analyzing the input image. Later, the best combination of these hypotheses is
estimatedusingourreduced histograms learningalgorithm. Reducedhistogramsal-
gorithmusesthecross-correlationweightedlog-likelihoodsumoflowerdimensional
distributions rather than the JPDF. Reduced histograms, in comparison to a full
JPDF, can be trained using a small set of training samples (hundreds). Further-
more, the storage and computational complexity of reduced histograms is O(N
2
),
where N is the number of parts. Part detection results are used for not only 3D
modelingbutalsoautomaticcameracalibration. Sinceourpart-basedmodeleronly
requires 2D vectors as inputs, the part detection results combined with the current
location of example model parts is enough for modeling. A variety of vehicles are
modeled in around ten seconds from a single un-calibrated image.
113
5.1 Future Work
Our part-based modeler, although capable as a standalone program, can benefit
from an automatic part split/merge capability. As mentioned earlier in the thesis,
level-of-detail is a function of number of parts. Thus, by allowing the user to
control the number of parts, the level-of-detail can also be controlled. Secondly, a
frameworksimilartoFunkhouseretal. canbeincorporatedtoourframeworktoadd
andremovepartsofthemodel[16]. Thiswillremoveoneofthemajorrequirements
of our modeler: same hierarchy requirement. Without this requirement, a diverse
set of object classes can be modeled using our approach.
Ourpart-detectorsareappearance-basedthusonlyside-viewvehiclesareallowed
in our current automatic modeling framework. An approach similar to multi-view
face detection can be added to our framework [21]. This will require the training
of the part detectors for the other views as well as an automatic pose detection
algorithm.
114
References
[1] S.AgarwalandD.Roth. Learningasparserepresentationforobjectdetection.
In European Conference on Computer Vision, pages 113–130, 2002.
[2] V.BlanzandT.Vetter. Amorphablemodelforsynthesisof3dfaces. In ACM
SIGGRAPH, volume 08, pages 187–194, 1999.
[3] J. Carranza, C. Theobalt, M. Magnor, and H. Seidel. Freeviewpoint video of
human actors. In ACM SIGGRAPH, pages 569–577, 2003.
[4] K.M. Cheung, S. Baker, and T. Kanade. Shape-from-silhouette across time:
Part ii: Applications to human modeling and markerless motion tracking. In
International Journal of Computer Vision, volume 53/3, pages 225–245, 2005.
[5] R. Cipolla, T. Drummond, and D. Robertson. Camera calibration from van-
ishing points in images of architectural scenes. In British Machine Vision
Conference, volume 2, pages 382–391, 1999.
[6] M. Cohen. Everthing by example. In Keynote Paper in Chinagraphics, 2000.
[7] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Al-
gorithms. The Art of Computer Programming. McGraw-Hill Book Company,
second edition, 2001.
[8] P. Debevec, G. Borshukov, and Y. You. Efficent video-dependent image-based
rendering with projective texture-mapping. In In 9th Eurographics Rendering
Workshop, 6 1998.
[9] P.E. Debevec, C.J. Taylor, and J. Malik. Modeling and rendering architecture
fromphotographs: Ahybridgeometry-andimage-based. Journal of Computer
Graphics, 30:11–20, 1996.
[10] A.Deshpande, M.Garofalakis, andR.Rastogi. Independenceisgood: Depen-
dencybased histogram synopses for highdimensional data. In SIGMOD, pages
199–210, 2001.
[11] A.A. Efros and T.K. Leung. Texture synthesis by non-parametric sampling.
In International Conference on Computer Vision, pages 1033–1038, 1999.
115
[12] Alexei A. Efros and William T. Freeman. Image quilting for texture synthesis
and transfer. In ACM SIGGRAPH, pages 341–346, 2001.
[13] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsuper-
vised scale-invariant learning. In IEEE Conference on Computer Vision and
Pattern Recognition, volume 2, pages 264–271, 2003.
[14] D.A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice
Hall, 2002.
[15] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. In European Conference on Compu-
tational Learning Theory, pages 23–37, 1995.
[16] T. Funkhouser, M. Kazhdan, P. Shilane, P. Min, W. Kiefer, A. Tal,
S. Rusinkiewicz, and D.Dobkin. Modelingbyexample. In ACM SIGGRAPH,
volume 23, pages 652–663, 2004.
[17] D. Hoeim, A.A. Efros, and M. Hebert. Automatic photo pop-up. In ACM
Transactions on Computer Graphics, volume 24, pages 577–584, 2005.
[18] J. Hu. Integrating complementary information for photorealistic representa-
tion of large-scale einvironments. In USC PhD. Thesis, 2007.
[19] J.Hu,S.You,andU.Neumann. Texturepaintingfromvideo. InWSCG:Com-
puter Graphics, Visualization and Computer Vision (Journal Papers), pages
119–125, 2005.
[20] D.Huber, A.Kapuria, R.Donamukkala, andM.Hebert. Part-based3dobject
classification. In IEEE Conference on Computer Vision and Pattern Recogni-
tion, volume 2, pages 143 –146, 2004.
[21] M.J. Jones and P. Viola. Fast multi-view face detection. In IEEE Conference
on Computer Vision and Pattern Recognition, 2003.
[22] T.KadirandM.Brady. Saliency,scaleandimagedescription. InInternational
Journal of Computer Vision, volume 45, pages 83–105, 2001.
[23] T. Kanade, H. Saito, and S. Vedula. The 3d room: Digitizing time-varying 3d
events by synchronized multiple video streams. In Technical Report CMU-RI-
TR-98-34, 1998.
[24] S.B. Kang and H.Y. Shum. A review of image-based rendering techniques.
In IEEE/SPIE Visual Communications and Image Processing (VCIP), pages
2–13, 2000.
116
[25] D. Koller, K. Daniilidis, and H.H. Nagel. Model-based object tracking in
monocular image sequences of road traffic scenes. International Journal of
Computer Vision, 10(3):257–281, 1993.
[26] K. Kutulakos and S. Seitz. A theory of shape by space carving. International
journal of computer vision, 38(3):199–218, 1999.
[27] V. Kwatra, A. Schodl, I. Essa, and A. Bobick. Graphcut textures: Image and
videosynthesisusinggraphcuts. ACMTransactionsonGraphics, SIGGRAPH
2003, 22(3):277–286, July 2003.
[28] J. Lee, Y. Hu, and T. Selker. isphere: A proximity-based 3d input interface.
In CAAD Futures, 2005.
[29] D. Liebowitz, A. Criminisi, and A. Zisserman. Creating architectural models
from images. In Eurographics, pages 39–50, 1999.
[30] S.LiuandZ.Huang. Interactive3dmodelingusingonlyoneimage. InVirtual
Reality Software and Technology, pages 49–54, 2000.
[31] D.G. Lowe. Three-dimensional object recognition from single two-dimensional
images. In Artificial Intelligence, volume 31, pages 355–395, 1987.
[32] D.G. Lowe. Distinctive image features from scale-invariant keypoints. In In-
ternational Journal of Computer Vision, volume 60, pages 91–110, 2004.
[33] B. Lueng. Component-based car detection in street scene images. In Mas-
sachusetts Institute of Technology Master Thesis, 2004.
[34] M. Magnor. Video Based Rendering. AK Peters, 2005.
[35] W. Matusik, C. Buehler, and L. McMillan. Polyhedral visual hulls for Real-
Time rendering. In Eurographics Workshop on Rendering, pages 115–126,
2001.
[36] W. Matusik, C. Buehler, R. Raskar, L. McMillan, and S.J. Gortler. Image-
based visual hulls. In ACM SIGGRAPH, pages 369–374, 2000.
[37] U. Neumann, S. You, J. Hu, B. Jiang, and I.O. Sebe. Visualizing reality
in an augmented virtual environment. Presence: Teleoperators and Virtual
Environments Journal, 13(2):222–233, 2004.
[38] C.P. Papageorgiou and T. Poggio. A trainable object detection system: Car
detection in static images. Technical Report AI-Memo-1673, CBCL-180, Mas-
sachusetts Institute of Technology Technical Paper, 1999.
117
[39] M. Pollefeys. 3d modeling from images. In Tutorial at European Conference
on Computer Vision, 2000.
[40] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops,
andR.Koch. Visualmodelingwithahand-heldcamera. International Journal
of Computer Vision, 59(3):207–232, 2004.
[41] J. Portilla and E.P Simoncelli. A parametric texture model based on joint
statisticsofcomplexwaveletcoefficients. InInternationalJournalofComputer
Vision, volume 40-1, pages 49–71, 2000.
[42] P.V. Sander, J. Snyder, S. Gortler, and H. Hoppe. Texture mapping progres-
sive meshes. In ACM SIGGRAPH, pages 409–416, 2001.
[43] H. Schneiderman and T. Kanade. A statistical method for 3d object detec-
tion applied to faces and cars. In IEEE Conference on Computer Vision and
Pattern Recognition, 2000.
[44] I.O. Sebe, P. Ramanathan, and B. Girod. Multi-view geometry estimation for
light field compression. In Visualization, Modeling and Vision, pages 265–272,
2002.
[45] I.O. Sebe, S. You, and U. Neumann. Rapid part-based 3d modeling. In ACM
Symposium on Virtual Reality Software and Technology (VRST), pages 143–
146, 2005.
[46] I.O. Sebe, S. You, and U. Neumann. Model-driven video-based rendering.
In IEEE CVPR Workshop on Three-Dimensional Cinematography, page 166,
2006.
[47] T.W.SederbegandS.R.Parry. Free-formdeformationofsolidgeometricmod-
els. In ACM SIGGRAPH, pages 151–159, 1986.
[48] J. Shi and C. Tomasi. Good features to track. In IEEE Conference on Com-
puter Vision and Pattern Recognition, 1994.
[49] H.Y. Shum, S.C. Chan, and S.B. Kang. Image Based Rendering. Springer,
2006.
[50] A.StorkandM.Maidhof. Efficentandprecisesolidmodellingusinga3dinput
device. In ACM symposium on Solid and Physical Modeling, pages 181–194,
1997.
[51] R. Szeliski. From images to models (and beyond): a personal retrospective.
In Vision Interface, pages 126–137, 1997.
118
[52] F. Ulupinar and R. Nevatia. Shape from contour: Straight homogeneous gen-
eralized cylinders and constant cross-section generalized cylinders. In IEEE
Trans. on Pattern Analysis and Machine Intelligence, volume 17, pages 120–
135, 1995.
[53] P. Viola and M. Jones. Robust real-time object detection. International Jour-
nal of Computer Vision, 2002.
[54] http://www.canoma.com/.
[55] http://sketchup.google.com.
[56] http://imagemodeler.realviz.com/.
[57] http://www-ui.is.s.u-tokyo.ac.jp/ takeo/teddy/teddy.htm.
[58] L.Y. Wei and M. Levoy. Texture synthesis over arbitrary manifold surfaces.
In ACM SIGGRAPH, pages 355–360, 2001.
[59] B. Wilburn, M. Smulski, H.H.K. Lee, and M. Horowitz. The light field video
camera. In SPIE Electronic Imaging, 2002.
[60] K.Y.K. Wong, P.R.S. Mendonca, and R. Cipolla. Reconstruction of surfaces
of revolution from single uncalibrated views. In British Machine Vision Con-
ference, 2002.
[61] B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in
a single image by bayesian combination of edgelet part detectors. In Interna-
tional Conference on Computer Vision, volume 1, pages 90–97, 2005.
[62] B. Wu and R. Nevatia. Simultaneous object detection and segmentation by
boostinglocalshapefeaturebasedclassifier. InIEEE Conference on Computer
Vision and Pattern Recognition, pages 1–8, 2007.
[63] Y. Xu, S.C. Zhu, B. Guo, and H.Y. Shum. Asymptotically admissible texture
synthesis. In Second International Workshop of Statistical and Computational
Theories of Vision, 2001.
[64] C. Yang, D. Sharon, and M. van de Panne. Sketch-based modeling of param-
eterized objects. In ACM SIGGRAPH Sketch, page 89, 2005.
[65] R.C.Zeleznik,K.P.Herndon,andJ.F.Hughes. Sketch: Aninterfaceforsketch-
ing 3d scenes. In ACM SIGGRAPH, pages 163–170, 1996.
119
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Line segment matching and its applications in 3D urban modeling
PDF
3D deep learning for perception and modeling
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
3D urban modeling from city-scale aerial LiDAR data
PDF
Point-based representations for 3D perception and reconstruction
PDF
Object detection and recognition from 3D point clouds
PDF
3D object detection in industrial site point clouds
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Hybrid methods for robust image matching and its application in augmented reality
PDF
Rapid creation of photorealistic large-scale urban city models
PDF
Fast iterative image reconstruction for 3D PET and its extension to time-of-flight PET
PDF
Automatic image matching for mobile multimedia applications
PDF
Data-driven 3D hair digitization
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
3-D building detection and description from multiple intensity images using hierarchical grouping and matching of features
PDF
Accurate 3D model acquisition from imagery data
Asset Metadata
Creator
Sebe, Ismail Oner
(author)
Core Title
Interactive rapid part-based 3d modeling from a single image and its applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
01/29/2008
Defense Date
11/08/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
image-based modeling,OAI-PMH Harvest,part-based detection and learning,part-based modeling,single image modeling
Language
English
Advisor
Neumann, Ulrich (
committee chair
), Kuo, C.-C. Jay (
committee member
), Nevatia, Ramakant (
committee member
), You, Suya (
committee member
)
Creator Email
iosebe@graphics.usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m999
Unique identifier
UC1149391
Identifier
etd-Sebe-20080129 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-591265 (legacy record id),usctheses-m999 (legacy record id)
Legacy Identifier
etd-Sebe-20080129.pdf
Dmrecord
591265
Document Type
Dissertation
Rights
Sebe, Ismail Oner
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
image-based modeling
part-based detection and learning
part-based modeling
single image modeling