Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Rapid creation of photorealistic virtual reality content with consumer depth cameras
(USC Thesis Other)
Rapid creation of photorealistic virtual reality content with consumer depth cameras
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Rapid Creation of Photorealistic Virtual Reality Content
with Consumer Depth Cameras
Chih-Fan Chen
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2019
Copyright © 2018 Chih-Fan Chen
PUBLISHED BY UNIVERSITY OF SOUTHERN CALIFORNIA
http://www.usc.edu
First Printing, October 2018
A B S T R AC T
High-fidelity virtual content is essential for the creation of compelling and
effective virtual reality (VR) experiences. However, creating photorealistic
content is not easy, and handcrafting detailed 3D models can be time and la-
bor intensive. Structured camera arrays, such as light-stages, can scan and
reconstruct high-fidelity virtual models, but the expense makes this technol-
ogy impractical for most users.
In this dissertation, we first present a complete end-to-end pipeline for the
capture, processing, and rendering of view-dependent 3D models in virtual
reality from a single consumer-grade depth camera. To achieve photorealis-
tic results, view-dependent texture mapping(VDTM) method is used for real-
time rendering to preserve the specular reflections and light-burst effects. To
further improve the overall visual quality, we then present a new VDTM ap-
proach, dynamic omnidirectional texture synthesis (DOTS), that synthesize a
high-resolution view-dependent texture map for any virtual camera location.
Synthetic textures are generated by uniformly sampling a spherical virtual
camera set surrounding the virtual object, thereby enabling efficient real-time
rendering for all potential viewing directions. The results showed that DOTS
were rated as superior visual quality compared to fixed textures and tradi-
tional VDTM, with DOTS producing smoother dynamic transitions between
viewpoints. Although DOTS can produce promising texture at all viewpoints,
post-editing such as relighting is still not possible in VDTM framework. This
is due to all the lighting effect is baked into the texture. We present a novel
approach for estimating the illumination and reflectance properties of virtual
objects captured using consumer-grade RGB-D cameras. This method is im-
plemented within a fully automatic content creation pipeline that can generate
photorealistic objects suitable for integration with virtual reality scenes with
dynamic lighting. The geometry of the target object is first reconstructed
from depth images captured using a handheld camera. To get nearly drift-
free texture maps of the virtual object, a set of selected images from the orig-
inal color stream is used for camera pose optimization. Our approach further
separates these images into diffuse (i.e, view-independent) component and
specular (i.e., view-dependent) components using low-rank decomposition.
The lighting conditions during capture and reflectance properties of the vir-
tual object are subsequently estimated from the computed specular maps. By
combining these parameters with the diffuse texture, the reconstructed object
can then be rendered in a real-time scene that plausibly replicates the real
world illumination or with arbitrary lighting that varies in direction, intensity,
and color.
1
C O N T E N T S
I T H E P R E A M B L E 12
1 I N T RO D U C T I O N 13
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.1 Rapidly Photorealistic Content Creation (Offline Stage) 15
1.2.2 Good Virtual Reality Experience (Online Stage) . . . 16
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . 18
2 B AC K G RO U N D 20
2.1 Camera Model . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 3D graphic and Rendering . . . . . . . . . . . . . . . . . . 21
2.3 3D Reconstruction for RGBD sensor . . . . . . . . . . . . . 22
2.4 Geometry Reconstruction . . . . . . . . . . . . . . . . . . . 22
2.5 Texture Reconstruction . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Image-Based methods . . . . . . . . . . . . . . . . 23
2.5.2 Hybrid methods . . . . . . . . . . . . . . . . . . . . 24
2.6 Reflectance Reconstruction . . . . . . . . . . . . . . . . . . 24
2.7 Position of this thesis . . . . . . . . . . . . . . . . . . . . . 25
II C R E AT I O N O F P H OT O R E A L I S T I C V I RT UA L R E A L I T Y C O N -
T E N T 27
3 V I E W- D E P E N D E N T V I RT UA L R E A L I T Y C O N T E N T F RO M
R G B - D I M AG E S 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 3D Reconstruction and Camera Trajectory Optimization . . 30
3.3.1 Geometry Reconstruction . . . . . . . . . . . . . . 30
3.3.2 Key Frame Selection . . . . . . . . . . . . . . . . . 30
3.3.3 Camera Trajectory Optimization . . . . . . . . . . . 32
3.4 Real-time Texture Rendering . . . . . . . . . . . . . . . . . 33
3.4.1 Pre-processing . . . . . . . . . . . . . . . . . . . . 33
3.4.2 Render Image Selection . . . . . . . . . . . . . . . 33
3.4.3 View-Dependent Texture Mapping . . . . . . . . . . 33
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . 34
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 DY N A M I C O M N I D I R E C T I O N A L T E X T U R E S Y N T H E S I S F O R
P H OT O R E A L I S T I C V I RT UA L C O N T E N T C R E AT I O N 39
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 Open Problems of VDTM . . . . . . . . . . . . . . 40
4.2.2 Overall Process . . . . . . . . . . . . . . . . . . . . 40
2
Contents 3
4.2.3 Geometric Reconstruction . . . . . . . . . . . . . . 41
4.2.4 Global Texture Generation . . . . . . . . . . . . . . 41
4.3 Synthetic Texture Map Generation and Rendering . . . . . . 42
4.3.1 Parameters Used in Image Uniqueness Selection . . 43
4.3.2 Objective and Optimization . . . . . . . . . . . . . 43
4.3.3 Real-time Image-based Rendering . . . . . . . . . . 46
4.4 User Study and Experiment Results . . . . . . . . . . . . . 46
4.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.2 Study Design . . . . . . . . . . . . . . . . . . . . . 47
4.4.3 Individual Rating Results . . . . . . . . . . . . . . . 49
4.4.4 Pair-wise Ranking Results . . . . . . . . . . . . . . 50
4.4.5 Visual Analysis . . . . . . . . . . . . . . . . . . . . 51
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 R E C O N S T RU C T V I RT UA L C O N T E N T R E F L E C TA N C E W I T H
C O N S U M E R - G R A D E D E P T H C A M E R A S 56
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Overview and Preprocessing . . . . . . . . . . . . . . . . . 57
5.2.1 Overall Process . . . . . . . . . . . . . . . . . . . . 57
5.2.2 Geometry Reconstruction . . . . . . . . . . . . . . 58
5.2.3 Key Frame Selection . . . . . . . . . . . . . . . . . 58
5.3 Diffuse and Specular Map Separation . . . . . . . . . . . . 59
5.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . 59
5.3.2 Camera Pose Optimization . . . . . . . . . . . . . . 59
5.3.3 Low-Rank Decomposition . . . . . . . . . . . . . . 60
5.3.4 Texture Separation and Pose Estimation . . . . . . . 60
5.4 Material and Lighting Estimation . . . . . . . . . . . . . . . 64
5.4.1 Phong Reflection Model . . . . . . . . . . . . . . . 64
5.4.2 Objective Function . . . . . . . . . . . . . . . . . . 64
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5.1 Visual Analysis . . . . . . . . . . . . . . . . . . . . 70
5.5.2 Dynamic Relighting . . . . . . . . . . . . . . . . . 74
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 78
III F U RT H E R A DVA N C E M E N T S 79
6 R E S E A R C H A N D D E V E L O P M E N T G U I D E L I N E S F O R F U -
T U R E W O R K 80
6.1 Capture Improvements . . . . . . . . . . . . . . . . . . . . 80
6.1.1 Geometry Reconstruction . . . . . . . . . . . . . . 82
6.1.2 Texture Improvement . . . . . . . . . . . . . . . . . 82
6.1.3 More Control of the Capture Environment . . . . . . 83
6.1.4 Guidance of Capturing . . . . . . . . . . . . . . . . 83
6.1.5 More Challenging Cases . . . . . . . . . . . . . . . 84
6.2 Content Creation Pipeline Advancement . . . . . . . . . . . 85
6.2.1 Joint Optimization of Geometry and Texture . . . . 85
6.2.2 Self-inter-reflections Assumption Relaxation . . . . 85
6.2.3 Source of Lights Estimation . . . . . . . . . . . . . 85
Contents 4
6.2.4 Categorize Material of Object . . . . . . . . . . . . 86
6.2.5 Adding More Parameter(Freedom) in the Color Model 86
6.3 Virtual Reality Experience Enhancement . . . . . . . . . . . 86
6.3.1 Real-time Editing . . . . . . . . . . . . . . . . . . . 86
6.3.2 Replaced by Texture Map . . . . . . . . . . . . . . 87
6.3.3 User Study . . . . . . . . . . . . . . . . . . . . . . 87
6.4 More than Virtual Reality . . . . . . . . . . . . . . . . . . . 88
6.4.1 Extended our method to AR/MR . . . . . . . . . . . 88
6.4.2 Build A Database for Virtual Content . . . . . . . . 88
IV A P P E N D I X 94
Appendix A C O L O R M A P P I N G O P T I M I Z AT I O N 95
A.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . 95
A.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 95
Appendix B L O W- R A N K M AT R I X D E C O M P O S I T I O N 98
B.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . 98
Appendix C M AT E R I A L E S T I M AT I O N 99
C.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . 99
C.1.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . 99
C.1.2 Outputs . . . . . . . . . . . . . . . . . . . . . . . . 99
C.1.3 Material Property and Light Color Estimation . . . . 99
L I S T O F F I G U R E S
Figure 1 Reconstructing the virtual reality content of a tar-
geting physical object . . . . . . . . . . . . . . . . 14
Figure 2 The rendering problem in VDTM. The image in the
middle is interpolated from the left and the right im-
ages. The rendering results has visual artifact. . . . 15
Figure 3 Thesis overview . . . . . . . . . . . . . . . . . . . 19
Figure 4 The pinhole camera . . . . . . . . . . . . . . . . . 20
Figure 5 Three different rendering method . . . . . . . . . . 21
Figure 6 (a) A 3D model reconstructed from images captured
with a single consumer depth camera. (b) The tex-
tured model generated using the traditional approach
of blending color images. (c) Our approach is ca-
pable of rendering view-dependent textures in real-
time based on the user’s head position, thereby pro-
viding finer texture detail and better-replicating il-
lumination changes and specular reflections that be-
come especially noticeable when observed from dif-
ferent viewpoints in virtual reality. . . . . . . . . . 29
Figure 7 An overview of the complete pipeline, consisting of
two phases: (1) offline processing/optimization, and
(2) online view-dependent texture rendering. . . . . 30
Figure 8 Comparison of the 3D geometry reconstructed using
structure from motion (SfM) with color images vs.
KinectFusion with depth images. . . . . . . . . . . 31
Figure 9 An example of the textured model from (a) captured
camera view and (b) from a previously un-captured
viewing direction (optimized camera pose in blue
rectangle), (c) results of iteration 0, 10, 20, 30 re-
spectively. Note that the vertex in red color means
invisible in the image. . . . . . . . . . . . . . . . . 32
Figure 10 A view-dependent texture rendered using three source
images (left) compared with a texture generated us-
ing a single image (right). . . . . . . . . . . . . . 34
Figure 11 (top left) An image of a real object captured using a
Kinect v1. (middle left) The untextured 3D model.
(bottom left) The model with a fixed texture gener-
ated from blending the source images. (right) The
model with view-dependent texture rendered from
several viewpoints. Note the flame from the candle
within the object. . . . . . . . . . . . . . . . . . . 35
5
List of Figures 6
Figure 12 The results of our pipeline using the same object
captured with two consumer RGB-D sensors. . . . 35
Figure 13 (1st column) Untextured 3D models. (2nd column)
Fixed textures generated without optimization. (3rd
column) Fixed textures generated with optimization.
(last 3 columns) The final output from our pipeline
with view-dependent textures shown from three dif-
ferent viewpoints. . . . . . . . . . . . . . . . . . . 37
Figure 14 Overview of the DOTS content creation pipeline.
Color and depth image streams are captured from
a single RGB-D camera. The geometry is recon-
structed from depth information and is then used to
uniformly sample a set of virtual camera poses sur-
rounding the virtual object. For each camera pose,
a synthetic texture map is blended from the global
texture and local color images captured near the cur-
rent location. The synthetic texture maps are then
used to dynamically render the object in real-time
based on the user’s current viewpoint in virtual reality. 40
Figure 15 Visualization of a typical hand-held RGB-D capture
sequence showing non-uniform spatial and tempo-
ral distribution. The blue line shows the estimated
trajectory from the input depth stream. The green
line shows the trajectory projected from the blue
line onto a sphere surrounding the object. Red cir-
cles indicate the position of the selected keyframes
for VDTM. The spatial regions covered by the cam-
era’s trajectory are shown in the upper right image.
Time spent in each region is displayed in the lower
left image with brighter shades indicating longer du-
ration. . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 16 (a) The synthetic texture map is blended from the lo-
cal texture (green rectangle) and the global texture
(red rectangle). (b) It was not possible to completely
cover entire the 3D model using only the local tex-
ture information. (c) By blending with the global
texture, complete coverage is obtained. . . . . . . . 45
Figure 17 Comparison of an input image captured from the
RGB-D camera (left) with a high-resolution syn-
thetic texture map generated using the proposed method
(right). . . . . . . . . . . . . . . . . . . . . . . . . 46
List of Figures 7
Figure 18 (a) Four virtual objects used in our study. From
left to right respectively: a female sculpture, The
Kiss, Jean d’Aire, and an antique leather chair. (b)
The virtual environment and questionnaire user in-
terfaces (individual ratings on the left, pair-wise com-
parisons on the right). Participants could use the ro-
tation slider to rotate the model around the vertical
axis. The bottom two sliders were used to gather
feedback from participants. . . . . . . . . . . . . . 48
Figure 19 Results for the individual ratings of visual quality
and transition smoothness. The graph indicates the
median and interquartile range for each experimen-
tal condition. . . . . . . . . . . . . . . . . . . . . 50
Figure 20 Summary of pair-wise rankings comparing DOTS
with the other two methods according to visual tex-
ture quality and transition smoothness between view-
points. . . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 21 (a) The key frames selected by VDTM are not uni-
formly distributed around the 3D model because they
are dependent upon the camera trajectory during ob-
ject capture. Thus, this leads to an irregular trian-
gulation (red) and undesirable visual artifacts. (b)
In contrast, the synthetic maps generated by DOTS
cover all potential viewing directions and the tri-
angulation is uniform, resulting in seamless view-
dependent textures. . . . . . . . . . . . . . . . . . 52
Figure 22 Rendering results from viewpoints that were not cov-
ered during object capture (e.g. a top-down view
of the virtual objects). (left) The model is rendered
incorrectly by VDTM. For example, the specular-
ity on the female sculpture’s shoulder and chest are
conflicting with each other. Blending textures from
distanced camera positions result in noticeable color
discontinuity on the base of The Kiss and the back
of the chair. (right) The model rendered by DOTS
still presents reasonable visual quality even though
the viewpoint was never observed in the capture se-
quence. . . . . . . . . . . . . . . . . . . . . . . . 53
Figure 23 (top) A 3D model rendered using per-vertex color.
(bottom) A reduced polygon mesh rendered using a
UV map. From left to right: the untextured geomet-
ric model, fixed texture mapping [Zhou and Koltun,
2014], VDTM [Chen et al., 2017], and DOTS. . . . 54
List of Figures 8
Figure 24 Appearance comparison results for each virtual ob-
ject used in our study. The top row shows three
example images from original video. The second
row are the geometry model and the results from
VDTM [Chen et al., 2017] (in blue rectangle). Note
that although VDTM can achieve good visual qual-
ity (lesfmost),the The third row is the fixed texture
[Zhou and Koltun, 2014] (in red rectangle) and the
results of DOTS (in green rectangle). . . . . . . . . 55
Figure 25 (Left) Several key frames are first selected from a
given RGB-D sequence. The diffuse components
and the specular components can be obtained by si-
multaneously optimizing the camera pose of each
frame and low-rank decomposition. The material
property of the virtual object and the lighting condi-
tions are estimated from the specular components.
At run-time, the specularity of a virtual object is
computed based on the user viewpoint and com-
bined with the diffuse map (i.e., view-independent
component) for final results. (Right) The compari-
son for an unselected frame and the corresponding
rendering result. Note that our approach is able to
generate plausible results for unseen views that were
not present in the captured dataset. . . . . . . . . . 57
Figure 26 Overview of our content creation pipeline. Color
and depth image streams are captured from a sin-
gle RGB-D camera. The geometry is reconstructed
from depth information. In section 4, each selected
frame is separated into diffuse and specular maps by
low-rank decomposition and camera pose optimiza-
tion. In section 5, lighting condition and material
property are obtained from the specular maps. The
texture of the dynamically relightable reconstructed
object is rendered in real-time based on the user’s
current viewpoint. . . . . . . . . . . . . . . . . . . 58
Figure 27 Separation of selected frames into a diffuse map and
a specular map. The original texture of selected
frames (top row) are projected to the camera pose
of the left frame (second row) to form a data matrix
A. Low-rank decomposition is able to separate A
into a diffuse matrix (third row) and a specular map
(fourth row). . . . . . . . . . . . . . . . . . . . . 61
List of Figures 9
Figure 28 An example of the texture synthesis functionY. (a)
and (b) show a reference image with a known cam-
era pose (red). This image is used to render the
model and the other two camera views (blue and
green) are projected to its camera view. (c) The syn-
thesized results of the three camera views. Note that
the occluded area (i.e., not visible in (a)) is painted
in gray for better visualization. . . . . . . . . . . . 62
Figure 29 (Left) An example of a selected frame and its sur-
face normal. (Middle) The diffuse map is unable to
recover the lighting condition. (Right) In contrast,
the specular map is sparse and the lighting direction
can roughly assumed from the top. Using this as-
sumption from all selected images, we can approx-
imately calculate the number of lights and their di-
rections for reflectance estimation. . . . . . . . . . 65
Figure 30 The results of possible incident light direction. Note
that each highlight pixel votes for a possible direc-
tion. Mean-shift algorithm is applied to get the num-
ber of clustering and their center. . . . . . . . . . . 66
Figure 31 Visualization of estimated K
s
and the a. Note that
the darker area means smaller estimated value of
each coefficient. . . . . . . . . . . . . . . . . . . . 67
Figure 32 Comparison of the rendered virtual object (top row)
and original capture frames (bottom row) for two
different objects. The highlights on the virtual sphere
visualizes the direction and color of the virtual lights
estimated from the real world scene. . . . . . . . . 68
Figure 33 Representative real world images for three captured
objects. From left to right: Torso of Elevation, The
Kiss, and an antique leather chair. . . . . . . . . . 69
Figure 34 Comparison of the diffuse appearance using the av-
erage vertex color from the original images (i.e.,
w/o low-rank decomposition) and the derived dif-
fuse maps (i.e., w/ low-rank decomposition). Note
that our approach can remove most of the specular
highlights such as the shoulder area on the Torso
and the self inter-reflection on the seat of the an-
tique chair. Moreover, by removing the highlights,
the details from the original image can be correctly
preserved in the diffuse map. . . . . . . . . . . . . 70
List of Figures 10
Figure 35 (a) Without camera pose optimization, the rendered
texture is incorrect due to the inaccurate initial tra-
jectory obtained from Kinect Fusion. (b) (Left) The
incorrect rendering will decrease the quality of dif-
fuse and specular maps. Moreover, the generated
specular maps failed to recover the material prop-
erty and lighting estimation. (Right) With camera
pose optimization, the proposed texture separation
method correctly divided the image into diffuse and
specular components. . . . . . . . . . . . . . . . . 71
Figure 36 Comparison of our method with fixed textures [Zhou
and Koltun, 2014] and DOTS [Chen and Suma Rosen-
berg, 2018]. Note that the fixed texture results in
lower fidelity (blurriness) due to averaging the ob-
served images. DOTS or other view-dependent tex-
ture mapping methods are able to generate photo-
realistic rendering results. However, the specular
highlights (bottom row) of an unseen view cannot
be correctly interpolated from the source images. In
contrast, our method estimated the light sources and
the specular reflectance properties of the object and
is able to synthesize the highlights of unseen views. 73
Figure 35 Demonstration of two reconstructed virtual objects
(The Kiss and Torso of Elevation) in a virtual scene
with dynamic illumination. The proposed reflectance
estimation method can provide plausible results with
varying virtual light direction, color, and intensity.
Furthermore, the specular highlights on the virtual
object will smoothly change in real-time as the user
moves between different viewpoints. A virtual sphere
is also displayed to more easily visualize the color
and the direction of virtual lights. . . . . . . . . . . 77
Figure 36 Thesis Overview Revisit . . . . . . . . . . . . . . 81
Figure 37 One example of giving guidance to capture a good
RGBD video as an input of our content creation
pipeline. . . . . . . . . . . . . . . . . . . . . . . . 84
L I S T O F TA B L E S
Table 1 Information about the 3D models and images used
in different examples. . . . . . . . . . . . . . . . . 36
Table 2 Information about the 3D models and images used
in different examples. . . . . . . . . . . . . . . . . 47
11
Part I
T H E P R E A M B L E
1
I N T RO D U C T I O N
1.1 M OT I VAT I O N
With the recent proliferation of high-fidelity head-mounted displays (HMDs),
there is an increasing demand for realistic 3D content that can be integrated
into virtual reality environments. However, creating photorealistic models is
not only difficult but also time-consuming. A simpler alternative involves
scanning objects in the real world and rendering their digitized counterpart
in the virtual world. Capturing objects can be achieved by performing a 3D
scan using widely available consumer-grade RGB-D cameras. This process
involves reconstructing the geometric model from depth images generated us-
ing a structured light or time-of-flight sensor (Figure. 1 (a)). Reconstructing
the geometry of objects by consumer-grade RGB-D cameras has been an ex-
tensive research topic and many techniques have been developed [Izadi et al.,
2011, Whelan et al., 2012, Newcombe et al., 2015] with promising results.
The colormap is determined by fusing data from multiple color images cap-
tured during the scan. However, fusing data from color frames and replicat-
ing the appearance of reconstructed objects is still an open question. Existing
methods (e.g., [Zhou and Koltun, 2014, Bi et al., 2017]) compute the color
of each vertex by averaging the colors from all captured images. Blending
colors in this manner results in lower fidelity textures that appear blurry espe-
cially for objects with non-Lambertian surfaces. Other methods (e.g., [Choi
et al., 2016, Waechter et al., 2014]) selected a single view per face to achieve
higher color fidelity rendering. But these approaches yield textures with fixed
lighting that is baked onto the model. This limitation becomes more apparent
when viewed in head-tracked virtual reality, as the illumination (e.g. specu-
lar reflections) does not change appropriately based on the user’s viewpoint.
As the surface illumination (e.g. specular reflections) does not change ap-
pearance based on the user’s physical movements, the reconstructed model
becomes especially noticeable when viewed in a different direction.
To improve color fidelity, techniques such as View-Dependent Texture Map-
ping (VDTM) have been introduced in [Debevec et al., 1998, Nakashima
et al., 2015] and also in our approach [Chen et al., 2017] in Chapter 3. In this
approach, the texture is dynamically updated in real-time using a subset of im-
ages closest to the current virtual camera position (Figure. 1 (b)). Although
these methods typically result in improved visual quality, the dynamic transi-
tion between viewpoints is potentially problematic, especially for objects cap-
tured using consumer RGB-D cameras. This is due to the fact that the input
13
1.1 M OT I VAT I O N 14
sequences often cover only a limited range of viewing directions, and some
frames may only partially capture the target the object. In Chapter 4, we pro-
posed dynamic omnidirectional texture synthesis (DOTS) in order to improve
the smoothness of viewpoint transitions while maintaining the visual quality
provided by VDTM techniques. Given a target virtual camera pose, DOTS
is able to synthesize a high-resolution texture map from the input stream of
color images. The proposed objective function simultaneously optimizes the
synthetic texture map, the camera poses of selected frames with respect to the
virtual camera, and the pre-computed global texture. Furthermore, instead of
using traditional spatial/temporal selection, DOTS uniformly samples a spher-
ical set of virtual camera poses surrounding the reconstructed object. This re-
sults in a well-structured triangulation of synthetic texture maps that provide
omnidirectional coverage of the virtual object, thereby leading to improved
visual quality and smoother transitions between viewpoints.
(a)Reconstructed
Model
(b)Appearance from different viewing direction
Figure 1.: Reconstructing the virtual reality content of a targeting physical
object
Although using VDTM approaches typically result in improved visual qual-
ity, the dynamic transition of specular reflections between viewpoints is poten-
tially problematic. Since specular reflections have a strong non-local effect
and are strongly related to the geometry and lighting conditions, in Figure
2, the synthetic views created from interpolating between color images may
have visual artifacts. In Chapter 5, we propose a capture-to-rendering content
creation pipeline to estimate the optimized diffuse and specular reflectance
and the lighting condition. Our novel low-rank decomposition simultane-
ously optimized the camera poses of each frame and separated the frame
into a diffuse map and a specular map. Based on the specular maps from
different viewing direction, the material properties and the lighting condition
can be estimated. At run-time, the estimated diffuse reflectance and specular
reflectance are used to synthesize the captured object in an unseen view.
1.2 G O A L S 15
Figure 2.: The rendering problem in VDTM. The image in the middle is in-
terpolated from the left and the right images. The rendering results
has visual artifact.
In this dissertation, we first proposed an end-to-end pipeline that can cre-
ate photo-realistic virtual content from an RGB-D sequence. A novel method
for smoother transition between viewpoints is proposed for better virtual re-
ality experience. The material of the virtual project is further estimated for
accurate specularity replication.
1.2 G O A L S
Our main objective is to provide an end-to-end pipeline for virtual content
replication. The pipeline can be separated into two parts: an offline process
and an online rendering. For the offline process, the goal is to rapidly create
the photorealistic virtual replica from a targeting physical object and the
online stage, our objective is to provide a satisfying virtual reality experience
of the reconstructed model. To this aim, the designed method has to meet the
following goals:
1.2.1 Rapidly Photorealistic Content Creation (Offline Stage)
To rapidly create virtual contents, although the process time of reconstructing
a single object needs to be fast, the replicability of the whole reconstructed
process is a more important factor. Too many expert knowledge or additional
devices makes the setup time longer resulting in decreasing rapidness of the
method.
C O N S U M E R - G R A D E P O RTA B L E D E V I C E To obtain high fidelity mod-
els, [Einarsson et al., 2006] used a light stage to capture the target object under
different lighting condition and [Bolas et al., 2015] used a turntable captures
a ring of images at 1-degree increments. However, these specialized devices
are not ideal for our rapid creation because the devices are very expensive and
not practical for most users. Instead, we use only a consumer-grade RGB-D
1.2 G O A L S 16
sensor (i.e., Kinect) for the virtual content creation. Due to the required de-
vice is low-cost and easily obtained, the virtual content creation is applicable
to every user.
M I N I M I Z E D U S E R I N P U T S To improve texture quality, Some works
required additional devices with the RGB-D sensor. [Hedman et al., 2016]
tried to use a DSLR camera captured the high-resolution images for rendering.
[Wu et al., 2016b] needs additional infrared LEDs with a IR camera device
to estimate the material and lighting. However, calibration between sensors
is tedious and complicated for users. Thus, in our approach, we have only
an RGB-D video sequence to minimized the needs of expert knowledge to
achieve rapidly creation.
P H OT O R E A L I S T I C Q UA L I T Y [Zhou and Koltun, 2014, Jeon et al.,
2016, Bi et al., 2017] compute the color of each vertex by averaging the
colors from all captured images. Blending colors in this manner results in
lower fidelity textures that appear blurry for objects with non-Lambertian sur-
faces. Furthermore, this approach also yields textures with fixed lighting that
is baked onto the model. To improve color fidelity, we used View-Dependent
Texture Mapping (VDTM) approach to achieve photorealistic texture render-
ing.
1.2.2 Good Virtual Reality Experience (Online Stage)
R E A L - T I M E V I RT UA L R E A L I T Y E X P E R I E N C E VDTM [Debevec
et al., 1998, Bastian et al., 2010, Nakashima et al., 2015, Hedman et al., 2016]
is a well-known technique for photorealistic rendering. However, the compu-
tational time is still under 10 frame per second (FPS), which is far from the
needs of smooth cinematic virtual reality (VR) experience (i.e., at least 60
FPS). Using the depth information as visibility check per-vertex, the perfor-
mance of our system is usually around 90-100 fps, which is sufficient for
smooth VR experience.
F R E E E X P L O R AT I O N I N V R Although existing VDTM methods typi-
cally result in improved visual quality, the dynamic transition between view-
points is potentially problematic, especially for objects captured using con-
sumer RGB-D cameras. This is due to the fact that the input sequences often
cover only a limited range of viewing directions, and some frames may only
partially capture the target the object.
P O S T- E D I T I N G O R R E L I G H T I N G Due to the nature of image-based
rendering techniques, real-world illumination is baked into the texture map.
Putting the reconstructed object into a different lighting environment looks
incongruous. Thus, the virtual object should be adapted to the new lighting.
1.3 C O N T R I B U T I O N 17
1.3 C O N T R I B U T I O N
Creating virtual contents from scratch is time-consuming and usually requires
expert knowledge of computer-aided design software. To ease the creating
process, we aimed to step-by-step establish a capture-to-rendering pipeline to
achieve our goals in the previous section. In Chapter 3, we first build up a
system that is able to automatically generated a view-dependent texture map-
ping method for high fidelity rendering. The frame rate of our method is more
than 100 fps, which is fast enough for virtual reality application. To further
improve the texture quality of reconstructed virtual contents, in Chapter 4,
we proposed dynamic omnidirectional texture synthesis (DOTS). DOTS over-
come the limitation of missing data and partial observation in the RGB-D se-
quence captured by non-experts. As suggested by a user study, our proposed
method has comparative texture quality with outperform smooth transition.
However, to accurately replicate non-Lambertian objects, the specularity in-
terpolated from multiple images is not ideal due to the specular reflection is
highly depends on the input lights and object property. In Chapter 5, we pro-
posed a texture separation method to divide a color image to a diffuse map
and a specular map. The material properties can be derived from the specular
maps. By using our approach, the virtual object can fit in any virtual envi-
ronment with dynamic lighting. By combining the three proposed method
from Chapter 3 - 5, we are able to rapidly generate the virtual object with
photo-realistic fidelity from a single consumer-grade camera. Without any
constraints and assumption of the RGB-D and without any additional device,
our system can further reduce the requirement of expert knowledge, making
it a useful application for real-world 3D scanning.
In summary, the contribution are :
• A real-time view-dependent texture mapping method is first proposed
for virtual reality experience.
• A capture-to-rendering pipeline is established to minimize user inputs.
The input data is captured by non-expert users with a single consumer-
grade RGB-D sensor.
• A novel method for computing an optimized high-resolution synthetic
texture map for a particular target viewpoint.
• A novel view-dependent texture mapping approach that supports omni-
directional real-time rendering and remains reasonably robust to uncon-
strained capture conditions such as non-uniform camera trajectories,
missing coverage, and partial object views.
• A novel texture separation approach that decomposes each captured
image into a diffuse component and a specular component and simulta-
neously optimizes its camera pose.
• A novel method to estimate the light sources in the capture environment
using only the observed specular reflections. The optimized specular
1.4 T H E S I S OV E RV I E W 18
reflectance of the object is derived from the estimated lighting condi-
tions.
• A fully automatic virtual content creation pipeline that does not require
expert knowledge or manual human effort. The reconstructed mod-
els are suitable for integration with industry-standard virtual environ-
ments.
1.4 T H E S I S OV E RV I E W
The outline of this thesis is shown in Figure 3. Chapter 3 established the entire
pipeline for real-time VDTM. Chapter 4 focus on solving the free exploration
in VR. Chapter 5 proposed a novel method to enable post-editing ability.
Chapter 2 survey the literature of view-dependent texture mapping
Chapter 3 introduce the end-to-end systematic pipeline of real-time VDTM
Chapter 4 develop DOTS for free viewpoints exploration
Chapter 5 propose a novel algorithm to achieve post editing and relighting.
Chapter 6 concludes the work in this thesis and discuss the limitation of
our work and gives the directions for future research.
1.4 T H E S I S OV E RV I E W 19
Figure 3.: Thesis overview
2
BAC K G RO U N D
2.1 C A M E R A M O D E L
A camera model contains the intrinsic property such as focal length and prin-
cipal point and the extrinsic property such as rotation and translation. Since
the technique in this dissertation is highly based on the camera, we will briefly
explain the important concepts of the camera model.
I N T R I N S I C The intrinsic matrix is used to project a 3D point (X, Y , Z)
onto the image plane. As shown in Figure 4, The light from any 3D point
pass through the pinhole and project onto the image plane. The intrinsic
camera calibration matrixK2R
33
is illustrated as follow:
Figure 4.: The pinhole camera
K =
0
@
f
x
t c
x
0 f
y
c
y
0 0 1
1
A
where f
x
and f
y
are the focal length of the lens and the c
x
and c
y
are the
principal points in x- and y-coordinate in image plane. t represents the skew
coefficient between the x- and the y-axis and is often 0. The nonlinear intrin-
sic parameters such as lens distortion are also important although they cannot
be included in the linear camera model described by the intrinsic parameter
matrix. In our case, the target object is usually at the center and the distortion
is negligible.
20
2.2 3 D G R A P H I C A N D R E N D E R I N G 21
E X T R I N S I C The extrinsic matrix is used to set the camera pose in the
world space. The most common used matrix contains the rotation matrix
R2 R
33
and the translation matrix t 2 R
31
. Combining the rotation
matrixR and the translation matrixt into a transformation matrixT2R
44
is often used to represent the extrinsic matrix.
T =
R t
0
T
1
G E N E R A L P RO J E C T I O N M AT R I X Projection matrix describes the en-
tire process of projecting 3D points in world space onto the 2D image plane.
First, the extrinsic matrix is used to transform the 3D point from world space
to camera space. Then the intrinsic matrix is applied to project the trans-
formed point.
P =K[Rjt]
=K
0
T
whereK
0
is defined as
K
0
=
K 0
0
T
1
2.2 3 D G R A P H I C A N D R E N D E R I N G
Rendering is the inverse process of projection. Given the projection matrix
from the camera position to the model, the corresponding texture map is used
to color the model. In Figure. 5, there are three well-known methods to
render the model: per-face Rendering, per-vertex Rendering, and per-pixel
rendering.
Figure 5.: Three different rendering method
P E R - F AC E R E N D E R I N G Given a 3D model, per-face rendering assigns
a different colors to each mesh. Using this method to render the model usually
leads to low fidelity results.
2.3 3 D R E C O N S T RU C T I O N F O R R G B D S E N S O R 22
P E R - V E RT E X R E N D E R I N G Instead of assigning colors for each face,
per-vertex rendering retrieves the color of each vertex by projecting the vertex
onto the image plane. The color of each mesh is then interpolated from its
vertex color. Note that the colors from interpolation make the appearance of
the model is higher than per-face rendering.
P E R - P I X E L R E N D E R I N G ( U V M A P P I N G ) Per-pixel rendering projects
each vertex of a mesh onto the image plane and retrieves the UV values. For
each pixel of the screen space, its UV value is interpolated from the UV of
surrounding vertices. The UV value is then used to retrieve the color from
the texture map.
2.3 3 D R E C O N S T RU C T I O N F O R R G B D S E N S O R
The geometry and the texture of an object are essential in a virtual environ-
ment. Thus, the reconstruction process of both geometry and texture has
gained lots of attention. Since we focus on the texture rather than geometry,
we will briefly introduce some geometry reconstruction method and have a
full literature review of texture reconstruction.
2.4 G E O M E T RY R E C O N S T RU C T I O N
Reconstructing the geometry of a target object requires several scans from
different viewpoints. For each viewpoint, depth information can be obtained
from laser scanners or depth sensors. However, the devices usually do not
provide the precise 6 Degree of Freedom (DoF) of camera pose for each scan.
Thus, directly combining all scans into one mesh would not work. Itera-
tive Closest Point (ICP) algorithm [Besl and McKay, 1992, Zhang, 1994]
has been proposed to solve the registration problem. ICP iteratively finds
the closest correspondences between point clouds and updates the 6 DoF of
the cameras to minimize the point-to-point distances. Chen et al, [Chen and
Medioni, 1992] further extend the point-to-point distance metric to a more ro-
bust point-to-plane distance metric. More recently, [Rusinkiewicz and Levoy,
2001] discussed and evaluated all potential variants of the basic ICP algo-
rithm. Recently, the advent of consumer-grade depth sensors such as the
Microsoft Kinect has led to widespread interest in 3D scanning. Variants on
the KinectFusion algorithm [Izadi et al., 2011] have gained popularity, due
to its real-time reconstruction ability. Many papers proposed revised versions
of Kinect Fusion. Whelan et al, [Whelan et al., 2012] proposed Kintinuous
to enlarge the reconstruction area. Newcombe et al, [Newcombe et al., 2015]
proposed DynamicFusion to reconstruct non-static objects.
2.5 T E X T U R E R E C O N S T RU C T I O N
A wide variety of techniques [Shum and Kang, 2000, Remondino and El-
Hakim, 2006] have been proposed to incorporate 3D representations of phys-
2.5 T E X T U R E R E C O N S T RU C T I O N 23
ical objects in virtual environments. These methods can be categorized as
image-based, model-based, or a hybrid of the two based on the way unob-
served viewpoints are represented and visually reproduced.
2.5.1 Image-Based methods
Image-Based Rendering (IBR) is a rendering method using photographs with-
out utilizing the geometry of an object. Light field rendering (LFR) [Levoy
and Hanrahan, 1996] can synthesize unseen views if the number of images
is sufficiently large to capture all the reflected lights coming from an object.
LFR can achieve photorealistic quality by ray-tracing to find each pixel color
from its image dataset captured by a camera array. Bolas et al, [Bolas et al.,
2015] used a turn-table to capture the target object every 1 °angle and the
novel view that with same height can be synthesized. However, to obtain
such an input dataset, LFR requires a well-designed camera array or a pro-
grammable turntable. These specialized devices are very expensive and not
practical for most users. Gortler et al, [Gortler et al., 1996] proposed Lu-
migraph has similar rendering technique with LFR that try to synthesize the
novel view from lots of images captured from the same camera. Although
it does not require an image array, the target object should be placed on the
pre-designed environment with special markers to recover the camera pose.
Davis et al, [Davis et al., 2012] proposed the unstructured light field that us-
ing Simultaneous Localization and Mapping technique to obtain the camera
pose in real-time without using the markers. However, the drawback of these
methods is that it directly rendering on the screen without a 3D model. With-
out the geometry, it is complicated to edit or interact with the content after
capturing.
2.5.1.1 Model-Based Rendering
On the other hand, model-based methods use geometric models with mate-
rials such as albedo, specular map, normal map, etc., to represent an object.
Color mapping optimization [Zhou and Koltun, 2014] techniques have been
developed to maximize the color (i.e., albedo) agreement between multiple in-
put images. Waechter et al. [Waechter et al., 2014] proposed a similar method
to solve a conditional random field energy equation that consisting of a data
term, which the views that are closer to the mesh and less blurry are preferred,
and a smoothness term, which penalizes the inconsistencies between adjacent
faces. Jeon et al, [Jeon et al., 2016] proposed to use textures instead of vertex
color for better visual reconstruction. Bi et al, [Bi et al., 2017] proposed to
use texture to join optimize the texture project and the camera poses, which
can remove the unwanted texture while recovering the texture. Whelan et al
[Whelan et al., 2016] estimate the positions and directions of light sources
in the scene to reject samples that only contain specular highlights. These
methods produce higher visual quality for models with Lambertian surfaces.
However, averaging all the observed colors of a non-Lambertian object can
result in lower fidelity textures that appear blurry. Moreover, fixed textures
2.6 R E F L E C TA N C E R E C O N S T RU C T I O N 24
are not ideal to represent the dynamic illumination effects of non-Lambertian
surfaces.
2.5.2 Hybrid methods
View-dependent texture mapping (VDTM) [Debevec et al., 1998, Porquet
et al., 2005, Buehler et al., 2001, Rongsirigul et al., 2017, Hedman et al.,
2016, Chen et al., 2017] is a hybrid method that combines aspects of model-
based and image-based rendering. Given a set of selected images, it dynam-
ically blends the color maps from different viewing directions to render the
model’s texture at run-time. Similar to image-based methods, VDTM can
achieve more realistic surface color and illumination effects, while simultane-
ously maintaining the flexibility and interactivity of model-based approaches.
However, VDTM is very sensitive to errors such as missing data or inaccu-
rate camera pose estimation. Furthermore, earlier methods required manual
overhead for adjusting all camera positions relative to the model [Debevec
et al., 1998, Porquet et al., 2005]. In later work, structure-from-motion (SfM)
and multi-view stereo were utilized to automate this process [Buehler et al.,
2001, Rongsirigul et al., 2017]. SfM estimates the camera poses and recon-
structs the point clouds from visual odometry. This photometric reconstruc-
tion highly depends on matching the visual appearance between images, and
as a result, it tends to work poorly on objects with high specularity or repeated
patterns. In contrast, RGB-D sensors (e.g., Kinect) can be used to integrate
the depth data into a voxel volume which does not require color information
[Izadi et al., 2011]. However, the camera trajectories computed while scan-
ning objects are typically prone to error that accumulates over time, resulting
in a texture that appears blurry when mapped onto the reconstructed 3D mesh.
Several approaches have been proposed to overcome this problem. For exam-
ple, Hedman et al. [Hedman et al., 2016] used additional high-dynamic range
images for registration and mapping, and Chen et al. [Chen et al., 2017] used
a color mapping optimization method to further refine the computed trajecto-
ries and improve photometric quality.
2.6 R E F L E C TA N C E R E C O N S T RU C T I O N
While reconstructing consistent color textures is itself a challenging task, it
can still not fully explain the visual appearance of objects. In general, the ob-
served colors in an image depend not only on the specific material properties
but also on the surrounding scene-specific illumination. The process of re-
constructing the material reflectances ou In general, the observed colors in an
image depend not only on the specific material properties but also on the sur-
rounding scene-specific illumination.t of the final rendered image is a highly
ill-posed problem. The first work trying to model the specularity with 3D
reconstruction from RGB-D data has been done by Wu et al [Wu and Zhou,
2015]. After reconstructing the object shape using KinectFusion [Izadi et al.,
2011] in a first pass, its appearance is estimated afterward. In a subsequent
2.7 P O S I T I O N O F T H I S T H E S I S 25
work [Wu et al., 2016a], they change the philosophy of their design from an
interactive system to a more accurate offline approach. Wu et al. [Wu et al.,
2016b] had attempted to estimate the surface reflectance in uncontrolled en-
vironments. However, it requires special designed IR devices, which are also
not practical for users of consumer-grade cameras. Shi et al. [Shi et al., 2016]
reviewed several methods recovering the BRDF model of a target object un-
der different lighting conditions. However, these methods require the target
to be captured with known illumination; thus, it is not suitable for portable
scanning. Recently, some papers [Richter-Trummer et al., 2016, Jiddi et al.,
2016] proposed methods that used only the RGB-D sensor to estimate mate-
rial properties. To get good quality results, these methods had strict material
constraints on the object. The following two papers are most closely related
to our work: Park et al. [Park et al., 2018] use IR sequence to estimate the
specular map first and compute the diffuse map in real-time, and Wei et al.
[Wei et al., 2018] used low-rank decomposition for highlight removal. Both
methods assumed that the camera poses obtained from Kinect Fusion were ac-
curate enough for estimating surface reflectance. However, the accumulated
drifting errors would cause the methods to fail to recover the specular proper-
ties. Thus, these methods can only work when the input RGB-D sequence is
considerably short.
2.7 P O S I T I O N O F T H I S T H E S I S
In this chapter, we summarized the related works. However, there are open
problems have not been sufficiently addressed in previous work :
• Rendering Quality in VR Fixed texture has low color fidelity that is
not ideal for virtual experiences. The frame rate of view-dependent
texture mapping is reported much less than 60 fps in previous papers.
However, the minimum requirement for a virtual reality application is
90 fps.
• Constrained of Input Covering all potential view directions during
object capture using a hand-held RGB-D camera is non-trivial. Fur-
thermore, the spatial and temporal distributions of camera trajectories
are non-uniform and often inconsistent.
• Transition Smoothness The unconstrained camera trajectory typically
results in sub-optimal triangulation, which in turn can lead to artifacts
such as sharp color discontinuities or erroneous surface illumination.
• Missing Data for Rendering The texture generated for a particular
viewpoint can have large regions of missing data when the closest im-
ages in the input sequence contain only partial views of the captured
object.
• Impossible Relighting Due to the nature of image-based rendering
techniques, real-world illumination is baked into the texture. During
2.7 P O S I T I O N O F T H I S T H E S I S 26
object capture, a controlled lighting environment and fixed white bal-
ance and exposure of the camera is expected.
• Specular Reflection Replication Since specular reflections have a strong
non-local effect and are strongly related to the geometry and light-
ing conditions, the synthetic views created from interpolating between
color images may have visual artifacts.
In this dissertation, we aim to build a systematic framework and some of
the associated methods are proposed to solve the above problems.
Part II
C R E AT I O N O F P H OT O R E A L I S T I C V I RT UA L
R E A L I T Y C O N T E N T
3
V I E W- D E P E N D E N T V I RT UA L R E A L I T Y C O N T E N T
F RO M R G B - D I M AG E S
3.1 I N T RO D U C T I O N
To improve color fidelity, techniques such as View-Dependent Texture Map-
ping (VDTM) have been introduced [Debevec et al., 1998, Bastian et al.,
2010, Nakashima et al., 2015]. In this approach, the system finds observed
camera poses closest to the viewpoint and use the corresponding color im-
ages to texture the model. Previous work has used Structure-from-Motion
and Stereo Matching to automatically generate the model and the camera tra-
jectory. Although these methods typically result in higher color fidelity, the
reconstructed geometric model is often less detailed and more prone to error
than depth-based approaches. In this chapter, we leverage the strengths of
both methods to create a novel view-dependent rendering pipeline (Figure 7).
In our method, the 3D model is reconstructed from the depth stream using
KinectFusion. The camera trajectory computed during reconstruction is then
refined using the images from the color camera to improve photometric coher-
ence. The color of the 3D model is then determined at runtime using a subset
of color images that best match the viewpoint of the observing user (Figure 6
(c)).
3.2 S Y S T E M OV E RV I E W
In this chapter, we proposed a system pipeline (Figure 7) to generate a view-
dependent 3D model from a consumer-grade RGBD sensor through an offline
stage and an online stage. In the offline stage (Sec.3.3), an RGBD sensor is
first used to record a sequence of the color and depth images. The proposed
method uses the depth data to reconstruct a geometry model (Figure 6 (c)) and
generate a corresponding camera trajectory. To maximize the texture quality
and the cover area, we select color images based on the blurriness and the dis-
tribution in physical space. Color trajectory optimization is used to eliminate
the noise from color-depth transformation and better color consistency. In the
online stage (Sec.3.4), the 3D model is projected to each color image for the
visibility information and the color of each vertex is precomputed by average
from all images. Note that the procedure above only is only performed once
at the beginning. Based on the user’s viewpoint, the closest images are fused
to generate the texture at run-time.
28
3.2 S Y S T E M OV E RV I E W 29
(a)Reconstructed Model (b)Fixed Texture
(c)View-Dependent Textures
Figure 6.: (a) A 3D model reconstructed from images captured with a sin-
gle consumer depth camera. (b) The textured model generated
using the traditional approach of blending color images. (c) Our
approach is capable of rendering view-dependent textures in real-
time based on the user’s head position, thereby providing finer tex-
ture detail and better-replicating illumination changes and specular
reflections that become especially noticeable when observed from
different viewpoints in virtual reality.
3.3 3 D R E C O N S T RU C T I O N A N D C A M E R A T R A J E C T O RY O P T I M I Z AT I O N 30
Figure 7.: An overview of the complete pipeline, consisting of two phases:
(1) offline processing/optimization, and (2) online view-dependent
texture rendering.
3.3 3 D R E C O N S T RU C T I O N A N D C A M E R A T R A J E C T O RY O P T I M I Z A -
T I O N
3.3.1 Geometry Reconstruction
The depth information comes directly from RGB-D sensors, so visual odom-
etry techniques such as SfM and stereo matching are not needed to generate
models. Instead, the Kinectfusion system [Izadi et al., 2011] is used to recon-
struct a single 3D model by integrating the depth data into a voxel volume
over time. The camera pose of the latest frame is computed by comparing
its depth image and the reconstructed model from previous frames. If the
camera pose is tracked, the depth information will be fused into the voxel
volume to refine the model. Because of the fusion mechanism, the resulting
3D model is usually smooth and the computation time of camera pose track-
ing will not increase over time. Finally, the 3D model M and the camera
trajectoryT =fT
1
,T
2
,T
n
g, whereT
n
is the extrinsic matrix of the n-th
tracked camera pose, are generated. Comparing to model from pure color
images (Figure 8 (b)), the model created by depth sensor has better quality
(Figure 8 (c)).
3.3.2 Key Frame Selection
Since the data is captured from a handheld camera, some of the color images
are not ideal for rendering due to motion blur. Using those blurry images
will not improve the quality of texture rendering but increase memory usage
at run-time. Instead of using every image from the trajectory, we select a
set of color images I
i
with good image quality. Unlike Zhou et al. [Zhou
and Koltun, 2014] selecting frames from different time intervals, we aim to
maximize the covered area of viewpoints in our online stage view-dependent
3.3 3 D R E C O N S T RU C T I O N A N D C A M E R A T R A J E C T O RY O P T I M I Z AT I O N 31
(a)Example scene from different view points
(b)Model from SfM (c)Model from KinectFusion
Figure 8.: Comparison of the 3D geometry reconstructed using structure from
motion (SfM) with color images vs. KinectFusion with depth im-
ages.
rendering. To achieve this purpose, the color images are first ranked by the
blurriness [Crete et al., 2007] and we select from top of the ranked images
and add the images, which the distance between it and the images already in
I
i
must larger than d cm, until we select K frame (see Algorithm 1). In our
experiments, K is around 50-120 and d is 5-10 based on the total number of
images and the covered area of the object while capturing the data.
Data: Color sequence I
Result: Selected color image sequence I
sel
Ranked color sequence I
R
= sort(blurriness(I));
I
sel
=Æ ;
while i2 I
R
do
if (P
i
P
0
i
< d,8i
0
2 I
sel
) then
I
sel
= I
sel
[fig
end
if (length(I
sel
== K) then
break;
end
end
Algorithm 1: Key Frame Selection
3.3 3 D R E C O N S T RU C T I O N A N D C A M E R A T R A J E C T O RY O P T I M I Z AT I O N 32
(a) (b)
(c)
Figure 9.: An example of the textured model from (a) captured camera view
and (b) from a previously un-captured viewing direction (opti-
mized camera pose in blue rectangle), (c) results of iteration 0, 10,
20, 30 respectively. Note that the vertex in red color means invisi-
ble in the image.
3.3.3 Camera Trajectory Optimization
Although we can get the selected frames T
sel
= fT
1
,T
2
, ,T
K
g from
Kinectfusion, the initial camera trajectories are not sufficiently accurate for
texture mapping because it is purely based on geometry information. This
is particularly noticeable at the boundary of objects. The textured models
look similar from a previously captured viewpoint (Figure. 9 (a)). However
when rendering the model from a new viewpoint that was not originally cap-
tured, optimized camera pose (Figure. 9 (b) left) provides more accurate
texture(e.g., the turtle head) than the model textured without optimization
(Figure. 9 (b) right).
To maximize the color and geometry agreement, first, we calibrate and
align the color and IR camera. We then apply Color Mapping Optimization
[Zhou and Koltun, 2014] to yield more accurate camera poses. The objective
function minimizes the difference between the color of the vertices and their
corresponding color in each frame. Note that we only use the alternating opti-
mization method as suggested by Narayan et al. [Narayan and Abbeel, 2015]
that using the deformation grid optimization sometimes leading to divergence.
It iteratively solves the problem from the initial state T
sel
and converges to
T
opt
by Gauss-Newton method (Figure 9 (c)). The detail of solving the objec-
tive function can be found in Appendix A. In our experience, the procedure
takes 20-30 iterations (approximately 2-3 hours) to get the solution.
3.4 R E A L - T I M E T E X T U R E R E N D E R I N G 33
3.4 R E A L - T I M E T E X T U R E R E N D E R I N G
3.4.1 Pre-processing
To update the vertex color, the visibility should be used to avoid incorrect
texture mapping. For example, in Figure 9, the portion of the desk occluded
by the turtle should not be rendered. We project the 3D model to each image
space and generate the associated depth images for a visibility check. If the
distance from a vertex to the image is larger than the value in the depth image,
the vertex is considered invisible. By using visibility information, the vertex
color can be parallel updated without given geometry information to achieve
real-time performance. We also pre-compute the basic color for each vertex
by averaging the RGB value from all images that pass the visibility check.
The vertex color will remain the same if not rendered by any images. The
two procedures only need to be applied once unless we add or remove images
from the database, so it would not affect the process time at run-time.
3.4.2 Render Image Selection
At the online stage, we sample images based on the euclidean distance of
user’s HMD position and all the camera positions in our database. The HMD
position p
u
is provided by Oculus Rift DK2 and its position tracking camera.
The camera position of each image can be computed from the transformation
matrix T
opt
in Sec. 3.3.3. Each transformation matrix T
i
2 T
opt
can be
decomposed to a rotation matrix R
i
2 R
3X3
and a translation matrix t
i
2
R
1x3
. The camera position is obtained by p
i
=R
i
T
t
i
.
I
r
= argmin
i
jjp
i
p
u
jj
2
,0 i K (1)
Using the closest image to render the model will create a sudden transition
from one image to another. It also produces a sharp edge between updated
vertices and the others (Figure. 10) (left). Selecting more images can achieve
smooth transitions with head movement. However, it will lose details such as
specularities and light-bursts (e.g., the fixed-texture model is colored by all
images). In our experiment, we select three images to not only preserve the
detail but also smoothly switch from different viewpoints (Figure. 10).
3.4.3 View-Dependent Texture Mapping
Each vertex is mapped to image planes to retrieve their corresponding RGB
values. We compute the vector from the model center to the HMD position
p
u
and if it intersects the triangle formed by the three selected camera poses
fp
1
,p
2
,p
3
g at p, we use the barycentric coordinates to compute the weight.
p= w
1
p
1
+w
2
p
2
+w
3
p
3
(2)
3.5 E X P E R I M E N TA L R E S U LT S 34
Figure 10.: A view-dependent texture rendered using three source images
(left) compared with a texture generated using a single image
(right).
If the direction of a ray from the HMD position to the model center does
not intersect with the triangle (e.g., the user position is outside the covered
area), we use the inverse of Euclidean distance as the weight.
w
i
=kpp
i
k
1
2
,i2f1,2,3g (3)
Before combining these values, we must perform a visibility check to de-
tect occlusion as described in Section 3.4.1. RGB values that fail the visibility
check are then discarded (i.e., set the weight to zero). The remaining weights
are normalized and the vertex color is updated by the new RGB value.
C(v)= w
0
1
c
1
+w
0
2
c
2
+w
0
3
c
3
(4)
whereC(v) represents the color of vertexv,w
0
i
= w
i
/(w
1
+w
2
+w
3
),i2
f1,2,3g,c
1
,c
2
, andc
3
are the pixel colors retrieved from projecting the ver-
tex to the chosen images.
3.5 E X P E R I M E N TA L R E S U LT S
We use Microsoft Kinect, which streams VGA (640X480) resolution depth
and color images at 30 frames per second. The color images are captured
with fixed exposure and white balancing. The results can be seen in Figures
11. Note that the position of specular reflection varies by viewpoint. The
color images captured the light-burst effect and accurately reproduced it at
run-time.
We feel the visual quality is not good enough because of the VGA resolu-
tion. To increase the quality, we also use our system with an Intel Realsense
F200 RGB-D camera, which provides the same VGA resolution in depth but
has 1080p (1920X1080) resolution in color. Furthermore, the F200 has a
closer range (0.2-1.2 meter) than Kinect (0.8-4 meter), so the object in the
Intel depth camera usually has a 4X larger texture. As shown in Figures 12,
the ”L” on the belt is still clear. Our system still can perform VDTM with
higher resolutions in real-time.
In addition, we also use the data from Choi et al. [Choi et al., 2016] to
test our system. It contains thousands of RGB-D sequences. The raw color
3.5 E X P E R I M E N TA L R E S U LT S 35
Figure 11.: (top left) An image of a real object captured using a Kinect v1.
(middle left) The untextured 3D model. (bottom left) The model
with a fixed texture generated from blending the source images.
(right) The model with view-dependent texture rendered from sev-
eral viewpoints. Note the flame from the candle within the object.
(a)Kinect v1
(b)Intel RealSense F200
Figure 12.: The results of our pipeline using the same object captured with
two consumer RGB-D sensors.
3.6 C O N C L U S I O N 36
Object vertex surface images color/depth images
Sculpture 208K 406K 108 3210 / 3225
Table 390K 763K 98 2906 / 2918
Chair 255K 495K 111 3299 / 3313
Toy 49K 97K 65 1936 / 1944
Figurine 99K 197K 96 2843 / 2856
Table 1.: Information about the 3D models and images used in different ex-
amples.
and depth are not synchronized, so we assigned the color images to the depth
images with the smallest time-stamp difference. Since the streams are both
30 fps, the shifting error can also be handled by color mapping optimization.
We tested our system with different models such as a woman body sculpture
(3887) 6, a round table (5648), an antique leather chair (5989), a dragon toy
(9634), and an old man figurine (09933) in Figure 13 where the number in
parentheses is the index of the sequence in [Choi et al., 2016]. The first
column is the 3D model generated from KinectFusion and the second column
is textured by the volumetric blending method. [Izadi et al., 2011]. Although
the color mapping optimization method [Zhou and Koltun, 2014] (the third
column in Figure. 13) is able to generate detailed 3D models, our method
(columns 4-6) preserve the specularity of the object from different viewpoints.
As shown in Figure. 13, our system is able to handle all of the following
conditions:
• various lighting
• various object material property (plastic, ceramics, leather, wood, and
metal)
• various object size (table, sculpture, toy)
• can cover 360 degree range (old man figurine)
• multiple objects (statues)
It is also worth noting that the sequences in [Choi et al., 2016] are captured
by operators, who are not experts in computer vision. It confirms that using
our system does not require the knowledge of 3D vision or any engineering
background. The detail of each model is shown in Table. 1.
All the experiments were performed in Unity 5.3 on a MacBook Pro with
an Intel i7-4850HQ CPU, Nvidia GeForce GT750M GPU and 16 GB of
RAM. Per-vertex rendering is parallelized in our design by passing the depth
map to the shader for the visibility check. Our method can render an entire
model in only 10-15 milliseconds (i.e., 70-90 fps) which is sufficient for real-
time high-frequency rendering.
3.6 C O N C L U S I O N
We proposed a novel pipeline for rapidly creating a photorealistic virtual re-
ality content with only one consumer-grade RGB-D sensor. Our system auto-
3.6 C O N C L U S I O N 37
Figure 13.: (1st column) Untextured 3D models. (2nd column) Fixed textures
generated without optimization. (3rd column) Fixed textures gen-
erated with optimization. (last 3 columns) The final output from
our pipeline with view-dependent textures shown from three dif-
ferent viewpoints.
3.6 C O N C L U S I O N 38
matically generates the geometry model and the camera trajectory of selected
color images from an RGBD sequence. It also generates a texture for the 3D
model that changes based on the HMD position in real-time even for view-
points that were not originally captured. By fusing weighted vertex color
from multiple images we can smoothly transition the texture from one view-
point to another. In our experimental results, our system is able to correctly
reproduce the appearance when the object is captured without expert knowl-
edge, making it a useful application for real-world 3D scanning.
4
DY NA M I C O M N I D I R E C T I O N A L T E X T U R E
S Y N T H E S I S F O R P H OT O R E A L I S T I C V I RT UA L
C O N T E N T C R E AT I O N
4.1 I N T RO D U C T I O N
In this Chapter, we proposed dynamic omnidirectional texture synthesis (DOTS)
in order to improve the smoothness of viewpoint transitions while maintain-
ing the visual quality provided by VDTM techniques. Given a target vir-
tual camera pose, DOTS is able to synthesize a high-resolution texture map
from the input stream of color images. The proposed objective function si-
multaneously optimizes the synthetic texture map, the camera poses of se-
lected frames with respect to the virtual camera, and the pre-computed global
texture. Furthermore, instead of using traditional spatial/temporal selection,
DOTS uniformly samples a spherical set of virtual camera poses surround-
ing the reconstructed object. This results in a well-structured triangulation of
synthetic texture maps that provide omnidirectional coverage of the virtual
object, thereby leading to improved visual quality and smoother transitions
between viewpoints.
DOTS is a novel capture-to-rendering content generation pipeline that pro-
vides several advantages over previous approaches (e.g. [Chen et al., 2017,
Nakashima et al., 2015, Rongsirigul et al., 2017]). The following outlines the
major contributions of this paper:
• A method for computing an optimized high-resolution synthetic texture
map for a particular target viewpoint.
• A novel view-dependent texture mapping approach that supports omni-
directional real-time rendering and remains reasonably robust to uncon-
strained capture conditions such as non-uniform camera trajectories,
missing coverage, and partial object views.
• A user study that empirically evaluated the subjective quality of objects
rendered using three different texture mapping methods. The results
demonstrated that DOTS provides superior visual quality over fixed tex-
tures, while simultaneously providing smoother viewpoint transitions
compared to a previously proposed view-dependent technique.
39
4.2 OV E RV I E W 40
Figure 14.: Overview of the DOTS content creation pipeline. Color and depth
image streams are captured from a single RGB-D camera. The ge-
ometry is reconstructed from depth information and is then used
to uniformly sample a set of virtual camera poses surrounding the
virtual object. For each camera pose, a synthetic texture map is
blended from the global texture and local color images captured
near the current location. The synthetic texture maps are then
used to dynamically render the object in real-time based on the
user’s current viewpoint in virtual reality.
4.2 OV E RV I E W
4.2.1 Open Problems of VDTM
• Covering all potential view directions during object capture using a
hand-held RGB-D camera is non-trivial. Furthermore, the spatial and
temporal distributions of camera trajectories are non-uniform and often
inconsistent (Figure 15).
• The unconstrained camera trajectory typically results in sub-optimal
triangulation, which in turn can lead to artifacts such as sharp color
discontinuities or erroneous surface illumination (Figures 21 and 22).
• The texture generated for a particular viewpoint can have large regions
of missing data when the closest images in the input sequence contain
only partial views of the captured object.
We developed DOTS in order to overcome all three of the above limitations.
The proposed approach can generate a set of complete synthetic texture maps
for omnidirectional viewing and is reasonably robust to inconsistent camera
trajectories, partial views, and missing coverage during object capture.
4.2.2 Overall Process
The system pipeline is shown in Figure 14. Given an RGB-D video sequence,
the geometry is first reconstructed from the original depth stream. A set of
4.2 OV E RV I E W 41
Figure 15.: Visualization of a typical hand-held RGB-D capture sequence
showing non-uniform spatial and temporal distribution. The blue
line shows the estimated trajectory from the input depth stream.
The green line shows the trajectory projected from the blue line
onto a sphere surrounding the object. Red circles indicate the po-
sition of the selected keyframes for VDTM. The spatial regions
covered by the camera’s trajectory are shown in the upper right
image. Time spent in each region is displayed in the lower left
image with brighter shades indicating longer duration.
keyframes are selected from the entire color stream, and a global texture is to
be generated from those keyframes. Next, a virtual sphere is defined to cover
the entire 3D model and the virtual camera poses are uniformly sampled and
triangulated on the sphere’s surface. For each virtual camera pose, the cor-
responding texture is synthesized from several frames and the pre-generated
global texture maps. At run-time, the user viewpoints provided by a head-
tracked virtual reality display is used for selecting the synthetic maps to ren-
der the model in real-time.
4.2.3 Geometric Reconstruction
Any 3D reconstruction method can be used to obtain the geometry model with
a triangular mesh representation. Similar to other papers[Zhou and Koltun,
2014, Jeon et al., 2016, Bi et al., 2017, Chen et al., 2017] that focus on the
texture mapping for objects captured using handheld RGB-D cameras, we
use Kinect Fusion [Izadi et al., 2011] to construct the 3D model M from the
depth sequences. The camera trajectory of the sequence is roughly estimated
(the blue line in Figure 15) and can be used for the texture map generation.
4.2.4 Global Texture Generation
Using all frames I of the input video for generating the global texture is inef-
ficient. Instead, temporal [Zhou and Koltun, 2014, Bi et al., 2017] and spatial
[Jeon et al., 2016, Richter-Trummer et al., 2016, Chen et al., 2017] key frame
4.3 S Y N T H E T I C T E X T U R E M A P G E N E R AT I O N A N D R E N D E R I N G 42
selections have been proposed that use either the distribution of time or space
to obtain a set of most representative frames from the original color sequence.
In our DOTS framework, we chose spatial key frame selection to maximize
the variation of viewing angles of the 3D model (red circles in Figure 15).
The selected N key frames, G =fg
1
,g
2
, ,g
N
g2 I, with initial esti-
mated camera poses, T
G
=ft
g
1
,t
g
2
, ,t
g
N
g, are used for global texture
generation. The objective function iteratively solves the vertex colorC(v) of
the reconstructed model M and the camera poses T
G
. The details for solv-
ing the following optimization are explained in [Zhou and Koltun, 2014] and
Appendix A.
E(C,T
G
)=
å
g2G
å
v2M
(C(v)G(v,g,T
g
))
2
(5)
whereG is the color retrieved by projecting vertex v to image g using the
estimated poset
g
.
The imagesG and their optimized camera positionsT
G
are used as anchor
images for our synthetic image generation. We discard the vertex colorC(v)
in our DOTS framework because the texture is dynamically updated at run-
time. The number of keyframes N varies because it depends on the trajectory
of the original video sequence. The actual N used in each virtual object is
reported in Table 2.
4.3 S Y N T H E T I C T E X T U R E M A P G E N E R AT I O N A N D R E N D E R I N G
Our objective is to replicate high-fidelity models with smooth transitions be-
tween viewpoints. To achieve this goal, both the camera poses need to be
accurate and that texture map must have high quality. Instead of selecting
images directly from the original video, we uniformly sample virtual cam-
eras surrounding the reconstructed geometry in 3D and set the size of the
sphere large enough to cover the entire object in each synthesized virtual
view. To avoid confusion, in subsequent sections, we refer to the virtual cam-
eras and real cameras as VCam and RCam. Note that unlike key frame se-
lection, the number of synthesized textures is independent of the length of
the input video and the camera trajectory. As shown in Figure 15, we sam-
pled 162 synthesize textureS =fs
1
,s
2
, ,s
162
g with known VCam poses
T
S
=ft
s
1
,t
s
2
, ,t
s
162
g forming a total of 320 triangles to guarantee the
angle between any viewing direction and its closest VCam is always smaller
than 15
. Although the number of synthesized textures is 1.5 times larger
than the number used in VDTM (see Table 2), DOTS covers all viewing di-
rections, while VDTM covers only 18 % - 20 % of the entire sphere (e.g,
Figure 15). As shown in Figure 22, when the viewing direction is far away
from the trajectory, the model rendered with DOTS are better than traditional
VDTM method.
To generate the synthetic view s
i
, all frames I are weighted and sorted
based on their uniqueness with respect to the VCam pose t
s
i
. Images with
higher weights are chosen for optimization. Our objective function simulta-
4.3 S Y N T H E T I C T E X T U R E M A P G E N E R AT I O N A N D R E N D E R I N G 43
neously optimizes the local texture map, the RCam of chosen images respect
to t
s
i
, and the global texture map I
G
. Only the derived texture maps S and
the corresponding VCam poses T
S
are required for real-time image-based
rendering.
4.3.1 Parameters Used in Image Uniqueness Selection
When choosing the best representative frames for a given VCam, the texture
quality of each frame from the original video, and the similarity of the corre-
sponding RCam with the VCam are considered.
The blurriness metric from Crete et al[Crete et al., 2007] is used to evaluate
the image quality. The distance and the angle between RCams and VCam are
used for similarity evaluation. The overall weighting function is defined as
follows:
w
i
= max(
cos(q)b
i
d
2
,d) 8i2 I (6)
whereq andd are the angle and the Euclidean distance between VCam and
RCam. d is a small number to prevent non-positive value if q is larger than
90
. In our case, we setd= 10
3
.
4.3.2 Objective and Optimization
For each synthetic texture map s
j
, we first sorted the weights and selected
K images with highest weights, where L =fl
1
,l
2
, ,l
K
g2 I with initial
estimated camera posesT
L
=ft
l
1
,t
l
2
, ,t
l
K
g. In our work, we useK = 20.
We aim to simultaneously optimize the RCam poses T
L
and the synthetic
textures
j
by minimizing the error betweens
j
and every texture map rendered
by imagel2 L. Given a geometry model M and the camera poset
l
of image
l,Y renders the model M by image l and projects it to the VCam view with
known camera poset
s
j
(e.g., the two green rectangles in Figure 16 (a)). Thus,
the error function of the synthetic textures
j
is defined as follows:
E(T
K
,s
j
)=
å
k2K
w
0
k
å
x2X
k
(s
j
(x)Y(i
k
,t
k
,M,t
s
j
,x))
2
(7)
Note thatX
k
are the visible vertices in both texture maps
j
and the map gen-
erated fromY(i
k
,t
k
,M,t
s
j
,x), and w
0
k
= w
k
/å
k2K
w
k
are the normalized
weights of each image in L.
As shown in Figure 16 (b), using only the local texture is insufficient for
the entire model because the closest images might have only a partial view
of the object. Thus, we introduce a weighted global texture to our objective
function but kept T
G
unchanged since it is already optimized in the previous
section. Unlike Eq. 5, the weighted function is added to each global texture
with respect to VCam. Two examples of global texture rendered by Y are
shown in the red rectangles of Figure 16 (a). To combine the two terms, l
4.3 S Y N T H E T I C T E X T U R E M A P G E N E R AT I O N A N D R E N D E R I N G 44
defines the weights of the global texture term. The error function can be
rewritten as follows:
E(T
K
,s
j
)= E
L
+lE
G
=
å
k2K
w
0
k
å
x2X
k
(s
j
(x)Y(i
k
,t
k
,M,t
s
j
,x))
2
+l
å
n2N
w
0
n
å
x2X
n
(s
j
(x)h
n
(x))
2
(8)
Wherew
0
k
= w
k
/(å
k2K
w
k
+lå
n2N
w
n
) ,w
0
n
= w
n
/(å
k2K
w
k
+lå
n2N
w
n
)
andh
n
is the global texture rendered from image g
n
2 G.
The optimization iteratively minimized the error function with respect to
t
k
2 T
K
ands
j
.
Optimizing T
K
: To find the optimized RCam pose t
k
2 T
K
, while s
j
and
8t
y
2 T
K
,y6= k are fixed. The error function with respect tot
k
is simplified
as follows:
E(t
k
)=
å
x2X
k
(s
j
(x)Y(i
k
,t
k
,M,t
s
j
,x))
2
(9)
In Eq. 9, E(t
k
) is a non-linear least square function and can be minimized
by using the Gauss-Newton algorithm.
Optimizings
j
: To find the optimized texture maps
j
while T
K
is fixed, the
texture maps generated fromT
K
are also fixed. We replaceY(i
k
,t
k
,M,t
s
j
,x)
to f
k
(x). Thus, the nonlinear least-squares problem in Eq. 8 turns into a
linear least-square problem in Eq. 10 that has a closed form solution.
E(s
j
)=
å
k2K
w
0
k
å
x2X
k
(s
j
(x) f
k
(x)))
2
+l
å
n2N
w
0
n
å
x2X
n
(s
j
(x)h
n
(x))
2
(10)
Note that the weights of the global texture G are comparatively smaller
than the weights of locally selected images L. We further decrease it by set-
ting l to 0.1. The regions covered by L will not be affected much, but the
uncovered area with the missing texture can be filled in with the weighted
global texture. As shown in Figure 16 (c), the optimized texture maps s
j
not only maximize the color agreement of locally selected images L but also
seamlessly blends the global textureG.
4.3 S Y N T H E T I C T E X T U R E M A P G E N E R AT I O N A N D R E N D E R I N G 45
(a)overview of texture blending
(b)local texture only
(c)local and global textures
Figure 16.: (a) The synthetic texture map is blended from the local texture
(green rectangle) and the global texture (red rectangle). (b) It was
not possible to completely cover entire the 3D model using only
the local texture information. (c) By blending with the global
texture, complete coverage is obtained.
Moreover, the resolution of synthetic texture can be set to any arbitrary
positive number. We set the resolution as20482048, which is preferred by
most game engines without compression. By blending several images with
VGA resolution (i.e.,640480), our method achieves higher resolution. The
same region of the virtual object is shown in Figure 17. The resolution of the
4.4 U S E R S T U DY A N D E X P E R I M E N T R E S U LT S 46
Figure 17.: Comparison of an input image captured from the RGB-D camera
(left) with a high-resolution synthetic texture map generated using
the proposed method (right).
blue rectangle of the synthetic texture is approximately 2 times that of the red
rectangle of the original image.
4.3.3 Real-time Image-based Rendering
At run-time, the HMD pose is provided by the Oculus Rift CV1 and two
external Oculus Sensors. The traditional VDTM method computes the Eu-
clidean distance between the users head position and all RCam poses and
then selects the closest images for rendering. In contrast, DOTS systemati-
cally samples all spherical VCam poses surrounding the virtual object. Thus,
the vector from the model center to the HMD positionp
u
only intersects with
one triangle mesh at p. The barycentric coordinates are used to compute the
weight and blend the color from the three synthesized texturesfs
1
,s
2
,s
3
g of
the intersected triangle to generate the novel view.
p= w
1
p
s
1
+w
2
p
s
1
+w
3
p
s
1
(11)
Before combining these values, a visibility check is performed to detect
occlusions. RGB values that fail the visibility check are then discarded (i.e.,
their weight is set to zero). The remaining weights are normalized and used
to update the colorc.
c= w
0
1
c
1
+w
0
2
c
2
+w
0
3
c
3
(12)
where w
0
i
= w
i
/(w
1
+w
2
+w
3
),i 2 f1,2,3g, c
1
, c
2
, and c
3
are the
colors retrieved from the texture mapsfS
t
1
,S
t
2
,S
t
3
g.
4.4 U S E R S T U DY A N D E X P E R I M E N T R E S U LT S
4.4.1 Data Sets
The virtual object dataset used in the user study contains thousands of RGB-
D sequences captured by Primesense [Choi et al., 2016]. The raw color and
4.4 U S E R S T U DY A N D E X P E R I M E N T R E S U LT S 47
Object vertex surface Keyframes color/depth stream
Sculpture 208K 406K 101 3210 / 3225
(decimated) 1.3K 2.5K
The Kiss 280K 544K 116 3989 / 4007
Jean d’Aire 74K 144K 94 3609 / 3624
Chair 255K 495K 98 3299 / 3313
Table 2.: Information about the 3D models and images used in different ex-
amples.
depth are not synchronized, so we assigned the color images to the depth
images with the smallest time-stamp difference. Since the streams are both
30 fps, the shifting error is small and can also be handled by the optimization.
We tested our system with four different models: a female sculpture, the Kiss
by Rodin, Jean d’Aires by Rodin, and an antique leather chair (corresponding
to IDs 3887, 4252, 5137 and 5989 in the database respectively). The detail of
each object is shown at Table 2 and some example images are shown in Figure
18 (a). The vertices and surfaces are reconstructed using KinectFusion [Izadi
et al., 2011], and the keyframes are selected from the color/depth stream. The
decimated model of the sculpture is used in the second row of Figure 23.
4.4.2 Study Design
PA RT I C I PA N T S Twenty-nine participants (15 male and 14 female) were
recruited for our study. Participants were between 22 and 52 years old and
were required to have normal or corrected normal vision. Recruitment and ex-
perimental procedures were approved by the local Institutional Review Board
(IRB).
A P PA R AT U S A N D I M P L E M E N TAT I O N Participants wore an Oculus Rift
CV1 head-mounted display (HMD). The HMD has a 1080 x 1200 per eye res-
olution, 90Hz refresh rate, 110-degree nominal field of view, and six-degree
of freedoms tracking from two external Oculus Sensors. The participants
were tracked in a 2m x 2m space. and the virtual reality experience was ren-
dered using Unity 2017.2.0f3 on an MSI GS73VR with an Intel i7-7820HK
CPU, Nvidia GeForce GTX1070 GPU, and 16 GB of RAM. Per-vertex ren-
dering is parallelized in our design by also passing the depth map of each
synthetic texture to a shader for a visibility check. Although the number of
images and the number of vertices varies between virtual objects, our method
can render each model in under 8 milliseconds (i.e., more than 125 fps),
which is sufficient for real-time virtual reality applications.
C O N D I T I O N S There were three within-subjects texture rendering condi-
tions used in our study: a single fixed texture with color map optimization
[Zhou and Koltun, 2014], simple view-dependent texture mapping (VDTM)
[Chen et al., 2017], and dynamic omnidirectional texture synthesis (DOTS).
4.4 U S E R S T U DY A N D E X P E R I M E N T R E S U LT S 48
Q U E S T I O N N A I R E The evaluation was designed to gauge both the vi-
sual quality and the smoothness of transitions between viewpoints for each
method. For the purposes of this experiment, visual quality was defined
as the degree to which the appearance of the virtual object resembled the
source video of the real object from the capture session, which was displayed
as a reference during each trial. This was achieved using two different ap-
proaches: (1) individual ratings where participants were asked to rate each
method using a 1 to 7 ranking scale and (2) pair-wise comparisons where
users would compare method A and B by selecting one of the following op-
tions:fA B,A < B,A = B,A > B,A Bg. For instance, A < B
indicated that method B is preferable to method A, and A B indicated
that method B is greatly preferred over method A. There were a total of six
question sets: three for individual ratings and three for pair-wise rankings.
Figure 18 shows the user interfaces used for the individual ratings and pair-
wise comparisons.
(a)virtual objects
(b)virtual environment and questionnaire user interfaces
Figure 18.: (a) Four virtual objects used in our study. From left to right respec-
tively: a female sculpture, The Kiss, Jean d’Aire, and an antique
leather chair. (b) The virtual environment and questionnaire user
interfaces (individual ratings on the left, pair-wise comparisons
on the right). Participants could use the rotation slider to rotate
the model around the vertical axis. The bottom two sliders were
used to gather feedback from participants.
P RO C E D U R E The procedure of the user study was explained to partici-
pants using a printout of the user interface (i.e., Figure 18(b)) before begin-
ning the experiment. Participants were then introduced to the hardware used.
After putting on the head-mounted display, they were loaded into an open-
ended virtual environment with gray flooring and a blue sky. Users verbally
verified they could clearly see the visuals then proceeded with the tutorial
4.4 U S E R S T U DY A N D E X P E R I M E N T R E S U LT S 49
for using the Oculus Touch. Participants were then asked to verbally confirm
they had understood the tutorial and the questionnaire. Participants would
then immediately begin the user study. In this study, the four different virtual
objects in Table 2 were showed in random order. Participants were shown
the original video of the capture sequence for 10 seconds to ensure they saw
the reference object before the experimental task. The reconstructed object
was rendered using the three texture methods in randomized order in both the
individual ratings and the pair-wise comparisons to avoid response bias. Par-
ticipants could switch between the two selected texture conditions anytime
during the pair-wise comparison task using the slider in the user interface.
4.4.3 Individual Rating Results
Initial examination of the individual rating scores suggested that the data were
not normally distributed, and Mauchly’s test indicated a possible violation of
sphericity. Therefore, visual quality and transition smoothness ratings were
analyzed using nonparametric statistical tests. Post-hoc tests were conducted
using a Bonferroni adjustment of a = .017 to control for error in multiple
comparisons. A visual comparison of these results is shown in Figure 19.
V I S UA L Q UA L I T Y A Friedman test analyzing visual quality scores was
significant,c
2
(29)= 40.35,p< .001. Post-hoc comparisons using Wilcoxon
signed rank tests indicated that objects rendered using a fixed texture (Mdn=
4.00, IQR = 3.384.88) were rated as poorer quality compared to both
VDTM (Mdn = 5.50, IQR = 4.756.00), p< .001, and DOTS (Mdn =
5.75, IQR = 5.136.13), p < .001. The difference between the two dy-
namic texture mapping conditions was not significant.
T R A N S I T I O N S M O OT H N E S S A Friedman test analyzing transition smooth-
ness scores was significant, c
2
(29) = 18.40, p < .001. Post-hoc com-
parisons using Wilcoxon signed rank tests indicated that the viewpoint tran-
sitions were rated as less smooth in the VDTM condition (Mdn = 3.75,
IQR= 3.004.88) compared to both fixed textures (Mdn= 5.50, IQR=
4.006.13), p = .015, and DOTS (Mdn = 5.25, IQR = 4.886.00),
p< .001. The difference between the fixed texture condition and DOTS was
not significant.
4.4 U S E R S T U DY A N D E X P E R I M E N T R E S U LT S 50
Figure 19.: Results for the individual ratings of visual quality and transition
smoothness. The graph indicates the median and interquartile
range for each experimental condition.
4.4.4 Pair-wise Ranking Results
For the pair-wise ranking, we summarized the feedback into pie-charts to vi-
sualize participant preferences for each method. Note that we also performed
the comparison between VDTM and the fixed texture condition during the
study to anonymize all three methods. However, since this comparison is ex-
traneous for the purposes of this study, we report only the results between
DOTS and the other two methods in Figure 20.
With regards to texture quality, DOTS was preferred to the fixed texture
method 94% of the time. Differentiating between DOTS and VDTM was
less conclusive, with only 48% of cases ranking DOTS over VDTM for this
specific criterion. However, for transition smoothness, DOTS was preferred
over VDTM in 78% of the comparisons. Taken together, these results sug-
gest that DOTS is the generally preferable technique when attempting to si-
multaneously maximize both visual quality and smooth transitions between
viewpoints.
4.4 U S E R S T U DY A N D E X P E R I M E N T R E S U LT S 51
Figure 20.: Summary of pair-wise rankings comparing DOTS with the other
two methods according to visual texture quality and transition
smoothness between viewpoints.
4.4.5 Visual Analysis
In Figure 21, we present a visual comparison of DOTS and VDTM [Chen
et al., 2017] . DOTS systematically sampled spherical virtual camera set
surrounding the target object, while the selected keyframes in VDTM are un-
structured. For both methods, three images from the highlighted triangles are
used to render the model. Because the selected virtual cameras for DOTS
have similar viewing directions, the viewpoint generated by blending the
three corresponding synthetic textures exhibits fewer artifacts compared to
VDTM, which uses the three closest images from original capture sequence.
Moreover, the synthetic texture maps provide omnidirectional coverage of
the virtual object, while the selected keyframes in VDTM sometimes only
partially capture the target object. Because of this missing texture informa-
tion, VDTM can only synthesize a partial unobserved view, thereby result-
ing in sharp texture discontinuities (i.e. seams), while DOTS can render the
model with such artifacts. Similar observations can be made among all four
virtual objects. As shown in Figure 24, the fixed texture method ( red rect-
angle) results in a lower fidelity (blurry) appearance. Although the model
rendered using VDTM can achieve a more photorealistic visual appearance
(leftmost in blue rectangle), the region of high-fidelity viewpoints is limited.
4.4 U S E R S T U DY A N D E X P E R I M E N T R E S U LT S 52
Due to the unconstrained capture process, this region varies between differ-
ent objects and is highly dependent on the specific motions of the hand-held
RGB-D camera. Texture defects and unnatural changes between viewpoints
are not pleasant during virtual reality experiences where user freedom is en-
couraged and the view direction cannot be predicted in advance. However, in
contrast to VDTM, DOTS generates omnidirectional synthetic texture maps
that produce visually reasonable results even under such conditions.
(a)VDTM (b)DOTS
Figure 21.: (a) The key frames selected by VDTM are not uniformly dis-
tributed around the 3D model because they are dependent upon
the camera trajectory during object capture. Thus, this leads to an
irregular triangulation (red) and undesirable visual artifacts. (b) In
contrast, the synthetic maps generated by DOTS cover all poten-
tial viewing directions and the triangulation is uniform, resulting
in seamless view-dependent textures.
It is worth noting that DOTS can generate good appearance even if the
HMD position is significant deviates from the captured trajectory, while VDTM
results in unnatural specularity and noticeable artifact. Lacking images around
the HMD position, the closest images of VDTM are selected from far away
in both distance and the viewing angle. For example, The top views, which
are distanced from the covered are of captured data, of three different objects
rendered by VDTM (Figure 22 left) have severe artifacts. For example, the
texture used to render the female sculpture comes from both front views and
side views, resulting in inconsistent specularity on the shoulder and the chest.
Several texture edges on the base of The Kiss and the back of the chair is
observed because of the unstructured triangulation.
4.4 U S E R S T U DY A N D E X P E R I M E N T R E S U LT S 53
Figure 22.: Rendering results from viewpoints that were not covered during
object capture (e.g. a top-down view of the virtual objects). (left)
The model is rendered incorrectly by VDTM. For example, the
specularity on the female sculpture’s shoulder and chest are con-
flicting with each other. Blending textures from distanced camera
positions result in noticeable color discontinuity on the base of
The Kiss and the back of the chair. (right) The model rendered
by DOTS still presents reasonable visual quality even though the
viewpoint was never observed in the capture sequence.
U V M A P P I N G In Figure 23, we decimated the reconstructed female sculp-
ture from 208K vertices to only 1.3K vertices (i.e., 0.6% of the original
model). The visual fidelity of the model decreased dramatically when ren-
dered using the fixed texture. The texture mapping failed for VDTM, result-
ing in extremely undesirable artifacts. However, when rendered using DOTS,
the reduced polygon mesh retained a high-fidelity appearance that was nearly
indistinguishable from the original model.
4.4 U S E R S T U DY A N D E X P E R I M E N T R E S U LT S 54
Figure 23.: (top) A 3D model rendered using per-vertex color. (bottom) A re-
duced polygon mesh rendered using a UV map. From left to right:
the untextured geometric model, fixed texture mapping [Zhou and
Koltun, 2014], VDTM [Chen et al., 2017], and DOTS.
4.5 C O N C L U S I O N 55
Figure 24.: Appearance comparison results for each virtual object used in our
study. The top row shows three example images from original
video. The second row are the geometry model and the results
from VDTM [Chen et al., 2017] (in blue rectangle). Note that
although VDTM can achieve good visual quality (lesfmost),the
The third row is the fixed texture [Zhou and Koltun, 2014] (in red
rectangle) and the results of DOTS (in green rectangle).
4.5 C O N C L U S I O N
We present dynamic omnidirectional texture synthesis, a novel approach for
rendering virtual reality content captured using a consumer-grade RGB-D
camera. The proposed objective function is used to generate optimized syn-
thetic texture maps for real-time free-viewpoint rendering. The user study
demonstrated that DOTS achieves comparable visual quality to a previously
proposed VDTM technique, while simultaneously producing smoother tran-
sitions between viewpoints. Visual comparison further confirmed that DOTS
can handle more extreme cases such as viewing perspectives that were not
directly covered during object capture. Due to the nature of image-based
rendering techniques, real-world illumination is baked into the texture map.
During object capture, a controlled lighting environment and fixed white bal-
ance and exposure of the camera is expected.
5
R E C O N S T RU C T V I RT UA L C O N T E N T R E F L E C TA N C E
W I T H C O N S U M E R - G R A D E D E P T H C A M E R A S
5.1 I N T RO D U C T I O N
In the previous chapter, using View-Dependent Texture Mapping (VDTM)
approach, the texture is dynamically updated in real-time using a subset of im-
ages closest to the current virtual camera position. Although these methods
typically result in improved visual quality, the dynamic transition between
viewpoints is potentially problematic, especially for objects captured using
consumer RGB-D cameras. This is due to the fact that the input sequences
often cover only a limited range of viewing directions, and some frames may
only partially capture the target object. Another limitation is synthesizing
accurate specular reflections since the reflectance of an object is strongly re-
lated to its geometry, surface material properties, and environmental lighting
conditions. The synthetic views created from interpolating between color
images may have visual artifacts or are inconsistent with the geometry, es-
pecially for capture data with sparse camera trajectories. Moreover, the re-
constructed objects are incompatible with virtual environments that include
dynamic lighting. In this paper, we propose a capture-to-rendering content
creation pipeline that estimates diffuse and specular reflectance of the object
and lighting conditions during capture (Figure 25 left). Given an RGB-D
sequence, the geometry of the object is first reconstructed and several color
frames are selected from the original stream. We introduce a novel low-rank
decomposition method that simultaneously optimizes the camera poses of
each frame and separates each color frame into a diffuse map and a specular
map. Using the specular maps from many different viewing directions, the
surface material properties and the lighting conditions can be derived. At run-
time, the optimized diffuse textures and specular reflectance parameters can
then be used to synthesize the captured object’s appearance from an arbitrary
viewpoint. As shown in Figure 25 (right) and Figure 32, our method can
generate plausible results even for unseen views that were not present in the
captured dataset. Moreover, the reconstructed virtual content can be readily
integrated with virtual scenes and dynamically relit with different lights of
varying direction, color, and intensity.
The following outlines the major contributions of this paper:
56
5.2 OV E RV I E W A N D P R E P RO C E S S I N G 57
Figure 25.: (Left) Several key frames are first selected from a given RGB-D
sequence. The diffuse components and the specular components
can be obtained by simultaneously optimizing the camera pose of
each frame and low-rank decomposition. The material property of
the virtual object and the lighting conditions are estimated from
the specular components. At run-time, the specularity of a virtual
object is computed based on the user viewpoint and combined
with the diffuse map (i.e., view-independent component) for final
results. (Right) The comparison for an unselected frame and the
corresponding rendering result. Note that our approach is able to
generate plausible results for unseen views that were not present
in the captured dataset.
• A novel texture separation approach that decomposes each captured
image into a diffuse component and a specular component and simulta-
neously optimizes its camera pose.
• Estimation of the light sources in the capture environment using only
the observed specular reflections. The optimized specular reflectance
of the object is derived from the estimated lighting conditions.
• With estimated specular and diffuse reflectance, our method can not
only replicate the appearance of the object during capture but can also
synthesize unseen views of the reconstructed model.
• A fully automatic virtual content creation pipeline that does not require
expert knowledge or manual human effort. The reconstructed mod-
els are suitable for integration with industry-standard virtual environ-
ments.
5.2 OV E RV I E W A N D P R E P RO C E S S I N G
5.2.1 Overall Process
An overview of the system pipeline is shown in Figure 26. The object’s geom-
etry is first reconstructed from the depth stream captured using a consumer-
grade RGB-D camera. A set of keyframes are selected from the entire color
stream, and the camera poses of those frames and the diffuse/specular maps
5.2 OV E RV I E W A N D P R E P RO C E S S I N G 58
Figure 26.: Overview of our content creation pipeline. Color and depth image
streams are captured from a single RGB-D camera. The geome-
try is reconstructed from depth information. In section 4, each
selected frame is separated into diffuse and specular maps by low-
rank decomposition and camera pose optimization. In section 5,
lighting condition and material property are obtained from the
specular maps. The texture of the dynamically relightable recon-
structed object is rendered in real-time based on the user’s current
viewpoint.
are optimized from the low-rank objective function. Lighting condition and
material are estimated from the specular components. At run-time, the user
viewpoints provided by a head-tracked virtual reality display is used to gen-
erate the appearance of the reconstructed 3D model in real-time.
5.2.2 Geometry Reconstruction
Any 3D reconstruction method can be used to obtain the geometry model with
a triangular mesh representation. Similar to other papers[Zhou and Koltun,
2014, Jeon et al., 2016, Bi et al., 2017, Chen and Suma Rosenberg, 2018]
that focus on the texture mapping for objects captured using hand-held RGB-
D cameras, we use Kinect Fusion [Izadi et al., 2011] to construct the 3D
model M from the depth sequences. The camera trajectory of the sequence
is roughly estimated and can be used for the texture generation.
5.2.3 Key Frame Selection
Using all frames I of the input video for generating the global texture is inef-
ficient. Instead, temporal [Zhou and Koltun, 2014, Bi et al., 2017] and spatial
[Jeon et al., 2016, Richter-Trummer et al., 2016, Chen and Suma Rosenberg,
2018] keyframe selections have been proposed that use either the distribu-
tion of time or space to obtain a set of most representative frames from the
original color sequence. We chose spatial keyframe selection to maximize
the variation of viewing angles of the 3D model. The selected n keyframes,
5.3 D I F F U S E A N D S P E C U L A R M A P S E PA R AT I O N 59
G =fg
1
,g
2
, ,g
n
g2 I, with initial estimated camera positions, T
g
=
ft
g
1
,t
g
2
, ,t
g
n
g, are used for material properties estimation. The blurri-
ness [Crete et al., 2007] of each frame is used to sort all frames and select
the images from the lowest to the highest blurriness that satisfy the follow-
ing constraint:jjp
i
p
j
jj
2
2
> d 8i,j2 G, where p is the camera position
andd is the minimum distance between any selected images. The number of
keyframesn varies because it depends on the trajectory of the original video
sequence.
5.3 D I F F U S E A N D S P E C U L A R M A P S E PA R AT I O N
5.3.1 Motivation
The appearance of an object is a linear combination of two components: dif-
fuse reflection and specular reflection. Directly using the selected frames
G from the original color-stream to render the object can achieve a photo-
realistic texture. The reflectance can be accurately replicated when the view-
ing direction is perfectly aligned with one of the camera poses T
G
. How-
ever, for objects captured using handheld consumer-grade RGB-D cameras,
the trajectories are sparse and unstructured, especially when performed by
non-expert users. Therefore, rendering an unseen view (i.e., the viewing di-
rection cannot be found in T
G
) by interpolating between the closest frames
often leads to poor results. This is due to the fact that specular reflections are
highly dependent on the viewing direction. Thus, our objective is separating
each original imageg
i
2 G into a diffuse mapd
i
and a specular maps
i
, where
g
i
= d
i
+s
i
. Separating a single image into two images is an ill-posed prob-
lem. Moreover, the unknown camera poses of each image make this problem
more challenging.
5.3.2 Camera Pose Optimization
From Section 5.2, the reconstructed model M and selected n images K and
their initial camera pose estimation T
K
is obtained. However, the raw cam-
era poses are not accurate enough for mapping the texture onto the model
correctly. Therefore, Zhou et al. [Zhou and Koltun, 2014] proposed a color
mapping optimization to get the optimized camera poses T
G
and the vertex
color C(v) of the reconstructed model M. The objective function is defined
as follows:
E(C,T
G
)=
å
g2G
å
v2M
(C(v)G(v,g,t
i
))
2
(13)
WhereG retrieved the color by projecting vertex v to image g using the
estimated pose t
g
. The objective function iteratively solves C(v) and T
G
to
find the optimal solution.
5.3 D I F F U S E A N D S P E C U L A R M A P S E PA R AT I O N 60
5.3.3 Low-Rank Decomposition
The appearance of an object from a certain viewpoint can be divided into two
parts: view-independent component (i.e., diffuse map) and view-dependent
component (i.e., the specular map). Thus, estimating the camera poses of
each frame is not enough for reflectance reconstruction. In our approach, we
aim to use the diffuse isotropic property to solve the problem.
According to the Lambertian reflectance model, the diffuse textures obey
Lambert’s cosine law and is not determined by the viewing direction. As-
sume the selected imageK are well aligned and the illumination environment
is fixed, the diffuse layers are strongly correlated. In Figure 27, with known
camera poses, projecting all texture of selected frames into a certain view-
point, the synthetic images can be vectorized into an image matrix A. Our
objective is to separate A into a diffuse matrixL and a specular matrixS. The
rank of the diffuse matrixL is considered low since the observed intensity of a
Lambertian surface is the same regardless of the observers angle of view. On
the other hand, the specular matrixS containing only the highlights formed a
sparse matrix. We aim to use low-rank matrix recovery [Wright et al., 2009]
to separate each frame into a diffuse and a specular map. The detail of low-
rank decomposition is explained in Appendix B. Low-rank decomposition
minimizes the rank of L while reducing the 0-norm ofS:
min
L,S
rank(L)+lkSk
0
s.t. A= L+S (14)
From the above formulation, we note thatjjSjj
0
calculates the number of
non-zero elements in E. Since solving Eq. 14 involves the low-rank matrix
completion and the 0-norm minimization problems, it is NP-hard and thus is
not easy to solve. To convert Eq. 14 into a more tractable optimization prob-
lem, Cands et al. [Cand` es et al., 2011] relax Eq. 14 by replacing rank(A) with
its nuclear normjjAjj
(i.e., the sum of the singular values of A). Instead of
solving the minimization of 0-normjjSjj
0
, that of 1-normjjSjj
1
is now con-
sidered (i.e., the sum of the absolute values of each entry in E). Consequently,
the convex relaxation of Eq. 14 has the following form:
min
L,S
lkLk
+lkSk
1
s.t. A= L+S (15)
In our approach, using Eq. 15, we can obtain the diffuse map and the
specular map of each selected frame.
5.3.4 Texture Separation and Pose Estimation
In our approach, we aim to simultaneously to estimate the camera poses and
separate the frames. For each image g
i
2 G, it can be divided into diffuse
5.3 D I F F U S E A N D S P E C U L A R M A P S E PA R AT I O N 61
Figure 27.: Separation of selected frames into a diffuse map and a specular
map. The original texture of selected frames (top row) are pro-
jected to the camera pose of the left frame (second row) to form
a data matrix A. Low-rank decomposition is able to separate A
into a diffuse matrix (third row) and a specular map (fourth row).
imaged
i
and specular images
i
by using low-rank decomposition. The diffuse
componentd
i
is used to refine camera pose by projecting all the verticesv of
the model M onto image d
i
using the estimated transformation matrix T
i
.
Note that the vertices did not pass the visibility check (i.e., either out of the
image view or occluded by other vertices) will be discarded. The objective
function is defined as follow:
E(C,D,T
G
)=
å
d
i
2D
å
v2M
(C(v)G(v,d
i
,T
g
i
))
2
(16)
Where D =fd
1
,d
2
, ,d
n
g is the set of all diffuse maps. Because G
and the Low-rank decomposition are both non-linear, the Eq. 16 is solved by
iteratively updating camera poseT
G
, vertex colorC and the camera posesT
G
.
T E X T U R E S Y N T H E S I S F U N C T I O N Y The low-rank decomposition re-
quires an image matrix A which each column of A represent an observation
5.3 D I F F U S E A N D S P E C U L A R M A P S E PA R AT I O N 62
(a)Reference image (b)Relationship between cameras
(c)Projected results compared to the closest camera view
Figure 28.: An example of the texture synthesis functionY. (a) and (b) show
a reference image with a known camera pose (red). This image
is used to render the model and the other two camera views (blue
and green) are projected to its camera view. (c) The synthesized
results of the three camera views. Note that the occluded area (i.e.,
not visible in (a)) is painted in gray for better visualization.
of the target object and each row of A should be highly correlated. Since
the keyframes (Figure 27 top row) are captured from different viewpoints, di-
rectly using those images is not possible to form a reasonable image matrix
A. Thus, to utilize those images in our texture decomposition setting, they
need to be aligned first. We introduce the texture synthesis functionY for
synthesizing the textures from each image to the certain camera view. Given
a model M and the camera pose t
i
of an imagei,Y renders the model M by
image i and projects the rendered model to a known camera pose t
j
. Thus,
the texture synthesis function is defined as Y(i,t
i
,M,t
j
). For example, to
synthesize the texture of Figure 28(a) in other camera views, the model is
rendered from the image with its corresponding camera pose (i.e., the red
camera in Figure 28(b)). The rendered model is projected to another known
camera pose (e.g., the blue or the green camera in Figure 28(b)) to generate
the synthetic texture. Figure 28(c) shows the synthesizing results of three
camera views. As shown in Figure 28(a) and (c) left, the synthetic texture of
the model generated fromy is the same with the texture from the original im-
age. Note that the occluded area on the model is gray for better visualization
in the paper and is set to black (i.e., zero) in our experiments.
U P DAT I N G D To update the diffuse map d
i
, the selected frames G =
fg
1
,g
2
, ,g
n
g and their current camera poses T
K
are fixed. We first apply
5.3 D I F F U S E A N D S P E C U L A R M A P S E PA R AT I O N 63
Psi(i,t
i
,M,t
j
),wheret
j
2 T
K
to synthesize the textures from all keyframes
(Figure 27 second row). We also generate the depth map of the model from
the camera pose. A binary mask derived from the depth map represents the
existence region of the model in the synthetic texture. The mask is used to
select the region of interest for each projected synthetic view and stacks it as
a vector. Since the mask is based on the geometry instead of appearance, the
length of each vector are the same. The image matrix A
i
=fa
1
,a
2
, ,a
n
g
is formed by concatenating all the image vectors. Since the prerequisites
for low-rank decomposition is that the low-rank matrix is strongly correlated,
using the synthetic texture of an image far from the camera position will
increase the sparse errors. The ratio of nonzero pixels between synthetic
images a
j
,j6= i and the rendered image a
i
is used as a constraint to discard
those unqualified synthetic textures.
A
0
i
=fa
1
,a
2
,a
3
, ,a
m
g, m n
8a
j
2 A
0
i
, NZ(a
j
)/NZ(a
i
) g
(17)
Where NZ(i) is the function to compute non-zero pixels in an imagei. In
our experiment,g= 0.8.
The objective function aim to separate the data A
0
i
to a low-rank diffuse
matrix L
i
and sparse specular matrix S
i
. Note that the illumination inten-
sity should not be negative, we add the constraints to the original low-rank
decomposition. The function can be written as follow:
min
L
i
,S
i
jjL
i
jj
+ljjS
i
jj
1
A
0
i
= L
i
+S
i
, L
i
0, S
i
0
(18)
Although the specular highlights are incorrectly represented (the second
row of Figure 27 ) for the viewpoint when using Psi, it will be treated as
sparse errors and can be separated by the low-rank decomposition as shown
in the fourth row of Figure 27. After low-rank decomposition, the corre-
sponding diffuse map and the specular map is obtained. The intermediate
results of diffuse and specular map are saved as D =fd
1
,d
2
, ,d
n
g and
S =fs
1
,s
2
, ,s
n
g. In the third row and fourth row of Figure 27, we show
the final separation results of the diffuse and the specular maps. Note that
we only keep the diffuse and specular maps that related to the view (i.e., the
left-most) for reflection estimation. Other maps are used only for texture sep-
aration and discarded after the decomposition process.
U P DAT I N G T A N D C SinceD is fixed, we use the original color mapping
optimization to solve the camera pose for each frame. For estimating the
camera pose ford
i
2 D, the error function can be is simplified as follows:
E(C,T
d
i
)=
å
v2M
(C(v)G(v,d
i
,T
d
i
))
2
(19)
Note that replacing the original image with diffuse mapd
i
in the color map-
ping optimization can further improve the accuracy of camera pose estimation
due to the removal of specular highlights.
5.4 M AT E R I A L A N D L I G H T I N G E S T I M AT I O N 64
5.4 M AT E R I A L A N D L I G H T I N G E S T I M AT I O N
5.4.1 Phong Reflection Model
The colors observed in eyes or images are a combination of both illumination
and the object color. In order to approximate these light reflections and sur-
faces properties, several empirical models have been proposed to describe the
appearance changes. In this paper, we use the well-known Phong reflection
model [Phong, 1975] as our reflectance model. The equation for computing
the illumination of each surface point I
p
by combining I
d
and I
s
:
I
p
= I
d
+I
s
I
p
=
å
j2J
k
d
(l
j
N)i
j
+
å
j2J
k
s
(R
j
V)
a
i
j
(20)
wherek
d
andk
s
are the reflection constant for diffuse and specular respec-
tively, J is the set of all lights,i
j
is the intensity of the jth light source. L
j
and R
j
are the corresponding incident and reflection direction from the sur-
face, N is the surface normal, V is the direction toward the viewer, a is its
shininess parameter.
5.4.2 Objective Function
M AT E R I A L E S T I M AT I O N F RO M S P E C U L A R M A P In section 5.3, we’ve
already separated each selected image into a specular map and a diffuse map.
Without the knowledge of the amount of lights M and their directions L
m
,
I
d
=å
j2J
k
d
(L
m
N)i
j
has infinite solutions that can minimize the objective
function. Thus, it is impossible to recover k
d
of the virtual model from the
diffuse maps. However, in figure 29, the specularity is sparse and highly
depends on both viewing and lights direction. Using both specular map
S = fS
1
,S
2
,S
n
g and surface normal, we are able to derive the light-
ing direction first and then estimate the lighting condition and the material
property of the highlighted region of the object.
E R RO R F U N C T I O N We introduce the r(S
a
,b) function to retrieve the
value from specular map S
a
using the corresponding transformation matrix
T
a
(i.e., extrinsic) and the camera intrinsic parameters. To simplify notation,
we use k instead of k
s
in our equation. For any vertex v 2 M, the error
function is defined as follow:
fk,a,I
J
g= argmin
å
S
a
2S
0
(r(S
a
,b)k
å
j2J
(R
j
V
j
)
a
i
j
)
2
(21)
5.4 M AT E R I A L A N D L I G H T I N G E S T I M AT I O N 65
Figure 29.: (Left) An example of a selected frame and its surface normal.
(Middle) The diffuse map is unable to recover the lighting con-
dition. (Right) In contrast, the specular map is sparse and the
lighting direction can roughly assumed from the top. Using this
assumption from all selected images, we can approximately cal-
culate the number of lights and their directions for reflectance es-
timation.
Where S
0
is the set of frames that v is not occluded, i
j
is the intensity of
each light. To find the optimized lighting condition and material property
solution for all vertex, the objective function is defined as follow:
fK,A,I
J
g= argmin
å
S
a
2S
å
v2v
a
(r(S
a
,v)k
vå
j2J
(R
vj
V
av
)
a
v
i
j
)
2
(22)
where v
a
is the set of visible vertex in S
a
. For each visible vertex v, k
v
is
the specular coefficient,R
vj
is the reflection direction in respected to incident
light j and V
av
is the viewing direction from the camera pose of S
a
to the
vertexv. In this objective function, we aim to find the optimized specular co-
efficient K =fk
1
,k
2
, ,k
m
g, shininess parameter A =fa
1
,a
2
, ,a
m
g,
and the intensity of lights I
J
=fi
1
,i
2
, ,i
J
g
L I G H T I N G D I R E C T I O N E S T I M AT I O N To find the optimized solution
for Eq. 22, we need to first determine the number of lights and its lighting
direction. As the specular reflection in Phong shading model is defined as
(RV)
a
, the specularity only appears when the viewing direction v and the
reflection direction r is similar. Based on this property, we are able to com-
pute the approximate incident direction l by using l = 2(vn)nr,
wheren is the surface normal.
For each pixel in the highlights region of the specular map, an approxi-
mated direction is added as a candidate. We assume that the incident lights
are all directional lights, so it can be converted from Cartesian coordinates to
sphere coordinate system (radius = 1). In Figure 30, is the voting results of
the incident light.
The number of clusters and their center can be obtained by mean-shift [Co-
maniciu and Meer, 2002]. We discard those clusters which have less than0.2
votes of the largest cluster.
5.4 M AT E R I A L A N D L I G H T I N G E S T I M AT I O N 66
Figure 30.: The results of possible incident light direction. Note that each
highlight pixel votes for a possible direction. Mean-shift algo-
rithm is applied to get the number of clustering and their center.
O P T I M I Z AT I O N Because of the known camera poses from the previous
low-rank section and the lighting direction from clustering vertex normal, and
the pre-computed vertex normal, the above equation can be rewritten as fol-
low:
fk,a,ig= argmin
n
å
a=1
å
b2v
a
(y
ab
k
b
l
å
c=1
(x
abc
)
a
b
i
c
)
2
s.t
l
å
c=1
k
b
(x
abc
)
a
b
i
c
< y
ab
,
1 k
b
0, a
max
a 0, 1 i 0
(23)
wherey
ab
= r(S
a
,b) is a scalar for camera viewa and vertexb andx
abc
=
(R
bc
V
ab
) is a scalar based on camera viewa and vertexb and incident light
c. a
max
is the constraint that bounds the value of alpha to avoid over-fitting.
The detail of how to solve the objectuve function can be found in Appendix
C.
The optimized specular reflectancek and the shininess parametera can be
used in real-time rendering. In Figure 31, the estimated K
s
and a are shown.
Note that for better visualization, we scale the K
s
value range from 01 to
0255 and a from 0a
max
to 0255. Note that those highlight regions
in the original image had higherK
s
anda, while the regions without specular
effect had lower values.
5.4 M AT E R I A L A N D L I G H T I N G E S T I M AT I O N 67
Figure 31.: Visualization of estimatedK
s
and thea. Note that the darker area
means smaller estimated value of each coefficient.
In Figure 32, we compared our rendering results with their original images.
Note that the appearance rendered by our optimized material property can
preserve the highlights in the selected frames and replicate it. The virtual
sphere is added to the scene to illustrate the lighting direction and color.
5.5 R E S U LT S 68
(a)Torso of Elevation
(b)Antique Leather Chair
Figure 32.: Comparison of the rendered virtual object (top row) and original
capture frames (bottom row) for two different objects. The high-
lights on the virtual sphere visualizes the direction and color of
the virtual lights estimated from the real world scene.
5.5 R E S U LT S
DATA S E T S The virtual object dataset used in the user study contains thou-
sands of RGB-D sequences captured by non-experts using a Primesense cam-
era [Choi et al., 2016]. The raw color and depth are not synchronized, so we
assigned the color images to the depth images with the smallest time-stamp
difference. Since the streams are both 30 fps, the shifting error is small and
can also be handled by the optimization. We tested our system with four dif-
ferent models: Torso of Elevation, the Kiss by Rodin, and an antique leather
chair (corresponding to IDs 3887, 4252, and 5989 in the database respec-
tively). The detail of each object is shown at Table 2 and some example
5.5 R E S U LT S 69
images are shown in Figure 33. The vertex and surface are reconstructed us-
ing Kinect Fusion [Izadi et al., 2011], and the keyframes are selected from
the color/depth stream.
Figure 33.: Representative real world images for three captured objects. From
left to right: Torso of Elevation, The Kiss, and an antique leather
chair.
A P PA R AT U S A N D I M P L E M E N TAT I O N All the results were performed
in Unity 2017.2.0f3 on a MacBook Pro with an Intel i7-4850HQ CPU, Nvidia
GeForce GT750M GPU and 16 GB of RAM. Our method can render the
models in 10-15 milliseconds (i.e., 70-90 fps), making it sufficient for real-
time high-frequency rendering in virtual environments.
5.5 R E S U LT S 70
5.5.1 Visual Analysis
L O W- R A N K D E C O M P O S I T I O N In a previous approach [Zhou and Koltun,
2014], the albedo of the model is estimated by averaging the color from all
observed frames to represent the diffuse map of the model. In Figure 34,
we compared the diffuse map computed from original images and the diffuse
map from the low-rank texture separation. The specular regions are outliers
with high intensity, thus, averaging the color leaves some highlight effects on
the diffuse map. In contrast, the diffuse map generated from low-rank decom-
position removed most of the highlights of the virtual content and move it to
the specular maps.
Figure 34.: Comparison of the diffuse appearance using the average vertex
color from the original images (i.e., w/o low-rank decomposition)
and the derived diffuse maps (i.e., w/ low-rank decomposition).
Note that our approach can remove most of the specular high-
lights such as the shoulder area on the Torso and the self inter-
reflection on the seat of the antique chair. Moreover, by removing
the highlights, the details from the original image can be correctly
preserved in the diffuse map.
5.5 R E S U LT S 71
C A M E R A P O S E O P T I M I Z AT I O N Although the core of our texture map
separation is the low-rank decomposition, the camera pose optimization is
also essential for getting better results. In Figure 35 (a), the camera pose ob-
tained from Kinect Fusion is not accurate enough for correctly rendering the
texture onto the model. Note that in the low-rank decomposition framework,
the assumption of the low-rank matrix L should have a strong correlation
between each projection from keyframes. Thus, using inaccurate projected
images will lead to erroneous map separation. As shown in Figure 35 (b),
without camera pose optimization, the diffuse maps has severe artifact such
as isopleth around the edges. The same issue also can be found in the specular
map that will cause the optimization of specular reflection to fail.
(a)Incorrect texture projection
(b)Low-rank decomposition with and without camera pose optimization
Figure 35.: (a) Without camera pose optimization, the rendered texture is
incorrect due to the inaccurate initial trajectory obtained from
Kinect Fusion. (b) (Left) The incorrect rendering will decrease
the quality of diffuse and specular maps. Moreover, the generated
specular maps failed to recover the material property and lighting
estimation. (Right) With camera pose optimization, the proposed
texture separation method correctly divided the image into diffuse
and specular components.
5.5 R E S U LT S 72
M E T H O D C O M PA R I S O N To compare the fidelity of model, we com-
pare our method with fixed textures [Zhou and Koltun, 2014] and the Dy-
namic Omnidirectional Texture Synthesis (DOTS) [Chen and Suma Rosen-
berg, 2018], an improved view-dependent texture mapping method. As shown
in Figure 36 (rightmost), three original images are selected for method com-
parison. Note that the first and the third images are chosen from the selected
keyframes and the second images are chosen from the unselected frames (i.e.,
an unseen view) that the camera pose is closest to the middle of the camera
poses of first and the third images. Note that the appearance did not change in
the fixed texture method and the fidelity is recognizably lower than the other
two methods. DOTS can generate photo-realistic results if the user view-
point is close to selected frames since it directly renders the texture onto the
model. However, the synthesized appearance is inaccurate if the user view-
point moves further away from the camera trajectories in the original dataset,
which occurs quite commonly for objects captured using handheld cameras.
Since the specular effect is a local phenomenon not only based on the user
viewpoint but also the lighting direction, it is not possible to simply interpo-
late from the textures of other views. In our approach, the specular reflection
is optimized based on the texture, lighting, and the geometry; thus, the ren-
dering results are still plausible for unseen views that are further away from
images in the captured dataset.
5.5 R E S U LT S 73
Figure 36.: Comparison of our method with fixed textures [Zhou and Koltun,
2014] and DOTS [Chen and Suma Rosenberg, 2018]. Note that
the fixed texture results in lower fidelity (blurriness) due to averag-
ing the observed images. DOTS or other view-dependent texture
mapping methods are able to generate photo-realistic rendering re-
sults. However, the specular highlights (bottom row) of an unseen
view cannot be correctly interpolated from the source images. In
contrast, our method estimated the light sources and the specular
reflectance properties of the object and is able to synthesize the
highlights of unseen views.
5.5 R E S U LT S 74
5.5.2 Dynamic Relighting
Virtual objects created with fixed textures or VDTM are difficult to integrate
with arbitrary scenes because the illumination during capture is baked into
the texture. In contrast, with the reflectance properties estimated using our
method, we can more readily adapt the reconstructed models into virtual en-
vironments with varying lighting conditions. In Figure 35 (a), we demon-
strate the integration of two reconstructed virtual objects into a scene with dy-
namic illumination. Note that the rendering results are updated in real-time
by changing the color and the direction of lights. In Figure 35 (b), we can
increase or decrease the intensity of the lights, and the appearance changes
of both models are consistent with other objects in the original scene, such as
the scale. To further illustrate the view-dependent specular reflectance prop-
erty, we fixed the lighting conditions in Figure 35 (c) and moved the camera
to three different view positions. Note that the highlight areas on the recon-
structed objects are dynamically changing based on the camera’s viewpoint.
The users can freely explore in the environments without any constraints, and
because objects are illuminated using virtual lights instead of image interpo-
lation, the transitions between viewpoints will remain smooth.
5.5 R E S U LT S 75
(a)Renderingwithdifferentlightdirectionandcolor
5.5 R E S U LT S 76
(b)Rendering with different light intensities
5.5 R E S U LT S 77
(c)Rendering with different viewing angles
Figure 35.: Demonstration of two reconstructed virtual objects (The Kiss and
Torso of Elevation) in a virtual scene with dynamic illumination.
The proposed reflectance estimation method can provide plausi-
ble results with varying virtual light direction, color, and intensity.
Furthermore, the specular highlights on the virtual object will
smoothly change in real-time as the user moves between differ-
ent viewpoints. A virtual sphere is also displayed to more easily
visualize the color and the direction of virtual lights.
5.6 C O N C L U S I O N 78
5.6 C O N C L U S I O N
In this section, we presented an end-to-end content creation pipeline to cre-
ate dynamically relightable virtual objects from a single RGB-D sequence.
Our approach first separated each image from the original color stream into
a diffuse map and a specular map using low-rank decomposition. The illumi-
nation and reflectance properties are then estimated from the specular maps.
The objects can then be integrated with virtual scenes and rendered in real-
time under arbitrary lighting conditions.
L I M I TAT I O N S A N D F U T U R E W O R K Our method maximizes the agree-
ment of all the observed images and the estimated reflection of the virtual
object. However, because we are focusing on overcoming challenges of un-
constrained capture using handheld consumer-grade cameras, our approach
assumes fixed real-world lighting conditions, camera exposure, and white
balance. We also assume that the object remains stationary during capture.
Although several methods have been proposed for dealing with dynamic ge-
ometry reconstruction (e.g., [Newcombe et al., 2015, Kozlov et al., 2018]), to-
our-best-knowledge, texture/reflectance reconstruction for dynamically mov-
ing objects remains an unsolved problem. In the future, we would like to
extend our approach in several directions. First, we would like to introduce a
greater variety of real-world lights (e.g, area, point sources) into our shading
model. Second, we would like to segment the vertices of the object into sev-
eral categories for faster modeling and rendering results. Finally, although we
focused on virtual reality content creation in this paper, the proposed methods
can also be applied to augmented reality. In the future, we are particularly ex-
cited about applying this work to create dynamically relightable AR objects
that can realistically adapt to illumination in the user’s surrounding environ-
ment in real-time.
Part III
F U RT H E R A DVA N C E M E N T S
6
R E S E A R C H A N D D E V E L O P M E N T G U I D E L I N E S F O R
F U T U R E W O R K
In this dissertation, we developed a complete end-to-end pipeline for the vir-
tual content creation process and introduced three novel approaches for gener-
ating immersive view-dependent textures. The input of our pipeline is a single
RGBD video sequence captured by a consumer-grade depth sensor. Without
introducing any specialized capture devices, our system can minimize the
user input and make the system a portable solution for reconstructing vir-
tual objects from real world scans. The pipeline can be separated into two
parts: an offline process and online real-time rendering. For the offline pro-
cess, using our approach in Chapter 3, users can rapidly create a photoreal-
istic virtual replica from a captured physical object without requiring expert
knowledge or additional devices. Furthermore, DOTS, a novel texture syn-
thesize method proposed in Chapter 4, can further improve the texture quality
and smoothness of transitions between virtual viewpoints. User study results
showed that, in the online stage, we are able to provide a satisfying virtual
reality experience of the reconstructed model at real-time framerates. Addi-
tionally, in order to integrate the reconstructed content into an existing virtual
environment, the computational appearance should be adapted to match the
different lighting conditions. Our proposed method in Chapter 5 can obtain
the optimized material properties from the original RGBD sequence, thereby
enabling post-editing and relighting in arbitrary virtual environments with
dynamic illumination.
Despite the advancements presented in this dissertation, there are still lim-
itations and open problems that remain to be solved in the area of automatic
virtual reality content creation. Referring to the capture framework in Fig-
ure 36, we can categorize the possible future research directions into three
parts: Capture Improvements, Content Creation Pipeline Advancement
and Virtual Experience Enhancement
6.1 C A P T U R E I M P ROV E M E N T S
In this section, we consider improving the input sources such as a more ac-
curate geometry, color images with higher resolution, a more controllable
system, and environment setting and guidance for users while capturing the
data.
80
6.1 C A P T U R E I M P ROV E M E N T S 81
Figure 36.: Thesis Overview Revisit
6.1 C A P T U R E I M P ROV E M E N T S 82
6.1.1 Geometry Reconstruction
In this dissertation, we directly used the geometry generated by Kinect Fusion
[Izadi et al., 2011]. However, this method usually results in reduced detail due
to the use of a truncated signed distance function to integrate depth images
into voxels. Another issue is that the drifting error will accumulate over time.
There are many recent papers [Dai et al., 2017, Whelan et al., 2016] that
have proposed new methods to improve the accuracy of reconstructing the
geometry of objects captured using depth sensors. All of the methods intro-
duced in this dissertation will benefit from superior quality input geometry. In
the VDTM framework, the geometric model is rendered using different tex-
tures. With more accurate geometry, the texture projected from the images to
the geometry will be improved. Better rendering results will also improve the
color mapping optimization [Zhou and Koltun, 2014] because the optimiza-
tion process would converge faster and have better texture alignment. Thus,
a better corresponding camera trajectory can be obtained to generate better
synthetic textures using DOTS. Furthermore, in our material estimation ob-
jective function [Phong, 1975], the appearance is based on not only on the
geometry but also the surface normals. Thus, more accurate geometry with
improved detail can further improve the material estimation.
6.1.2 Texture Improvement
The textures we used in this dissertation all came from a consumer-grade
RBGD sensor, which only provides VGA resolution for both the depth and
color streams. Rendering the model with low-resolution images results in
visual artifacts, especially when a user views the virtual object at a distance
that is considerably shorter than the distance during capture. To solve this is-
sue, we can simply increase the resolution of the input color stream for better
rendering results. For example, an additional device that can capture higher-
resolution images, such as mobile phones or DSLR cameras, may be utilized.
However, although it will certainly increase the texture quality, it will also
introduce new challenges. Several examples are listed below:
1) Attached to the depth sensor
The advantage of attaching the camera to the depth sensor is its simplicity
because both devices can be treated as a rigid body. The spatial relationship
(i.e., the extrinsic matrix) between the newly added camera and the camera
on the RGBD sensor are always the same and only need to be calibrated
once. A set of images with a checkerboard captured by the two color cameras
could be used to calibrate the extrinsic matrix. However, when using an addi-
tional camera as the texture source, the capturing system needs to handle the
synchronization of two cameras while capturing the target object. Also, the
placement of the additional device is also a potential issue. The further the
distance from the depth sensor, the more projection error is expected.
6.1 C A P T U R E I M P ROV E M E N T S 83
2) Standalone Cameras
One possible method is to capture several high-resolution images offline with-
out attaching the color camera to the depth sensor. This approach has some
advantages: 1) The camera positions of those images do not need to follow
the same trajectory of the depth sensor. Thus, they could provide additional
viewpoints. 2) Due to the offline capture, we could add more color images to
further improve the texture if the initial results do not meet the expected per-
formance. However, this method will also introduce several disadvantages:
1) Since it is completely independent of the RGBD video sequence, the chal-
lenge becomes finding the correspondence between the original video and the
high-resolution images. 2) Since those images are captured at a different time,
the lighting conditions are not guaranteed to be the same.
6.1.3 More Control of the Capture Environment
There are several methods of capturing the data that has more control of the
environments such as a light-stage [Debevec et al., 2000] or a turntable [Bolas
et al., 2015]. In these examples, the object is placed at the center of a special-
ized device for capture. Although the intended goals of these type of methods
are far from our consumer-grade, non-expert content creation pipeline, we be-
lieve they are still worth discussing for future research ideas.
The advantage of these methods is that the lighting condition is well-controlled
and since the camera array can be obtained beforehand. Thus, they can fur-
ther estimate the BRDF model of the object or a good light-field rendering.
However, building the device is not only expensive and also time-consuming.
The other drawback is that it is not portable for causal scanning. We could
think of a hybrid method that using several depth sensors that is portable but
has a wider view or observations. There are already devices available such as
Matterport cameras.
6.1.4 Guidance of Capturing
In this dissertation, we assume the user can provide a ”good” RGBD video.
However, we think it is worth spending some time working on the capturing
systems. In our framework, the coverage of the viewing area through the
entire RGBD video is a very important factor for replicating the appearance
of different viewing direction. To design an application not only showing
the current reconstructed geometry model but also providing some hints for
the unseen viewing area can help to generate a ”good” video sequence. In
Figure 37, we demonstrate a possible way that overlays a sub-window onto
the current camera capturing screen which shows the areas that are already
been covered by the user and gives a suggested direction to maximize the
coverage for those uncovered regions.
6.1 C A P T U R E I M P ROV E M E N T S 84
Figure 37.: One example of giving guidance to capture a good RGBD video
as an input of our content creation pipeline.
6.1.5 More Challenging Cases
In this dissertation, we assumed the object should be static and captured un-
der the same lighting condition. However, there are more advanced cases that
are worthy for future research and investigation.
1) Moving Object
Recently, many researchers are working on reconstructing the geometry of
an object from a dynamic scene. However, there are only a few methods pro-
posed to reconstructing the material from a moving object. In such a case, we
think there’s lots of opportunities or research topic in this field. To generate
a dynamic view-dependent model, the temporal information such as vertex
movement, pixel tracks in the color streams can be added into the objective
function for optimization. For material estimation, since all the mesh and the
vertex of the model is moving, it is possible to get more observation, which
is also very useful for recover the material property.
2) Dynamic lighting Condition
If the lighting condition keeps changing without any prior knowledge through
the entire RGBD video, it is impossible to recover the material from those
images due to the ill-pose problem for the setting. However, it is still possible
to recover the BRDF model by knowing the lighting condition for each image.
As discussed in Chapter 2, the light-stage [Debevec et al., 2000] is designed
to recover the BRDF model. But it comes with more constrained such as the
6.2 C O N T E N T C R E AT I O N P I P E L I N E A DVA N C E M E N T 85
ability of portable capture, the target object size, the needs of well-controlled
environments.
6.2 C O N T E N T C R E AT I O N P I P E L I N E A DVA N C E M E N T
In this section, we aim to discuss how to further improve our content creation
pipeline. We provide several ideas below:
6.2.1 Joint Optimization of Geometry and Texture
We mainly focus on how to replicate the texture of an existing geometry
model. However, it is possible to consider both RGB and depth information
for reconstructing the virtual content in the objective function.
To introduce optimizing the geometry into our current objective function,
the known variables such as vertex position, surface normal in the original
function should become unknown parameters. Additional steps to optimize
those parameters should be considered. Although this approach might contain
lots of error from both color and depth streams, we believe that adding some
constraints or having some assumption, the results will be more convincing.
6.2.2 Self-inter-reflections Assumption Relaxation
In our setting, we ignore the self-inter-reflections data since most of our ob-
ject is convex. If we want to target on the concave objects, the self-inter-
reflections should also be added into considering. The objective functions
should also change to deal with it.
To relax the assumption to adopt self-inter-reflection, the Phong shading re-
flectance model should be revised to deal with lights from other vertex/object.
However, to determine the incident light is coming from the lighting source
or from the reflection of itself is not easy because the material property is
also the parameter that needs to be estimated. To the best or our knowledge,
there’s still no paper address this problem.
6.2.3 Source of Lights Estimation
Our assumption for estimating lighting is directional lights. However, there
are many types of light: directional lights, point lights, and area light, etc. In-
accurate lighting direction and type will decrease the performance of material
estimation.
To handle various source of lights, there are several ways to improve our
current material estimation method. In our current work, The proposed method
clusters the lighting directions first and estimated lighting condition and ma-
terial property. A possible way is to add the clustering step into the objective
function. The whole objective function becomes iteratively optimizing the
6.3 V I RT UA L R E A L I T Y E X P E R I E N C E E N H A N C E M E N T 86
type of lights and its direction and then estimate the material and the color of
lights. In this approach, Our existing work is able to handle various type of
lights. Thus, the improved material property can be estimated.
6.2.4 Categorize Material of Object
In this dissertation, we estimated the optimized material property for each
vertex. in the real world, artificial objects are usually made from only a few
materials.
The assumption for the vertex in the same mesh triangle could be used to
extend our method. We can revise our objective function by adding the error
function between the vertex and its neighbor. Furthermore, if we could cate-
gorize the material into several segments, it could further reduce the rendering
time and memory usage. Another way is to apply super-pixel segmentation
on either the color stream or depth stream. The prior knowledge can also be
added into consideration.
6.2.5 Adding More Parameter(Freedom) in the Color Model
In Chapter 5, we treat the surface model and vertex positions as known pa-
rameters. However, we could also estimate the parameters during the opti-
mization. Although it would not guarantee to get better results because more
parameters may lead to easily diverge of the objective function, I think it
is still worth trying by adding the constraints onto each parameter to avoid
divergence.
6.3 V I RT UA L R E A L I T Y E X P E R I E N C E E N H A N C E M E N T
Although the virtual object is reconstructed from our content creation pipeline,
I think there are some research topics or applications that can help to create a
better VR experience.
6.3.1 Real-time Editing
In this dissertation, we only provide the reconstructed virtual object with high
fidelity textures or the estimated material property. Users could only see the
results and comparing to the original video. However, we think real-time edit-
ing is a direction for further improvement. They can provide some concepts
that help the rendering results become better. Some information such as the
type of material, the number of material, the number of vertex of the object
is useful and can make the virtual objects even more customized.
To gain the capability of real-time editing, we think it will be an interesting
topic about how to design an user interface that can dynamically change the
material, lighting or even add a texture map on to the model. With this ability,
we are able to reduce the efforts for 3D artists to create content from scratch.
6.3 V I RT UA L R E A L I T Y E X P E R I E N C E E N H A N C E M E N T 87
6.3.2 Replaced by Texture Map
In Chapter 5, we estimated the material property and the lighting condition
per vertex. Thus, at run-time, the appearance is generated from the per-vertex
rendering shader. In our results, it will not affect much for small virtual ob-
jects. However, we think it will decrease the real-time performance if the
number of vertex increase to a considerably large amount. The problem usu-
ally comes from reconstructing virtual objects with more denser meshes or
the virtual object is duplicated several times in the virtual environments. One
solution to this problem is to build a texture map for diffuse map and the
specular map. Thus, we only need to pre-load two texture maps for the vir-
tual object despite the number of duplication or its number of vertex.
6.3.3 User Study
We are able to reconstruct the material property of virtual objects from Chap-
ter 5. However, we think there are several interested user study topics that
can be further extended to be potential research topics:
1) The quality of estimated material properties
The visual appearance is hard to do a quantitative evaluation. However, if
we could do a user study similar to the study we did in Chapter 4 and have
the rating and ranking results. It is useful to verify how well our algorithm is
comparing to other methods.
2) Determine the best parameters
It would be interested in determining the best parameters of our objective
function such as how the numbers of virtual lights effect replicating the ap-
pearance from the original video sequence, the trade-off between the number
of vertex and appearance, the numbers of key-frames to achieve the desired
quality.
3) Evaluate the importance of virtual content
Although we’ve shown that the DOTS is better than VDTM and Fixed
texture in terms of considering both appearance and transitions quality in
Chapter 4. However, we did not know the importance of such virtual contents
in the virtual environments. For example, does it really make a difference if
we replace some of the contents with the view-dependent virtual object or
material estimated virtual object? when or where are the best scenarios for
using the virtual content created by our framework?
6.4 M O R E T H A N V I RT UA L R E A L I T Y 88
6.4 M O R E T H A N V I RT UA L R E A L I T Y
6.4.1 Extended our method to AR/MR
In our framework, we focus more on the virtual reality application. However,
we think that the virtual objects are able to use in augmented reality (AR) or
mixed reality (MR). Although it is not ideal for using the VDTM method in
AR, it would be interested in using the virtual object with estimated material
in AR/MR. The only missing part is the real-time lighting condition estima-
tion from the current webcam or camera. It is a could be obtained by some
research papers [Chen et al., 2012, Lalonde and Matthews, 2014] and also the
AR toolkit provide by android phones and iPhone.
6.4.2 Build A Database for Virtual Content
As for the best of our knowledge, there are no database or system for creating
the virtual content automatically. Based on our proposed method, if we can
build up a website for uploading an RGBD sequence and returning the virtual
content reconstructed from the video. It will not only build up a database for
our method and also make our method more popular. Another application for
building such a database is good for a virtual museum tour and virtual product
such as furniture shopping.
B I B L I O G R A P H Y
[Bastian et al., 2010] Bastian, J., Ward, B., Hill, R., van den Hengel, A., and
Dick, A. (2010). Interactive modelling for ar applications. In IEEE In-
ternational Symposium on Mixed and Augmented Reality, pages 199–205.
IEEE.
[Besl and McKay, 1992] Besl, P. J. and McKay, N. D. (1992). A method
for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 14(2):239–256.
[Bi et al., 2017] Bi, S., Kalantari, N. K., and Ramamoorthi, R. (2017). Patch-
based optimization for image-based texture mapping. ACM Trans. Graph.,
36(4):106:1–106:11.
[Bolas et al., 2015] Bolas, M., Kuruvilla, A., Chintalapudi, S., Rabelo, F.,
Lympouridis, V ., Barron, C., Suma, E., Matamoros, C., Brous, C., Jasina,
A., Zheng, Y ., Jones, A., Debevec, P., and Krum, D. (2015). Creating near-
field vr using stop motion characters and a touch of light-field rendering. In
ACM SIGGRAPH 2015 Posters, SIGGRAPH ’15, pages 19:1–19:1, New
York, NY , USA. ACM.
[Bouguet, 2008] Bouguet, J. Y . (2008). Camera calibration toolbox for Mat-
lab.
[Buehler et al., 2001] Buehler, C., Bosse, M., McMillan, L., Gortler, S., and
Cohen, M. (2001). Unstructured lumigraph rendering. In Proceedings of
the 28th Annual Conference on Computer Graphics and Interactive Tech-
niques, SIGGRAPH ’01, pages 425–432, New York, NY , USA. ACM.
[Cand` es et al., 2011] Cand` es, E. J., Li, X., Ma, Y ., and Wright, J. (2011).
Robust principal component analysis? J. ACM, 58(3):11:1–11:37.
[Chen et al., 2017] Chen, C., Bolas, M., and Rosenberg, E. S. (2017). View-
dependent virtual reality content from RGB-D images. In IEEE Interna-
tional Conference on Image Processing.
[Chen and Suma Rosenberg, 2018] Chen, C.-F. and Suma Rosenberg, E.
(2018). Dynamic omnidirectional texture synthesis for photorealistic vir-
tual content creation. In Adjunct Proceedings of the IEEE International
Symposium for Mixed and Augmented Reality 2018 (To appear).
[Chen et al., 2012] Chen, C. F., Wei, C. P., and Wang, Y . C. F. (2012). Low-
rank matrix recovery with structural incoherence for robust face recogni-
tion. In 2012 IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 2618–2625.
89
Bibliography 90
[Chen and Medioni, 1992] Chen, Y . and Medioni, G. (1992). Object mod-
elling by registration of multiple range images. Image and Vision Comput-
ing, 10(3):145 – 155.
[Choi et al., 2016] Choi, S., Zhou, Q.-Y ., Miller, S., and Koltun, V . (2016).
A large dataset of object scans. arXiv:1602.02481.
[Comaniciu and Meer, 2002] Comaniciu, D. and Meer, P. (2002). Mean shift:
A robust approach toward feature space analysis. IEEE Trans. Pattern
Anal. Mach. Intell., 24(5):603–619.
[Crete et al., 2007] Crete, F., Dolmiere, T., Ladret, P., and Nicolas, M.
(2007). The blur effect: perception and estimation with a new no-reference
perceptual blur metric. Proc. SPIE, 6492:64920I–64920I–11.
[Dai et al., 2017] Dai, A., Nießner, M., Zoll¨ ofer, M., Izadi, S., and Theobalt,
C. (2017). Bundlefusion: Real-time globally consistent 3d reconstruction
using on-the-fly surface re-integration. ACM Transactions on Graphics
2017 (TOG).
[Davis et al., 2012] Davis, A., Levoy, M., and Durand, F. (2012). Unstruc-
tured light fields. Comput. Graph. Forum, 31(2pt1):305–314.
[Debevec et al., 2000] Debevec, P., Hawkins, T., Tchou, C., Duiker, H.-P.,
Sarokin, W., and Sagar, M. (2000). Acquiring the reflectance field of a
human face. In Proceedings of the 27th Annual Conference on Computer
Graphics and Interactive Techniques, SIGGRAPH ’00, pages 145–156,
New York, NY , USA. ACM Press/Addison-Wesley Publishing Co.
[Debevec et al., 1998] Debevec, P., Yu, Y ., and Boshokov, G. (1998). Ef-
ficient view-dependent image-based rendering with projective texture-
mapping. Technical report, University of California at Berkeley, Berkeley,
CA, USA.
[Einarsson et al., 2006] Einarsson, P., Chabert, C.-F., Jones, A., Ma, W.-C.,
Lamond, B., Hawkins, T., Bolas, M., Sylwan, S., and Debevec, P. (2006).
Relighting human locomotion with flowed reflectance fields. In Pro-
ceedings of the 17th Eurographics Conference on Rendering Techniques,
EGSR ’06, pages 183–194, Aire-la-Ville, Switzerland, Switzerland. Euro-
graphics Association.
[Gortler et al., 1996] Gortler, S. J., Grzeszczuk, R., Szeliski, R., and Cohen,
M. F. (1996). The lumigraph. In Proceedings of the 23rd Annual Confer-
ence on Computer Graphics and Interactive Techniques, SIGGRAPH ’96,
pages 43–54, New York, NY , USA. ACM.
[Hedman et al., 2016] Hedman, P., Ritschel, T., Drettakis, G., and Bros-
tow, G. (2016). Scalable inside-out image-based rendering. ACM Trans.
Graph., 35(6):231:1–231:11.
Bibliography 91
[Izadi et al., 2011] Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., New-
combe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., Davison, A.,
and Fitzgibbon, A. (2011). Kinectfusion: Real-time 3d reconstruction and
interaction using a moving depth camera. In Proceedings of the 24th An-
nual ACM Symposium on User Interface Software and Technology, UIST
’11, pages 559–568, New York, NY , USA. ACM.
[Jeon et al., 2016] Jeon, J., Jung, Y ., Kim, H., and Lee, S. (2016). Tex-
ture map generation for 3d reconstructed scenes. The Visual Computer,
32(6):955–965.
[Jiddi et al., 2016] Jiddi, S., Robert, P., and Marchand, E. (2016). Re-
flectance and illumination estimation for realistic augmentations of real
scenes. In 2016 IEEE International Symposium on Mixed and Augmented
Reality (ISMAR-Adjunct), pages 244–249.
[Kozlov et al., 2018] Kozlov, C., Slavcheva, M., and Ilic, S. (2018). Patch-
based non-rigid 3d reconstruction from a single depth stream. In 2018
International Conference on 3D Vision (3DV), pages 42–51.
[Lalonde and Matthews, 2014] Lalonde, J.-F. and Matthews, I. (2014). Light-
ing estimation in outdoor image collections. In International Conference
on 3D Vision.
[Levoy and Hanrahan, 1996] Levoy, M. and Hanrahan, P. (1996). Light field
rendering. In Proceedings of the 23rd Annual Conference on Computer
Graphics and Interactive Techniques, SIGGRAPH ’96, pages 31–42, New
York, NY , USA. ACM.
[Lin et al., 2011] Lin, Z., Liu, R., and Su, Z. (2011). Linearized alternating
direction method with adaptive penalty for low-rank representation. In
Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger,
K. Q., editors, Advances in Neural Information Processing Systems 24,
pages 612–620. Curran Associates, Inc.
[Nakashima et al., 2015] Nakashima, Y ., Uno, Y ., Kawai, N., Sato, T., and
Yokoya, N. (2015). Ar image generation using view-dependent geometry
modification and texture mapping. Virtual Reality, 19(2):83–94.
[Narayan and Abbeel, 2015] Narayan, K. S. and Abbeel, P. (2015). Opti-
mized color models for high-quality 3d scanning. In IEEE/RSJ Interna-
tional Conference on Intelligent Robots and Systems (IROS), pages 2503–
2510.
[Newcombe et al., 2015] Newcombe, R. A., Fox, D., and Seitz, S. M. (2015).
Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-
time. In 2015 IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 343–352.
[Park et al., 2018] Park, J. J., Newcombe, R., and Seitz, S. (2018). Surface
light field fusion. ArXiv e-prints.
Bibliography 92
[Phong, 1975] Phong, B. T. (1975). Illumination for computer generated
pictures. Commun. ACM, 18(6):311–317.
[Porquet et al., 2005] Porquet, D., Dischler, J.-M., and Ghazanfarpour, D.
(2005). Real-time high-quality view-dependent texture mapping using per-
pixel visibility. In Proceedings of the 3rd International Conference on
Computer Graphics and Interactive Techniques in Australasia and South
East Asia, GRAPHITE ’05, pages 213–220, New York, NY , USA. ACM.
[Remondino and El-Hakim, 2006] Remondino, F. and El-Hakim, S. (2006).
Image-based 3d modelling: A review. The Photogrammetric Record,
21(115):269–291.
[Richter-Trummer et al., 2016] Richter-Trummer, T., Kalkofen, D., Park, J.,
and Schmalstieg, D. (2016). Instant mixed reality lighting from casual
scanning. In IEEE International Symposium on Mixed and Augmented
Reality (ISMAR), pages 27–36.
[Rongsirigul et al., 2017] Rongsirigul, T., Nakashima, Y ., Sato, T., and
Yokoya, N. (2017). Novel view synthesis with light-weight view-
dependent texture mapping for a stereoscopic hmd. In 2017 IEEE Interna-
tional Conference on Multimedia and Expo (ICME), pages 703–708.
[Rusinkiewicz and Levoy, 2001] Rusinkiewicz, S. and Levoy, M. (2001). Ef-
ficient variants of the icp algorithm. In Proceedings Third International
Conference on 3-D Digital Imaging and Modeling, pages 145–152.
[Shi et al., 2016] Shi, B., Wu, Z., Mo, Z., Duan, D., Yeung, S., and Tan,
P. (2016). A benchmark dataset and evaluation for non-lambertian and
uncalibrated photometric stereo. In 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 3707–3716.
[Shum and Kang, 2000] Shum, H. and Kang, S. B. (2000). A review of
image-based rendering techniques. Proceedings of IEEE/SPIE Visual
Communications and Image Processing (VCIP), 4067:2–13.
[Waechter et al., 2014] Waechter, M., Moehrle, N., and Goesele, M. (2014).
Let there be color! large-scale texturing of 3d reconstructions. In Com-
puter Vision – ECCV 2014, pages 836–850, Cham. Springer International
Publishing.
[Wei et al., 2018] Wei, X., Xu, X., Zhang, J., and Gong, Y . (2018). Spec-
ular highlight reduction with known surface geometry. Computer Vision
and Image Understanding, 168:132 – 144. Special Issue on Vision and
Computational Photography and Graphics.
[Whelan et al., 2012] Whelan, T., Kaess, M., Fallon, M., Johannsson, H.,
Leonard, J., and McDonald, J. (2012). Kintinuous: Spatially extended
KinectFusion. In RSS Workshop on RGB-D: Advanced Reasoning with
Depth Cameras, Sydney, Australia.
Bibliography 93
[Whelan et al., 2016] Whelan, T., Salas-Moreno, R. F., Glocker, B., Davison,
A. J., and Leutenegger, S. (2016). Elasticfusion: Real-time dense slam and
light source estimation. The International Journal of Robotics Research,
35(14):1697–1716.
[Wright et al., 2009] Wright, J., Ganesh, A., Rao, S., Peng, Y ., and Ma, Y .
(2009). Robust principal component analysis: Exact recovery of corrupted
low-rank matrices via convex optimization. In Bengio, Y ., Schuurmans,
D., Lafferty, J. D., Williams, C. K. I., and Culotta, A., editors, Advances
in Neural Information Processing Systems 22, pages 2080–2088. Curran
Associates, Inc.
[Wu et al., 2016a] Wu, H., Wang, Z., and Zhou, K. (2016a). Simultaneous lo-
calization and appearance estimation with a consumer rgb-d camera. IEEE
Transactions on Visualization and Computer Graphics, 22(8):2012–2023.
[Wu and Zhou, 2015] Wu, H. and Zhou, K. (2015). Appfusion: Interactive
appearance acquisition using a kinect sensor. Computer Graphics Forum,
34(6):289–298.
[Wu et al., 2016b] Wu, Z., Yeung, S., and Tan, P. (2016b). Towards building
an RGBD-M scanner. CoRR, abs/1603.03875.
[Zhang, 1994] Zhang, Z. (1994). Iterative point matching for registration of
free-form curves and surfaces. International Journal of Computer Vision,
13(2):119–152.
[Zhou and Koltun, 2014] Zhou, Q.-Y . and Koltun, V . (2014). Color map opti-
mization for 3d reconstruction with consumer depth cameras. ACM Trans.
Graph., 33(4):155:1–155:10.
Part IV
A P P E N D I X
A
C O L O R M A P P I N G O P T I M I Z AT I O N
A.1 O B J E C T I V E F U N C T I O N
The 3D model M generated from Kinectfusion consist of a vertex set P,
where each p2 P. We want to estimate C = C(p), where C(p) repre-
sent the vertex p’s color, and the optimized transformation matrix T
opt
=
fT
0
1
,T
0
2
, ,T
0
K
g.
E(C,T)=
å
i
å
p2P
i
(C(p)G
i
(p,T
i
))
2
(24)
G
i
(p,T
i
) is produced by projecting p to uv map and retrieve the color.
It composite of a rigid transformation, a projection, and a color evaluation,
which can be expressed asG
i
(p,T
i
)=G
i
(u(g(p,T
i
))).
g is the rigid transformation function that project p from world coordinate
system to the camera coordinate system.
g = g(p,T
i
)= T
i
p (25)
u project g onto the image space by using the intrinsic matrix K of the
RGB camera.
u(g
x
,g
y
,g
z
,g
w
)=(
g
x
f
x
g
z
+c
x
,
g
y
f
y
g
z
+c
y
)
T
(26)
where(f
x
, f
y
) is the focal length and(c
x
,c
y
) is the principal point fromK,
which is usually provided by the camera manufacturer. It also can be com-
puted by calibrating the camera via checkerboard [Bouguet, 2008] for more
accurate parameters.
G
i
(u
x
,u
y
) is the color evaluation that returns the bi-linearly interpolated
color value for coordinate(u
x
,u
y
) in image I
i
.
A.2 O P T I M I Z AT I O N
E(C,T) is a nonlinear least-squares objective and can be minimized using the
Gaussian-Newton Method. The basic idea is to alternate between optimizing
95
A.2 O P T I M I Z AT I O N 96
C andT.
O P T I M I Z I N G C WhenT is fixed,C is the average intensity of p onto the
associated images.
C(p)=
1
n
p
å
I
i
2I
p
G
i
(p,T
i
) (27)
wheren
p
is the number of images associate with p.
O P T I M I Z I N G T
i
When C is fixed, the objective function is independent
for eachT
i
. Assume there arem points projected onto the image I
i
.
E(T)=
å
p2P
i
r
2
i,p
=
m
å
i=1
r
2
i
(28)
It is a non-linear least squared problem and can be solved by Gauss-Newton
method. We parameterizeT
i
by a vectorx=(a,b,g,a,b,c), where(a,b,c)
is the translation and (a,b,g) is the angle. Each iteration of the algorithm
updatesx as follow:
x
k+1
= x
k
+Dx (29)
whereDx =(J
r
T
J
r
)
1
J
r
T
r, andr is the residual vector and J
r
= J
r
(x)
is the Jacobian ofr.
r =[r
i
(x)j
x=x
k] =fr
1
,r
2
, ,r
m
g
J
r
(x)= [rr
i
(x)j
x=x
k] =
0
B
B
B
B
@
¶r
1
¶a
¶r
1
¶b
¶r
1
¶c
¶r
2
¶a
¶r
2
¶b
¶r
2
¶c
.
.
.
.
.
.
.
.
.
.
.
.
¶r
m
¶a
¶r
m
¶b
¶r
m
¶c
1
C
C
C
C
A
(30)
To partial derivative ofr
i,p
with respect toT, we can use the chain rule.
rr
i
(x)j
x=x
k
=
¶
¶x
(Gug)j
x=x
k
= rG(u)J
u
(g)J
g
(x)j
x=x
k
(31)
where,
rG(u)=(
¶G
¶u
x
,
¶G
¶u
y
) is the gradient ofG. It can be pre-computed by apply-
ing a normalized Scharr Kernel over grayscale image.
A.2 O P T I M I Z AT I O N 97
J
u
(g) is the Jacobian ofu.
J
u
(g)=
¶u
x
¶g
x
¶u
x
¶g
y
¶u
x
¶g
z
¶u
x
¶g
w
¶u
y
¶g
x
¶u
y
¶g
y
¶u
y
¶g
z
¶u
y
¶g
w
!
=
f
x
g
1
z
0 g
x
f
x
g
2
z
0
0 f
y
g
1
z
g
y
f
y
g
2
z
0
(32)
J
g
(x) is the Jacobian of g. To computeJ
g
(x), we locally linearize T
aroundT
k
.
T =
0
B
B
@
1 g b a
g 1 a b
b a 1 c
0 0 0 1
1
C
C
A
T
k
g=
0
B
B
@
1 g b a
g 1 a b
b a 1 c
0 0 0 1
1
C
C
A
T
k
p=
0
B
B
@
1 g b a
g 1 a b
b a 1 c
0 0 0 1
1
C
C
A
0
B
B
B
@
p
0
x
p
0
y
p
0
z
p
0
w
1
C
C
C
A
=
0
B
B
@
g
x
g
y
g
z
g
w
1
C
C
A
(33)
Thus,
J
g
(x)=
0
B
B
B
B
@
¶g
x
¶a
¶g
x
¶b
¶g
x
¶g
¶g
x
¶a
¶g
x
¶b
¶g
x
¶c
¶g
y
¶a
¶g
y
¶b
¶g
y
¶g
¶g
y
¶a
¶g
y
¶b
¶g
y
¶c
¶g
z
¶a
¶g
z
¶b
¶g
z
¶g
¶g
z
¶a
¶g
z
¶b
¶g
z
¶c
¶g
w
¶a
¶g
w
¶b
¶g
w
¶g
¶g
w
¶a
¶g
w
¶b
¶g
w
¶c
1
C
C
C
C
A
=
0
B
B
B
@
0 p
0
z
p
0
y
p
0
w
0 0
p
0
z
0 p
0
x
0 p
0
w
0
p
0
y
p
0
x
0 0 0 p
0
w
0 0 0 0 0 0
1
C
C
C
A
(34)
B
L OW- R A N K M AT R I X D E C O M P O S I T I O N
B.1 O B J E C T I V E F U N C T I O N
L O W- R A N K D E C O M P O S I T I O N Low-rank matrix recovery seeks to de-
compose a data matrix D into A+E, where A is a low-rank matrix and E
is the associated sparse error. More precisely, given the input data matrix D,
LR minimizes the rank of matrix A while reducing the number of non-zero
value inkEk
0
to derive the low-rank approximation ofD.
min lkEk
0
+rank(A) s.t. D = A+E (35)
Since the aforementioned optimization problem is NP-hard, Candes et al.
[Cand` es et al., 2011] solve the following formulation to make the original LR
tractable:
minkDAEk
2
2
+l
1
kEk
1
+l
2
kDk
(36)
where D is the observed data, A is the low-rank matrix and the E is the
error matrix. Thek.k
0
is replaced by thek.k
1
, which sums up the absolute
values of entries in E and thek.k
is the nuclear norm (i.e., the sum of the
singular values) approximates the rank of A
It is shown in [3] that, solving this convex relaxation version is equivalent
to solving the original low-rank matrix approximation problem, as long as
the rank of A to be recovered is not too large and the number of errors in
E is small (sparse). To solve the optimization problem of (1), the technique
of inexact augmented Lagrange multipliers (ALM) [Cand` es et al., 2011, Lin
et al., 2011] has been applied due to its computational efficiency.
98
C
M AT E R I A L E S T I M AT I O N
C.1 O B J E C T I V E F U N C T I O N
C.1.1 Inputs
• Model M with a total m vertex, where v =fv
1
,v
2
,v
3
, ,v
m
g of
vertex and its normaln=fn
1
,n
2
,n
3
, ,n
m
g
• Selected n specular maps S =fS
1
,S
2
,S
3
, ,S
n
g and their corre-
sponding transformation matrixT =fT
1
,T
2
,T
3
, ,T
n
g
C.1.2 Outputs
• specular reflection constantks =fks
1
,ks
2
,ks
3
, ,ks
m
g of each ver-
tex
• shininess constanta=fa
1
,a
2
,a
3
, ,a
m
g
• Assuming totall incident lights with their light intensityi =fi
1
,i
2
,i
3
, ,i
l
g
and directiond=fd
1
,d
2
,d
3
, ,d
l
g
C.1.3 Material Property and Light Color Estimation
F I X a A N D I For eachk
b
2 k is independent to the other vertex andk
b
is
also independent with incident lights. Thus, we can simplify the equation as
follow :
k
b
= argmin
n
å
a=1
(y
ab
k
b
x
0
ab
)
2
s.t k
b
x
0
ab
< y
ab
k
b
0
(37)
k
b
= max(fy
ab
/x
0
ab
: a = 1,2,3, ,ng,0), where x
0
ab
=å
l
c=1
(x
abc
)
a
b
i
c
99
C.1 O B J E C T I V E F U N C T I O N 100
F I X K A N D I For eacha
b
2 a is independent to the other vertex, thus, we
can simplify the equation as follow :
a
b
= argmin
n
å
a=1
(y
ab
k
b
l
å
c=1
i
c
(x
abc
)
a
b
)
2
s.t
l
å
c=1
i
c
(x
abc
)
a
b
< y
ab
/k
b
a
b
0
(38)
F I X K A N D a
i
i
= argmin
n
å
a=1
å
b2v
a
(y
ab
k
b
l
å
c=1
(x
abc
)
a
b
i
c
)
2
= argmin
n
å
a=1
å
b2v
a
(y
ab
k
b
l
å
c=1,c6=i
(x
abc
)
a
b
i
c
k
b
(x
abi
)
a
b
i
i
)
2
= argmin
n
å
a=1
å
b2v
a
(z
abi
x
00
abi
i)
2
(39)
where x
00
abi
= k
b
(x
abi
)
a
b
andz
abi
= y
ab
k
b
å
l
c=1,c6=i
(x
abc
)
a
b
Thus, for incident lighti, the minimization function is written as follow :
i = argmin
n
å
a=1
å
b2v
a
(z
ab
x
00
ab
i)
2
s.t x
00
ab
i< z
ab
i 0
(40)
i
i
= max(f(z
ab
/x
00
ab
: a2 n,b2 v
a
g,0)
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Recording, reconstructing, and relighting virtual humans
PDF
Multi-scale dynamic capture for high quality digital humans
PDF
Scalable dynamic digital humans
PDF
Compositing real and virtual objects with realistic, color-accurate illumination
PDF
Human appearance analysis and synthesis using deep learning
PDF
Query processing in time-dependent spatial networks
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
City-scale aerial LiDAR point cloud visualization
PDF
Data-driven 3D hair digitization
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Immersive computing for coastal engineering
PDF
Perception and haptic interface design for rendering hardness and stiffness
PDF
Efficient indexing and querying of geo-tagged mobile videos
PDF
Simultaneous center of mass estimation and foot placement selection in complex planar terrains for legged architectures
PDF
Locomotor skill learning in virtual reality in healthy adults and people with Parkinson disease
Asset Metadata
Creator
Chen, Chih-Fan
(author)
Core Title
Rapid creation of photorealistic virtual reality content with consumer depth cameras
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
04/23/2019
Defense Date
11/16/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,reflectance modeling,virtual content creation,virtual reality, view-dependent texture mapping
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Suma Rosenberg, Evan (
committee chair
), Nakano, Aiichiro (
committee member
), Sawchuk, Alexander (
committee member
)
Creator Email
chihfanc@usc.edu,ryanchen1203@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-140740
Unique identifier
UC11675747
Identifier
etd-ChenChihFa-7215.pdf (filename),usctheses-c89-140740 (legacy record id)
Legacy Identifier
etd-ChenChihFa-7215.pdf
Dmrecord
140740
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chen, Chih-Fan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
reflectance modeling
virtual content creation
virtual reality, view-dependent texture mapping