Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Recording, reconstructing, and relighting virtual humans
(USC Thesis Other)
Recording, reconstructing, and relighting virtual humans
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Recording, Reconstructing, and Relighting Virtual Humans
by
Loc Vinh Huynh
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
December 2021
Copyright 2021 Loc Vinh Huynh
Acknowledgements
This dissertation is accomplished thanks to the great effort, guidance, support, and encouragement
of many people. I would like to express my sincerest gratitude to those that help to make this
possible.
First, I would like to express my sincerest gratitude to my advisor Prof. Paul Debevec for his
wonderful guidance in research, generous support in many ways, and kind encouragement during
tough times. Thank you for enlightening me on many aspects of visual effects, and for giving me
opportunities to apply academic research in such applications. I would like to thank my dissertation
committee members Prof. Hao Li, Prof. Ulrich Neumann, Prof. Aiichiro Nakano, Prof. Michelle
Povinelli, and Prof. Andrew Nealen for their insightful suggestions and efforts. I would like to
thank Prof. G´ erard Medioni and VEF Foundation for giving me an opportunity to study in the
Ph.D. program at the University of Southern California and providing guidance on my early days
in the US. I would like to thank USC’s CS department Ph.D. student adviser Lizsl DeLeon for
keeping me on track. I would also like to express my gratitude to my undergraduate supervisors
and mentors, Dr. Tien Dinh and Dr. Thang Dinh for encouraging me to study in the US.
I would like to express my sincerest gratitude to all members of the USC ICT Vision and Graph-
ics Lab. I would like to thank Prof. Hao Li for his interim guidance as my co-advisor, Jay Busch
for her cheerful support, Kathleen Haase, Michael Trejo, and Christina Trejo for their coordina-
tion, Bipin Kishore and Xinglei Ren for helping with hardware. Thank you, Marcel Ramos and
Pratusha Prasad for helping with data processing. Thank you, Graham Fyffe, Andrew Jones, Koki
Nagano, Weikai Chen, Shunsuke Saito, Jun Xing for staying overnight with me to finish awesome
ii
submissions. Thank you, Yajie Zhao and Mingming He for giving me insightful discussions and
research directions.
I was very fortunate to collaborate with SHOAH Foundation on the amazing project New Di-
mension in Testimony, which teaches me learn a lot about history as well as the importance of
using technologies in good ways. I would like to thank Kia Hays, Anita Pace, Zachary Goode for
sharing precious data of the three Holocaust survivors Pinchas Gutter, Aaron Elster, Eva Schloss.
I would like to thank Darren Hendler for giving me a chance to work at Digital Domain. I
would like to thank the amazing Software Research and Development team at Digital Domain for
showing me great technologies in VFX and for providing me kind support. I would like to thank
Doug Roble for his great discussions as a mentor, and for showing me the first look of Digital
Doug as a friend. Working in Digital Domain broadens my horizon and gives me a clear direction
on my future research.
I would like to thank the Apple Technology Development Group team: Olivier Soares for his
great leadership and for allowing me to work on an exciting project, Andrew Mason and Shaobo
Guan for their wonderful discussions, Jeong Wook Park for a lot of technical guidance and inspi-
rations, and Andrew Harvey for an amazing time in Bay Area. The spell at Apple inspires me to
do great things and ultimately to change the world with technologies.
I would like to thank my family Vinh Huynh, Ha Pham, Phuong Huynh, Chi Huynh for their
love, support, and encouragement throughout the entire time in the program. Without them, I
would not be able to come this far. Additionally, I would like to thank all of my friends, especially
my Ph.D. fellows Koki Nagano, Shunsuke Saito, and Chloe LeGendre for giving me a precious
experience during this journey.
Finally, I would like to thank my wonderful wife Giang Le for her patience, enthusiasm, un-
conditional love, and tenderness. She always believes in me, encourages me during tough times,
and pushes me to realize my dreams. I am very grateful to have her by my side.
iii
Table of Contents
Acknowledgements ii
List of Figures vi
Abstract x
Chapter 1: Introduction 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2: Related Work 6
2.1 Facial Performance Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Mesoscopic Facial Geometry Inference . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Facial Geometry and Appearance Capture. . . . . . . . . . . . . . . . . . 9
2.2.2 Mesoscopic Facial Detail Capture. . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Geometric Detail Inference. . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Relighting a Dynamic Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 3: Multi-view Steoreo on Consistent Face Topology 14
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Shared Template Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Landmark-Based Initialization . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.2 Coarse-Scale Template Warping . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Pose Estimation, Denoising, and Template Personalization . . . . . . . . . 20
3.3.4 Fine-Scale Template Warping . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.5 Final Pose Estimation and Denoising . . . . . . . . . . . . . . . . . . . . 21
3.3.6 Detail Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Appearance-Driven Mesh Deformation . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.1 Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Optical Flow Based Update . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.3 Dense Mesh Representation . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.4 Laplacian Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.5 Updating Eyeballs and Eye Socket Interiors . . . . . . . . . . . . . . . . . 30
3.5 PCA-Based Pose Estimation and Denoising . . . . . . . . . . . . . . . . . . . . . 31
iv
3.5.1 Rotation Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.2 Translation Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.3 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 4: Mesoscopic Facial Geometry Inference Using Deep Neural Networks 44
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Geometry Detail Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Texture-to-Displacement Network . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 Super-Resolution Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.8 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 5: Relighting Video with Reflectance Field Exemplars 61
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 Data Acquisition and Processing . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.1 Single Image Portrait Relighting . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.2 Relighting as Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.3 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.4 Relighting Dynamic Performance . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 6: Conclusion and Future Work 73
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.1 Multi-View Stereo Solve . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.2 Mesoscopic Facial Geometry Inference . . . . . . . . . . . . . . . . . . . 74
6.2.3 Multi-view Reflectance Fields for Relighting . . . . . . . . . . . . . . . . 75
References 76
v
List of Figures
1.1 Uncanny valley. (a),(b) 2D characters, (c) 3D cartoon character, (d),(e) Digital
humans in the valley, (f) Digital human climbing out of the valley, (g) Photograph . 2
1.2 Comparison photograph (left) and static rendering (right) . . . . . . . . . . . . . . 3
3.1 Our pipeline proceeds in six phases, illustrated as numbered circles. 1) A common
template is fitted to multi-view imagery of a subject using landmark-based fitting
(Section 3.3.1). 2) The mesh is refined for every frame using optical flow for
coarse-scale consistency and stereo (Section 3.3.2). 3) The meshes of all frames
are aligned and denoised using a PCA scheme (Section 3.3.3). 4) A personalized
template is extracted and employed to refine the meshes for fine-scale consistency
(Section 3.3.4). 5) Final pose estimation and denoising reduces “sizzling” (Section
3.3.5). 6) Details are estimated from the imagery (Section 3.3.6). . . . . . . . . . 16
3.2 Production-quality mesh template and the cross-section of the volumetric template
constructed from the surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 (a) Facial landmarks detected on the source subject and (b) the corresponding tem-
plate; (c) The template deformed based on detected landmarks on the template and
subject photographs; (d) Detailed template fitting based on optical flow between
the template and subject, and between views. . . . . . . . . . . . . . . . . . . . . 18
3.4 Facial expressions reconstructed without temporal flow. . . . . . . . . . . . . . . 20
3.5 (a) Dense base mesh; (b) Proposed detail enhancement; (c) “Dark is deep” detail
enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 Optical flow between photographs of different subjects (a) and (e) performs poorly,
producing the warped images (b) and (f). Using 3D mesh estimates (e.g. a template
deformed based on facial landmarks), we compute a smooth vector field to produce
the warped images (c) and (g). Optical flow between the original images (a, e) and
the warped images (c, g) produces the relatively successful final warped images
(d) and (h). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vi
3.7 (a) A stereo cue. An estimated point x is projected to the 2D point p
k
in view k. A
flow field F
l
k
transfers the 2D point to p
l
in a second view l. The point x is updated
by triangulating the rays through p
k
and p
l
. (b) A reference cue. An estimated
point y is projected to the 2D point q
j
in view j. A flow field G
k
j
transfers the 2D
point to p
k
in view k of a different subject or different time. A second flow field F
l
k
transfers the 2D point to p
l
in view l and then point x is estimated by triangulating
the rays through p
k
and p
l
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.8 Laplacian regularization results. Left: surface regularization only. Right: surface
and volumetric regularization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9 Comparison of rigid mesh alignment techniques on a sequence with significant
head motion and extreme expressions. Top: center view. Middle: Procrustes
alignment. Bottom: our proposed method. The green horizontal strike-through
lines indicate the vertical position of the globally consistent eyeball pivots. . . . . 33
3.10 12 views of one frame of a performance sequence. Using flat blue lighting provides
sharp imagery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.11 Zoomed renderings of different facial regions of three subjects. Top: results af-
ter coarse-scale fitting using the shared template (Section 3.3.2). The landmarks
incorrectly located the eyes or mouth in some frames. Middle: results after pose
estimation and denoising (Section 3.3.3). Bottom: results after fine-scale consis-
tent mesh propagation (Section 3.3.4) showing the recovery of correct shapes. . . . 37
3.12 Comparison using a morphable model of [20] as a initial template. (a) Front face
region captured by the previous technique, (b) stitched on our full head topology.
(c) Resulting geometry from Section 3.3.2 deformed using our method as (b) as
a template, compared to (d) the result of using the Digital Emily template. The
linear morphable model misses details in the nasolabial fold. . . . . . . . . . . . . 38
3.13 (a) Synthetic rendering of a morphable model using Figure 3.12. (b) Result using
our image warping method to warp (a) to match real photograph (e). Similarly the
common template image (c) is warped to match (e), producing plausible coarse-
scale facial feature matching in (d). . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.14 (a) High quality mesh deformed using our technique; (b) high-resolution displace-
ment details. Half-face visualization indicates good agreement and topology flow
around geometric details within the facial expression. . . . . . . . . . . . . . . . . 40
3.15 Dynamic face reconstruction from multi-view dataset of a male subject shown
from one of the calibrated cameras (top). Wireframe rendering (second) and per
frame texture rendering (third) from the same camera. Enhanced details captured
with our technique (bottom) shows high quality agreement in the fine-scale details
in the photograph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vii
3.16 Reconstructed mesh (a), enhanced displacement details with our technique (b),
and comparison to previous work. Our method automatically captures whole head
topology including nostrils, back of the head, mouth interior, and eyes, as well as
skin details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.17 Comparison of our result (a) to PMVS2 [29] (b). Overlaying the meshes in (c)
indicates a good geometric match. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.18 Since our method reconstructs the face on a common head topology with coarse-
scale feature consistency across subjects, blending between different facial per-
formances is easy. Here we transition between facial performances from three
different subjects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.19 While our system is able to reconstruct most of surface facial features, it struggles
to reconstruct features that are not represented by a template such as tongue, and
teeth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1 System pipeline. From the multi-view captured images, we calculate the texture
map and base mesh. The texture (1K resolution) is first fed into our trained Tex-
ture2Disp network to produce 1K-high and 1K-middle frequency displacement
maps, followed by up-sampling them to 4K resolution using our trained SuperRes
Network and bicubic interpolation, respectively. The combined 4K displacement
map can be embossed to the base mesh to produce the final high detailed mesh. . . 46
4.2 The histogram of the medium and high frequency pixel count shows that the ma-
jority of the high frequency details lie within a very narrow band of displacement
values compared to the medium frequency values spreading over a broader dy-
namic range. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 By separating the displacement map into high and middle frequencies, the network
could learn both the dominant structure and subtle details (a and b), and the details
could be further enhanced via super-resolution (b and c). . . . . . . . . . . . . . . 49
4.4 Synthesis results given different input textures with variations in subject identity
and expression. From (a) to (e), we show the input texture, base mesh, the output
geometry with the medium, 1K multi-scale (both medium and high frequency) and
4K multi-scale frequency displacement map. The closeup is shown under each result. 52
4.5 High frequency details of our method (center) comparison with ground truth Light
Stage data [55] (left) and “dark is deep” heuristic.[30] (right) . . . . . . . . . . . . 54
4.6 Inferred detail comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7 Compared with Sela et al. [74], our method could produce more detailed normal
map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.8 Our method could generate much more subtle details (middle) than the surface
normal prediction Bansal et al. [70] (left). . . . . . . . . . . . . . . . . . . . . . . 57
viii
4.9 Failure case with extreme makeup: (left)Input texture, (center) Our result, (right)
Ground truth [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.10 Results with unconstrained images. Left: input image, texture. Middle: displace-
ment (zoom-in), rendering (zoom-in). Right: rendering . . . . . . . . . . . . . . . 60
5.1 The architecture of our neural network. The input image is passed through a U-Net
style architecture to regress to the set of OLAT images. When the ground truth is
available, the network prioritizes the reconstruction loss of the OLAT imageset.
Otherwise, the network is trained based on the feedback of the relit image. . . . . . 62
5.2 End-to-end semi-supervised training scheme. We use reconstruction loss for syn-
thetic images while image-based lighting loss is applied to both real and synthetic
interview images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Reflectance field: 27 of 41 one-light-at-a-time images. . . . . . . . . . . . . . . . 65
5.4 Comparison with Single Image Portrait Relighting. Our result has greater lighting
detail and looks much closer to the reference lighting. . . . . . . . . . . . . . . . . 69
5.5 Comparison with Style Transfer based relighting. Our method reproduces more
convincing shadows and highlights. . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6 Relighting results - Row 1: Input interview videos. Row 2,3: OLAT predictions
on two patterns. Row 4,5: Relighting results with two HDRI lighting environ-
ments: Grace Cathedral and Pisa Courtyard. See more examples in our video. . . . 71
ix
Abstract
This dissertation presents methods and systems for recording, reconstructing, and relighting virtual
humans. The work aims to not only reproduce high fidelity digital humans but also to realistically
place their dynamic performance in the environmental context of the target scenes in the virtual
world. This work presents two approaches to address the stated problem.
The first approach is to use the traditional computer graphics rendering pipeline, where we need
to accurately estimate the geometry and materials of the human subject. First, we present a multi-
view stereo reconstruction technique that directly produces a complete high-fidelity head model
with a consistent facial mesh topology. Our approach consists of deforming a common template
model to match multi-view input images of the subject, while satisfying cross-view, cross-subject,
and cross-pose consistencies. While the quality of our results is on par with the current state-of-
the-art, our approach can be fully parallelized, does not suffer from drift, and produces face models
with production-quality mesh topologies.
Additionally, we present a learning-based approach for synthesizing facial geometry at medium
and fine scales from diffusely lit facial texture maps. Our model is trained with measured facial
detail collected using polarized gradient illumination. This enables us to produce plausible facial
detail across the entire face, including where previous approaches may incorrectly interpret dark
features as concavities such as moles, hair stubble, and occluded pores. Instead of directly infer-
ring 3D geometry, we encode fine details in high-resolution displacement maps which are learned
through a hybrid network adopting the state-of-the-art image-to-image translation network and
x
super-resolution network. To effectively capture geometric detail at both mid and high frequen-
cies, we factorize the learning into two separate sub-networks, enabling the full range of facial
detail to be modeled.
Deep learning offers infinite possibilities to tackle hard problems including our dynamic re-
lighting problem. For this second approach, this work presents a learning-based method for es-
timating 4D reflectance field of a person given video footage illuminated under a flat-lit environ-
ment of the same subject. For training data, we use one light at a time to illuminate the subject
and capture the reflectance field data in a variety of poses and viewpoints. We then train a deep
convolutional neural network to regress the reflectance field from the synthetic relit images. We
also use a differentiable renderer to provide feedback for the network by matching the relit images
with the input video frames. This semi-supervised training scheme allows the neural network to
handle unseen poses in the dataset as well as compensate for the lighting estimation error.
xi
Chapter 1
Introduction
1.1 Problem statement
One of the central goals of Computer Graphics research over the past decades has been to create
photo-realistic virtual humans, animate, and relight them in a way that is indistinguishable from
real humans. More and more digital doubles have been seen in blockbuster movies and games
nowadays, which emphasizes the importance of realism in the modern entertainment industry.
With the surge of Artificial Intelligence (AI), which insanely hungers for data, the demand for
data is increasing tremendously. And high-quality data such as photo-real renderings of virtual
humans are in need as they are proved to boost the performances of AI technologies significantly.
Besides, the growing impact of Virtual Reality and Augmented Reality in daily life applications
from virtual teleconference, health care assistants to virtual concerts and exhibitions has created a
great demand for automated ways to create high-quality digital avatars and to seamlessly put them
in the virtual world. These virtual activities became a must during the COVID-19 pandemic and
they have gradually integrated into our daily life even the post-pandemic future.
A major challenge of creating high-quality digital avatars is avoiding the ”Uncanny Valley”
phenomenon [1]. The Uncanny Valley describes the reaction of observers as they interact with the
digital character. As the level of realism increases, observers feel more comfortable interacting
with the digital character. The level of comfort continues to increase until the character appears
”close to human”. The observers are alerted to strangeness or creepiness as their expectations
1
Figure 1.1: Uncanny valley. (a),(b) 2D characters, (c) 3D cartoon character, (d),(e) Digital humans
in the valley, (f) Digital human climbing out of the valley, (g) Photograph
of being a ”human” are not met, and the comfort level drops significantly (Figure 1.1). Recent
advancements in 3D capture and modeling system nearly get out of the valley (See Figure 1.2),
but lots of artists’ effort are still involved in the processing pipeline. Our first work in Chapter 3
focuses on techniques to automate the pipeline, make it faster without losing realism.
An important part of creating photo-realistic human faces is skin detail, from the dynamic
wrinkles around the eyes, mouth, and forehead that define expression to the fine-scale texture of
fine creases and pores that make the real human skin surface. Constructing such details on a digital
character can take weeks of effort by digital artists, and often employs specialized and expensive
3D scanning equipment to measure skin details from real people. And the problem is made much
more complicated by the fact that such skin details are dynamic: wrinkles form and disappear,
skin pores stretch and shear, and every change provides a cue to the avatar’s expression and their
2
(a) Photograph (b) Rendering
Figure 1.2: Comparison photograph (left) and static rendering (right)
realism. We address this problem in our Chapter 4, which provides a solution to dynamically
capture the skin details at mesoscopic resolution.
On some occasions when the applications focus more on a single-view relighting, the image-
based rendering technique reproduces the most realistic lighting of the subject under a certain
environment. This technique overcomes the need for highly accurate reconstruction of the subject’s
shape and reflectance. The technique, however, requires quite many photographs to fully capture
the reflectance of the subject. Therefore, they are normally used for static scans. In Chapter 5, we
propose a learning-based approach to acquire reflectance of the subject in any moment of the input
video, and realistically relight the subject with novel illuminations.
3
1.2 Contributions
This dissertation presents solutions for recording, reconstructing, and photo-realistically relighting
virtual humans. The traditional computer graphics solution focuses on highly accurate reconstruc-
tion of shape and reflectance while the modern learning-based solution bypasses the geometry
reconstruction to create convincing videos of the subject under various lighting conditions.
In summary, the contributions of this thesis are:
• A fully parallelizable multi-view stereo facial performance capture pipeline that produces a
high-quality facial reconstruction with consistent mesh topology.
– A passive multi-view capture system using static blue LEDs for imaging high resolu-
tion skin textures suitable for dynamic surface reconstruction.
– An appearance-driven mesh deformation algorithm using optical flow on high-resolution
imaging data combines with volumetric Laplacian regularization.
– A novel PCA-based pose estimation and denoising technique.
• The first deep learning framework that reconstructs high resolution dynamic displacement
comparable to active illumination scanning systems entirely from a passive multi-view im-
agery.
– We show how it is possible to learn such inference from sparse but high-resolution
geometry data using a two-level image translation network with a conditional GAN
combined with a patch-based super resolution network.
– We leverage an existing 3D face tracking method to provide state-of-the-art tracking
results on low-resolution face videos.
– We provide robust reconstruction of both medium and high frequency structures includ-
ing moles and stubble hair, correctly distinguishing surface pigmentation and the actual
surface bumps, outperforming other methods based on high-frequency hallucination or
simulation methods.
4
• A practical process for recording interview footage where the lighting can be controlled
realistically after filming.
– A capture system that enables 4D reflectance field estimation of moving subjects.
– A machine learning-based model that maps diffuse images lit from above to the whole
set of reflectance fields.
5
Chapter 2
Related Work
Accurate acquisition of geometry and reflectance from images under controlled or unconstrained
lighting conditions has been a long-studied research topic in computer vision and graphics. Re-
lighting, considered as one direct application of this, has got more research interests recently thanks
to its highly adaptable applications in visual effects, virtual and augmented reality. In this chap-
ter, we summarize some of the most related work for digitalizing dynamic facial appearance and
relighting virtual humans.
2.1 Facial Performance Tracking
Driving the motion of digital characters with real actor performances has become a common and
effective process for creating realistic facial animation. Performance-driven facial animation dates
back as least as far as Williams [2] who used facial markers in a monocular video to animate and
deform a 3D scanned facial model. Guenter et al. [3] drove a digital character from multi-view
video by 3D tracking a few hundred facial markers seen in six video cameras. Yet even a dense set
of facial markers can miss subtle facial motion details necessary for conveying the entire meaning
of a performance. Addressing this, Disney’s Human Face Project [4] was perhaps the first to use
dense optical flow on multi-view video of a facial performance to obtain dense facial motion for
animation, setting the stage for the markerless multi-view facial capture system used to animate
realistic digital characters in the ”Matrix” sequels [5]. In our work, we use a multi-view video
6
setup to record facial motion, but fit a model to the images per time instant rather than temporally
tracking the performance.
Realistic facial animation may also be generated through physical simulation as in [6, 7]. The
computer animation ”The Jester” [8] tracked the performer’s face with a standard set of mocap
markers but used finite element simulation to simulate higher-resolution performance details such
as the skin wrinkling around the eyes. [9] used a bone, flesh, and muscle model of a face to reverse-
engineer the muscle activations which generate the same motion of the face as recorded with mocap
markers. Recent work showed a way to automatically construct a personalized anatomical model
for volumetric facial tissue simulations [10]. In our work, we use a volumetric facial model to
enable robust model fitting solutions including occluded regions.
Faces assume many shapes but have the same features in similar positions, and as a result,
can be modeled with generic templates such as morphable models [11]. Such models have proven
useful in recent work for real-time performance tracking [12–19], expression transfer [20–23],
and performance reconstruction from monocular video [24, 25]. Recent work demonstrated the
reconstruction of a personalized avatar from mobile free-form videos [26], medium-scale dynamic
wrinkles from RGB monocular video [27], and high fidelity mouth animation for a head-mounted
display [28]. These template-based approaches can provide facial animation in a common artist-
friendly topology with blendshape animations. But they cannot capture shape details outside of
the assumed linear deformation subspace, which may be important for high-quality expressive
facial animation. On the other hand, our technique captures accurate 3D shapes comparable to
multi-view stereo, on a common head topology without the need for complex facial rigs.
Multi-view stereo approaches [29–31] remain popular since they yield verifiable and accurate
geometry even though they require offline computation. Our dynamic performance reconstruction
technique differs from techniques such as [31–33] in that we do not begin by solving for indepen-
dent multi-view stereo geometry at each time instant. In fact, our method does not require a set
of high-resolution facial scans (or even a single facial scan) of the subject to assist performance
7
tracking as in [24, 34–36]. Instead, we employ optical flow and surface/volume Laplacian priors
to constrain 3D vertex estimates based on a template.
Video-based facial performance capture is susceptible to “drift”, meaning inconsistencies in
the relationship between facial features and the mesh parameterization across different instants in
time. For example, a naive algorithm that tracks vertices from one frame to the next will accu-
mulate error throughout a performance. Previous works have taken measures to mitigate drift in
a single performance. However, none of these approaches lends itself to multiple performance
clips or collections of single-frame captures. Furthermore, previous works addressing fine-scale
consistency involve at least one manual step if the high-quality topology is desired. [31] employs
a manually selected reference frame and geometry obtained from the stereo reconstruction. If
a clean topology is desired, is it edited manually. The method locates “anchor frames” similar
to the reference frame to segment the performance into short clips, and optical flow tracking is
performed within each clip and across clip seams. The main drawback of this method is that all
captures must contain well-distributed anchor frames that resemble the reference, which limits
the expressive freedom of the performer. [37] constructs a minimum spanning tree in appearance
space and employs non-sequential tracking to reduce drift, combined with temporal tracking to
reduce temporal seams. The user must manually create a mesh for the frame at the root of the
tree, based on geometry obtained from the stereo reconstruction. Despite the minimum spanning
tree, expressions far from the root expression still require concatenation of multiple flow fields,
accumulating drift. If single-frame captures are included in the data, it may fail altogether. [24]
employs a neutral facial scan as a template, and requires manually refined alignment of the neutral
scan to the starting frame of the performance. The method locates “keyframes” resembling the
neutral scan (much like anchor frames) to segment the performance into short clips that are tracked
via temporal optical flow and employs synthetic renderings of the neutral scan to reduce drift via
optical flow correction. This method is unsuitable for collections of multiple performances, as
the manual initial alignment required for each performance would be prohibitive and error-prone.
[36] employs multiple facial scans, with one neutral scan serving as a template. The neutral scan
8
topology is produced manually. The neutral scan is tracked directly to all performance frames and
all other scans using optical flow, which is combined with temporal optical flow and flows between
a sparse set of frames and automatically selected facial scans to minimize drift. This method han-
dles multiple performance clips and multiple single-frame captures in correspondence but requires
multiple facial scans spanning the appearance space of the subject’s face, one of which is manually
processed.
2.2 Mesoscopic Facial Geometry Inference
2.2.1 Facial Geometry and Appearance Capture.
The foundational work of Blanz and Vetter [11] showed that a morphable principal components
model built from 3D facial scans can be used to reconstruct a wide variety of facial shapes and
overall skin coloration. However, the scans used in their work did not include submillimeter-
resolution skin surface geometry, and the nonlinear nature of skin texture deformation would be
difficult to embody using such a linear model. For the skin detail synthesis, Saito et al. [38]
presented a photorealistic texture inference technique using a deep neural network-based feature
correlation analysis. The learned skin texture details can be used to enhance the fidelity of person-
alized avatars [39]. While the method learns to synthesize mesoscopic facial albedo details, the
same approach cannot be trivially extended to the geometry inference since the feature correlation
analysis requires a number of geometry scans.
2.2.2 Mesoscopic Facial Detail Capture.
There are many techniques for deriving 3D models of faces, and some are able to measure high-
resolution skin detail to a tenth of a millimeter, the level of resolution we address in this work.
Some of the best results are derived by scanning facial casts from a molding process [40, 41], but
the process is time-consuming and impossible to apply to a dynamic facial performance. Multi-
view stereo provides the basic geometry estimation technique for many facial scanning systems
9
[29–32, 36, 42–44]. However, stereo alone typically recovers limited surface detail due to the
semi-translucent nature of the skin.
Inferring local surface detail from shape from shading is a well-established technique for un-
constrained geometry capture [45–47], and has been employed in digitizing human faces [24–26,
48–51]. However, the fidelity of the inferred detail is limited due to the input image captured
under an unconstrained setting. Beeler et al. [30, 31] applied shape from shading to emboss high-
frequency skin shading as hallucinated mesoscopic geometric details for skin pores and creases.
While the result is visually plausible, some convexities on the surface can be misclassified as
geometric concavities, producing incorrect surface details. More recent work [44] extended this
scheme to employ relative shading change (“darker is deeper” heuristics) to mitigate the ambiguity
between the dark skin pigmentation and the actual geometric details.
Active photometric stereo based from specular reflections has been used to measure detailed
surface geometry in devices such as a light stage [52–55] which can be used to create photoreal
digital humans [56–58]. A variant of the system has been introduced by Weyrich et al. [59] for sta-
tistical photorealistic reflectance capture. Ghosh et al. [55] employed multi-view polarized spher-
ical gradient illumination to estimate sub-millimeter accurate mesoscopic displacements. Static
[54] and dynamic [60] microstructures can be recorded using a similar photometric system or a
contact based method [61, 62]. Photometric stereo techniques have been extended to video perfor-
mance capture [63–65]. However, these require high-speed imaging equipment and synchronized
active illumination to record the data.
2.2.3 Geometric Detail Inference.
Previous work has successfully employed data-driven approaches for inferring facial geometric de-
tails. Skin detail can be synthesized using data-driven texture synthesis [61] or statistical skin detail
models [66]. Dynamic facial details can be inferred from sparse deformation using polynomial
texture maps[63] or radial basis functions [67]. However, these methods can require significant
10
effort to apply to a new person. More recently Cao et al [68] proposed to locally regress medium-
scale details (e.g. expression wrinkles) from high-resolution capture data. While generalizing to
new test data after training, their approach cannot capture pore-level details.
A neural network-based approach has been introduced for predicting image-domain pixel-wise
transformation with a conditional GAN [69] and inference of surface normals for general objects
[70, 71]. For facial geometry inference, Trigeorgis et al. [72] employed fully convolutional net-
works to infer a coarse face geometry through surface normal estimation. More recently, Richard-
son et al. [73] and Sela et al. [74] presented a learning-based approach to reconstruct detailed facial
geometry from a single image. However, none of the previous works has addressed the inference
of mesoscopic facial geometry, perhaps due to the limited availability of high fidelity geometric
data.
2.3 Relighting a Dynamic Performance
Relighting virtual humans from images and video is an active research topic in computer vision
and computer graphics. In this section, we summarize some of the most related work in inverse
rendering, image-based relighting, and learning-based relighting methods.
Inverse Rendering. If photos of a scene can be analyzed to derive an accurate model of the
scene’s geometry and materials, the model can be rendered under arbitrary new lighting using
forward rendering. This inverse rendering problem is a long-studied research topic in computer
vision and graphics [75–77]. Many of these approaches use strong assumptions such as known
illumination [78], or hand-crafted priors [47].
Unsurprisingly, relighting human bodies and faces has received particular interest recently.
Many parametric models have been proposed to jointly reconstruct geometry, reflectance, and il-
lumination of human bodies [79], faces [11, 23–26, 80], eyes [81], eyelids [82], and hair [83, 84].
[85] relights videos of humans based on estimation of parametric BRDF models and wavelet-based
11
incident illumination. [86] uses a diffuse model for the face to relight it with a radiance environ-
ment map using ratio images. [87] performs relighting by using spherical gradient illumination
images to fit a cosine lobe the reflectance function. Several works estimate spatially-varying re-
flectance properties of a scene from either flash [88, 89] or flat-lit images [90]. [91] uses deep
neural networks to estimate the parameters of a predefined geometry and reflectance model from a
single image.
These parametric models are typically designed to handle specific parts of the human body.
Many of these techniques rely on lightweight morphable models for geometry, a Lambertian model
for skin reflectance, and a low-frequency 2
nd
order spherical harmonic basis for illumination. Un-
fortunately, these strong priors only capture low-frequency detail and do not reproduce the appear-
ance of specular reflections and sub-surface scattering in the skin.
In contrast, we use a deep neural network to infer the subject’s reflectance field, which can
be used to relight the images without explicitly modeling the geometry, material reflectance, and
illumination of the images.
Image-based Relighting. When a person is recorded under a large number of individual lighting
conditions, they can be accurately relit by linearly combining those one-light-at-a-time (OLAT)
images with the target illumination [92]. [93] used high-speed video to capture dynamic subjects
with time-multiplexed lighting conditions for relighting, but required expensive cameras, optical
flow computation, and was data intensive. [94] recorded a coarser lighting basis from a multitude
of viewpoints, allowing post-production control of both the lighting and the viewpoint. However,
neither technique could be applied to long video segments due to the large size of the high frame
rate video.
Another technique is to transfer reflectance field properties from a pre-captured subject to a
target subject’s performance as in [95]. However, the quality of the lighting transfer depends on
number of captured poses and the similarity in appearance of the two subjects. Relighting can also
be performed by transferring local image statistics from one portrait image to the target portrait as
in [96]; however, this technique does not work well for extreme lighting changes.
12
Our approach also uses a set of lighting basis conditions to perform relighting. But instead of
recording OLAT’s for every moment of the video, our neural network infers OLATs for each video
frame based on exemplars from static poses, enabling dynamic performance relighting.
Learning-Based Relighting [97] trains a deep neural network to map images of a subject lit by
spherical gradient basis illumination to a set of one-light-at-a-time (OLAT) images for relighting.
Similar to this approach, we map the interview lighting images to a set of OLAT images. But
unlike [97], we employ a semi-supervised training scheme to train the network due to the lack of
ground truth in our dataset and to work on the single interview lighting condition that’s available.
[98] proposes a neural network that takes a single portrait photo under any lighting environment
and relights the subject with arbitrary target illumination. This network was trained on a large
set of subjects individuals under a dense set of lighting conditions to predict the input illumina-
tion and perform relighting by replacing the illumination at the bottleneck of the neural network.
The technique is overall successful, but the low resolution of the predicted illumination limits the
quality of the relit result. [99] also uses multitudes of data from 70 subjects with dense lighting
conditions to estimate a more detailed HDR lighting environment from a single portrait image.
[100] presents a recent advance in Style Transfer techniques, where a video can be changed to a
different style by registering to one or more keyframes in the new style. This is most often used
to transfer non-photorealistic rendering styles such as a pastel drawing, but can also be used to
transfer a new style of lighting. However, this technique has not been applied to create arbitrarily
relightable models and requires registration from the style exemplars to the video sequence. In
comparison, our method is designed to perform realistic relighting from a single lighting condition
by providing the neural network a set of reflectance field exemplars of how the subject actually
should appear under OLAT lighting conditions.
13
Chapter 3
Multi-view Steoreo on Consistent Face Topology
3.1 Introduction
Video-based facial performance capture has become a widely established technique for the digiti-
zation and animation of realistic virtual characters in high-end film and game production. While
recent advances in facial tracking research are pushing the boundaries of real-time performance
and robustness in unconstrained capture settings, professional studios still rely on computationally
demanding offline solutions with high resolution imaging. To further avoid the uncanny valley,
time-consuming and expensive artist input, such as tracking clean-up or key-framing, is often
required to fine-tune the automated tracking results and ensure consistent UV parameterization
across the input frames.
State-of-the-art facial performance capture pipelines are mostly based on a multi-view stereo
setup to capture fine geometric details, and generally decouple the process of model building and
facial tracking. The facial model (often a parametric blendshape model) is designed to reflect the
expressiveness of the actor but also to ensure that any deformation stays within the shape and
expression space during tracking. Because of the complexity of facial expressions and potentially
large deformations, most trackers are initialized from the previous input frames. However, such
sequential approaches cannot be parallelized and naturally result in drift, which requires either
artist-assisted tracking corrections or ad-hoc segmentation of the performance into short temporal
clips.
14
We show in this work, that it is possible to directly obtain, for any frame, a high-resolution
facial model with consistent mesh topology using a passive multi-view capture system with flat
illumination and high-resolution input images. We propose a framework that can accurately warp
a reference template model with existing texture parameterization to the face of any person, and
demonstrate successful results on a wide range of subjects and challenging expressions. While
existing multi-view methods either explicitly compute the geometry [31, 37] or implicitly encode
stereo constraints [101], they rely on optical flow or scene-flow to track a face model, for which
computation is only possible sequentially. Breaking up the performance into short clips using
anchor frames or key frames with a common appearance is only a partial solution, as it requires
the subject to return to a common expression repeatedly throughout the performance.
Our objective is to warp a common template model to a different person in arbitrary poses and
different expressions while ensuring consistent anatomical matches between subjects and accurate
tracking across frames. The key challenge is to handle the large variations of facial appearances
and geometries, as well as the complexity of facial expression and large deformations. We propose
an appearance-driven mesh deformation approach that produces intermediate warped photographs
for reliable and accurate optical flow computation. Our approach effectively avoids image disconti-
nuities and artifacts often caused by methods based on synthetic renderings or texture reprojection.
In a first pass, we compute temporally consistent animations, that are produced from indepen-
dently computed frames, by deforming a template model to the expressions of each frame while
enforcing consistent cross-subject correspondences. To initialize our face fitting, we leverage re-
cent work in facial landmark detection. In each subsequent phase of our method, the appearance-
based mesh warping is driven by the mesh estimate from the previous phase. We show that even
where the reference and target images exhibit significant differences in appearance (due to signifi-
cant head rotation, different subjects, or expression changes), our warping approach progressively
converges to a high-quality correspondence. Our method does not require a complex facial rig or
blendshape priors. Instead, we deform the full head topology according to the multi-view optical
flow correspondences, and use a combination of surface and volumetric Laplacian regularization
15
Figure 3.1: Our pipeline proceeds in six phases, illustrated as numbered circles. 1) A common
template is fitted to multi-view imagery of a subject using landmark-based fitting (Section 3.3.1).
2) The mesh is refined for every frame using optical flow for coarse-scale consistency and stereo
(Section 3.3.2). 3) The meshes of all frames are aligned and denoised using a PCA scheme (Section
3.3.3). 4) A personalized template is extracted and employed to refine the meshes for fine-scale
consistency (Section 3.3.4). 5) Final pose estimation and denoising reduces “sizzling” (Section
3.3.5). 6) Details are estimated from the imagery (Section 3.3.6).
to produce a well-behaved shape, which helps especially in regions that are prone to occlusion and
inter-penetration such as the eyes and mouth pocket.
As the unobserved regions such as the back of the head are inferred from the Laplacian defor-
mation, these regions may be temporally inconsistent in the presence of significant head motion
or expression changes. Thus we propose a novel PCA-based technique to estimate pose and de-
noise the facial meshes over the entire performance, which improves temporal consistency around
the top and back of the head and reduces high-frequency “sizzling” noise. We then compute a
subject-specific template and refine the performance capture in a second pass to achieve pore-level
tracking accuracy.
Our method never computes optical flow between neighboring frames, and never compares
a synthetic rendering to a photograph. Thus, our method does not suffer from drift, and accu-
rately corresponds regions that are difficult to render synthetically such as around the eyes. Our
method can be applied equally well to a set of single-frame expression captures with no temporal
continuity, bringing a wide variety of facial expressions into (u,v) correspondence with pore-level
16
(a) (b)
Figure 3.2: Production-quality mesh template and the cross-section of the volumetric template
constructed from the surface.
accuracy. Furthermore, our joint optimization for stereo and fitting constraints also improves the
digitization quality around highly occluded regions such as mouth, eyes, and nostrils as they pro-
vide additional reconstruction cues in the form of shape priors. We report timings for each step of
our method, most of which are trivially parallelizable across multiple computers.
3.2 Shared Template Mesh
Rather than requiring a manually constructed personalized template for each subject, our method
automatically customizes a generic template including the eyes and mouth interior. We maintain
a consistent representation of this face mesh throughout our process: a shared template mesh with
its deformation parameterized on the vertices. The original template can be any high-quality artist
mesh with associated multi-view photographs. To enable volumetric regularization, we construct
a tetrahedral mesh for the template using TetGen [102] (Figure 3.2). We also symmetrize the
17
(a) (b) (c) (d)
Figure 3.3: (a) Facial landmarks detected on the source subject and (b) the corresponding template;
(c) The template deformed based on detected landmarks on the template and subject photographs;
(d) Detailed template fitting based on optical flow between the template and subject, and between
views.
template mesh by averaging each vertex position with that of the mirrored position of the vertex
bilaterally opposite it. This is because we do not want to introduce any facial feature asymmetries
of the template into the Laplacian shape prior. For operations relating the template back to its multi-
view photographs, we use the original vertex positions. For operations employing the template as
a Laplacian shape prior, we employ the symmetrized vertex positions.
We demonstrate initialization using a high-quality artist mesh template constructed from multi-
view photography. We use the freely available “Digital Emily” mesh, photographs, and camera
calibration from [103]. The identity of the template is of no significance, though we purposely
chose a template with no extreme unique facial features. A single template can be reused for many
recordings of different subjects. We also compare to results obtained from a morphable model [20]
with synthetic renderings in place of multi-view photographs.
3.3 Method Overview
Given an existing template mesh, we can reconstruct multiple video performances by optimizing
photoconsistency cues between different views, across different expressions, and across different
subjects. Our method consists of six sequential phases, illustrated in Figure 5.1. In this section,
18
we provide a short overview of each phase. Further technical details are provided in Sections 3.4
and 3.5. We report run times for each phase based on computers with dual 8-core Intel E5620
processors and NVidia GTX980 graphics cards. All phases except for rigid alignment are trivially
parallelizable across frames.
3.3.1 Landmark-Based Initialization
First, we leverage 2D facial landmark detection to deform the common template and compute an
initial mesh for each frame of the performance. Subsequent optical flow steps require a mesh
estimate which is reasonably close to the true shape. We estimate facial landmark positions on
all frames and views using the method of [104] implemented in the DLib library [105]. We then
triangulate 3D positions with outlier rejection, as the landmark detection can be noisy. We use
the same procedure for the template photographs to locate the template landmark positions. Fig-
ure 3.3(a) shows an example with detected landmarks as black dots, and triangulated landmarks
after outlier rejection as white dots. We transform the 3D landmarks of all poses to a common
coordinate system using an approximate rigid registration to the template landmarks. We perform
PCA-based denoising per subject in the registered space to remove any isolated errors, and then
transform the landmarks back into world space. We additionally apply Gaussian smoothing to the
landmark trajectories in each performance sequence. We finally compute a smooth deformation of
the template to non-rigidly register it to the world space 3D landmarks of each captured facial pose,
using Laplacian mesh deformation. Figure 3.3(c) shows an example of the template deformed to
a subject using only the landmarks. These deformed template meshes form the initial estimates in
our pipeline, and are not required to be entirely accurate. This phase takes only a few seconds per
frame, which are processed in parallel except for the PCA step.
3.3.2 Coarse-Scale Template Warping
Starting from the landmark-based initialization in Section 3.3.1, we employ an appearance-driven
mesh deformation scheme to propagate the shared template mesh onto the performance frames.
19
Figure 3.4: Facial expressions reconstructed without temporal flow.
More details on this algorithm are provided in Section 3.4. This phase takes 25 minutes per frame,
which are processed in parallel. (Most time is spent in the volumetric Laplacian solve.) After this
phase, the processed facial meshes are all high quality 3D scans with the same topology, and are
consistent at the level of coarse features such as the eyebrows, eyes, nostrils, and corners of the
mouth. Figure 3.4 shows the results at this phase directly deforming the template to multiple poses
of the same individual without using any temporal information. Despite significant facial motion,
the mesh topology remains consistent with the template. If only a single pose is desired for each
subject, we can stop here. If sequences or multiple poses were captured, we continue with the
remaining phases to improve consistency across poses.
3.3.3 Pose Estimation, Denoising, and Template Personalization
The face mesh estimates from Section 3.3.2 are reasonably good facial scans, but they exhibit two
sources of distracting temporal noise. First, they lack fine-scale consistency in the UV domain,
and second, any vertices that are extrapolated in place of missing data may differ considerably
from frame to frame (for example, around the back of the head). The primary purpose of this
phase is to produce a mesh sequence that is temporally smooth, with a plausible deformation
basis, and closer to the true face sequence than the original estimate in Section 3.3.1. We wish to
project the meshes into a reduced dimensional deformation basis to remove some of the temporal
20
noise, which requires the meshes to be registered to a rigidly aligned head pose space, rather than
roaming free in world space. Typically this is accomplished through iterative schemes, alternating
between pose estimation and deformation basis estimation. In Section 3.5 we describe a method
to decouple the pose from the deformation basis, allowing us to first remove the relative rotation
from the meshes without knowledge of the deformation basis, then remove the relative translation,
and finally compute the deformation basis via PCA. We truncate the basis retaining 95% of the
variance, which reduces temporal noise without requiring frame-to-frame smoothing. Finally, we
identify the frame whose shape is closest to the mean shape in the PCA basis, and let this frame
be the personalized template frame for the proceeding phases. This phase, which is not easily
parallelized, takes about 8 seconds per frame.
3.3.4 Fine-Scale Template Warping
This phase is nearly identical to Section 3.3.2, except that we propagate the shared template only to
the personalized template frame identified in Section 3.3.3 (per subject), after which the updated
personalized template becomes the new template for the remaining frames (again, per subject).
This enables fine-scale consistency to be obtained from optical flow, as the pores, blemishes, and
fine wrinkles on a subject’s skin provide ample registration markers across poses. Further, we start
from the denoised estimates from Section 3.3.3 instead of the landmark based estimates of Section
3.3.1, which are much closer to the actual face shape of each frame, reducing the likelihood of
false matches in the optical flow.
3.3.5 Final Pose Estimation and Denoising
After the consistent mesh has been computed for all frames, we perform a final step of rigid regis-
tration to the personalized template and PCA denoising, similar to Section 3.3.3 but retaining 99%
of the variance. We found this helps remove “sizzling” noise produced by variations in the optical
flow. We also denoise the eye gaze animation using a simple Gaussian filter. More details on this
phase are provided in Section 3.5.
21
Figure 3.5: (a) Dense base mesh; (b) Proposed detail enhancement; (c) “Dark is deep” detail
enhancement.
3.3.6 Detail Enhancement
Finally we extract texture maps for each frame, and employ the high frequency information to
enhance the surface detail already computed on the dense mesh in Section 3.4.3, in a similar
manner as [30]. We make the additional observation that the sequence of texture maps holds an
additional cue: when wrinkles appear on the face, they tend to make the surface shading darker
relative to the neutral state. To exploit this, we compute the difference between the texture of each
frame and the texture of the personalized template, and then filter it with an orientation-sensitive
filter to remove fine pores but retain wrinkles. We call this the wrinkle map, and we employ it as
a medium-frequency displacement, in addition to the high-frequency displacement obtained from
a high-pass filter of all texture details. We call this scheme “darker is deeper”, as opposed to the
“dark is deep” schemes from the literature. Figure 3.5 shows the dense mesh, enhanced details
captured by our proposed technique, and details using a method similar to [31].This step including
texture extraction and mesh displacement takes 10 minutes per frame and is trivially parallelizable.
3.4 Appearance-Driven Mesh Deformation
We now describe in detail the deformation algorithm mentioned in Sections 3.3.2 and 3.3.4. Sup-
pose we have a known reference mesh with vertices represented as y
i
2 Y and a set of photographs
22
I
Y
j
2I
Y
corresponding to the reference mesh along with camera calibrations. Now suppose we
also have photographs I
X
k
2I
X
and camera calibrations for some other, unknown mesh with ver-
tices represented as x
i
2 X. Our goal is to estimate X given Y;I
Y
;I
X
. In other words, we
propagate the known reference mesh Y to the unknown configuration X using evidence from the
photographs of both. Suppose we have a previous estimate
ˆ
X somewhat close to the true X. (We
explain how to obtain an initial estimate in Section 3.3.1.) We can improve the estimate
ˆ
X by first
updating each vertex estimate ˆ x
i
2
ˆ
X using optical flow (described in Section 3.4.2), then updat-
ing the entire mesh estimate
ˆ
X using Laplacian shape regularization with Y as a reference shape
(described in Section 3.4.4). Finally we position the eyeballs based on flow vectors and geometric
evidence from the eyelid region (described in Section 3.4.5).
3.4.1 Image Warping
Before further discussion, we must address the difficult challenge of computing meaningful optical
flow between pairs of photographs that may differ in viewpoint, in facial expression, in subject
identity, or any combination of the three. We assume high-resolution images and flat illumination,
so different poses of a subject will have generally similar shading and enough fine details for
good registration. Still, if the pose varies significantly or if the subject differs, naive optical flow
estimation will generally fail. For example, 3.6(b, f) shows the result of naively warping one
subject to another using optical flow, which would not be useful for facial correspondence since
the flow mostly fails.
Our solution is to warp the image of one face to resemble the other face before computing opti-
cal flow (and vice-versa to compute optical flow in the other direction). We warp the images based
on the current 3D mesh estimates (first obtained via the initialization in Section 3.3.1.) One might
try rendering a synthetic image in the first camera view using the first mesh and texture sourced
from the second image via the second mesh, to produce a warped version of the second image in a
similar configuration to the first. However this approach would introduce artificial discontinuities
23
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 3.6: Optical flow between photographs of different subjects (a) and (e) performs poorly,
producing the warped images (b) and (f). Using 3D mesh estimates (e.g. a template deformed
based on facial landmarks), we compute a smooth vector field to produce the warped images (c)
and (g). Optical flow between the original images (a, e) and the warped images (c, g) produces the
relatively successful final warped images (d) and (h).
wherever the current mesh estimates are not in precise alignment with the photographs, and such
discontinuities would confuse the optical flow algorithm. Thus we instead construct a smooth vec-
tor field to serve as an image-space warp that is free of discontinuities. We compute this vector field
by rasterizing the first mesh into the first camera view, but instead of storing pixel colors we write
the second camera’s projected image plane coordinates (obtained via the second mesh). We skip
pixels that are occluded in either view using a z buffer for each camera, and smoothly interpolate
the missing vectors across the entire frame. We then apply a small Gaussian blur to slightly smooth
any discontinuities, and finally warp the image using the smooth vector field. Examples using our
warping scheme are shown in 3.6(c, g), and the shape is close enough to the true shape to produce
24
Figure 3.7: (a) A stereo cue. An estimated point x is projected to the 2D point p
k
in view k. A flow
field F
l
k
transfers the 2D point to p
l
in a second view l. The point x is updated by triangulating the
rays through p
k
and p
l
. (b) A reference cue. An estimated point y is projected to the 2D point q
j
in view j. A flow field G
k
j
transfers the 2D point to p
k
in view k of a different subject or different
time. A second flow field F
l
k
transfers the 2D point to p
l
in view l and then point x is estimated by
triangulating the rays through p
k
and p
l
.
a relatively successful optical flow result using the method of [106], shown in 3.6(d, h). After
computing flow between the warped image and the target image, we concatenate the vector field
warp and the optical flow vector field to produce the complete flow field. (This is implemented
simply by warping the vector field using the optical flow field.)
3.4.2 Optical Flow Based Update
Within the set of imagesI
Y
;I
X
, we may find cues about the shape of X and the relationship
between X and Y . We employ a stereo cue between images from the same time instant, and a
reference cue between images from different time instants or different subjects, using optical flow
25
and triangulation (Figure 3.7. First, consider a stereo cue. Given an estimate p
k
i
= P
k
( ˆ x
i
) P
k
(x
i
)
with P
k
representing the projection from world space to the image plane coordinates of view k of
X, we can employ an optical flow field F
l
k
between views k and l of X, to estimate p
l
i
= F
l
k
(p
k
i
)
P
l
(x
i
). Defining D
k
(p
k
i
)= I d
k
(p
k
i
)d
k
>
(p
k
i
) where d
k
(p
k
i
) is the world space view vector passing
through image plane coordinate p
k
i
of the camera of view k of X, and c
k
the center of projection of
the lens of the same camera, we may employ p
k
i
and p
l
i
together to triangulate ˆ x
i
as:
ˆ x
i
argmin
x
i
kD
k
(p
k
i
)(x
i
c
k
)k
2
+kD
l
(p
l
i
)(x
i
c
l
)k
2
(3.1)
which may be solved in closed form. Next, consider a reference cue, which is a cue involving the
reference mesh Y , being either a shared template or another pose of the same subject. Given a
known or estimated q
j
i
= Q
j
(y
i
) with Q
j
representing the projection from world space to the image
plane coordinates of view j of Y , we can employ an optical flow field G
k
j
between view j of Y and
view l of X, to estimate p
k
i
= G
k
j
(q
j
i
) P
k
(x
i
). Next, we use F
l
k
to obtain p
l
i
from p
k
i
as we did
for the stereo cue, and triangulate ˆ x
i
as before. However, instead of triangulating all these different
cues, we combine them into a single triangulation, introducing a scalar field r
k
j
representing the
optical flow confidence for flow field G
k
j
, and s
l
k
representing the optical flow confidence for flow
field F
l
k
. Including one more parameter g to balance between stereo cues and reference cues, the
whole thing looks like this:
ˆ x
i
argmin
x
i
g
å
k;l
s
l
k
2
(p
k
i
)
kD
k
(p
k
i
)(x
i
c
k
)k
2
+kD
l
(F
l
k
(p
k
i
))(x
i
c
l
)k
2
+(1g)
å
j;k;l
r
k
j
(q
j
i
)s
l
k
(G
k
j
(q
j
i
))
kD
k
(G
k
j
(q
j
i
))(x
i
c
k
)k
2
+kD
l
(F
l
k
(G
k
j
(q
j
i
)))(x
i
c
l
)k
2
(3.2)
This differs from [36] in that reference flows are employed as a lookup into stereo flows instead
of attempting to triangulate pairs of reference flows, and differs from [31] in that the geometry is
not computed beforehand; rather the stereo and consistency are computed together. While (3.2) can
be trivially solved in closed form, the flow field is dependent on the previous estimate ˆ x
i
, and hence
26
we perform several iterations of optical flow updates interleaved with Laplacian regularization for
the entire face. We schedule the parameter g to range from 0 in the first iteration to 1 in the last
iteration, so that the solution respects the reference most at the beginning, and respects stereo most
at the end. We find five iterations to generally be sufficient, and we recompute the optical flow
fields after the second iteration as the mesh will be closer to the true shape, and a better flow may
be obtained.
The optical flow confidence fields r
k
j
and s
l
k
are vitally important to the success of the method.
The optical flow implementation we use provides an estimate of flow confidence [106] based on the
optical flow matching term, which we extend in a few ways. First, we compute the flows both ways
between each pair of images, and multiply the confidence by an exponentially decaying function of
the round-trip distance. Specifically, in s
l
k
(p
k
i
) we include a factor exp
kkp
k
i
F
k
l
(F
l
k
(p
k
i
))k
2
(for normalized image coordinates), where k = 20 is a parameter controlling round-trip strict-
ness, and analogously for r
k
j
. Since we utilize both directions of the flow fields anyway, this
adds little computational overhead. For stereo flows (i.e. flows between views of the same pose)
we include an additional factor penalizing epipolar disagreement, including in s
l
k
(p
k
i
) the factor
exp
l dl
2
(c
k
;d
k
(p
k
i
);c
l
;d
l
(F
l
k
(p
k
i
)))
, where dl(o
1
;d
1
;o
2
;d
2
) is the closest distance between the
ray defined by origin o
1
and direction d
1
and the ray defined by origin o
2
and direction d
2
, and
l = 500 is a parameter controlling epipolar strictness. Penalizing epipolar disagreement, rather
than searching strictly on epipolar lines, allows our method to find correspondences even in the
presence of noise in the camera calibrations. In consideration of visibility and occlusion, we em-
ploy the current estimate
ˆ
X to compute per-vertex visibility in each view of X using a z-buffer and
back-face culling on the GPU, and likewise for each view of Y . If vertex i is not visible in view k,
we set s
l
k
(p
k
i
) to 0. Otherwise, we include a factor of (n
i
d
k
(p
k
i
))
2
to soften the visibility based
on the current surface normal estimate n
i
. We include a similar factor for view l in s
l
k
(p
k
i
), and
for view j of Y and view k of X in r
k
j
. As an optimization, we omit flow fields altogether if the
current estimated head pose relative to the camera differs significantly between the two views to be
flowed. We compute the closest rigid transform between the two mesh estimates in their respective
27
Figure 3.8: Laplacian regularization results. Left: surface regularization only. Right: surface and
volumetric regularization.
camera coordinates, and skip the flow field computation if the relative transform includes a rotation
of more than twenty degrees.
3.4.3 Dense Mesh Representation
The optical flow fields used in Section 3.4.2 contain dense information about the facial shape, yet
we compute our solution only on the vertices of an artist-quality mesh. It would be a shame to
waste the unused information in the dense flow fields. Indeed we note that sampling the flow fields
only at the artist mesh vertices in Section 3.4.2 introduces some amount of aliasing, as flow field
28
values in between vertices are ignored. So we compute an auxiliary dense mesh with 262,144
vertices parameterized on a 512 512 vertex grid in UV space. Optical flow updates are applied
to all vertices of the dense mesh as in Section 3.4.2. We then regularize the dense mesh using the
surface Laplacian terms from Section 3.4.4, but omit the volumetric terms as they are prohibitive
for such a dense mesh. The surface Laplacian terms, on the other hand, are easily expressed and
solved on the dense grid parameterization.
This dense mesh provides two benefits. First, it provides an intermediate estimate that is free
of aliasing, which we utilize by looking up the dense vertex position at the same UV coordinate as
each artist mesh vertex. This estimate lacks volumetric regularization, but that will be applied next
in Section 3.4.4. Second, the dense mesh contains surface detail at finer scales than the artist mesh
vertices, and so we employ the dense mesh in Section 3.3.6 as a base for detailed displacement
map estimation.
3.4.4 Laplacian Regularization
After the optical flow update, we update the entire face mesh using Laplacian regularization, using
the position estimates from Section 3.4.3 as a target constraint. We use the framework of [107],
wherein we update the mesh estimate as follows:
ˆ
X argmin
X
å
i2S
a
i
kx
i
ˆ x
i
k
2
+
å
i2S
kL
S
(x
i
)e
i
k
2
+b
å
i2V
kL
V
(x
i
)d
i
k
2
(3.3)
where a
i
=s(gå
k;l
s
l
k
2
(p
k
i
)+(1g)å
j;k;l
r
k
j
(q
j
i
)s
l
k
(G
k
j
(q
j
i
))) is the constraint strength for vertex i
derived from the optical flow confidence with s = 15 being an overall constraint weight, S is the
set of surface vertices,L
S
is the surface Laplace operator,e
i
is the surface Laplacian coordinate of
vertex i in the rest pose, V is the set of volume vertices,L
V
is the volume Laplace operator, andd
i
is the volume Laplacian coordinate of vertex i in the rest pose. In our framework, the rest pose is
the common template in early phases, or the personalized template in later phases. We solve this
sparse linear problem using the sparse normal Cholesky routines implemented in the Ceres solver
29
[108]. We also estimate local rotation to approximate as-rigid-as-possible deformation [109]. We
locally rotate the Laplacian coordinate frame to fit the current mesh estimate in the neighborhood
of ˆ x
i
, and iterate the solve ten times to allow the local rotations to converge. 3.8 illustrates the
effect of including the volumetric Laplacian term. While previous works employ only the surface
Laplacian term [101], we find the volumetric term is vitally important for producing good results
in regions with missing or occluded data such as the back of the head or the interior of the mouth,
which are otherwise prone to exaggerated extrapolation or interpenetration.
3.4.5 Updating Eyeballs and Eye Socket Interiors
Our template represents eyeballs as separate objects from the face mesh, with their own UV texture
parameterizations. We include the eyeball vertices in the optical flow based update, but not the
Laplacian regularization update. Instead, the eyes are treated as rigid objects, using the closest
rigid transform to the updated positions to place the entire eyeball. This alone does not produce
very good results, as the optical flow in the eye region tends to be noisy, partly due to specular
highlights in the eyes. To mitigate this problem, we do two things. First, we apply a 3 3 median
filter to the face images in the region of the eye, using the current mesh estimate to rasterize a mask.
Second, after each Laplacian regularization step, we consider distance constraints connecting the
eye pivot points e
0
and e
1
to the vertices on the entire outer eyelid surfaces, and additional distance
constraints connecting the eye pivot points to the vertices lining the inside of the eye socket. These
additional constraints appear as: å
1
j=0
å
i2E
j
[O
j
(kx
i
e
j
kky
i
e
Y
j
k)
2
+å
i2E
j
f(kx
i
e
j
k;r)
,
where E
0
and E
1
are the set of left and right eyelid vertices, O
0
and O
1
are the set of left and
right socket vertices, e
Y
0
and e
Y
1
are the eye pivot positions in the reference mesh, f is a distance
constraint with a non-penetration barrier defined as f(a;b) = (a b)
2
exp(10(b a)), and r is a
hard constraint distance representing the radius of the eyeball plus the minimum allowed thickness
of the eyelid. The target distance of each constraint is obtained from the reference mesh. We
minimize an energy function including both (3.3) and the eye pivot distance constraints, but update
only the eye pivots, interior volume vertices, and vertices lining the eye sockets, leaving the outer
30
facial surface vertices constant. The distance constraints render this a nonlinear problem, which
we solve using the sparse Levenberg-Marquardt routines implemented in the Ceres solver [108].
After the eye pivots are computed, we compute a rotation that points the pupil towards the centroid
of the iris vertex positions obtained in the optical flow update. Although this scheme does not
personalize the size and shape of the eyeball, there is less variation between individuals in the eye
than in the rest of the face, and we obtain plausible results.
3.5 PCA-Based Pose Estimation and Denoising
We next describe the pose estimation and denoising algorithm mentioned in Sections 3.3.3 and
3.3.5. Estimating the rigid transformation for each frame of a performance serves several pur-
poses. First, it is useful for representing the animation results in standard animation packages.
Second, it allows consistent global 3D localization of the eyeballs, which should not move with
respect to the skull. Third, it enables PCA-based mesh denoising techniques that reduce high-
frequency temporal noise and improve consistency of occluded or unseen regions that are inferred
from the Laplacian regularization. Section 3.5.1 describes a novel rotation alignment algorithm for
deformable meshes which does not require knowledge of the deformation basis. Section 3.5.2 de-
scribes a novel translation alignment algorithm that simultaneously estimates per-mesh translation
and globally consistent eyeball pivot placement. Section 3.5.3 then describes a straightforward
PCA dimension reduction scheme.
We compare our rigid alignment technique with Procrustes rigid alignment [110] as a base-
line. 3.9 shows several frames from the evaluation of our technique on a sequence with significant
head motion and extreme facial expressions including wide open mouth. The baseline method
exhibits significant misalignment especially under wide open mouth expressions, which becomes
more apparent when globally consistent eye pivots are included. Our proposed technique stabi-
lizes the rigid head motion well, enabling globally consistent eye pivots to be employed without
interpenetration.
31
3.5.1 Rotation Alignment
Suppose we have a set of meshes X
t
;t = 1:::M, each with N vertices. We can interleave the
x, y, and z coordinates of the mesh vertices so that each X
t
corresponds to a 3N dimensional
vector, and then we may stack the vectors as columns in a single 3N M matrix X. Assuming
low dimensionality, we can suppose that in the absence of rigid transformations, X = BW, where
B is a 3N K basis matrix with K M, and W is a K M weight matrix with columns W
t
corresponding to the basis activations of each mesh X
t
. (We do not separately add the mean mesh
in the deformation model, so it will be included as the first column of B.) The trouble is that
the meshes X
t
may each have a different rigid transform in addition to the deformation, so that
X
t
= R
t
BW
t
+T
t
for some unknown rotation R
t
and translation T
t
, rendering B and W
t
difficult to
discover. We can ignore translation by analyzing the set of mesh edges
~
X
t
rather than the vertices
X
t
, defining
˜
B appropriately so that
~
X
t
= R
t
˜
BW
t
, with stacked
~
X
t
denoted as
˜
X. However, the
rotations will still obfuscate the solution.
To solve this, we exploit the fact that for small rotations of magnitude O(e), composing rota-
tions is equal to summing rotations up to an O(e
2
) error. Thus, we roughly align the meshes using
Procrustes method, allowing the relative rotations to be linearized within some common neigh-
borhood. We refer to the mean of the rotation neighborhood as R
0
, and hence R
t
= (I+ r
t
x
G
x
+
r
t
y
G
y
+r
t
z
G
z
+O(kr
t
k
2
))R
0
, where r
t
is the relative rotation between R
0
and R
t
in exponential map
(Rodrigues) notation, and G
x
, G
y
, G
z
are the generator functions for the SO
3
matrix lie group:
G
x
=
2
6
6
6
6
4
0 0 0
0 0 1
0 1 0
3
7
7
7
7
5
;G
y
=
2
6
6
6
6
4
0 0 1
0 0 0
1 0 0
3
7
7
7
7
5
;G
z
=
2
6
6
6
6
4
0 1 0
1 0 0
0 0 0
3
7
7
7
7
5
: (3.4)
Defining G
x
as a block diagonal matrix with G
x
repeated along the diagonal (and likewise for
y, z), we construct the following 3N 4M basis to span the set of meshes (and hence deformation)
as well as the local rotation neighborhood, within O(kr
t
k
2
):
ˆ
X =
˜
X G
x
˜
X G
y
˜
X G
z
˜
X
.
32
Figure 3.9: Comparison of rigid mesh alignment techniques on a sequence with significant head
motion and extreme expressions. Top: center view. Middle: Procrustes alignment. Bottom: our
proposed method. The green horizontal strike-through lines indicate the vertical position of the
globally consistent eyeball pivots.
33
Because the last three block columns of this matrix represent small rotation differentials, and com-
position of small rotations is linear, they will lie in the same subspace as the rotational components
of the first block column. Hence, we will be able to tease out which principle components span
rotation deformation, and which span deformation only. Performing column-wise principal
component analysis on this matrix without mean subtraction separates deformation and rotation
bases, with
ˆ
X =
ˆ
B
ˆ
W. There will be at most M deformation bases in
ˆ
B and three times as many
rotation bases, so we can assume that 3 out of 4 bases are rotational but need to identify which
ones. To do this, we score each basis with a data weight and a rotation weight. The data weight for
the basis at column b in
ˆ
B is the sum of the squares of the coefficients in the first M columns of row
b of
ˆ
W, and the rotation weight is the sum of the squares of the coefficients in the last 3M columns
of row b of
ˆ
W. The rotational score of column b is then the rotation weight divided by the sum
of the data weight and rotation weight. We assume the 3M columns with the greatest such score
represent rotations of meshes and rotations of deformation bases. The remaining M columns form
a basis with deformation only (up to O(kr
t
k
2
)), which we call the rotation suppressed basis. We
may then project the data
˜
X onto the rotation-suppressed basis to suppress rotation. This basis may
itself contain a global rotation, so we compute another rigid alignment between the mean resulting
mesh and (the edges of) our template mesh, and apply this rotation to all
~
X
t
. We iterate the entire
procedure starting from the construction of
ˆ
X until convergence, which we usually observed in 10
to 20 iterations. Finally, we compute the closest rigid rotation between the original
~
X
t
and the final
rotation suppressed
~
X
t
to discover R
t
.
3.5.2 Translation Alignment
After R
t
and W
t
are computed using the method in Section 3.5.1, there remains a translation
ambiguity in obtaining B. We define the rotationally aligned mesh A
t
= R
t
>
X
t
, and we define
the aligned-space translation t
t
= R
t
>
T
t
, thus A
t
= BW
t
+t
t
. We compute the mean of A
t
and
compute and apply the closest rigid translation to align it with the template mesh, denoting the
result
¯
A
t
. We then wish to discover t
t
;t = 1:::M such that each A
t
t
t
is well aligned to
¯
A
t
.
34
Since our model has eyes, we also wish to discover globally consistent eye pivot points in an
aligned head pose space, which we call ¯ e
0
and ¯ e
1
, and allow the eyes to move around slightly
relative to the facial surface, while being constrained to the eyelid vertices using the same distance
constraints as in Section 3.4.5, in order to achieve globally consistent pivot locations. We find ¯ e
0
,
¯ e
1
, andt
t
;t = 1:::M minimizing the following energy function using the Ceres solver [108]:
M
å
t=1
h
å
i2S
y(ka
t
i
t
t
¯ a
t
i
k)+
1
å
j=0
å
i2E
j
f(ka
t
i
t
t
¯ e
j
k;ky
i
e
Y
j
k)
i
; (3.5)
where a
t
i
is a vertex in mesh A
t
(and ¯ a
t
i
in
¯
A
t
), and y is the Tukey biweight loss function tuned to
ignore cumulative error past 1 cm. Witht
t
computed, we let T
t
= R
t
t
t
.
3.5.3 Dimension Reduction
With R
t
and T
t
computed for all meshes in Sections 3.5.1 and 3.5.2, we may remove the relative
rigid transforms from all meshes to place them into an aligned pose space. We perform a weighted
principle component analysis, with vertices weighted by the mean of the confidencea
i
(see Section
3.4.4), producing the basis B and weight matrix W. We truncate the basis to reduce noise and
inconsistencies across poses in areas of ambiguous matching to the shared template, and in areas
of insufficient data that are essentially inferred by the Laplacian prior, such as the back and top of
the head.
3.6 Results
We demonstrate our method for dynamic facial reconstruction with five subjects - three male and
two female. The first four subjects were recorded in a LED sphere under flat-lit static blue lighting.
The blue light gives us excellent texture cues for optical flow. We used 12 Ximea monochrome
2048x2048 machine vision cameras as seen in Figure 3.10. We synchronized the LEDs with
cameras at 72Hz, and only exposed the camera shutter for 2ms to eliminate motion blur as much
35
Figure 3.10: 12 views of one frame of a performance sequence. Using flat blue lighting provides
sharp imagery.
as possible. Also pulsing the LEDs for a shorter period of time reduces the perceived brightness
to the subject, and is more suitable for recording natural facial performance. Though we captured
72Hz, we only processed 24Hz in the results, to reduce computation time.
Figure 3.11 shows intermediate results from each reconstruction step described Section 3.3.2
through Section 3.3.4. This is a particularly challenging case as the initial facial landmark detector
matches large-scale facial proportions but struggles in the presence of facial hair that partially
occludes the lips and teeth. Artifacts remain even after dense optical flow in Section 3.3.2 (top).
PCA based denoising (Section 3.3.3) and fine-scale consistent mesh propagation (Section 3.3.4)
fill in more accurate mouth and eye contours that agree with inset photograph.
An alternative approach would be to initialize image-based mesh warping with a morphable
model [20] in place of the Digital Emily template (Figure 3.12). To perform this comparison,
we fit a morphable model to multi-view imagery of a male subject (Figure 3.13, Figure 3.13 (a))
and filled in the rest of head topology using laplacian mesh deformation (Figure 3.12 (b). We
36
Figure 3.11: Zoomed renderings of different facial regions of three subjects. Top: results after
coarse-scale fitting using the shared template (Section 3.3.2). The landmarks incorrectly located
the eyes or mouth in some frames. Middle: results after pose estimation and denoising (Section
3.3.3). Bottom: results after fine-scale consistent mesh propagation (Section 3.3.4) showing the
recovery of correct shapes.
37
(a) (b) (c) (d)
Figure 3.12: Comparison using a morphable model of [20] as a initial template. (a) Front face
region captured by the previous technique, (b) stitched on our full head topology. (c) Resulting
geometry from Section 3.3.2 deformed using our method as (b) as a template, compared to (d)
the result of using the Digital Emily template. The linear morphable model misses details in the
nasolabial fold.
then generated synthetic rendering of the morphable model using estimated morphable albedo and
inpainted textures. Since our original landmark-detection method employs stereo from multi-view
data, our technique reconstructs more geometric details such as nasolabial fold on Figure 3.12
(c) which are not captured well by a linear deformation model in Figure 3.12(a). Figure 3.13 (b)
illustrates the final warped image to match (e) in the same camera, compared to the warped image
(d) starting from the Emily model (c) in a similar camera view.
Our final geometry closely matches fine detail in the original photographs. Figure 3.14 shows
directly deformed artist quality topology (a) and a captured detail layer (b) in a calibrated camera
view. Overlay to the calibrated camera shows good agreement between the geometry, details such
as forehead wrinkles, and the input photograph.
Figure 3.15 shows several frames from a highly expressive facial performance reconstructed on
a high-quality template mesh using our pipeline. The reconstructed meshes shown in wire-frame
with and without texture mapping indicate good agreement with the actual performance as well
as texture consistency across frames. The detail enhancement from Section 3.3.6 produces high-
resolution dynamic details such as pores, forehead wrinkles, and crows feet, adding greater fidelity
to the geometry.
38
(a) (b) (c) (d) (e)
Figure 3.13: (a) Synthetic rendering of a morphable model using Figure 3.12. (b) Result using our
image warping method to warp (a) to match real photograph (e). Similarly the common template
image (c) is warped to match (e), producing plausible coarse-scale facial feature matching in (d).
We directly compare our method with [31] based on publicly available video datasets as shown
in Figure 3.16. Our method is able to recover significantly greater skin detail and realistic facial
features particularly around the mouth, eyes, and nose, as well as completing the full head. Our
system also does not rely in temporal flow making it easier to parallelize each frame independently
for faster processing times. Figure 3.17 also illustrates the accuracy of our method compared to the
multi-view reconstruction method of [29]. Though we never explicitly compute a point cloud or
depth map, the optical flow computation is closely related to stereo correspondence and our result
is a very close match to the multi-view stereo result (as indicated by the speckle pattern apparent
when overlaying the two meshes with different colors). Unlike [29] our technique naturally fills in
occluded or missing regions such as the back of the head and provides consistent topology across
subjects and dynamic sequences. We can reconstruct a single static frame as shown in Figure 3.4
or an entire consistent sequence.
Since our method reconstructs the shape and deformation on a consistent UV space and topol-
ogy, we can transfer attributes such as appearance or deformation between subjects. Figure 3.18
shows morphing between facial performances of three subjects, with smooth transition from one
subject to the next. Unlike previous performance transfer techniques, the recovered topology is
inherent to the reconstruction and does not require any post processing.
39
(a) (b)
Figure 3.14: (a) High quality mesh deformed using our technique; (b) high-resolution displacement
details. Half-face visualization indicates good agreement and topology flow around geometric
details within the facial expression.
3.7 Limitations
While our technique yields a robust system and provides several benefits compared to existing
techniques, it has several limitations. Initial landmark detection may incorrectly locate a landmark,
for example sometimes facial hair will be interpreted as a mouth. Improvements in landmark
detection would help here. The coarse-scale template alignment fails in some areas when the
appearance of the subject and the template differ significantly, which can happen in the presence
of facial hair or when the tongue and teeth become visible as they are not part of the template (see
Figure 3.19). While these errors are often mitigated by our denoising technique, in the future it
would be of interest to improve tracking in such regions by providing additional semantics such as
more detailed facial feature segmentation and classification, or by combining tracking from more
than one template to cover a larger appearance space.
Our facial surface details come from dynamic high-frequency appearance changes in the flat-
lit video, but as with other passive illumination techniques, they miss some of the facial texture
40
Figure 3.15: Dynamic face reconstruction from multi-view dataset of a male subject shown from
one of the calibrated cameras (top). Wireframe rendering (second) and per frame texture rendering
(third) from the same camera. Enhanced details captured with our technique (bottom) shows high
quality agreement in the fine-scale details in the photograph.
(a) (b) (c)
Figure 3.16: Reconstructed mesh (a), enhanced displacement details with our technique (b), and
comparison to previous work. Our method automatically captures whole head topology including
nostrils, back of the head, mouth interior, and eyes, as well as skin details.
41
(a) (b) (c)
Figure 3.17: Comparison of our result (a) to PMVS2 [29] (b). Overlaying the meshes in (c)
indicates a good geometric match.
Figure 3.18: Since our method reconstructs the face on a common head topology with coarse-scale
feature consistency across subjects, blending between different facial performances is easy. Here
we transition between facial performances from three different subjects.
(a) (b) (c) (d)
Figure 3.19: While our system is able to reconstruct most of surface facial features, it struggles to
reconstruct features that are not represented by a template such as tongue, and teeth.
42
realism obtainable with active photometric stereo processes such as in [55]. If such an active-
illumination scan of the subject could be used as the template mesh, our technique could propagate
its high-frequency details to the entire performance, and dynamic skin microgeometry could be
simulated as as in [60]. Furthermore, it would be of interest to allow the animator to conveniently
modify the captured performances; this could be facilitated by identifying sparse localized defor-
mation components as in [111] or performance morphing techniques as in [112].
3.8 Discussion
We have presented an entirely automatic method to accurately track facial performance geometry
from multi-view video, producing consistent results on an artist-friendly mesh for multiple subjects
from a single template. Unlike previous works that employ temporal optical flow, our approach
simultaneously optimizes stereo and consistency objectives independently for each instant in time.
We demonstrated an appearance-driven mesh deformation algorithm that leverages landmark de-
tection and optical flow techniques, which produces coarse-scale facial feature consistency across
subjects and fine-scale consistency across frames of the same subject. We also demonstrated a
displacement map estimation scheme that compares the uv-space texture of each frame against an
automatically selected neutral frame to produce stronger displacements in dynamic facial wrin-
kles. Our method operates solely in the desired artist mesh domain and does not rely on complex
facial rigs or morphable models. While performance retargeting is beyond the scope of this work,
performances captured using our proposed pipeline could be employed as high-quality inputs into
retargeting systems such as [12]. To our knowledge, this is the first method to produce facial
performance capture results with detail on par with multi-view stereo and pore-level consistent pa-
rameterization without temporal optical flow, and could lead to interesting applications in building
databases of morphable characters and simpler facial performance capture pipelines.
43
Chapter 4
Mesoscopic Facial Geometry Inference Using Deep Neural
Networks
4.1 Introduction
There is a growing demand for realistic, animated human avatars for interactive digital communi-
cation in augmented and virtual reality, but most real-time computer-generated humans continue
to be simplistic and stylized or require a great deal of effort to construct. An important part of cre-
ating a realistic, relatable human avatar is skin detail, from the dynamic wrinkles that form around
the eyes, mouth, and forehead that help express emotion, to the fine-scale texture of fine creases
and pores that make the skin surface look like that of a real human. Constructing such details on
a digital character can take weeks of effort by digital artists, and often employs specialized and
expensive 3D scanning equipment to measure skin details from real people. And the problem is
made much more complicated by the fact that such skin details are dynamic: wrinkles form and
disappear, skin pores stretch and shear, and every change provides a cue to the avatar’s expression
and their realism.
Scanning the overall shape of a face to an accuracy of a millimeter or two has been possible
since the 1980’s using commercial laser scanners such as a Cyberware system. In recent years,
advances in multiview stereo algorithms such as [29, 32] have enabled facial scanning using passive
multiview stereo which can be done with an ordinary setup of consumer digital cameras. However,
44
recording submillimeter detail at the level of skin pores and fine creases necessary for photorealism
remains a challenge. Some of today’s best results are obtained in a professional studio capture
setup with specialized lighting patterns, such as the polarized gradient photometric stereo process
of [53, 55]. Other techniques [54, 61, 66] use high-resolution measurements or statistics of a few
skin patches and perform texture synthesis over the rest of the face to imagine what the high-
resolution surface detail might be like. Other work uses a heuristic ”dark is deep” shape-from-
shading approach [30, 44] to infer geometric surface detail from diffuse texture maps, but can
confuse surface albedo variation with geometric structure.
In this work, we endeavor to efficiently reconstruct dynamic medium- and fine-scale geomet-
ric facial detail for static facial scans and dynamic facial performances across a wide range of
expressions, ages, gender, and skin types without requiring specialized capture hardware. To do
this, we propose the first deep learning-based approach to infer temporally coherent high-fidelity
facial geometry down to the level of skin pore detail directly from a sequence of diffuse texture
maps. To learn this mapping, we leverage a database of facial scans recorded with a state-of-the-
art active illumination facial scanning system which includes pairs of diffusely-lit facial texture
maps and high-resolution skin displacement maps. We then train a convolutional neural network
to infer high-resolution displacement maps from the diffuse texture maps, the latter of which can
be recorded much more easily with a passive multiview stereo setup. Our hybrid network fuses
two components: 1) an image-to-image translation net that translates input texture map to dis-
placement map, and 2) a super-resolution net that generates the high-resolution output given the
outcome of the preceding network. Our preliminary experiments demonstrate that medium-scale
and pore-level geometries are encoded in different dynamic ranges. Therefore, we introduce two
sub-networks in the image translation net to decouple the learning of middle and high-frequency
details. Experimental results indicate our architecture is capable of inferring a full range of detailed
geometries with quality that is on par with state-of-the-art facial scanning data.
Compared with conventional methods, our proposed approach provides much faster reconstruc-
tion of fine-scale facial geometry thanks to the deep learning framework. In addition, our model is
45
Figure 4.1: System pipeline. From the multi-view captured images, we calculate the texture map
and base mesh. The texture (1K resolution) is first fed into our trained Texture2Disp network to
produce 1K-high and 1K-middle frequency displacement maps, followed by up-sampling them to
4K resolution using our trained SuperRes Network and bicubic interpolation, respectively. The
combined 4K displacement map can be embossed to the base mesh to produce the final high de-
tailed mesh.
free from certain artifacts which can be introduced using a ”dark is deep” prior to infer geometric
facial detail. Since our model is trained with high-resolution surface measurements from the active
illumination scanning system, the network learns the relationship between facial texture maps and
geometric detail which is not a simple function of local surface color variation.
4.2 Overview
Figure 4.1 illustrates the pipeline of our system. In the pre-processing, we first reconstruct a
base face mesh and a 1K-resolution UV texture map from input multi-view images of a variety
of subjects and expressions by fitting a template model with consistent topology using the state
of the art dynamic facial reconstruction [44]. The texture map is captured under a uniformly lit
environment to mimic the natural lighting. Our learning framework takes a texture map as an
input and generates a high-quality 4K-resolution displacement map that encodes a full range of
geometric details. In particular, it consists of two major components: a two-level image-to-image
translation network that synthesizes 1K resolution medium and high-frequency displacement maps
46
from the input facial textures, and a patch-based super resolution network that enhances the high-
frequency displacement map to 4K resolution, introducing sub-pore level details. The medium
frequency displacement map is upsampled using a naive bicubic upsampling, which turns out to
be sufficient in our experiments. The final displacement is obtained by combining individually
inferred medium and high frequency displacement maps. Finally, the inferred displacement is
applied on the given base mesh to get the final geometry with fine-scale details.
4.3 Geometry Detail Separation
A key to the success of our method is carefully processed geometric data and its representation.
While recent research has directly trained neural networks with 3D vertex positions [113], these
approaches can be memory intensive. While unstructured representation is suitable for general ob-
jects, it can be suboptimal for human faces, which assumes many common parts. In this work, we
encode our facial mesoscopic geometry details in a high resolution displacement map parameter-
ized in a common 2D texture space. The main advantages to use such a representation is two-fold.
First, a displacement map is a commonly used [54, 60, 63] and lightweight representation than
full 3D coordinates, requiring only a single channel to encode surface details. In particular, human
faces deform to develop similar skin texture patterns across different individuals. Graham et al.
[54] showed that cross-subject transfer of high frequency geometry details is possible among sim-
ilar ages and genders. With the displacement data parameterized in a common UV space encoding
the same facial regions of different individuals, this helps the network to encapsulate the geomet-
ric characteristics of each facial region from a limited number of facial scans. While our method
assumes fixed topology, existing methods can also be used to convert between different UV co-
ordinate systems. Secondly, from a learning point of view, 2D geometric representation can take
advantage of recent advances in convolutional neural networks that could serve for our purpose.
Our displacement data encodes high resolution geometry details that are beyond the resolution
of a few tens of thousands vertices in our base mesh. Thus it contains relatively large tens of
47
Figure 4.2: The histogram of the medium and high frequency pixel count shows that the majority
of the high frequency details lie within a very narrow band of displacement values compared to the
medium frequency values spreading over a broader dynamic range.
milimeter forehead wrinkles to submilimeter fine details. Figure 4.2 shows the histogram of the
displacement pixel count shown in a log scale with respect to the value of the displacement. As
shown here, there is a spike in the histogram within a very small displacement value, implying that
there are distinctive geometry features at different scales. Special care must be taken to properly
learn such multi-scale nature of facial geometry details. Our experiment shows that, if we naively
train our texture-to-displacement network using the unfiltered displacement, the medium-scale ge-
ometry dominates the dynamic range of pixel value and leaves the high frequency details trivial,
making the network unable to learn high frequency geometric details. Inspired by previous work
[63], we factor the displacement into the medium and high frequency components, and learn them
48
(a) no separation, 1K (b) separation, 1K (c) separation, 4K
Figure 4.3: By separating the displacement map into high and middle frequencies, the network
could learn both the dominant structure and subtle details (a and b), and the details could be further
enhanced via super-resolution (b and c).
individually via two subnetworks. In particular, during training, we decouple the ground truth dis-
placement mapD into two component maps –D
L
andD
H
, which capture the medium and high
frequency geometry respectively. The resulting data is fed into the corresponding subnetworks of
the Image-to-Image translation network.
We show in Figure 4.3 that the geometry detail separation is the key to achieve faithful re-
construction that capture multi-scale facial details (Figure 4.3b), while a naive approach without
decoupling tends to lose all high frequency details, introducing artifacts (Figure 4.3a).
49
4.4 Texture-to-Displacement Network
Human skin texture provides a great deal of perceptual depth information through skin shading,
and previous work has leveraged the apparent surface shading to reveal the underlying surface
geometry from a variant of models and heuristics. The inference of the truthful surface details is
non-trivial due to the complex light transport in human skin and non-linear nature of skin defor-
mation. To mitigate some of these challenges, we employ uniformly lit texture as an input to our
system. Since we employ the input texture and the displacement maps registered at a common
pixel coordinate, our inference problem can be posed as image-space translation problem. In this
paper, we propose to directly learn this image-to-image translation by leveraging the pairs of input
texture maps and corresponding geometry encoded in the displacement map. To our knowledge,
we are the first to solve pore-level geometry inference from an input texture as image-to-image
translation problem.
We adopt the state-of-the-art image-to-image translation network using a conditional GAN and
U-net architecture with skip connections [69]. The advantage of the proposed network is three-
fold. First, the adversarial training facilitates the learning of the input modality manifold and
produces sharp results, which is essential for high frequency geometry inference. On the other
hand, naive pixel-wise reconstruction loss in L2 or L1 norm often generates a blurry output, as
demonstrated in [69]. Furthermore, a patch-based discriminator, which makes real/fake decision
in local patches using a fully convolutional network, captures local structures in each receptive
field. As the discriminator in each patch shares the weights, the network can effectively learn
variations of skin details even if the large amount of data is not available. Last but not least, the
U-net architecture with skip connections utilizes local details and global features to make inference
[ronneberger2015u]. Combining a local feature analysis and global reasoning greatly improves
the faithful reconstruction especially when underlying skin albedo ambiguates the translation (e.g.
skin pigmentation, moles). Our texture-to-displacement network consists of two branches, each
fulfilled by the image-to-image translation network. The two subnetworks infer the medium and
high frequency displacement maps from the same input texture, respectively.
50
4.5 Super-Resolution Network
The effective texture resolution is determined by the ratio of the target face size and the final target
resolution (in our work submilimeter details). In our setting, we find that no smaller than a 4K
resolution displacement is detailed enough to resolve the pore-level geometry details we want to
produce photorealistic rendering. However, in practice applying an image-to-image translation
network to a texture more than 1K resolution pixel is computationally demanding and can be
beyond the capacity of the modern GPU hardware. To overcome the limitation in the resolution, we
propose to further upsample the resulting displacement map using a patch-based super-resolution
network. We build our super-resolution network based upon the state-of-the-art super resolution
network using sub-pixel convolution [114]. During the training, we downsample the 4K ground-
truth displacement mapsfD
hr
g to obtain its corresponding 1K-resolution training setfD
lr
g. We
then randomly pick pairs of a 64 64 patch fromD
lr
and their corresponding 256 256 patch
fromD
hr
, which are fed into the network for training. At test time, we first divide the input image
into a regular grid, with each of the block forming a 64 64 patch image. We then upsample each
patch to 256 256 resolution using the super-resolution network. Finally, to ensure consistency
between patch boundaries, we stitch the resulting upsampled patches using image quilting [115] to
produce a 4K displacement.
4.6 Implementation Details
It is important that the training data covers a wide range of ethnic, age, genders, and skin tones.
We collected 328 corresponded Light Stage facial scans [55] as ground truth photometric stereo to
train the network. This includes 19 unique subjects, between the ages of 11 and 56, with multiple
expressions per subject to capture wrinkle and pore dynamics. 6 additional subjects were used to
test system performance. For each collected displacement, we apply a Gaussian filter to remove all
high frequency data, obtaining the medium frequency displacement mapD
L
. The high-frequency
51
(a) Input texture (b) Base mesh (c) Med-frequency (d) 1K multi-scale (e) 4K multi-scale
Figure 4.4: Synthesis results given different input textures with variations in subject identity and
expression. From (a) to (e), we show the input texture, base mesh, the output geometry with
the medium, 1K multi-scale (both medium and high frequency) and 4K multi-scale frequency
displacement map. The closeup is shown under each result.
52
displacement map can be calculated by subtractionD
H
=DD
L
. Given the histogram of the
displacement maps, we iteratively optimize the filter size of the Gaussian filter so thatD
H
covers
only high frequency data. We find that the filter size of 29 at 4K resolution gives the best results
most of examples. We apply a64 scale to the high frequency values to distribute the values
well over the limited pixel intensity to facilitate the convergence during learning. For the medium
frequency data, which usually exhibits higher displacement range, we apply a sigmoid function so
that all the values fit well into the pixel range without clipping. This step takes less than a second
for a 1K displacement map.
We train our network with pairs of texture and displacement maps at 1K resolution. The train-
ing time on a single NVidia Titan X GPU with 12GB memory is around 8 hours. For the super
resolution network, we feed in displacement maps at 4k resolution to train. It takes less than 2
hours to train with the same GPU. At test time, it takes one second to get both 1K resolution dis-
placement maps from a 1K input texture map. Then these maps are up-sampled using our super
resolution network. We get the final 4K displacement map after 5 seconds.
4.7 Experimental Results
We evaluate the effectiveness of our approach on different input textures with a variety of sub-
jects and expressions. In Figure 4.4, we show the synthesized geometries embossed by (c) only
medium-scale details, (d) 1K and (e) 4K combined multi-scale (both medium and high frequency)
displacement maps, with the the input textures and base mesh shown in the first and second col-
umn, respectively. As seen from the results, our method can faithfully capture both the medium
and fine scale geometries. The final geometry synthesized using the 4K displacement map ex-
hibits meso-scale geometry on par with active facial scanning. None of these subjects are used in
training the network, and show the the robustness of our method to a variety of texture qualities,
expressions, gender, and ages.
53
(a) Light Stage (b) our method (c) dark is deep
Figure 4.5: High frequency details of our method (center) comparison with ground truth Light
Stage data [55] (left) and “dark is deep” heuristic.[30] (right)
54
We validate the effectiveness of geometry detail separation by comparing with an alternative
solution which does not decouple middle and high frequencies. As illustrated in Figure 4.3a, the
displacement map learned from the alternative method fails to capture almost all the high frequency
details while introducing artifacts in middle frequencies, which is manifested in the embossed
geometry. Our method, on the other hand, faithfully replicates both medium and fine scale details
in the resulting displacement map (Figure 4.3b).
We also assess the effectiveness of the proposed super-resolution network in our framework.
Figure 4.3c and Figure 4.3b show the results with and without the super-solution network, respec-
tively. The reconstructed result using super-solution network outperforms its opponent signifi-
cantly in faithfully replicating mesocopic facial structures.
Comparisons. We compare the reconstruction quality of our method with Beeler et al. [30] and
the ground truth by Ghosh et al. [55]. As demonstrated in Figure 4.6, our reconstruction (right)
generally agrees with the ground truth (middle) in capturing the roughness variation between the
tip of the nose and the cheek region, and the mole by the upper lip. The “dark is deep” heuris-
tic [30] (left), on the other hand, fails to capture these geometric differences. In Figure 4.5, we
provide the quantitative evaluation comparing with Beeler et al. [30]. We measure the reconstruc-
tion error using the L
1
metric between ours and the ground truth displacement map provided by
Ghosh et al. [55]. The resulting error map is visualized in false color with red and blue indicating
the absolute max difference 1 mm to 0 mm, respectively. As manifested in Figure 4.5, Beeler et
al. [30] is prone to introduce larger reconstruction error particularly for regions with stubble hair
and eyebrows. Our model, trained with photometric scans, achieves superior accuracy and robust
inference without being confused too much by the local skin albedo variations.
Our system can also generate dynamic displacement maps for video performances. In the sup-
plemental video, we demonstrate that our results are stable across frames and accurately represent
changing wrinkles and pores. We also compare our results against a dynamic sequence with the
state of the art dynamic multi-view face capture of Fyffe et al. 2017 [44]. Fyffe et al relies on
55
(a) Beeler et al.[31] (b) Ground truth [55] (c) our method
Figure 4.6: Inferred detail comparison.
multiview stereo to reconstruct medium frequencies and on inter-frame changes in shading to in-
fer high-freqency detail. Our method produces more accurate fine-scale details as it is trained on
photometric stereo and can be computed on each frame independently.
We also evaluate our technique against similar neural network synthesis methods. Sela et al
[74] use an image-to-image translation network but to infer a facial depth map and dense corre-
spondences from a single image. This is followed by non-rigid registration and shape from shading
similar to Beeler et al. [30]. Their generated image lacks fine-scale details as these are not en-
coded in their network (see Figure 4.7). We also provide the comparison with Bansal et al. [70] in
terms of normal prediction accuracy. Bansal et al. [70] offers the state-of-the-art performance on
56
(a) Sela et al. [74] (b) our method
Figure 4.7: Compared with Sela et al. [74], our method could produce more detailed normal map.
Figure 4.8: Our method could generate much more subtle details (middle) than the surface normal
prediction Bansal et al. [70] (left).
estimating surface normal using convolutional neural network. The normal map predicted by our
approach is converted from the output displacement map. The model of Bansal et al. [70] is trained
using the same data with ours. Figure 4.8 compares the reconstructed results after embossing the
inferred normal map onto our base mesh. As illustrated in the figure, our approach significantly
outperforms the technique in predicting high-fidelity mesoscopic details.
User Study. We assessed the realism of our inferred geometry by a user study. Users are asked to
sort the renderings of 3D face without skin textures from unrealistic to realistic. We used 6 subjects
57
for rendering using (1) our synthesized geometry, (2) the Light Stage [55], and (3) the ”dark is
deep” synthesis [30] and randomly sorted them, aligning in the same head orientation to remove
bias. We collected 58 answers from 25 subjects. 20:7% of users think our reconstructions are
the most realistic, while 67:2% and 12:1% of people find the Light Stage and [30] more realistic.
Although the Light Stage still shows superior performance in terms of realism, our method is
favorably compared with the geometry synthesis method [30].
4.8 Discussion and Future Work
Our primary observation is that a high-resolution diffuse texture map contains enough implicit in-
formation to accurately infer useful geometric detail. Secondly, neural network based synthesis
trained on ground truth photometric stereo data outperforms previous shape from shading heuris-
tics. Our system can successfully differentiate between skin pores, stubble, wrinkles, and moles
based on their location on the face, and how their appearance changes across different subjects
and expressions. Our method generates stable high-resolution displacement maps in only a few
seconds, with realistic dynamics suitable for both static scans and video performances.
The limitation of our method is that the training data need to be carefully corresponded. How-
ever, our learning framework does not strictly require dense registration since there is no mean-
ingful pore-to-pore correspondence across different identities. We ensure in the training that cor-
respondence is roughly maintained in UV space across different subjects so that the generated
displacement maintains the correct skin detail distributions. Though our training data was cap-
tured using flat-lit environment, our method could be integrated with previous albedo synthesis
techniques which compensate for varying illumination and fill in occluded regions [38] in order to
infer facial details of unconstrained images in the wild. We show additional results in Figure 4.10
to support these claims using a novel topology obtained from a conventional 3D morphable model.
58
Figure 4.9: Failure case with extreme makeup: (left)Input texture, (center) Our result, (right)
Ground truth [55].
While our training dataset contains several examples of commonly applied cosmetics, more pro-
nounced theatrical makeup may introduce displacement artifacts (see Figure 4.9 for a failure ex-
ample).
We believe our results will continue to improve with additional training data featuring unusual
moles, blemishes, and scars. We would also like to incorporate other channels of input. For
example, wrinkles are correlated with low-frequency geometry stress [63, 67] and local specular
highlights can provide additional detail information.
59
Figure 4.10: Results with unconstrained images. Left: input image, texture. Middle: displacement
(zoom-in), rendering (zoom-in). Right: rendering
60
Chapter 5
Relighting Video with Reflectance Field Exemplars
5.1 Introduction
The New Dimensions in Testimony project at the University of Southern California’s Institute for
Creative Technologies recorded extensive question-and-answer interviews with twelve survivors
of the World War II Holocaust. Each twenty-hour interview, conducted over five days, produced
over a thousand responses, providing the material for time-offset conversations through AI based
matching of novel questions to recorded answers [116]. These interviews were recorded inside a
large Light Stage system [94] with fifty-four high-definition video cameras. The multi-view data
enabled the conversations to be projected three-dimensionally on an automultiscopic display [117,
118]
The light stage system is designed for recording relightable reflectance fields, where the subject
is illuminated from one lighting direction at a time, and these datasets can be recombined through
image-based relighting [92]. If the subject is recorded with a high speed video camera, a large
number of lighting conditions can be recorded during a normal video frame duration [93, 94]
allowing a dynamic video to be lit with new lighting. This enables the subject to be realistically
composited into a new environment (for example, the place that the subject is speaking about)
such that their lighting is consistent with that of the environment. In 2012, the project performed a
successful early experiment using a Spherical Harmonic Lighting Basis as in [119] for relighting a
Holocaust survivor interview. However, recording with an array of high speed cameras proved to
61
Figure 5.1: The architecture of our neural network. The input image is passed through a U-Net
style architecture to regress to the set of OLAT images. When the ground truth is available, the
network prioritizes the reconstruction loss of the OLAT imageset. Otherwise, the network is trained
based on the feedback of the relit image.
be too expensive for the project, both in the cost of the hardware, and the greatly increased storage
cost of numerous high-speed uncompressed video streams.
The project settled for recording the survivors in just a single interview lighting condition
consisting of diffuse, symmetrical lighting from above. But to enable relighting in the future, each
survivor was recorded in a basis of forty-one lighting conditions in several static poses in a special
session toward the end of each shoot as in Figure 5.3. The hope was that at some point, this
set of static poses in different lighting conditions, plus the interview footage in diffuse lighting,
could eventually be combined through machine learning to realistically show the interview as
if it had been recorded in any combination of the lighting conditions, enabling general purpose
relighting. This paper presents a technique to achieve this goal, which provides a practical process
for recording interview footage where the lighting can be controlled realistically after filming.
5.2 Method
One of the most effective ways to perform realistic relighting is to combine a dense set of basis
lighting conditions (a reflectance field) with according to a novel lighting environment to simulate
62
Figure 5.2: End-to-end semi-supervised training scheme. We use reconstruction loss for synthetic
images while image-based lighting loss is applied to both real and synthetic interview images.
the appearance in the new lighting. However, this approach is not ideal for a dynamic performance
since it requires either high-speed cameras, or requires the actor to sit still for several seconds
to capture the set of OLAT images. [97] overcomes this limitation by using neural networks to
regress 4D reflectance fields from just two images of a subject lit by gradient illumination. They
postulate that one can also use flat-lit images to achieve similar results with less high-frequency
detail. Since the method casts relighting as a supervision regression problem, it requires pairs of
tracking images and their corresponding OLAT images as ground truth for training.
63
In the New Dimensions in Testimony project, most of the Holocaust survivors’ interview
footage was captured in front of a green screen so that the virtual backgrounds can be added
during post-production. However, this setup poses difficulties for achieving consistent illumina-
tion between the actors and the backgrounds in the final testimony videos and does not provide
the ground truth needed for supervision training. In this paper, we use the limited OLAT data to
train a neural network to infer reflectance fields from synthetically relit images. The synthetic relit
images are improved by matching them with the input interview images through a differentiable
renderer, enabling an end-to-end training scheme. For more training details, see Figure 5.2.
In this section, we describe the data acquisition process, how we relate the OLAT reflectance
field exemplars with the interview footage, and how we train an end-to-end neural network to
regress reflectance fields for realistic relighting.
5.2.1 Data Acquisition and Processing
Each Holocaust survivor was recorded over a 180-degree field of view using an array of 50 Pana-
sonic X900MK 60fps progressive scan consumer camcorders, each four meters away framed on
the subject. Toward the end of each Holocaust survivor’s lengthy interview, they were captured
in several different static poses under a reflectance field lighting basis of 41 lighting conditions as
in Figure 5.3. The lighting conditions were formed using banks of approximately 22 lights each
of the 931 light sources on the 8m diameter dome [94]. This somewhat lower lighting resolution
was chosen to keep the capture time shorter than what would be required to record each of the 931
lights individually and to avoid too great of a degree of underexposure so that we could use the
same exposure settings as the interview lighting without touching the cameras.
Data Processing. The original resolution of our video frames images is 19201080 in portrait
orientation. For each image, we crop the full body of the actors, and then use Grabcut [120] to
mask out the background. The images are then padded and resized to 512 512.
Synthetic Tracking Frames. We use a mirror ball image captured right after the interview ses-
sion as a light probe [121]. This light probe represents the illumination of the interview session.
64
Figure 5.3: Reflectance field: 27 of 41 one-light-at-a-time images.
For convenience, we convert the light probe to a latitude-longitude format. Then we use mirror
ball images captured in the OLAT session to find their projections in our target environment illu-
mination map. By taking a weighted combination of the images in the OLAT set according to these
projections, we are able to relight all the static poses of the actor.
The OLAT images are not as well exposed as the interview lighting since fewer light sources
are on, and we discovered that the consumer video cameras applied a weaker level of gamma
correction to the darker range of pixel values, making dark regions appear even darker, presumably
as a form of noise suppression. Thus, we developed a dual gamma correction curve to linearize the
image data:
I
0
=(1 I) I
g
1
+ I I
g
2
(5.1)
65
where g
1
, g
2
describe the gamma we use for the lower and upper part of the gamma curve, and
we interpolate between these two curves according to the brightness of the pixel. We optimize
g
1
, g
2
so that the OLAT reflectance field exemplars, relit with the measured interview lighting
condition, match the appearance of the first frame of the interview video. Though each subject is
only recorded as a reflectance field in a few poses, these synthetic relit images play an important
role in bringing the output of the network closer to the illumination of the input video footage.
5.2.2 Network Architecture
We cast the relighting problem as prediction of the reflectance field, and use these measurements
to render the subject under arbitrary illumination. To be consistent with the Holocaust survivor
dataset, we define a reflectance field to have 41 OLAT images. Our goal is to predict how the
actor would look under 41 specified lighting conditions for every frame of dynamic performance.
The structure of our neural network resembles the structure of the popular image transformation
architecture with skip connections [122]. The encoder consists of ten blocks of 3 3 convolution
layers each followed by a batch-normalization layer and a leaky ReLU activation function. A
blur-pooling operation [123] is used at the end of the block is to decrease the spatial resolution
and increase the number of channels. Note that the first block of the encoder does not have a
batch-normalization layer and uses a 7 7 convolution layer.
The decoder follows a similar structure with ten blocks of bilinear upsampling followed by a
convolution layer. At the end of each decoder block, we use skip connections to concatenate the
network features with their corresponding activations in the encoder. All convolution layers are
followed by a ReLU activation except for the last convolution layer where a sigmoid activation is
used. At the end of the decoder is a differentiable renderer that takes as input a whole set of OLAT
images to render a subject under a new calibrated illumination condition. For network details, see
Figure 5.1.
66
5.2.3 Loss Function
Our model is trained through the minimization of a weighted combination of two loss functions.
A reconstruction loss minimizes errors between the set of OLAT images in the dataset and set
of OLAT images predicted by the network. The second loss is an image-based relighting loss
that minimizes the errors between the input image and the rendered image lit with the predicted
reflectance field. The backgrounds are masked out in all loss calculations.
Reconstruction Loss. This loss ensures the accurate inference of the network by matching
the network prediction with the ground truth. Since the per-pixel photometric loss often leads
to blurry output images, we choose to minimize the loss in feature space with a perceptual loss.
Letting V GG
(i)
(I) be the activations of the ith layer of a VGG network [124], the reconstruction
loss is defined as:
L
rec
=
N
å
i=1
M
å
j=1
kV GG
( j)
(I
i
pred
)V GG
( j)
(I
i
gt
)k
2
(5.2)
where N is the number of images in a complete OLAT set, and M is the number of VGG layers to
be used.
Image-Based Relighting Rendering Loss This self-supervision loss makes the network more ro-
bust to unseen poses of the actor in the training set. Given a predicted reflectance field R(q;f;x;y)
and the calibrated interview lighting environment L
i
, we can relight the actor as follows:
I
relit
=
å
q;f
R
x;y
(q;f)L
i
(q;f) (5.3)
where R
x;y
(q;f) represents how much light is reflected toward the camera by pixel(x;y) as a result
of illumination from direction (q;f). Matching this relit image I
relit
and the input image I gives
us the rendering loss:
L
render
=
M
å
j=1
kV GG
( j)
(I
relit
)V GG
( j)
(I)k
2
(5.4)
67
The full objective is the weighted combination of the two loss functions:
L=l
1
L
rec
+l
2
L
render
(5.5)
Implementation details. We use two sets of data to train the network. The first set consists of
six poses with groundtruth OLAT images and the six corresponding relit images showing the re-
flectance field exemplars under the simulated interview lighting condition. The second set consists
of 100 frames of the target video. We train the first set with both a reconstruction loss L
rec
and a
rendering loss L
render
for 100 epochs, and then we train the second set for 4 epochs with only the
rendering loss before going back to supervised training. The training process continues until we
reach 1040 epochs. We use the ADAM optimizer [125] with b
1
= 0:9;b
2
= 0:999 and a learning
rate of 0.001.
5.3 Evaluation
We evaluate our technique by relighting several hundred frames of interview footage and com-
paring to relit images made with that of [98] and [126]. We do not have ground truth relighting
for each frame of the video to compare to, so we employ a user study to evaluate our method
against prior works. Finally, we show how our method is able to realistically relight the dynamic
performance of the subject with arbitrary poses and motions.
5.3.1 Single Image Portrait Relighting
We first compared our method with a state-of-the-art lighting estimation and relighting for por-
trait photos [98]. Their neural network was trained on a dataset of numerous synthetically relit
portrait images of 18 individuals from pre-captured OLAT data. From Figure 5.4, we can see that
our method performs much more believable relighting, as the single image portrait relighting re-
sult only reproduces the low-frequency components of the novel illumination. Note also that we
68
(a) Reference (b) Sun et al.[98] (c) Ours
Figure 5.4: Comparison with Single Image Portrait Relighting. Our result has greater lighting
detail and looks much closer to the reference lighting.
cropped our method’s result down to just the face to match the output capability of the Single Im-
age Portrait Relighting network, whereas our model is able to relight more of the body as shown
in Figure 5.5.
5.3.2 Relighting as Style Transfer
We next compared our approach with the state-of-the-art style transfer technique of [126] that
takes several keyframes to use as style and transfer the styles or relighting from those keyframes
to the video. As we can see from Figure 5.5, the shading on the inner palms of the actor is not
supposed to be in shadow, but since the provided keyframes do not cover this pose, [126] predicts
the wrong shading in this area. In contrast, thanks to self-supervised learning, our network is able
to recover a more reasonable rendition of the shading one would expect for this pose. For side by
side comparison, see our supplementary video.
5.3.3 User Study
We conducted a user study to evaluate which relighting technique produced preferable results. We
showed users a reference image of the subject under one of the OLAT conditions, and then short
video clips of the subject’s interview re-lit by that condition using our approach, Single Image
69
(a) Reference (b) Texler et al. [126] (c) Ours
Figure 5.5: Comparison with Style Transfer based relighting. Our method reproduces more con-
vincing shadows and highlights.
Portrait Relighting [98], and Style Transfer based relighting [126]. We finally asked users two
questions: 1) Which video clip looks more like the reference image, and 2) Which video clip looks
better? From 61 responses, all users answered both questions with the same answer: 52 chose
the video clip rendered with our approach, and 9 chose the video rendered with [126], while none
chose [98]. This showed a clear preference for our approach.
5.3.4 Relighting Dynamic Performance
We perform relighting for interview footage of three Holocaust survivors. The first survivor was
recorded in 2012, while the other two survivors were recorded in 2015. In 2012, the OLAT set con-
sists of 41 patterns while there are 146 patterns used in 2015. Because our method is not restricted
to any OLAT patterns, it can generalize to the new setup as long as the diffuse lighting condition
from above is guaranteed. For consistency, we choose evenly distributed 41 out of 146 OLAT
patterns to train our network. It is important to note that none of the evaluated interview videos
are used to train the neural network. As we can see from Figure 5.6, our network is able to predict
convincing reflectance fields for novel poses in the interview videos, enabling it to realistically
place these interviews in any lighting environment.
70
Input Video
OLAT 1 OLAT 2 Grace Cathedral
Pisa Courtyard
Figure 5.6: Relighting results - Row 1: Input interview videos. Row 2,3: OLAT predictions on
two patterns. Row 4,5: Relighting results with two HDRI lighting environments: Grace Cathedral
and Pisa Courtyard. See more examples in our video.
71
5.4 Future Work
In this project, we made use of both the diffusely-lit interview footage and the reflectance field
exemplars of each subject, but we only used a single one of the available viewpoints in the data.
It seems possible that even better relighting results could be obtained by leveraging some or all of
the views of the subject from the other cameras’ positions, even though these other views are also
recorded in the same diffuse interview lighting. The reason is that the multiple viewpoints carry
additional information about the subject’s three-dimensional shape, and knowing the subject’s 3D
shape is also useful for predicting their shading and shadowing under new lighting conditions. For
future work, it would be of interest to use the 50 viewpoints available to reconstruct a 3D model
such as a Neural Radiance Field [127] for each frame of the interview footage and to leverage these
models during training so that the network is better able to learn how shape and the appearance
under novel illumination are connected. However, at this time, such reconstruction techniques
might be prohibitively expensive to run on hours of video material.
5.5 Conclusion
In this paper, we presented a deep learning-based video relighting technique that takes diffusely
lit video and a set of reflectance field exemplars of the same subject as input. We designed this
technique to work with the data available from the Holocaust survivor interviews recorded in 2014
in the New Dimensions in Testimony project and showed how we can realistically render the
Holocaust survivor interview footage in novel lighting conditions. The technique suggests that this
approach could be used to obtain high-quality relighting of new interview footage, assuming that
the subjects can also be recorded under a variety of directional lighting conditions in a number
of static poses. This provides the relighting network with subject-specific information for how to
relight the video than just the single interview lighting condition alone.
72
Chapter 6
Conclusion and Future Work
6.1 Conclusion
This dissertation has presented techniques and solutions to acquire the shape and reflectance for
the virtual humans which can be used to faithfully place their dynamic performances in the virtual
world. Concretely, the solutions come from both traditional computer graphics pipeline, which
requires accurate estimation of geometry and materials, to modern deep learning-based image
synthesis techniques.
In Chapter 3, we present an automatic system to accurately track facial performance geometry
from multi-view video using a single template. Our technique directly solves for shape estima-
tion and facial tracking with respect to stereo constraints and consistent parameterization. This
approach not only gets rid of drifting from temporal tracking but also enables parallelizable pro-
cessing for dynamic facial performance capture. We have also presented a technique that can
produce medium and high-frequency dynamic details to enhance the realism of dynamic faces
from passive flat-lit performance.
To improve further the quality of the geometry, we present in Chapter 4 a learning-based ap-
proach for synthesizing facial geometry at medium and fine scales from diffusely lit facial texture
maps. We present a hybrid network adopting the state-of-the-art image-to-image translation net-
work and super-resolution network to learn fine skin details encoded in high-resolution displace-
ment maps. Since the training dataset comes from highly accurate facial scans using Light Stage
73
with polarized gradient illumination, the networks can correctly interpret dark features as concavi-
ties or not. The method effectively covers the full range of facial details by using two sub-networks
to handle mid and high frequencies independently.
Accurate geometry acquisition is always challenging, especially when only one view of the
subject is available. Without proper assets, it is not likely that the traditional computer graphics
pipeline can achieve photo-realism. We present a solution to address this problem in Chapter 5 and
achieve photo-realistic relighting results for Holocaust survivors. Concretely, we present a deep
learning-based video relighting technique that takes diffusely lit video and a set of reflectance field
exemplars of the same subject as input. By combining a deep convolutional neural network with
a differentiable renderer, the method learns to reproduce high-quality reflectance fields from both
limited training data and relit images enabling photo-realistic relighting with image-based lighting.
6.2 Future Work
6.2.1 Multi-View Stereo Solve
While our technique yields a robust system and provides several benefits compared to existing
techniques, there are some aspects that we can improve in future work. First, initial landmark
detection accuracy can be improved with recent deep learning models. Second, the coarse-scale
template alignment fails in some areas when the appearance of the subject and the template differ
significantly. In the future, it would be of interest to improve tracking in such regions by providing
additional semantics such as more detailed facial feature segmentation and classification, or by
combining tracking from more than one template to cover a larger appearance space.
6.2.2 Mesoscopic Facial Geometry Inference
In the future where GPU memory would not be a problem, we would like to train the network in
an end-to-end manner. While our training dataset contains several examples of commonly applied
cosmetics, more pronounced theatrical makeup may introduce displacement artifacts. It would
74
be of interest to see the inference results when the training set is improved with more samples
featuring unusual moles, blemishes, and scars. We would also like to incorporate other channels of
input. For example, wrinkles are correlated with low-frequency geometry stress and local specular
highlights can provide additional detail information.
6.2.3 Multi-view Reflectance Fields for Relighting
In this thesis, we made use of both the diffusely-lit video performance and the reflectance field
exemplars of each subject, but we only used a single one of the available viewpoints in the data.
It seems possible that even better relighting could be obtained by leveraging some or all of the
views of the subject from the other cameras’ positions. In the future when the computation power
allows, it would be of interest to use all viewpoints available to reconstruct a 3D model such as
a Neural Radiance Field [127] for each frame of the input video and to leverage these models
during training so that the network is better able to learn how shape and the appearance under
novel illumination are connected. However, at this time, such reconstruction techniques might be
prohibitively expensive to run on hours of video material.
75
References
1. Mori, M., MacDorman, K. F. & Kageki, N. The Uncanny Valley [From the Field] in IEEE
Robotics and Automation Magazine (2012).
2. Williams, L. Performance-driven Facial Animation in Proceedings of the 17th Annual Con-
ference on Computer Graphics and Interactive Techniques (ACM, Dallas, TX, USA, 1990),
235–242. doi:10.1145/97879.97906.
3. Guenter, B., Grimm, C., Wood, D., Malvar, H. & Pighin, F. Making Faces in Proceedings
of the 25th Annual Conference on Computer Graphics and Interactive Techniques (ACM,
New York, NY , USA, 1998), 55–66. doi:10.1145/280814.280822.
4. Yeatman, H. Human Face Project in Proceedings of the 29th International Conference on
Computer Graphics and Interactive Techniques. Electronic Art and Animation Catalog.
(ACM, San Antonio, Texas, 2002), 162–162. doi:10.1145/2931127.2931216.
5. Borshukov, G., Piponi, D., Larsen, O., Lewis, J. P. & Tempelaar-Lietz, C. Universal Capture
- Image-based Facial Animation for ”The Matrix Reloaded” in ACM SIGGRAPH 2005
Courses (ACM, Los Angeles, California, 2005). doi:10.1145/1198555.1198596.
6. Platt, S. M. & Badler, N. I. Animating Facial Expressions. SIGGRAPH Comput. Graph. 15,
245–252. doi:10.1145/965161.806812 (1981).
7. Terzopoulos, D. & Waters, K. Analysis and Synthesis of Facial Image Sequences Using
Physical and Anatomical Models. IEEE Transactions on Pattern Analysis and Machine In-
telligence 15, 569–579 (1993).
8. Charette, P., Sagar, M., DeCamp, G. & Vallot, J. The Jester in ACM SIGGRAPH 99 Elec-
tronic Art and Animation Catalog (ACM, Los Angeles, California, USA, 1999), 151–.
doi:10.1145/312379.312968.
9. Sifakis, E., Neverov, I. & Fedkiw, R. Automatic Determination of Facial Muscle Activations
from Sparse Motion Capture Marker Data in ACM SIGGRAPH 2005 Papers (ACM, Los
Angeles, California, 2005), 417–425. doi:10.1145/1186822.1073208.
10. Cong, M., Bao, M., E, J. L., Bhat, K. S. & Fedkiw, R. Fully Automatic Generation of
Anatomical Face Simulation Models in Proceedings of the 14th ACM SIGGRAPH / Euro-
graphics Symposium on Computer Animation (ACM, Los Angeles, California, 2015), 175–
183. doi:10.1145/2786784.2786786.
11. Blanz, V . & Vetter, T. A morphable model for the synthesis of 3D faces in SIGGRAPH
’99: Proceedings of the 26th annual conference on Computer graphics and interactive tech-
niques (ACM Press/Addison-Wesley Publishing Co., New York, NY , USA, 1999), 187–194.
doi:http://doi.acm.org/10.1145/311535.311556.
12. Li, H., Yu, J., Ye, Y . & Bregler, C. Realtime Facial Animation with On-the-fly Correctives.
ACM Trans. Graph. 32, 42:1–42:10. doi:10.1145/2461912.2462019 (2013).
76
13. Li, H., Weise, T. & Pauly, M. Example-Based Facial Rigging. ACM Transactions on Graph-
ics (Proceedings SIGGRAPH 2010) 29 (2010).
14. Weise, T., Bouaziz, S., Li, H. & Pauly, M. Realtime Performance-Based Facial Animation.
ACM Transactions on Graphics (Proceedings SIGGRAPH 2011) 30 (2011).
15. Saito, S., Li, T. & Li, H. Real-Time Facial Segmentation and Performance Capture from
RGB Input in Proceedings of the European Conference on Computer Vision (ECCV) (2016).
16. Hsieh, P.-L., Ma, C., Yu, J. & Li, H. Unconstrained Realtime Facial Performance Capture
in Computer Vision and Pattern Recognition (CVPR) (2015).
17. Cao, C., Hou, Q. & Zhou, K. Displaced Dynamic Expression Regression for Real-time Fa-
cial Tracking and Animation. ACM Trans. Graph. 33, 43:1–43:10. doi:10.1145/2601097.
2601204 (2014).
18. Bouaziz, S., Wang, Y . & Pauly, M. Online Modeling for Realtime Facial Animation. ACM
Trans. Graph. 32, 40:1–40:10. doi:10.1145/2461912.2461976 (2013).
19. Cao, C., Wu, H., Weng, Y ., Shao, T. & Zhou, K. Real-time Facial Animation with Image-
based Dynamic Avatars. ACM Trans. Graph. 35, 126:1–126:12. doi:10.1145/2897824.
2925873 (2016).
20. Thies, J., Zollhofer, M., Niessner, M., Valgaerts, L., Stamminger, M. & Theobalt, C. Real-
time Expression Transfer for Facial Reenactment. ACM Trans. Graph. 34, 183:1–183:14.
doi:10.1145/2816795.2818056 (2015).
21. Weise, T., Li, H., Gool, L. V . & Pauly, M. Face/Off: Live Facial Puppetry in Proceed-
ings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer animation (Proc.
SCA’09) (Eurographics Association, New Orleans, Louisiana, 2009).
22. Bhat, K. S., Goldenthal, R., Ye, Y ., Mallet, R. & Koperwas, M. High Fidelity Facial An-
imation Capture and Retargeting with Contours in Proceedings of the 12th ACM SIG-
GRAPH/Eurographics Symposium on Computer Animation (ACM, Anaheim, California,
2013), 7–14. doi:10.1145/2485895.2485915.
23. Thies, J., Zollh¨ ofer, M., Stamminger, M., Theobalt, C. & Nießner, M. Face2Face: Real-
time Face Capture and Reenactment of RGB Videos in Proc. Computer Vision and Pattern
Recognition (CVPR), IEEE (2016).
24. Garrido, P., Valgaert, L., Wu, C. & Theobalt, C. Reconstructing Detailed Dynamic Face
Geometry from Monocular Video. ACM Trans. Graph. 32, 158:1–158:10. doi:10.1145/
2508363.2508380 (2013).
25. Garrido, P., Zollhoefer, M., Casas, D., Valgaerts, L., Varanasi, K., Perez, P., et al. Recon-
struction of Personalized 3D Face Rigs from Monocular Video. ACM Trans. Graph. (Pre-
sented at SIGGRAPH 2016) 35, 28:1–28:15 (2016).
26. Ichim, A. E., Bouaziz, S. & Pauly, M. Dynamic 3D Avatar Creation from Hand-held Video
Input. ACM Trans. Graph. 34, 45:1–45:14. doi:10.1145/2766974 (2015).
27. Cao, C., Bradley, D., Zhou, K. & Beeler, T. Real-time High-fidelity Facial Performance
Capture. ACM Trans. Graph. 34, 46:1–46:9. doi:10.1145/2766943 (2015).
28. Olszewski, K., Lim, J. J., Saito, S. & Li, H. High-Fidelity Facial and Speech Animation
for VR HMDs. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia 2016) 35
(2016).
29. Furukawa, Y . & Ponce, J. Accurate, Dense, and Robust Multiview Stereopsis. Pattern Anal-
ysis and Machine Intelligence, IEEE Transactions on 32, 1362–1376. doi:10.1109/TPAMI.
2009.161 (2010).
77
30. Beeler, T., Bickel, B., Beardsley, P., Sumner, B. & Gross, M. High-Quality Single-Shot
Capture of Facial Geometry. ACM Trans. on Graphics (Proc. SIGGRAPH) 29, 40:1–40:9
(2010).
31. Beeler, T., Hahn, F., Bradley, D., Bickel, B., Beardsley, P., Gotsman, C., et al. High-quality
passive facial performance capture using anchor frames in ACM SIGGRAPH 2011 papers
(ACM, Vancouver, British Columbia, Canada, 2011), 75:1–75:10. doi:10.1145/1964921.
1964970.
32. Furukawa, Y . & Ponce, J. Dense 3D Motion Capture for Human Faces in Proc. of CVPR 09
(2009).
33. Bradley, D., Heidrich, W., Popa, T. & Sheffer, A. High Resolution Passive Facial Perfor-
mance Capture. ACM Trans. on Graphics (Proc. SIGGRAPH) 29 (2010).
34. Alexander, O., Rogers, M., Lambeth, W., Chiang, M. & Debevec, P. Creating a Photoreal
Digital Actor: The Digital Emily Project in Visual Media Production, 2009. CVMP ’09.
Conference for (2009), 176–187. doi:10.1109/CVMP.2009.29.
35. Alexander, O., Fyffe, G., Busch, J., Yu, X., Ichikari, R., Graham, P., et al. Digital Ira: High-
resolution Facial Performance Playback in ACM SIGGRAPH 2013 Computer Animation
Festival (ACM, Anaheim, California, 2013), 1–1. doi:10.1145/2503541.2503641.
36. Fyffe, G., Jones, A., Alexander, O., Ichikari, R. & Debevec, P. Driving High-Resolution
Facial Scans with Video Performance Capture. ACM Transactions on Graphics (TOG) 34,
1–13 (2014).
37. Klaudiny, M. & Hilton, A. High-detail 3D capture and non-sequential alignment of facial
performance in 3DIMPVT (2012).
38. Saito, S., Wei, L., Hu, L., Nagano, K. & Li, H. Photorealistic Facial Texture Inference Using
Deep Neural Networks in Proc. Computer Vision and Pattern Recognition (CVPR), IEEE
(2017).
39. Hu, L., Saito, S., Wei, L., Nagano, K., Seo, J., Fursund, J., et al. Avatar Digitization From A
Single Image For Real-Time Rendering in ACM Trans. Graph. (Proceedings of SIGGRAPH
Asia 2017) (2017).
40. Hyneman, W., Itokazu, H., Williams, L. & Zhao, X. Human Face Project in ACM SIG-
GRAPH 2005 Courses (ACM, Los Angeles, California, 2005). doi:10.1145/1198555.
1198585.
41. Acevedo, G., Nevshupov, S., Cowely, J. & Norris, K. An Accurate Method for Acquiring
High Resolution Skin Displacement Maps in ACM SIGGRAPH 2010 Talks (ACM, Los An-
geles, California, 2010), 4:1–4:1. doi:10.1145/1837026.1837032.
42. Bradley, D., Heidrich, W., Popa, T. & Sheffer, A. High resolution passive facial performance
capture in ACM transactions on graphics (TOG) (2010).
43. Valgaerts, L., Wu, C., Bruhn, A., Seidel, H.-P. & Theobalt, C. Lightweight Binocular Fa-
cial Performance Capture under Uncontrolled Lighting in ACM Transactions on Graphics
(Proceedings of SIGGRAPH Asia 2012) (2012). doi:10.1145/2366145.2366206.
44. Fyffe, G., Nagano, K., Huynh, L., Saito, S., Busch, J., Jones, A., et al. Multi-View Stereo on
Consistent Face Topology in Computer Graphics Forum (2017).
45. Langer, M. S. & Zucker, S. W. Shape-from-shading on a cloudy day. J. Opt. Soc. Am. A 11,
467–478. doi:10.1364/JOSAA.11.000467 (1994).
78
46. Glencross, M., Ward, G. J., Melendez, F., Jay, C., Liu, J. & Hubbold, R. A Perceptually
Validated Model for Surface Depth Hallucination in ACM SIGGRAPH 2008 Papers (ACM,
Los Angeles, California, 2008), 59:1–59:8. doi:10.1145/1399504.1360658.
47. Barron, J. T. & Malik, J. Shape, Illumination, and Reflectance from Shading. TPAMI (2015).
48. Kemelmacher-Shlizerman, I. & Basri, R. 3d face reconstruction from a single image using
a single reference face shape. IEEE TPAMI 33, 394–405 (2011).
49. Shi, F., Wu, H.-T., Tong, X. & Chai, J. Automatic Acquisition of High-fidelity Facial Per-
formances Using Monocular Videos. ACM Trans. Graph. 33, 222:1–222:13. doi:10.1145/
2661229.2661290 (2014).
50. Suwajanakorn, S., Kemelmacher-Shlizerman, I. & Seitz, S. M. in Computer Vision – ECCV
2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings,
Part IV 796–812 (Springer International Publishing, Cham, 2014). doi:10.1007/978-3-
319-10593-2_52.
51. Li, C., Zhou, K. & Lin, S. Intrinsic Face Image Decomposition with Human Face Priors in
ECCV (5)’14 (2014), 218–233.
52. Debevec, P., Hawkins, T., Tchou, C., Duiker, H.-P. & Sarokin, W. Acquiring the Reflectance
Field of a Human Face in SIGGRAPH (2000).
53. Ma, W.-C., Hawkins, T., Peers, P., Chabert, C.-F., Weiss, M. & Debevec, P. Rapid Acquisi-
tion of Specular and Diffuse Normal Maps from Polarized Spherical Gradient Illumination
in Proceedings of the 18th Eurographics Conference on Rendering Techniques (Eurograph-
ics Association, Grenoble, France, 2007), 183–194. doi:10.2312/EGWR/EGSR07/183-194.
54. Graham, P., Tunwattanapong, B., Busch, J., Yu, X., Jones, A., Debevec, P., et al. Measurement-
Based Synthesis of Facial Microgeometry in Computer Graphics Forum 32 (2013), 335–
344.
55. Ghosh, A., Fyffe, G., Tunwattanapong, B., Busch, J., Yu, X. & Debevec, P. Multiview face
capture using polarized spherical gradient illumination in Proceedings of the 2011 SIG-
GRAPH Asia Conference (ACM, Hong Kong, China, 2011), 129:1–129:10.
56. Alexander, O., Rogers, M., Lambeth, W., Chiang, M. & Debevec, P. The Digital Emily
Project: Photoreal Facial Modeling and Animation in ACM SIGGRAPH 2009 Courses
(ACM, New Orleans, Louisiana, 2009), 12:1–12:15. doi:10.1145/1667239.1667251.
57. V on der Pahlen, J., Jimenez, J., Danvoye, E., Debevec, P., Fyffe, G. & Alexander, O. Digital
Ira and Beyond: Creating Real-time Photoreal Digital Actors in ACM SIGGRAPH 2014
Courses (ACM, Vancouver, Canada, 2014), 1:1–1:384. doi:10.1145/2614028.2615407.
58. The Digital Human League. Digital Emily 2.0 http://gl.ict.usc.edu/Research/DigitalEmily2/.
2015.
59. Weyrich, T., Matusik, W., Pfister, H., Bickel, B., Donner, C., Tu, C., et al. Analysis of Hu-
man Faces using a Measurement-Based Skin Reflectance Model. ACM Trans. on Graphics
(Proc. SIGGRAPH 2006) 25, 1013–1024. doi:http://doi.acm.org/10.1145/1179352.
1141987 (2006).
60. Nagano, K., Fyffe, G., Alexander, O., Barbiˇ c, J., Li, H., Ghosh, A., et al. Skin Microstruc-
ture Deformation with Displacement Map Convolution. ACM Transactions on Graphics
(Proceedings SIGGRAPH 2015) 34 (2015).
61. Haro, A., Guenter, B. & Essa, I. Real-time, Photo-realistic, Physically Based Rendering of
Fine Scale Human Skin Structure in Eurographics Workshop on Rendering (eds Gortle, S. J.
& Myszkowski, K.) (2001). doi:10.2312/EGWR/EGWR01/053-062.
79
62. Johnson, M. K., Cole, F., Raj, A. & Adelson, E. H. Microgeometry Capture using an Elas-
tomeric Sensor. ACM Transactions on Graphics (Proc. ACM SIGGRAPH) 30, 46:1–46:8.
doi:http://dx.doi.org/10.1145/2010324.1964941 (2011).
63. Ma, W.-C., Jones, A., Chiang, J.-Y ., Hawkins, T., Frederiksen, S., Peers, P., et al. Facial
Performance Synthesis Using Deformation-driven Polynomial Displacement Maps in ACM
SIGGRAPH Asia 2008 Papers (ACM, Singapore, 2008), 121:1–121:10. doi:10.1145/
1457515.1409074.
64. Wilson, C. A., Ghosh, A., Peers, P., Chiang, J.-Y ., Busch, J. & Debevec, P. Temporal upsam-
pling of performance geometry using photometric alignment. ACM Transactions on Graph-
ics (TOG) 29, 17 (2010).
65. Gotardo, P. F., Simon, T., Sheikh, Y . & Matthews, I. Photogeometric scene flow for high-
detail dynamic 3d reconstruction in Proceedings of the IEEE International Conference on
Computer Vision (2015), 846–854.
66. Golovinskiy, A., Matusik, W., Pfister, H., Rusinkiewicz, S. & Funkhouser, T. A Statisti-
cal Model for Synthesis of Detailed Facial Geometry. ACM Trans. Graph. 25, 1025–1034.
doi:10.1145/1141911.1141988 (2006).
67. Bickel, B., Lang, M., Botsch, M., Otaduy, M. A. & Gross, M. Pose-space Animation and
Transfer of Facial Details in Proceedings of the 2008 ACM SIGGRAPH/Eurographics Sym-
posium on Computer Animation (Eurographics Association, Dublin, Ireland, 2008), 57–66.
68. Cao, C., Bradley, D., Zhou, K. & Beeler, T. Real-time high-fidelity facial performance cap-
ture. ACM Transactions on Graphics (TOG) 34, 46 (2015).
69. Isola, P., Zhu, J.-Y ., Zhou, T. & Efros, A. A. Image-to-image translation with conditional
adversarial networks. arXiv preprint arXiv:1611.07004 (2016).
70. Bansal, A., Russell, B. & Gupta, A. Marr revisited: 2d-3d alignment via surface normal
prediction in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition (2016), 5965–5974.
71. Chen, W., Xiang, D. & Deng, J. Surface Normals in the Wild. arXiv preprint arXiv:1704.02956
(2017).
72. Trigeorgis, G., Snape, P., Kokkinos, I. & Zafeiriou, S. Face Normals “in-the-wild” using
Fully Convolutional Networks in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (2017).
73. Richardson, E., Sela, M., Or-El, R. & Kimmel, R. Learning detailed face reconstruction
from a single image. arXiv preprint arXiv:1611.05053 (2016).
74. Sela, M., Richardson, E. & Kimmel, R. Unrestricted Facial Geometry Reconstruction Using
Image-to-Image Translation. arXiv preprint arXiv:1703.10131 (2017).
75. Lombardi, S. & Nishino, K. Reflectance and Illumination Recovery in the Wild. IEEE
Transactions on Pattern Analysis and Machine Intelligence 38, 129–141. doi:10.1109/
TPAMI.2015.2430318 (2016).
76. Ramamoorthi, R. & Hanrahan, P. A Signal-Processing Framework for Inverse Rendering in
Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Tech-
niques (Association for Computing Machinery, New York, NY , USA, 2001), 117–128.
doi:10.1145/383259.383271.
80
77. Yu, Y ., Debevec, P., Malik, J. & Hawkins, T. Inverse Global Illumination: Recovering Re-
flectance Models of Real Scenes from Photographs in Proceedings of the 26th Annual Con-
ference on Computer Graphics and Interactive Techniques (ACM Press/Addison-Wesley
Publishing Co., USA, 1999), 215–224. doi:10.1145/311535.311559.
78. Horn, B. K. SHAPE FROM SHADING: A METHOD FOR OBTAINING THE SHAPE OF A
SMOOTH OPAQUE OBJECT FROM ONE VIEW tech. rep. (USA, 1970).
79. Theobalt, C., Ahmed, N., Lensch, H., Magnor, M. & Seidel, H.-P. Seeing People in Different
Light - Joint Shape, Motion, and Reflectance Capture. IEEE Transactions on Visualization
and Computer Graphics (TVCG) 13, 663–674 (2007).
80. Hawkins, T., Wenger, A., Tchou, C., Gardner, A., Goransson, F. & Debevec, P. Animatable
Facial Reflectance Fields in Eurographics Symposium on Rendering (Norkoping, Sweden,
2004).
81. B´ erard, P., Bradley, D., Gross, M. & Beeler, T. Lightweight Eye Capture Using a Parametric
Model. ACM Trans. Graph. 35. doi:10.1145/2897824.2925962 (2016).
82. Bermano, A., Beeler, T., Kozlov, Y ., Bradley, D., Bickel, B. & Gross, M. Detailed Spatio-
Temporal Reconstruction of Eyelids. ACM Trans. Graph. 34. doi:10.1145/2766924 (2015).
83. Hu, L., Ma, C., Luo, L. & Li, H. Single-View Hair Modeling Using a Hairstyle Database.
ACM Trans. Graph. 34. doi:10.1145/2766931 (2015).
84. Zhang, M., Chai, M., Wu, H., Yang, H. & Zhou, K. A Data-Driven Approach to Four-View
Image-Based Hair Modeling. ACM Trans. Graph. 36. doi:10.1145/3072959.3073627
(2017).
85. Li, G., Wu, C., Stoll, C., Liu, Y ., Varanasi, K., Dai, Q., et al. Capturing Relightable Hu-
man Performances under General Uncontrolled Illumination. Computer Graphics Forum.
doi:10.1111/cgf.12047 (2013).
86. Zhen Wen, Zicheng Liu & Huang, T. S. Face relighting with radiance environment maps
in 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
2003. Proceedings. 2 (2003), II–158. doi:10.1109/CVPR.2003.1211466.
87. Fyffe, G. Cosine Lobe Based Relighting from Gradient Illumination Photographs in SIG-
GRAPH ’09: Posters (Association for Computing Machinery, New Orleans, Louisiana,
2009). doi:10.1145/1599301.1599381.
88. Nam, G., Lee, J. H., Gutierrez, D. & Kim, M. H. Practical SVBRDF Acquisition of 3D
Objects with Unstructured Flash Photography. ACM Trans. Graph. 37. doi:10.1145/
3272127.3275017 (2018).
89. Li, Z., Xu, Z., Ramamoorthi, R., Sunkavalli, K. & Chandraker, M. Learning to reconstruct
shape and spatially-varying reflectance from a single image in SIGGRAPH Asia 2018 Tech-
nical Papers (2018), 269.
90. Gotardo, P., Riviere, J., Bradley, D., Ghosh, A. & Beeler, T. Practical Dynamic Facial Ap-
pearance Modeling and Acquisition. ACM Trans. Graph. 37. doi:10.1145/3272127.
3275073 (2018).
91. Yamaguchi, S., Saito, S., Nagano, K., Zhao, Y ., Chen, W., Olszewski, K., et al. High-
Fidelity Facial Reflectance and Geometry Inference from an Unconstrained Image. ACM
Trans. Graph. 37. doi:10.1145/3197517.3201364 (2018).
92. Debevec, P., Hawkins, T., Tchou, C., Duiker, H.-P., Sarokin, W. & Sagar, M. Acquiring
the reflectance field of a human face in SIGGRAPH ’00: Proceedings of the 27th annual
81
conference on Computer graphics and interactive techniques (ACM Press/Addison-Wesley
Publishing Co., New York, NY , USA, 2000), 145–156.
93. Wenger, A., Gardner, A., Tchou, C., Unger, J., Hawkins, T. & Debevec, P. Performance Re-
lighting and Reflectance Transformation with Time-Multiplexed Illumination. ACM Trans.
Graph. 24, 756–764. doi:10.1145/1073204.1073258 (2005).
94. Einarsson, P., Chabert, C.-F., Jones, A., Ma, W.-C., Lamond, B., Hawkins, T., et al. Re-
lighting Human Locomotion with Flowed Reflectance Fields in Eurographics Symposium
on Rendering (2006) (2006).
95. Peers, P., Tamura, N., Matusik, W. & Debevec, P. Post-production Facial Performance Re-
lighting using Reflectance Transfer. ACM Transactions on Graphics 26. doi:http://doi.
acm.org/10.1145/1276377.1276442 (2007).
96. Shih, Y ., Paris, S., Barnes, C., Freeman, W. T. & Durand, F. Style Transfer for Headshot
Portraits. ACM Trans. Graph. 33. doi:10.1145/2601097.2601137 (2014).
97. Meka, A., Haene, C., Pandey, R., Zollhoefer, M., Fanello, S., Fyffe, G., et al. Deep Re-
flectance Fields - High-Quality Facial Reflectance Field Inference From Color Gradient
Illumination in. 38 (2019). doi:10.1145/3306346.3323027.
98. Sun, T., Barron, J. T., Tsai, Y .-T., Xu, Z., Yu, X., Fyffe, G., et al. Single Image Portrait
Relighting. ACM Trans. Graph. 38. doi:10.1145/3306346.3323008 (2019).
99. LeGendre, C., Ma, W.-C., Pandey, R., Fanello, S., Rhemann, C., Dourgarian, J., et al.
Learning Illumination from Diverse Portraits in SIGGRAPH Asia 2020 Technical Commu-
nications (Association for Computing Machinery, Virtual Event, Republic of Korea, 2020).
doi:10.1145/3410700.3425432.
100. Texler, O., Futschik, D., Kuˇ cera, M., Jamriˇ ska, O., Sochorov´ a,
ˇ
S., Chai, M., et al. Interactive
Video Stylization Using Few-Shot Patch-Based Training. ACM Transactions on Graphics
39, 73 (2020).
101. Valgaerts, L., Wu, C., Bruhn, A., Seidel, H.-P. & Theobalt, C. Lightweight binocular facial
performance capture under uncontrolled lighting. ACM Trans. Graph. 31, 187:1–187:11.
doi:10.1145/2366145.2366206 (2012).
102. Si, H. TetGen, a Delaunay-Based Quality Tetrahedral Mesh Generator. ACM Trans. Math.
Softw. 41, 11:1–11:36. doi:10.1145/2629697 (2015).
103. League, D. H. The Wikihuman Projecthttp://gl.ict.usc.edu/Research/DigitalEmily2/.
Accessed: 2015-12-01. 2015.
104. Kazemi, V . & Sullivan, J. One Millisecond Face Alignment with an Ensemble of Regression
Trees in Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recog-
nition (IEEE Computer Society, Washington, DC, USA, 2014), 1867–1874. doi:10.1109/
CVPR.2014.241.
105. King, D. E. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research
10, 1755–1758 (2009).
106. Werlberger, M., Trobin, W., Pock, T., Wedel, A., Cremers, D. & Bischof, H. Anisotropic
Huber-L1 Optical Flow in Proceedings of the British Machine Vision Conference (BMVC)
(London, UK, 2009).
107. Zhou, K., Huang, J., Snyder, J., Liu, X., Bao, H., Guo, B., et al. Large Mesh Deforma-
tion Using the Volumetric Graph Laplacian in ACM SIGGRAPH 2005 Papers (ACM, Los
Angeles, California, 2005), 496–503. doi:10.1145/1186822.1073219.
108. Agarwal, S., Mierle, K., et al. Ceres Solver http://ceres-solver.org.
82
109. Sorkine, O. & Alexa, M. As-rigid-as-possible Surface Modeling in Proceedings of the Fifth
Eurographics Symposium on Geometry Processing (Eurographics Association, Barcelona,
Spain, 2007), 109–116.
110. Gower, J. C. Generalized procrustes analysis. Psychometrika 40, 33–51. doi:10.1007/
BF02291478 (1975).
111. Neumann, T., Varanasi, K., Wenger, S., Wacker, M., Magnor, M. & Theobalt, C. Sparse
Localized Deformation Components. ACM Trans. Graph. 32, 179:1–179:10. doi:10.1145/
2508363.2508417 (2013).
112. Malleson, C., Bazin, J.-C., Wang, O., Bradley, D., Beeler, T., Hilton, A., et al. FaceDirec-
tor: Continuous Control of Facial Performance in Video in International Conference on
Computer Vision (ICCV) 2015 (2015).
113. Qi, C. R., Su, H., Mo, K. & Guibas, L. J. PointNet: Deep Learning on Point Sets for 3D
Classification and Segmentation. CoRR abs/1612.00593 (2016).
114. Shi, W., Caballero, J., Husz´ ar, F., Totz, J., Aitken, A. P., Bishop, R., et al. Real-time single
image and video super-resolution using an efficient sub-pixel convolutional neural network
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016),
1874–1883.
115. Efros, A. A. & Freeman, W. T. Image quilting for texture synthesis and transfer in Pro-
ceedings of the 28th annual conference on Computer graphics and interactive techniques
(2001), 341–346.
116. Artstein, R., Traum, D., Alexander, O., Leuski, A., Jones, A., Georgila, K., et al. Time-Offset
Interaction with a Holocaust Survivor in Proceedings of the 19th International Conference
on Intelligent User Interfaces (Association for Computing Machinery, Haifa, Israel, 2014),
163–168. doi:10.1145/2557500.2557540.
117. Jones, A., Nagano, K., Busch, J., Yu, X., Peng, H., Barreto, J., et al. Time-Offset Con-
versations on a Life-Sized Automultiscopic Projector Array in 2016 IEEE Conference on
Computer Vision and Pattern Recognition Workshops (CVPRW) (2016), 927–935. doi:10.
1109/CVPRW.2016.120.
118. Jones, A., Unger, J., Nagano, K., Busch, J., Yu, X., Peng, H.-Y ., et al. An Automultiscopic
Projector Array for Interactive Digital Humans in ACM SIGGRAPH 2015 Emerging Tech-
nologies (Association for Computing Machinery, Los Angeles, California, 2015). doi:10.
1145/2782782.2792494.
119. Fyffe, G., Hawkins, T., Watts, C., Ma, W.-C. & Debevec, P. Comprehensive Facial Perfor-
mance Capture. Comput. Graph. Forum 30, 425–434. doi:10.1111/j.1467-8659.2011.
01888.x (2011).
120. Rother, C., Kolmogorov, V . & Blake, A. ”GrabCut”: Interactive Foreground Extraction
Using Iterated Graph Cuts in ACM SIGGRAPH 2004 Papers (Association for Computing
Machinery, Los Angeles, California, 2004), 309–314. doi:10.1145/1186562.1015720.
121. Debevec, P. Rendering Synthetic Objects into Real Scenes: Bridging Traditional and Image-
Based Graphics with Global Illumination and High Dynamic Range Photography in Pro-
ceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques
(Association for Computing Machinery, New York, NY , USA, 1998), 189–198. doi:10.
1145/280814.280864.
83
122. Ronneberger, O., P.Fischer & Brox, T. U-Net: Convolutional Networks for Biomedical Im-
age Segmentation in Medical Image Computing and Computer-Assisted Intervention (MIC-
CAI) 9351. (available on arXiv:1505.04597 [cs.CV]) (Springer, 2015), 234–241.
123. Zhang, R. Making Convolutional Networks Shift-Invariant Again in ICML (2019).
124. Simonyan, K. & Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image
Recognition in International Conference on Learning Representations (2015).
125. Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization in 3rd International
Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings (2015).
126. Texler, O., Futschik, D., Fiˇ ser, J., Luk´ aˇ c, M., Lu, J., Shechtman, E., et al. Arbitrary Style
Transfer using Neurally-Guided Patch-Based Synthesis. Computers & Graphics 87, 62–71
(2020).
127. Mildenhall, B., Srinivasan, P., Tancik, M., Barron, J., Ramamoorthi, R. & Ng, R. in, 405–
421 (2020). doi:10.1007/978-3-030-58452-8_24.
84
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning to optimize the geometry and appearance from images
PDF
Compositing real and virtual objects with realistic, color-accurate illumination
PDF
Single-image geometry estimation for various real-world domains
PDF
A framework for high‐resolution, high‐fidelity, inexpensive facial scanning
PDF
Human appearance analysis and synthesis using deep learning
PDF
Spherical harmonic and point illumination basis for reflectometry and relighting
PDF
Multi-scale dynamic capture for high quality digital humans
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Sharpness analysis of neural networks for physics simulations
PDF
Complete human digitization for sparse inputs
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Rapid creation of photorealistic virtual reality content with consumer depth cameras
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Computational models for multidimensional annotations of affect
PDF
Effective data representations for deep human digitization
PDF
Deep representations for shapes, structures and motion
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Towards more human-like cross-lingual transfer learning
PDF
Scalable dynamic digital humans
Asset Metadata
Creator
Huynh, Loc Vinh
(author)
Core Title
Recording, reconstructing, and relighting virtual humans
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2021-12
Publication Date
09/29/2021
Defense Date
05/25/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer graphics,deep learning,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Debevec, Paul (
committee chair
), Nakano, Aiichiro (
committee member
), Nealen, Andrew (
committee member
)
Creator Email
lochuynh@usc.edu,vinhloc89@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15965160
Unique identifier
UC15965160
Legacy Identifier
etd-HuynhLocVi-10109
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Huynh, Loc Vinh
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
computer graphics
deep learning