Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Human appearance analysis and synthesis using deep learning
(USC Thesis Other)
Human appearance analysis and synthesis using deep learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Human Appearance Analysis and Synthesis Using Deep Learning
by
Lingyu ‘Cosimo’ Wei
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
May 2018
Copyright 2018 Lingyu ‘Cosimo’ Wei
Dedication
To my love, Ichi. It is an unimaginable gift meeting you and for both of us becoming the other half of each
other. I still cannot believe that it just all happened within these 4 years.
ii
Acknowledgements
I would like to thank my advisor, Hao Li. Thank you for all the working hours we spent together in the
office, and for all the wings and Red Bulls there :). Thank you for building the research team, the graphics
capture lab, and the company. Thank you for all the midnight conversations and being a tutor of not only
research but life.
I would like to thank my committee members: Krishna Nayak, Aiichiro Nakano, Joseph Lim, and
Gaurav Sukhatme, for your participation of my qualifying exam and defense of the dissertation, as well as
all the insightful discussions and suggestions.
I would like to acknowledge my close collaborators and friends: Peter Huang for always exchanging
your brilliant ideas and the efforts on re-implementing all algorithms. Thank you Etienne V ouga for all the
last-minute help and your poetic writing. Thank you Liwen Hu, Shunsuke Saito, Kyle Olszewski, Tianye
Li, and Zimo Li for those sleepless nights before deadlines, and being the model of most of the figures
in our papers. And of course other Ph.D. students from USC: Ruizhe Wang, and Koki Nagano. It is a
wonderful time working with you realizing those publications.
I would like to thank all the warm and enthusiastic friends in Adobe: Linjie Luo, Duygu Ceylan, Ersin
Yumer, Vladimir Kim, Nathan Carr, Radom´ ır Mˇ ech, Sunil Hadap, Kalyan Sunkavalli. The two internships
at Adobe were always intense yet joyful. And not to mention the honor of receiving the Adobe Research
Fellowship, which could not be possible without your acknowledgements. A special thank goes to my
supervisors Duygu Ceylan, Ersin Yumer, and Vladimir Kim. Thank you for never setting too specific
research projects or plans, but always listening to my ideas and intuitions. Those publications would not
exist without your unconditioned supports.
iii
I would also like to thank the Pinscreen folks: Koki Nagano, Jaewoo Seo, Jens Fursund, Aviral Agar-
wal, Carrie Sun, Stephen Chen, and Hanwei Kung. Pinscreen is almost another research lab if not more
productive. And I am looking forward to working together with you on more creative and customer-
oriented projects.
Finally, I would like to thank my father, my mother, my wife, and all my friends for the unfailing
supports during my 4-year Ph.D. study at USC.
The author acknowledges the partial support of the research team by Adobe, Oculus & Facebook,
Huawei, Sony, Pelican Imaging, Panasonic, Embodee, the USC Integrated Media System Center, the
Google Faculty Research Award, the Okawa Foundation Research Grant, the Office of Naval Research
(ONR) / U.S. Navy, under award number N00014-15-1-2639, the Office of the Director of National Intel-
ligence (ODNI) and Intelligence Advanced Research Projects Activity (IARPA), under contract number
2014-14071600010, and the U.S. Army Research Laboratory (ARL) under contract W911NF-14-D-0005.
The views and conclusions contained herein should not be interpreted as necessarily representing the offi-
cial policies or endorsements, either expressed or implied, of ODNI, IARPA, or the U.S. Government.
iv
v
Table of Contents
Dedication ii
Acknowledgements iii
List Of Tables viii
List Of Figures ix
Abstract xi
Chapter 1: Overview 1
1.1 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Dense Correspondence of 3D Scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Facial Texture Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Hair Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2: Dense Correspondence of 3D Scans 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Problem Statement and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Descriptor learning as ensemble classification . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Correspondence Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Training Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Network Design and Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 3: Facial Texture Inference 24
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Initial Face Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Texture Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 Texture Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 4: Hair Rendering 44
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
vi
4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Chapter 5: Discussion 61
5.1 Limitations and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Reference List 64
vii
List Of Tables
2.1 The end-to-end network architecture for per-pixel feature descriptor. . . . . . . . . . . . . 13
2.2 Evaluation of error distances from different algorithms on the FAUST dataset. . . . . . . . 19
2.3 Comparison to the work of Chen et al. [27] on the FAUST dataset. . . . . . . . . . . . . . 20
4.1 Hair image evaluation on photo realism using Amazon Mechanical Turk. . . . . . . . . . . 56
viii
List Of Figures
2.1 Dense correspondence between human in arbitrary shapes and clothing. . . . . . . . . . . 6
2.2 Pipeline of network training and correspondence computation. . . . . . . . . . . . . . . . 10
2.3 Illustration of multiple segmentations to ensure smoothness. . . . . . . . . . . . . . . . . 13
2.4 Sparse key point annotations of 33 landmarks across clothed human models of different
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Full-to-full, partial-to-full, and partial-to-partial matchings between full 3D models and
partial scans generated from a single depth map. . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Evaluation of error distributions from different algorithms on the FAUST dataset. . . . . . 22
2.7 Comparison to other non-rigid registration algorithms. . . . . . . . . . . . . . . . . . . . 22
2.8 Smooth reconstruction without explicit temporal coherency by our method. . . . . . . . . 23
3.1 Synthesizing photorealistic facial texture along with 3D geometry from a single uncon-
strained image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Overview of our texture inference framework. . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Overview of the step of texture analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Convex combination of feature correlations. . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Using convex constraints, we can ensure detail preservation for low-quality and noisy input
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Detail weight for texture synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Photorealistic renderings of geometry, texture obtained using PCA model fitting, and our
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8 Comparison between different convolutional neural network architectures. . . . . . . . . . 37
3.9 Fine-scale details from largely downsized image preserving similarity. . . . . . . . . . . . 37
ix
3.10 Evaluation of choosing different numbers of mid-layers. . . . . . . . . . . . . . . . . . . 38
3.11 Consistent and plausible reconstructions from two different viewpoints. . . . . . . . . . . 39
3.12 Results with images from unconstrained images. . . . . . . . . . . . . . . . . . . . . . . . 40
3.13 Additional results with images from the AFW dataset [113]. . . . . . . . . . . . . . . . . 41
3.14 Comparison of our method with PCA-based model fitting [12], visio-lization [98], and the
ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.15 Comparison with PatchMatch [8] on a partial input data. . . . . . . . . . . . . . . . . . . 42
3.16 Box plots of 150 turkers rating for realism and alikeness of generated images comparing
to the ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.17 Side-by-side renderings of 3D faces for AMT. . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Rendering photo-real hair images in real-time with referenced color and lighting. . . . . . 44
4.2 Overview of our hair image analysis and synthesis framework. . . . . . . . . . . . . . . . 49
4.3 Examples of our training data for hair analysis. . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 cV AE-GAN architecture for our hair pipeline. . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Our interactive interface for real-time hair appearance manipulation. . . . . . . . . . . . . 54
4.6 Rendering results for different CG hair models. . . . . . . . . . . . . . . . . . . . . . . . 57
4.7 Hair appearance interpolations in color, lighting, and structure. . . . . . . . . . . . . . . . 58
4.8 Hair appearance interpolations in a 2D grid for color and lighting. . . . . . . . . . . . . . 59
4.9 Comparisons between our method with non-sequential network. . . . . . . . . . . . . . . 59
4.10 Comparisons between our method with real-time rendering. . . . . . . . . . . . . . . . . . 60
4.11 Comparisons between our method with offline rendering. . . . . . . . . . . . . . . . . . . 60
4.12 User study examples of hair image realism. . . . . . . . . . . . . . . . . . . . . . . . . . 60
x
Abstract
Deep Neural Network (DNN) is a data-driven machine learning model that is proven to achieve state-of-
the-art performance in most AI applications. However, most of the current researches focus on vision-
based targets e.g. object classification, detection, and segmentation, whose training data are often obtained
from subjective judgment and manual labeling. Without accurate analysis of the geometric property of the
actual object, nor providing the ability for high resolution and high accuracy synthesis of images, existing
DNN frameworks are not sufficient to accomplish computer graphics based tasks including performance
capture, synthesis of texture, or material analysis for realistic graphics rendering. We propose multiple
innovative modifications to the current DNN frameworks to redefine the state-of-the-art performance of
several research fields of graphics and reduce the cost and manual effort by switching to automatic ana-
lyzing approaches in performance capturing and modeling, and image synthesizing methods in graphics
rendering.
First, we show how to use a neural network for human body capture and analysis. We can predict
shape-invariant, pose-invariant and temporarily consistent descriptor of the human body in pixel level.
This allows us to compute dense correspondences of human motions directly, remove the requirement
of watertight or occlusion-free geometry needed by previous methods, and has the state-of-the-art body
registration accuracy.
Then we show synthesizing mesoscopic skin details is possible, using the neural network, on human
faces from a low-quality facial image with occlusion. Our method can infer the feature of subject’s face
including skin characteristics, and reconstruct a photo-realistic texture following the subject’s identity
without any manual manipulation.
xi
Finally, a complete neural network based framework for human hair rendering is proposed. We can
use the neural network to retrieve hair information including color, lighting environment and hair structure
directly from an image. Using the predicted parameters, we can synthesize new and realistic hair images
from CG rendering without any knowledge of the 3d geometry, and in real-time. Comparing to the tradi-
tional manual process to model hair attributes, our new method is automatic and keeps a realistic style for
the output images.
xii
Chapter 1
Overview
Digitally capturing, analyzing, and resynthesizing human appearance with deformable geometry and ed-
itable structure and material, for accurate and realistic details in modeling and rendering is one of the core
tasks of Computer Graphics, Computer Vision, and Artificial Intelligence. Starting from over 40 years
ago, this technique has been widely used by biomechanics researcher and later adapted by filmmakers,
video game developers, cloth designers to provide not only the vivid appearances for customers, but also
insights of how a person perceives, behaves, and interacts with himself and others. Recent developments of
Virtual Reality(VR) and Augmented Reality(AR) push the demand for human capturing to a new level: we
need human modeling in higher detail to provide an immersive experience, yet a more automatic and rapid
progress to support ever-growing applications e.g. full-CG movies and appearance customizable computer
games.
A complete capturing and rendering pipeline of human appearance consists of three critical steps. The
performance capture step scans the surface of a body, and record the movement of the subject. The data is
digitalized into a sequence of deformable and aligned 3D meshes, storing positions, surface normals and
the topology of the surface. After that, material analysis records the detailed information of every point
on the surface e.g. color and material property. Such data is often stored as a 2D texture with aligned UV
parameterization from the surface’s geometry. Lastly, after editing the captured data, a rendering engine
1
will generate a new sequence of images based on the subject’s movement and properties. This rendering
process is often physically based and produce realistic images.
Undoubtfully, a high-quality final result of the pipeline would require heavy amounts of parameter
tuning for lighting and material, and a long time for computing the interaction of objects, as well as
simulating the real world physics. This proposal is an attempt to simplify multiple steps of the above
pipeline, by using a general yet powerful data-driven model called Deep Neural Network(DNN) [47].
1.1 Deep Neural Network
With the development of the Internet and the availability of mobile devices, we have far more data being
produced than we can absorb every day. The desire to analyze such an amount of data has led to the
rapid developments in AI and machine learning. Neural networks that are trained with sufficient data and
running on efficient GPUs are now capable of expressing highly complex functions with a large number
of parameters and solving many fundamental computer vision and computer graphics tasks that were not
possible before. With these deep models, prior techniques that are based on handcrafted models can be
outperformed by a significant margin [72, 52]. Consequentially, modern researchers often tend to consider
deep learning as their first (and final) solution to everything.
However, data used for training a specific model is often not applicable to other tasks. For example,
ImageNet [120] contains millions of images with annotated labels but cannot be used directly for segmen-
tation tasks. Very often, every dataset can only be used for a limited number of applications, and certain
research problems are difficult to study without dedicated datasets [120, 143, 145, 82]. As a consequence,
deep learning is now only improving the intelligence of solving specific problems, where massive amounts
of training data are available. Furthermore, such annotation requires not only huge amounts of labor but
also skills[19, 71]. That is also the main reason why we can have millions of images with semantic la-
bels [120, 143], but only thousands of images with dense pixel-wise segmentation [95, 18].
2
In the world of computer graphics, the input is often the opposite: every captured model and every
generated frame is of high quality and with rich details, while the number of available models is very
limited. In fact, film- and game-makers can only afford a few models for their new product, and the
number of studios eligible for performance capture is also limited. The techniques being proposed here
can leverage the power of deep learning methods, but trained with much smaller amounts of high-quality
data, hence making a neural network in capturing and rendering possible.
1.2 Dense Correspondence of 3D Scans
Standard capture of body movement in filmmaking often uses special suits with recognizable patterns and
only records the movement of human joints, instead of dense deformation of skin or clothes. A denser
correspondence between 3D scans of people can better capture accurate and continuous movement, and
provide richer details of movement on the captured surface including skin stretch and compression.
In Chapter 2, we propose a deep learning approach for finding such dense correspondences. Our
method requires only partial geometric information in the form of two depth maps or partially recon-
structed surfaces, works for humans in arbitrary poses and wearing any clothing, does not require the two
people to be scanned from similar viewpoints, and runs in real time. We use a deep convolutional neural
network to train a feature descriptor on depth map pixels, but crucially, rather than training the network
to solve the shape correspondence problem directly, we train it to solve a body region classification prob-
lem, modified to increase the smoothness of the learned descriptors near region boundaries. This approach
ensures that nearby points on the human body are nearby in feature space, and vice versa, rendering the fea-
ture descriptor suitable for computing dense correspondences between the scans. We validate our method
on real and synthetic data for both clothed and unclothed humans and show that our correspondences are
more robust than is possible with state-of-the-art unsupervised methods, and more accurate than those
found using methods that require full watertight 3D geometry.
3
1.3 Facial Texture Inference
While accurate texture map can be captured in a professional studio, it requires the presence of the subject
in the constrained environment and may not be feasible sometimes e.g. capturing the appearance of a
subject tens of years ago.
In Chapter 3, we present a data-driven inference method that can synthesize a photorealistic texture
map of a complete 3D face model given a partial 2D view of a person in the wild. After an initial es-
timation of shape and low-frequency albedo, we compute a high-frequency partial texture map, without
the shading component, of the visible face area. To extract the fine appearance details from this incom-
plete input, we introduce a multi-scale detail analysis technique based on mid-layer feature correlations
extracted from a deep convolutional neural network. We demonstrate that fitting a convex combination of
feature correlations from a high-resolution face database can yield a semantically plausible facial detail
description of the entire face. A complete and photorealistic texture map can then be synthesized by it-
eratively optimizing for the reconstructed feature correlations. Using these high-resolution textures and a
commercial rendering framework, we can produce high-fidelity 3D renderings that are visually comparable
to those obtained with state-of-the-art multi-view face capture systems. We demonstrate successful face
reconstructions from a wide range of low-resolution input images, including those of historical figures. In
addition to extensive evaluations, we validate the realism of our results using a crowdsourced user study.
1.4 Hair Rendering
Human’s hair consists of the most complicated geometry on a full body. Due to the nature of hair strands,
the hundreds of thousands of strands on one’s head are highly flexible and can move independently given
external forces. This makes analyzing and resynthesizing its accurate appearance almost infeasible.
In Chapter 4, we present a bidirectional pipeline for hair rendering as an alternative to conventional
computer graphics pipelines: an encoder network for analyzing the crucial factors for a photoreal rendering
of hair e.g. color, lighting and hair structure; and more importantly, an adversarial network for rendering
4
photorealistic hair using such inferred parameters. Our deep learning approach does not require low-level
parameter tuning nor ad-hoc asset design. Our method simply takes a strand-based 3D hair model as
input and provides intuitive user-control for color and lighting through reference images. To handle the
diversity of hairstyles and its geometric and appearance complexity, we disentangle hair structure, color,
and illumination properties using a sequential generative adversarial network architecture and a semi-
supervised training approach. We also introduce an intermediate edge activation map to orientation field
conversion step in order to ensure semi-supervised training and a successful CG-to-photoreal transition,
while preserving the hair structures of the original input data. As we only require a feed-forward pass
through the network, our rendering approach performs in real-time. We demonstrate the synthesis of
photorealistic hair images on a wide range of intricate hairstyles and compare our technique with state-of-
the-art hair rendering methods.
Note that some of the results have been published in different conferences [150, 126]. And some
passages have been quoted verbatim from the above sources.
Overall, we make the following contributions:
We prove that DNN architecture, with dedicated modification, is sufficient to handle tasks in all the
above graphics-oriented tasks.
With concrete evaluations, we show that DNN based algorithm can achieve the state-of-the-art speed
and performance of existing works.
We show that DNN-based image analysis is robust to raw low-quality input images in the wild, thus
simplifying the existing process e.g. calibration and image pre-processing.
DNN architecture can also be used for photoreal image synthesis. Comparing to graphics rendering
techniques, DNN is often significantly faster and suffers less from the uncanny valley phenomenon.
The DNN core can be upgraded independently without changes of other parts if a newer DNN
architecture appears to have better performance in the future, while still keeping the overall pipeline
universal.
5
Chapter 2
Dense Correspondence of 3D Scans
full-to-full correspondences full-to-partial correspondences partial-to-partial correspondences
20
0
error (in cm):
Figure 2.1: We introduce a deep learning framework for computing dense correspondences between human
shapes in arbitrary, complex poses, and wearing varying clothing. Our approach can handle full 3D models
as well as partial scans generated from a single depth map. The source and target shapes do not need to be
the same subject, as highlighted in the left pair.
2.1 Introduction
The computation of correspondences between geometric shapes is a fundamental building block for many
important tasks in 3D computer vision, such as reconstruction, tracking, analysis, and recognition. Temporally-
coherent sequences of partial scans of an object can be aligned by first finding corresponding points in
overlapping regions, then recovering the motion by tracking surface points through a sequence of 3D data;
semantics can be extracted by fitting a 3D template model to an unstructured input scan. With the pop-
ularization of commodity 3D scanners and recent advances in correspondence algorithms for deformable
shapes, human bodies can now be easily digitized [80, 100, 33] and their performances captured using a
single RGB-D sensor [77, 140].
6
Most techniques are based on robust non-rigid surface registration methods that can handle complex
skin and cloth deformations, as well as large regions of missing data due to occlusions. Because geometric
features can be ambiguous and difficult to identify and match, the success of these techniques generally
relies on the deformation between source and target shapes being reasonably small, with sufficient overlap.
While local shape descriptors [119] can be used to determine correspondences between surfaces that are
far apart, they are typically sparse and prone to false matches, which require manual clean-up. Dense cor-
respondences between shapes with larger deformations can be obtained reliably using statistical models of
human shapes [5, 14], but the subject has to be naked [13]. For clothed bodies, the automatic computation
of dense mappings [70, 83, 114, 27] have been demonstrated on full surfaces with significant shape varia-
tions, but are limited to compatible or zero-genus surface topologies. Consequently, an automated method
for estimating accurate dense correspondence between partial shapes, such as scans from a single RGB-D
camera and arbitrarily large deformations has not yet been proposed.
We introduce a deep neural network structure for computing dense correspondences between shapes
of clothed subjects in arbitrary complex poses. The input surfaces can be a full model, a partial scan, or
a depth map, maximizing the range of possible applications (see Figure 2.1). Our system is trained with
a large dataset of depth maps generated from the human bodies of the SCAPE database [5], as well as
from clothed subjects of the Yobi3D [2] and MIT [147] dataset. While all meshes in the SCAPE database
are in full correspondence, we manually labeled the clothed 3D body models. We combined both training
datasets and learned a global feature descriptor using a network structure that is well-suited for the unified
treatment of different training data (bodies, clothed subjects).
Similar to the unified embedding approach of FaceNet [128], we extend the AlexNet [72] classification
network to learn distinctive feature vectors for different subregions of the human body. Traditional clas-
sification neural networks tend to separate the embedding of surface points lying in different but nearby
classes. Thus, using such learned feature descriptors for correspondence matching between deformed
surfaces often results in significant outliers at the segmentation boundaries. In this paper, we introduce
a technique based on repeated mesh segmentations to produce smoother embeddings into feature space.
7
This technique maps shape points that are geodesically close on the surface of their corresponding 3D
model to nearby points in the feature space. As a result, not only are outliers considerably reduced during
deformable shape matching, but we also show that the amount of training data can be drastically reduced
compared to conventional learning methods. While the performance of our dense correspondence com-
putation is comparable to state of the art techniques between two full models, we also demonstrate that
learning shape priors of clothed subjects can yield highly accurate matches between partial-to-full and
partial-to-partial shapes. Our examples include fully clothed individuals in a variety of complex poses.We
also demonstrate the effectiveness of our method on a template based performance capture application that
uses a single RGB-D camera as input. Our contributions are as follows:
Ours is the first approach that finds accurate and dense correspondences between clothed human
body shapes with partial input data and is considerably more efficient than traditional non-rigid
registration techniques.
We develop a new deep convolutional neural network architecture that learns a smooth embedding
using a multi-segmentation technique on human shape priors. We also show that this approach can
significantly reduce the amount of training data.
We describe a unified learning framework that combines training data sets from human body shapes
in different poses and a database of clothed subjects in a canonical pose.
2.2 Related Work
Finding shape correspondences is a well-studied area of geometry processing. However, the variation in
human clothing, pose, and topological changes induced by different poses make applying existing methods
very difficult.
The main computational challenge is that the space of possible correspondences between two surfaces
is very large: discretizing both surfaces usingn points and attempting to naively match them is anO(n!)
8
calculation. The problem becomes tractable given enough prior knowledge about the space of possible
deformations; for instance if the two surfaces are nearly-isometric, both surfaces can be embedded in a
higher-dimensional Euclidean space where they can be aligned rigidly [39]. Other techniques can be used
if the mapping satisfies specific properties, e.g. being conformal [83, 69]. Kim et al [70] generalize this
idea by searching over a carefully-chosen polynomial space of blended conformal maps, but this method
does not extend to matching partial surfaces or to surfaces of nonzero genus.
Another common approach is to formulate the correspondence problem variationally: to define an
energy function on the space of correspondences that measures their quality, which is then maximized.
One popular objective is to measure preservation of pair-wise geodesic [16] or diffusion [17] distances.
Such global formulations often lead to NP-hard combinatorial optimization problems for which various
relaxation schemes are used, including spectral relaxation [75], Markov random fields [4], and convex re-
laxation [153, 27]. These methods require that the two surfaces are nearly-isometric, so that these distances
are nearly-preserved; this assumption is invalid for human motion involving topological changes.
A second popular objective is to match selected subsets of points on the two surfaces with similar
feature descriptors [121, 154, 84, 6]. However, finding descriptors that are both invariant to typical human
and clothing deformations and also robust to topological changes remains a challenge. Local geometric
descriptors, such as spin images [63] or curvature [110] have proven to be insufficient for establishing
reliable correspondences as they are extrinsic and fragile under deformations. A recent focus is on spectral
shape embedding and induced descriptors [62, 136, 103, 6, 96]. These descriptors are effective on shapes
that undergo near-isometric deformations. However, due to the sensitivity of spectral operators to partial
data and topological noise, they are not applicable to partial 3D scans.
A natural idea is to replace ad-hoc geometric descriptors with those learned from data. Several recent
papers [87, 50, 161, 45] have successfully used this idea for finding correspondences between 2D images,
and have shown that descriptors learned from deep neural networks are significantly better than generic
pixel-wise descriptors in this context. Inspired by these methods, we propose to use deep neural networks
to compute correspondence between full/partial scans of clothed humans. In this manner, our work is
9
network training (offline) correspondence computation (online)
SCAPE
data set
MIT
data set
Yobi3D
data set
feature
extraction
depth map labels
descriptor
classification
prediction 1
classification
prediction N ...
depth map labels
descriptor
classification
prediction 1
classification
prediction N ...
depth map labels
descriptor
classification
prediction
feature
extraction
feature
extraction
loss
function
input
depth map
feature
extraction
per pixel
descriptor
input
full model
depth
generator
depth
maps
feature
extraction
and
averaging
per vertex
descriptors
closest
feature
output
correspondences
shared
network
parameters
Figure 2.2: We train a neural network which extracts a feature descriptor and predicts the corresponding
segmentation label on the human body surface for each point in the input depth maps. We generate per-
vertex descriptors for 3D models by averaging the feature descriptors in their rendered depth maps. We
use the extracted features to compute dense correspondences.
similar to Fischer et al [40], which applies deep learning to the problem of solving for the optical flows
between images; unlike Fischer, however, our method finds correspondences between two human shapes
even if there is little or no coherence between the two shapes. Regression forests [139, 109] can also be
used to infer geometric locations from depth image, however such methods has not yet achieve comparable
accuracies with state-of-the-art registration method on full or partial data [27].
2.3 Problem Statement and Overview
We introduce a deep learning framework to compute dense correspondences across full or partial human
shapes. We train our system using depth maps of humans in arbitrary pose and with varying clothing.
Given depth maps of two humansI
1
,I
2
, our goal is to determine which two regionsR
i
I
i
of the
depth maps come from corresponding parts of the body, and to find the correspondence map :R
1
!R
2
between them. Our strategy for doing so is to formulate the correspondence problem first as a classification
problem: we first learn a feature descriptor f : I! R
d
which maps each pixel in a single depth image
to a feature vector. We then utilize these feature descriptors to establish correspondences across depth
maps (see Figure 2.2). We desire the feature vector to satisfy two properties:
1. f depends only on the pixel’s location on the human body, so that if two pixels are sampled from
the same anatomical location on depth scans of two different humans, their feature vector should be
10
nearly identical, irrespective of pose, clothing, body shape, and angle from which the depth image
was captured;
2.kf(p)f(q)k is small whenp andq represent nearby points on the human body, and large for distant
points.
The literature takes two different approaches to enforcing these properties when learning descriptors using
convolutional neural networks. Direct methods include in their loss functions terms penalizing failure of
these properties (by using e.g. Siamese or triplet-loss energies). However, it is not trivial how to sample
a dense set of training pairs or triplets that can all contribute to training [128]. Indirect methods instead
optimize the network architecture to perform classification. The network consists of a descriptor extraction
tower and a classification layer, and peeling off the classification layer after training leaves the learned
descriptor network (for example, many applications use descriptors extracted from the second-to-last layer
of the AlexNet.) This approach works since classification networks tend to assign similar (dissimilar)
descriptors to the input points belonging to the same (different) class, and thus satisfy the above properties
implicitly. We take the indirect approach, as our experiments suggest that an indirect method that uses an
ensemble of classification tasks has better performance and computational efficiency.
2.3.1 Descriptor learning as ensemble classification
There are two challenges to learning a feature descriptor for depth images of human models using this
indirect approach. First, the training data is heterogenous: between different human models, it is only
possible to obtain a sparse set of key point correspondences, while for different poses of the same person,
we may have dense pixel-wise correspondences (e.g., SCAPE [5]). Second, smoothness of descriptors
learned through classification is not explicitly enforced. Even though some classes tend to be closer to
each other than the others in reality, the network treats all classes equally.
To address both challenges, we learn per-pixel descriptors for depth images by first training a network
to solve a group of classification problems, using a single feature extraction tower shared by the different
11
classification tasks. This strategy allows us to combine different types of training data as well as designing
classification tasks for various objectives. Formally, suppose there areM classification problemsC
i
; 1
iM. Denote the parameters to be learned in classification problemC
i
as (w
i
;w), wherew
i
andw are
the parameters corresponding to the classification layer and descriptor extraction tower, respectively. We
define the descriptor learning as minimizing a combination of loss functions of all classification problems:
fw
?
i
g;w
?
= arg min
fwig;w
M
X
i=1
l(w
i
;w): (2.1)
After training, we take the optimized descriptor extraction tower as the output. It is easy to see that when
w
i
;w are given by convolutional neural networks, Eqn. 2.1 can be effectively optimized using stochastic
gradient descent through back-propagation.
To address the challenge of heterogenous training sets, we include two types of classification tasks
in our ensemble: one for classifying key points, used for iter-subject training where only sparse ground-
truth correspondences are available, and one for classifying dense pixel-wise labels, e.g., by segmenting
models into patches (See Figure 2.3), used for intra-subject training. Both contribute to the learning of the
descriptor extraction tower.
To ensure descriptor smoothness, instead of introducing additional terms in the loss function, we pro-
pose a simple yet effective strategy that randomizes the dense-label generation procedure. Specifically, as
shown in Figure 2.3, we consider multiple segmentations of the same person, and introduce a classifica-
tion problem for each. Clearly, identical points will always be associated with the same label and far-apart
points will be associated with different labels. Yet for other points, the number of times that they are as-
sociated with the same label is related to the distance between them. Consequently, the similarity of the
feature descriptors are correlated to the distance between them on the human body resulting in a smooth
embedding satisfying the desired properties discussed in the beginning of the section.
12
training mesh segmentation 1 segmentation 2 segmentation 3
Figure 2.3: To ensure smooth descriptors, we define a classification problem for multiple segmentations
of the human body. Nearby points on the body are likely to be assigned the same label in at least one
segmentation.
0 1 2 3 4 5 6 7 8 9 10
layer image conv max conv max 2conv conv max 2conv int conv
filter-stride - 11-4 3-2 5-1 3-2 3-1 3-1 3-2 1-1 - 3-1
channel 1 96 96 256 256 384 256 256 4096 4096 16
activation - relu lrn relu lrn relu relu idn relu idn relu
size 512 128 64 64 32 32 32 16 16 128 512
num 1 1 4 4 16 16 16 64 64 1 1
Table 2.1: The end-to-end network architecture generates a per-pixel feature descriptor and a classification
label for all pixels in a depth map simultaneously. From top to bottom in column: The filter size and the
stride, the number of filters, the type of the activation function, the size of the image after filtering and the
number of copies reserved for up-sampling.
2.3.2 Correspondence Computation
Our trained network can be used to extract per-pixel feature descriptors for depth maps. For full or partial
3D scans, we first render depth maps from multiple viewpoints and compute a per-vertex feature descriptor
by averaging the per-pixel descriptors of the depth maps. We use these descriptors to establish correspon-
dences simply by a nearest neighbor search in the feature space (see Figure 2.2).
For applications that require deforming one surface to align with the other, we can fit the correspon-
dences described in this paper into any existing deformation method to generate the alignment. In this
paper, we use the efficient as-rigid-as possible deformation model described in [77].
13
2.4 Implementation Details
We first discuss how we generate the training data and then describe the architecture of our network.
2.4.1 Training Data Generation
Collecting 3D Shapes. To generate the training data for our network, we collected 3D models from three
major resources: the SCAPE [5], the MIT [147], and the Yobi3D [2] data sets. The SCAPE database
provides 71 registered meshes of one person in different poses. The MIT dataset contains the animation
sequences of three different characters. Similar to SCAPE, the models of the same person have dense
ground truth correspondences. We used all the animation sequences except for the samba and swing
ones, which we reserve for evaluation. Yobi3D is an online repository that contains a diverse set of 2000
digital characters with varying clothing. Note that the Yobi3D dataset covers the shape variability in local
geometry, while the SCAPE and the MIT datasets cover the variability in pose.
Simulated Scans. We render each model from 144 different viewpoints to generate training depth images.
We use a depth image resolution of 512 512 pixels, where the rendered human character covers roughly
half of the height of the depth image. This setup is comparable to those captured from commercial depth
cameras; for instance, the Kinect One (v2) camera provides a depth map with resolution 512 424, where
a human of height 1:7 meters standing 2:5 meters away from the camera has a height of around 288 pixels
in the depth image.
Key-point annotations. We employ human experts to annotate 33 key points across the input models as
shown in Figure 2.4. These key points cover a rich set of salient points that are shared by different human
models (e.g. left shoulder, right shoulder, left hip, right hip etc.). Note that for shapes in the SCAPE
and MIT datasets, we only annotate one rest-shape and use the ground-truth correspondences to propagate
annotations. The annotated key points are then propagated to simulated scans, providing 33 classes for
training.
14
SCAPE MIT Yobi3D Yobi3D Yobi3D
Figure 2.4: Sparse key point annotations of 33 landmarks across clothed human models of different
datasets.
500-patch segmentation generation. For each distinctive model in our model collection, we divide it
into multiple 500-patch segmentations. Each segmentation is generated by randomly picking 10 points
on each model, and then adding the remaining points via furthest point-sampling. In total we use 100
pre-computed segmentations. Each such segmentation provides 500 classes for depth scans of the same
person (with different poses).
2.4.2 Network Design and Training
The neural network structure we use for training consists of a descriptor extraction tower and a classifica-
tion module.
Extraction tower. The descriptor extraction tower takes a depth image as input and extracts for each
pixel a dimensiond (d = 16 in this paper) descriptor vector. A popular choice is to let the network extract
each pixel descriptor using a neighboring patch (c.f.[50, 161]). However, such a strategy is too expensive
in our setting as we have to compute this for dozens of thousands of patches per scan.
Our strategy is to design a network that takes the entire depth image as input and simultaneously
outputs a descriptor for each pixel. Compared with the patch-based strategy, the computation of patch
descriptors are largely shared among adjacent patches, making descriptor computation fairly efficient in
testing time.
15
Table 2.1 describes the proposed network architecture. The first 7 layers are adapted from the AlexNet
architecture. Specifically, the first layer downsamples the input image by a factor of 4. This downsampling
not only makes the computations faster and more memory efficient, but also removes salt-and-pepper noise
which is typical in the output from depth cameras. Moreover, we adapt the strategy described in [129] to
modify the pooling and inner product layers so that we can recover the original image resolution through
upsampling. The final layer performs upsampling by using neighborhood information in a 3-by-3 window.
This upsampling implicitly performs linear smoothing between the descriptors of neighboring pixels. It is
possible to further smooth the descriptors of neighboring pixels in a post-processing step, but as shown in
our results, this is not necessary since our network is capable of extracting smooth and reliable descriptors.
Classification module. The classification module receives the per-pixel descriptors and predicts a class
for each annotated pixel (i.e., either key points in the 33-class case or all pixels in the 500-class case). Note
that we introduce one layer for each segmentation of each person in the SCAPE and the MIT datasets and
one shared layer for all the key points. Similar to AlexNet, we employ softmax when defining the loss
function.
Training. The network is trained using a variant of stochastic gradient descent. Specifically, we ran-
domly pick a task (i.e., key points or dense labels) for a random partial scan and feed it into the network
for training. If the task is dense labels, we also randomly pick a segmentation among all possible segmen-
tations. We run 200,000 iterations when tuning the network, with a batch size of 128 key points or dense
labels which may come from multiple datasets.
2.5 Results
We evaluate our method extensively on various real and synthetic datasets, naked and clothed subjects, as
well as full and partial matching for challenging examples as illustrated in Figure 2.5. The real capture
data examples (last column) are obtained using a Kinect One (v2) RGB-D sensor and demonstrate the
16
effectiveness of our method for real life scenarios. Each partial data is a single depth map frame with
512 424 pixels and the full template model is obtained using the non-rigid 3D reconstruction algorithm
of [80]. All examples include complex poses (side views and bended postures), challenging garment
(dresses and vests), and props (backpacks and hats).
We use 4 different synthetic datasets to provide quantitative error visualizations of our method using
the ground truth models. The 3D models from both SCAPE and MIT databases are part of the training data
of our neural network, while the FAUST and Mixamo models [1] are not used for training. The SCAPE
and FAUST data sets are exclusively naked human body models, while the MIT and Mixamo models are
clothed subjects. For all synthetic examples, the partial scans are generated by rendering depth maps from
a single camera viewpoint. The Adobe Fuse and Mixamo softwares [1] were used to procedurally model
realistic characters and generate complex animation sequences through a motion library provided by the
software.
The correspondence colorizations validate the accuracy, smoothness, and consistency of our dense
matching computation for extreme situations, including topological variations between source and target.
While the correspondences are accurately determined in most surface regions, we often observe larger
errors on depth map boundaries, hands, and feet, as the segmented clusters are slightly too large in those
areas. Notice how the correspondences between front and back views are being correctly identified in
the real capture 1 example for the full-to-partial matchings. Popular skeleton extraction methods from
single-view 3D captures such as [131, 151, 144] often have difficulties resolving this ambiguity.
Evaluation on the FAUST data set. We show that our deep network structure for computing dense
correspondences achieves state-of-the-art performance on establishing correspondences between the intra-
and inter-subject pairs from the FAUST dataset [14]. For each 3D scan in this dataset, we compute a
per-vertex feature descriptor by first rendering depth maps from multiple viewpoints and averaging the
per-pixel feature descriptors. Correspondences are then established by nearest neighbor search in the
feature space. The accuracy of this direct method is already significantly better than all existing global
17
shape matching methods (that do not require initial poses as input), and is comparable to the state-of-the-
art non-rigid registration method proposed by Chen et al. [27], which uses the initial poses of the models
to refine correspondences. To make a fair comparison with Chen et al. [27], we use an out-of-the-shelf
non-rigid registration algorithm [78] to refine our results. We initialize the registration algorithm with
the correspondences established with the nearest-neighbor search and refine their positions after non-rigid
alignment. Results obtained with and without this refinement step are reported in Figure 2.6 and Table 2.2.
It is worth mentioning that per-vertex feature descriptors for each scan are pre-computed. Thus for each
pair of scans, we can obtain dense correspondences in less than a second. Though our method is designed
for clothed human subjects, our algorithm is far more efficient than all other known methods which rely
on local or global geometric properties.
Comparisons. General surface matching techniques which are not restricted to naked human body
shapes are currently the most suitable solutions for handling subjects with clothing. Though robust to par-
tial input scans such as single-view RGB-D data, cutting edge non-rigid registration techniques [58, 77]
often fail to converge for large scale deformations without additional manual guidance as shown in Fig-
ure 2.7. When both source and target shapes are full models, an automatic mapping between shapes with
considerable deformations becomes possible as shown in [70, 83, 114, 27]. We compare our method with
the recent work of Chen et al. [27] and compute correspondences between pairs of scans sampled from
the same (intra-subject) and different (inter-subject) subjects. Chen et al. evaluate a rich set of methods
on randomly sampled pairs from the FAUST database [14] and report the state of the art results for their
method. For a fair comparison, we also evaluate our method on the same set of pairs. As shown in Ta-
ble 2.3, our method improves the average accuracy for both the intra- and the inter-subject pairs. Note
that by using simple AlexNet structure, we can easily achieve an average accuracy of 10 cm. However, if
multiple segmentations are not adapted to enforce smoothness, the worst average error can be up to 30 cm
in our experiments.
18
method AE (cm) worst AE 10cm-recall
CNN-S 2.00 9.98 0.975
CNN 5.65 18.67 0.918
CO[27] 4.49 10.96 0.907
RF[114] 13.60 83.90 0.658
BIM[70] 14.99 80.40 0.615
M¨ obius[83] 22.26 69.26 0.548
ENC[115] 23.60 51.32 0.385
C2FSym[124] 26.87 100.23 0.335
EM[123] 30.11 95.42 0.293
C2F[122] 23.63 73.89 0.334
GMDS[16] 28.94 91.84 0.300
SM[108] 28.81 68.42 0.326
(a) Accuracy on intra-subject pairs
method AE (cm) worst AE 10cm-recall
CNN-S 2.35 10.12 0.972
CNN 5.73 18.03 0.917
CO[27] 5.95 14.18 0.858
RF[114] 17.36 86.76 0.539
BIM[70] 30.58 70.02 0.300
M¨ obius[83] 26.92 79.43 0.435
ENC[115] 29.29 57.28 0.303
C2FSym[124] 25.89 96.46 0.359
EM[123] 31.25 90.74 0.235
C2F[122] 25.51 90.62 0.277
GMDS[16] 35.06 91.21 0.188
SM[108] 32.66 75.38 0.240
(b) Accuracy on inter-subject pairs
Table 2.2: Evaluation on the FAUST dataset. CNN is the result obtained by performing nearest neighbor
search on descriptors produced by our network. CNN-S is the result after non-rigid registration. Data
for algorithms other than ours are provided by Chen et al. [27]. Left: Results for intra-subject pairs.
Right: Results for inter-subject pairs. For each method we report the average error on all pairs (AE, in
centimeters), the worst average error among all pairs (worst AE), and the fraction of correspondences that
are within 10 centimeters of the ground truth (10cm-recall).
Application. We demonstrate the effectiveness our corrrespondence computation for a template based
performance capture application using a depth map sequence captured from a single RGB-D sensor.
The complete geometry and motion is reconstructed in every sequence by deforming a given template
model to match the partial scans at each incoming frame of the performance. Unlike existing meth-
ods [137, 77, 148, 140] which track a template using the previous frame, we always deform the template
model from its canonical rest pose using the computed full-to-partial correspondences in order to avoid
potential drifts. Deformation is achieved using the robust non-rigid registration algorithm presented in Li
19
intra AE intra WE inter AE inter WE
Chen et al. 4.49 10.96 5.95 14.18
our method 2.00 9.98 2.35 10.12
Table 2.3: We compare our method to the recent work of Chen et al. [27] by computing correspondences
for intra- and inter-subject pairs from the FAUST dataset. We provide the average error on all pairs (AE,
in centimeters) and average error on the worst pair for each technique (worst AE, in centimeters). While
our results may introduce worse WE, overall accuracies are improved in both cases.
et al. [77], where the closest point correspondences are replaced with the ones obtained from the presented
method. Even though the correspondences are computed independently in every frame, we observe a tem-
porally consistent matching during smooth motions without enforcing temporal coherency as with existing
performance capture techniques as shown in Figure 2.8. Since our deep learning framework does not re-
quire source and target shapes to be close, we can effectively handle large and instantenous motions. For
the real capture data, we visualize the reconstructed template model at every frame and for the synthetic
model we show the error to the ground truth.
Performance. We perform all our experiments on a 6-core Intel Core i7-5930K Processor with 3.9 GHz
and 16GB RAM. Both offline training and online correspondence computation run on an NVIDIA GeForce
TITAN X (12GB GDDR5) GPU. While the complete training of our neural network takes about 250
hours of computation, the extraction of all the feature descriptors never exceeds 1 ms for each depth map.
The subsequent correspondence computation with these feature descriptors varies between 0.5 and 1 s,
depending on the resolution of our input data.
20
SCAPE full-to-full
full-to-partial
partial-to-partial
FAUST
real capture 1
source target error source target error source target
MIT
Mixamo
real capture 2
source target error source target error source target
SCAPE
FAUST
real capture 1
source target error source target error source target
MIT
Mixamo
real capture 2
source target error source target error source target
SCAPE
FAUST
real capture 1
source target error source target error source target
MIT
Mixamo
real capture 2
source target error source target error source target
20 0 error (in cm):
Figure 2.5: Our system can handle full-to-full, partial-to-full, and partial-to-partial matchings between full
3D models and partial scans generated from a single depth map. We evaluate our method on various real
and synthetic datasets. In addition to correspondence colorizations for the source and target, we visualize
the error relative to the synthetic ground truth.
21
centimeters
0 10 20 30 40 50 60 70 80 90 100
% correspondences
80
82
84
86
88
90
92
94
96
98
100
(a) Cumulative error distribution, intra-subject
CNN-S
CNN
CO
BIM
Möbius
RF
ENC
C2FSym
EM
C2F
GMDS
SM
centimeters
0 10 20 30 40 50 60 70 80 90 100
% correspondences
80
82
84
86
88
90
92
94
96
98
100
(b) Cumulative error distribution, inter-subject
0 10 20 30 40 50
centimeters
0
10
20
30
40
(c) Average error for each intra-subject pair
SM
GMDS
C2F
EM
C2FSym
ENC
RF
Möbius
BIM
CO
CNN
CNN-S
0 10 20 30 40 50
centimeters
0
10
20
30
40
(d) Average error for each inter-subject pair
Figure 2.6: Evaluation on the FAUST dataset. CNN is the result obtained by performing nearest neighbor
search on descriptors produced by our network. CNN-S is the result after non-rigid registration. Data for
algorithms other than ours are provided by Chen et al. [27]. Left: Results for intra-subject pairs. Right:
Results for inter-subject pairs. Top: Cumulative error distribution for each method, in centimeters. Bottom:
Average error for each pair, sorted within each method independently.
our method [Li et al. 09] source / target [Huang et al. 08]
Figure 2.7: We compare our method to other non-rigid registration algorithms and show that larger defor-
mations between a full template and a partial scan can be handled.
22
input depth map
dense correspondence
dynamic geometry reconstruction
input depth map
dense correspondence
Mixamo (synthetic)
real capture
20
0 error (in cm):
Figure 2.8: We perform geometry and motion reconstruction by deforming a template model to captured
data at each frame using the correspondences computed by our method. Even though we do not enforce
temporal coherency explicitly, we obtain faithful and smooth reconstructions. We show examples both in
real and synthetic data.
23
Chapter 3
Facial Texture Inference
input picture output albedo map rendering rendering (zoom) rendering (zoom) rendering
Figure 3.1: We present an inference framework based on deep neural networks for synthesizing photo-
realistic facial texture along with 3D geometry from a single unconstrained image. We can successfully
digitize historic figures that are no longer available for scanning and produce high-fidelity facial texture
maps with mesoscopic skin details.
3.1 Introduction
Until recently, the digitization of photorealistic faces has only been possible in professional studio set-
tings, typically involving sophisticated appearance measurement devices [152, 92, 44, 10, 43] and care-
fully controlled lighting conditions. While such a complex acquisition process is acceptable for production
purposes, the ability to build high-end 3D face models from a single unconstrained image could widely
impact new forms of immersive communication, education, and consumer applications. With virtual and
augmented reality becoming the next generation platform for social interaction, compelling 3D avatars
could be generated with minimal efforts and pupeteered through facial performances [79, 102]. Within
the context of cultural heritage, iconic and historical personalities could be restored to life in captivating
24
output rendering input image
initial
face model
fitting
complete albedo map (LF)
partial albedo map (HF)
texture
analysis
texture
synthesis
complete albedo map (HF)
feature correlations
face database
Figure 3.2: Overview of our texture inference framework. After an initial low-frequency (LF) albedo map
estimation, we extract partial high-frequency (HF) components from the visible areas using texture anal-
ysis. Mid-layer feature correlations are then reconstructed to produce a complete high-frequency albedo
map via texture synthesis.
3D digital forms from archival photographs. For example: can we use Computer Vision to bring back our
favorite boxing legend, Muhammad Ali, and relive his greatest moments in 3D?
Capturing accurate and complete facial appearance properties from images in the wild is a fundamen-
tally ill-posed problem. Often the input pictures have limited resolution, only a partial view of the subject is
available, and the lighting conditions are unknown. Most state-of-the-art monocular facial capture frame-
works [142, 117] rely on linear PCA models [12] and important appearance details for photorealistic ren-
dering such as complex skin pigmentation variations and mesoscopic-level texture details (freckles, pores,
stubble hair, etc.), cannot be modeled. Despite recent efforts in hallucinating details using data-driven
techniques [85, 98] and deep learning inference [35], it is still not possible to reconstruct high-resolution
textures while preserving the likeness of the original subject and ensuring photorealism.
From a single unconstrained image (potentially low resolution), our goal is to infer a high-fidelity tex-
tured 3D model which can be rendered in any virtual environment. The high-resolution albedo texture
map should match the resemblance of the subject while reproducing mesoscopic facial details. Without
capturing advanced appearance properties (bump maps, specularity maps, BRDFs, etc.), we want to show
that photorealistic renderings are possible using a reasonable shape estimation, a production-level render-
ing framework [135], and, most crucially, a high-fidelity albedo texture. The core challenge consists of
developing a facial texture inference framework that can capture the immense appearance variations of
faces and synthesize realistic high-resolution details, while maintaining fidelity to the target.
25
Inspired by the latest advancement in neural synthesis algorithms for style transfer [42, 41], we adopt
a factorized representation of low-frequency and high-frequency albedo as illustrated in Figure 4.2. While
the low-frequency map is simply represented by a linear PCA model (Section 3.3.1), we characterize high-
frequency texture details for mesoscopic structures as mid-layer feature correlations of a deep convolu-
tional neural network for general image recognition [133]. Partial feature correlations are first analyzed on
an incomplete texture map extracted from the unconstrained input image. We then infer complete feature
correlations using a convex combination of feature correlations obtained from a large database of high-
resolution face textures [91] (Section 3.3.2). A high-resolution albedo texture map can then be synthesized
by iteratively optimizing an initial low-frequency albedo texture to match these feature correlations via
back-propagation and quasi-Newton optimization (Section 3.3.3). Our high-frequency detail representa-
tion with feature correlations captures high-level facial appearance information at multiple scales, and
ensures plausible mesoscopic-level structures in their corresponding regions. The blending technique with
convex combinations in feature correlation space not only handles the large variation and non-linearity of
facial appearances, but also generates high-resolution texture maps, which is not possible with existing
end-to-end deep learning frameworks [35]. Furthermore, our method uses the publicly available and pre-
trained deep convolutional neural network, VGG-19 [133], and requires no further training. We make the
following contributions:
We introduce an inference method that can generate high-resolution albedo texture maps with plau-
sible mesoscopic details from a single unconstrained image.
We show that semantically plausible fine-scale details can be synthesized by blending high-resolution
textures using convex combinations of feature correlations obtained from mid-layer deep neural net
filters.
We demonstrate using a crowdsourced user study that our photorealistic results are visually compa-
rable to ground truth measurements from a cutting-edge Light Stage capture device [43, 141].
26
We introduce a new dataset of 3D face models with high-fidelity texture maps based on high-
resolution photographs of the Chicago Face Database [91], which will be publicly available to the
research community.
3.2 Related Work
Facial Appearance Capture. Specialized hardware for facial capture, such as the Light Stage, has been
introduced by Debevec et al. [28] and improved over the years [92, 43, 44], with full sphere LEDs and
multiple cameras to measure an accurate reflectance field. Though restricted to studio environments,
production-level relighting and appearance measurements (bump maps, specular maps, subsurface scat-
tering etc.) are possible. Weyrich et al. [152] adopted a similar system to develop a photorealistic skin
reflectance model for statistical appearance analysis and meso-scale texture synthesis. A contact-based
apparatus for path-based microstructure scale measurement using silicone mold material has been pro-
posed by Haro et al. [51]. Optical acquisition methods have also been suggested to produce full-facial
microstructure details [49] and skin microstructure deformations [99]. As an effort to make facial digitiza-
tion more deployable, monocular systems [138, 60, 21, 130, 142] that record multiple views have recently
been introduced to generate seamlessly integrated texture maps for virtual avatars. When only a single
input image is available, Kemelmacher-Shlizerman and Basri [67] proposed a shape-from-shading frame-
work that produces an albedo map using a Lambertian reflectance model. Barron and Malik [9] introduced
a statistical approach to estimate shape, illumination, and reflectance from arbitrary objects. Li et al. [76]
later presented an intrinsic image decomposition technique to separate diffuse and specular components
for faces. For all these methods, only textures from the visible regions can be computed and the resolution
is limited by the input.
Linear Face Models. Turk and Pentland [146] introduced the concept of Eigenfaces for face recognition
and were one of the first to represent facial appearances as linear models. In the context of facial tracking,
27
Edwards et al. [36] developed the widely used active appearance models (AAM) based on linear combina-
tions of shape and appearance, which has resulted in several important subsequent works [3, 32, 97]. The
seminal work on morphable face models of Blanz and Vetter [12] has put forward an analysis-by-synthesis
framework for textured 3D face modeling and lighting estimation. Since their Principal Component Anal-
ysis (PCA)-based face model is built from a database of 3D face scans, a complete albedo texture map
can be estimated robustly from a single image. Several extensions have been proposed leveraging internet
images [66] and large-scale 3D facial scans [15]. PCA-based models are fundamentally limited by their
linear assumption and fail to capture mesoscopic details as well as large variations in facial appearances
(e.g., hair texture).
Texture Synthesis. Non-parametric synthesis algorithms [38, 149, 37, 73] have been developed to syn-
thesize repeating structures using samples from small patches, while ensuring local consistency. These
general techniques only work for stochastic textures such as micro-scale skin structures [51], but are not
directly applicable to mesoscopic face details due to the lack of high-level visual cues about facial configu-
rations. The super resolution technique of Liu et al. [85] hallucinates high-frequency content using a local
path-based Markov network, but the results remain relatively blurry and cannot predict missing regions.
Mohammed et al. [98] introduced a statistical framework for generating novel faces based on randomized
patches. While the generated faces look realistic, noisy artifacts appear for high-resolution images. Facial
detail enhancement techniques based on statistical models [46] have been introduced to synthesize pores
and wrinkles, but have only been demonstrated in the geometric domain.
Deep Learning Inference. Leveraging the vast learning capacity of deep neural networks and their abil-
ity to capture higher level representations, Duong et al. [35] introduced an inference framework based
on Deep Boltzmann Machines that can handle the large variation and non-linearity of facial appearances
effectively. A different approach consists of predicting non-visible regions based on context information.
Pathak et al. [106] adopted an encoder-decoder architecture trained with a Generative Adverserial Network
(GAN) for general in-painting tasks. However, due to fundamental limitations of existing end-to-end deep
28
neural networks, only images with very small resolutions can be processed. Gatys et al. [42, 41] recently
proposed a style-transfer technique using deep neural networks that has the ability to seamlessly blend the
content from one high-resolution image with the style of another while preserving consistent structures
of low and high-level visual features. They describe style as mid-layer feature correlations of a convo-
lutional neural network. We show in this work that these feature correlations are particularly effective in
representing high-frequency multi-scale appearance components including mesoscopic facial details.
3.3 Implementation Details
3.3.1 Initial Face Model Fitting
Figure 3.2 shows the overview of our proposed pipeline. Given an unconstrained input image, we use a
PCA-based morphable face model [12] to jointly estimate its facial shape and a low-frequency but com-
plete albedo map, both represented by PCA coefficients. We also compute its rigid head pose, and the
perspective transformation with respect to the camera parameters as intermediate results. A partial high-
frequency albedo map is then extracted from the visible area and represented in the same UV space of the
shape model. The partial high-frequency albedo is used to extract feature correlations in Section 3.3.2, and
the complete low-frequency albedo is used as the initialization for texture synthesis in Section 3.3.3. Our
model fitting framework is built upon the work of Thies et al. [142], and here we briefly describe the main
ideas and highlight key differences.
PCA Model Fitting. The facial albedo map and the face shape are represented as a multi-linear PCA
model with 53,000 vertices and 106,000 faces: We assume Lambertian surface reflectance and model the
illumination using a second order Spherical Harmonics (SH) [112], denoting the illuminationL2R
27
. We
use the Basel Face Model dataset [107] for the identity and albedo components, and FaceWarehouse [20]
29
for the expression component provided by [169]. Following the implementation in [142], we solve all the
unknowns with the following objective function:
E() =w
c
E
c
() +w
lan
E
lan
() +w
reg
E
reg
(); (3.1)
whereE
c
denotes the photo-consistency term about the error between the synthetic face the input,E
lan
is the landmark term for 2d distances between the synthetic landmarks and the detected ones on the input,
E
reg
is the regularization term forcing coefficients to follow the normal distribution. We use E
c
= 1,
E
lan
= 10 andE
reg
= 2:5 10
5
accordingly in the following experiments.E
lan
andE
reg
are adopted
from [142], but we augment the termE
c
with a visibility component:
E
c
() =
1
jMj
X
p2M
kC
input
(p)C
synth
(p)k
2
;
where C
input
is the input image, C
synth
the synthesized image. We useM, a visibility mask of facial
segmentation estimated using a two-stream deconvolution network introduced by Saito et al. [125], and
we only compute errors of all pixels p withinM. The segmentation mask ensures that the objective
function is computed with valid face pixels for improved robustness in the presence of occlusion. The
objective function is optimized using a Gauss-Newton solver based on iteratively reweighted least squares
with three levels of image pyramids (see [142] for details). In our experiments, the optimization converges
within 30, 10, and 3 Gauss-Newton steps respectively from the coarsest level to the finest.
Partial High-Frequency Albedo. The PCA-based estimation provides only a complete texture with
low frequency details. On the other hand, the high-frequency information from the input, although only
partially visible, allows the analysis of mesotropic skin details. To obtain this high-frequency albedo map,
we first remove the shading component of the input image by estimating the illuminationL, the surface
normal, and an optimized partial face geometry using the method in [12, 116]. After that, we use an
automatic facial segmentation technique [125] to unwrap the unshaded image and get the albedo map.
30
fitting via
convex
combination
partial feature
correlation
coefficients
partial feature
correlation set
complete feature
correlations
feature
correlation
evaluation
database
complete feature
correlation set
partial albedo
(HF)
Figure 3.3: Texture analysis. The hollow arrows indicate a processing through a deep convolutional neural
network.
4 8 256 128 64
Figure 3.4: Convex combination of feature correlations. The numbers indicate the number of subjects used
for blending correlation matrices.
3.3.2 Texture Analysis
As shown in Figure 3.3, we wish to extract multi-scale details from the resulting high-frequency partial
albedo map obtained in Section 3.3.1. These fine-scale details are represented by mid-layer feature cor-
relations from a deep convolutional neural network as explained in this Section. We first extract partial
feature correlations from the partially visible albedo map, then estimate coefficients of a convex combina-
tion of partial feature correlations from a face database with high-resolution texture maps. We use these
coefficients to evaluate feature correlations that correspond to convex combinations of full high-frequency
texture maps. These complete feature correlations represent the target detail distribution for the texture
synthesis step in Section 3.3.3. Notice that all processing is applied on the intensity Y using the YIQ color
space to preserve the overall color as in [41].
31
For an input uv map I, let F
l
(I) be the filter response of I on layer l. We have F
l
(I)2 R
N
l
M
l
whereN
l
is the number of channels/filters andM
l
is the number of pixels (widthheight) of the feature
map. The correlation of the local structures can be represented as the normalized Gramian matrixG
l
(I):
G
l
(I) =
1
M
l
F
l
(I)
F
l
(I)
T
2R
N
l
N
l
We show that for a face texture, its feature response from the latter layers and the correlation matrices
from former ones sufficiently characterize the facial details to ensure photo-realism and perceptually iden-
tical images. A complete and photorealistic face texture can then be inferred from this information using
the partially visible face in the uv mapI
0
.
As only the low-frequency appearance is encoded in the last few layers, exploiting feature response
from the complete low-frequency albedo I(
al
) optimized in Sec. 3.3.1 gives us an estimation of the
desired low-frequency feature response
^
F forI
0
:
^
F
l
(I
0
) =F
l
(I(
al
));
whereI(
al
) is the PCA estimation of the complete face albedo. The remaining problem is to extract such
a feature correlation (for the complete face) from a partially visible face as illustrated in Figure 3.3.
Feature Correlation Extraction. A key observation is that the correlation matrices obtained from im-
ages of different faces can be linearly blended, and the combined matrices still produce realistic results.
See Figure 3.4 as an example for matrices blended from 4 images to 256 images. Hence, we conjecture
that the desired correlation matrix can be linearly combined from such matrices using a sufficiently large
database.
However, the partially visible input uv map, I
0
, often contains only a partially visible face, so we
can only obtain the correlation in a partial region. To eliminate the change of correlation due to different
visibility, complete textures in the database are masked out and their correlation matrices are recomputed
32
to simulate the same visibility as the input. We define a mask-out functionM(I) to remove all non-visible
pixels:
M(I)
p
=
8
>
>
>
<
>
>
>
:
0:5; ifp is non-visible
I
p
; otherwise
wherep is an arbitrary pixel. We choose 0.5 as a constant intensity for non-visible regions. So the new
correlation matrix of layerl for each image in datasetfI
1
;:::;I
K
g is:
G
l
M
(I
k
) =G
l
(M(I
k
));8k2f1;:::;Kg
Multi-Scale Detail Reconstruction. Given the correlation matricesfG
l
M
(I
k
);k = 1;:::;Kg derived
from our database, we can find an optimal blending weight to linearly combine them to minimize its
difference fromG
l
M
(I
0
) observed from the inputI
0
:
min
w
P
l
P
k
w
k
G
l
M
(I
k
)G
l
M
(I
0
)
F
s.t.
P
K
k=1
w
k
= 1
w
k
0 8k2f1;:::;Kg
(3.2)
Here, the Frobenius norms of correlation matrix differences on different layers are accumulated. Note that
we add extra constraints to the blending weight so that the blended correlation matrix is located within
the convex hull of matrices derived from the database. While a simple least squares optimization without
constraints can find a good fitting for the observed correlation matrix, artifacts could occur if the observed
region in the input data is of poor quality. Enforcing convexity can reduce such artifacts, as shown in
Figure 3.5.
After obtaining the blending weights, we can simply compute the correlation matrix for the whole
image:
^
G
l
(I
0
) =
X
k
w
k
G
l
(I
k
);8l
33
input visible
texture
unconstrained
least square
convex
constraint
Figure 3.5: Using convex constraints, we can ensure detail preservation for low-quality and noisy input
data.
input α=0 α= 20 α= 2000 α=2
Figure 3.6: Detail weight for texture synthesis.
3.3.3 Texture Synthesis
After obtaining our estimated feature
^
F and correlation
^
G fromI
0
, the final step is to synthesize a complete
albedoI matching both aspects. More specifically, we select a set of high-frequency preserving layersL
G
and low-frequency preserving layers L
F
and try to match
^
G
l
(I
0
) and
^
F
l
(I
0
) for layers in these sets,
respectively.
The desired albedo is computed via the following optimization:
min
I
X
l2L
F
F
l
(I)
^
F
l
(I
0
)
2
F
+
X
l2L
G
G
l
(I)
^
G
l
(I
0
)
2
F
(3.3)
where is a weight balancing the effect of high and low-frequency details. As illustrated in Figure 3.6,
we choose = 2000 for all our experiments to preserve the details.
34
While this is a non-convex optimization problem, the gradient of this function can be easily computed.
G
l
(I) can be considered as an extra layer in the neural network after layerl, and the optimization above is
similar to the process of training a neural network with Frobenius norm as its loss function. Note that here
our goal is to modify the inputI rather than solving for the network parameters.
For the Frobenius loss functionL(X) =kXAk
2
F
, whereA is a constant matrix, and for Gramian
matrixG(X) =XX
T
=n, their gradients can be computed analytically as follows:
@L
@X
= 2(XA)
@G
@X
=
2
n
X
As the derivative of every high-frequencyL
d
and low-frequency layerL
c
can be computed, we can
apply the chain rule on this multi-layer neural network to back-propagate the gradient on preceding layers
all the way to the first one, to get the gradient of inputrI. Given the size of variables in this optimization
problem and the limitation of the GPU memory, we follow Gatys et al.’s choice [42] of using an L-BFGS
solver to optimizeI. We use the low frequency albedoI(
al
) from Section 3.3.1 to initialize the problem.
3.4 Results
We processed a wide variety of input images with subjects of different races, ages, and gender, including
celebrities and people from the publicly available annotated faces-in-the-wild (AFW), dataset [113]. We
cover challenging examples of scenes with complex illumination as well as non-frontal faces. As show-
cased in Figures 3.1, 3.12 and 3.13, our inference technique produces high-resolution texture maps with
complex skin tones and mesoscopic-scale details (pores, stubble hair), even from very low-resolution input
images. Consequentially, we are able to effortlessly produce high-fidelity digitizations of iconic personali-
ties who have passed away, such as Muhammad Ali, or bring back their younger selves (e.g., young Hillary
Clinton) from a single archival photograph. Until recently, such results would only be possible with high-
end capture devices [152, 92, 44, 43] or intensive effort from digital artists. We also show photorealistic
35
renderings of our reconstructed face models from the widely used AFW database, which reveal high-
frequency pore structures, skin moles, as well as short facial hair. We clearly observe that low-frequency
albedo maps obtained from a linear PCA model [12] are unable to capture these details. Figure 3.7 illus-
trates the estimated shape and also compares the renderings between the low-frequency albedo and our
final results. For the renderings, we use Arnold [135], a Monte Carlo ray-tracer, with generic subsurface
scattering, image-based lighting, procedural roughness and specularity, and a bump map derived from the
synthesized texture.
input fitted geometry texture from PCA texture from ours
Figure 3.7: Photorealistic renderings of geometry, texture obtained using PCA model fitting, and our
method.
Face Texture Database. For our texture analysis method (Section 3.3.2) and to evaluate our approach,
we built a large database of high-quality facial skin textures from the recently released Chicago Face
Database [91] used for psychological studies. The data collection contains a balanced set of standardized
high-resolution photographs of 592 individuals of different ethnicity, ages, and gender. While the images
were taken in a consistent environment, the shape and lighting conditions need to be estimated in order to
recover a diffuse albedo map for each subject. We extend the method described in Section 3.3.1 to fit a
PCA face model to all the subjects while solving for globally consistent lighting. Before we apply inverse
illumination, we remove specularities in SUV color space [93] by filtering the specular peak in the S
channel since the faces were shot with flash.
Evaluation. We evaluate the performance of our texture synthesis with three widely used convolutional
neural networks (CaffeNet, VGG-16, and VGG-19) [11, 133] for image recognition. While different
36
CaffeNet VGG-16 VGG-19
Figure 3.8: Comparison between different convolutional neural network architectures.
albedo map input rendering input (magnified)
Figure 3.9: Even for largely downsized image resolutions, our algorithm can produce fine-scale details
while preserving the person’s similarity.
models can be used, deeper architectures tend to produce less artifacts and higher quality textures. As the
visualization in Figure 3.8 indicates, other models can be used as well, but deeper architectures produce
less artifacts and higher quality textures. To validate our use of all 5 mid-layers of VGG-19 for the multi-
scale representation of details, we show that if less layers are used, the synthesized textures would become
blurrier, as shown in Figure 3.10. While the texture synthesis formulation in Equation 3.3 suggests a blend
37
between the low-frequency albedo and the multi-scale facial details, we expect to maximize the amount of
detail and only use the low-frequency PCA model estimation for initialization.
We also evaluate the robustness of our inference framework for downsized image resolutions in Fig-
ure 3.9. We crop a diffuse lit face from a Light Stage capture [43]. The resulting image has 435 652
pixels and we decrease its resolution to 108 162 pixels. In addition to complex skin pigmentations, even
the tiny mole on the lower left cheek is properly reconstructed from the reduced input image using our
synthesis approach.
1 layer 2 layers 3 layers 4 layers 5 layers
Figure 3.10: Different numbers of mid-layers affects the level of detail of our inference.
As depicted in Figure 3.11, we also demonstrate that our method is able to produce consistent high-
fidelity texture maps of a subject captured from different views. Even for extreme profiles, highly detailed
freckles are synthesized properly in the reconstructed textures. Please refer to our additional materials for
more evaluations.
Comparison. We provide in Figure 3.14 additional visualizations of our method when using the closest
feature correlation, unconstrained linear combinations, and convex combinations. We also compare against
a PCA-based model fitting [12] approach and the state-of-the-art visio-lization framework [98]. We notice
that only our proposed technique using convex combinations is effective in generating mesoscopic-scale
texture details. Both visio-lization and the PCA-based model result in lower frequency textures and less
38
albedo map input image input image albedo map
Figure 3.11: Consistent and plausible reconstructions from two different viewpoints.
similar faces than the ground truth. Since our inference also fills holes, we compare our synthesis tech-
nique with a general inpainting solution for predicting unseen face regions. We test with the widely used
PatchMatch [8] technique as illustrated in Figure 3.15.
User Study A: Photorealism and Alikeness. To assess the photorealism and the likeness of our re-
constructed faces, we propose a crowdsourced experiment using Amazon Mechanical Turk (AMT). We
compare ground truth photographs from the Chicago Face Database with renderings of textures that are
generated with different techniques. These synthesized textures are then composited on the original images
using the estimated lighting and shading parameters. We randomly select 11 images from the database and
blur them using Gaussian filtering until the details are gone. We then synthesize high-frequency textures
from these blurred images using (1) a PCA model, (2) visio-lization, (3) our method using the closest
feature correlation, (4) our method using unconstrained linear combinations, and (5) our method using
convex combinations. We show the turkers a left and right side of a face and inform them that the left side
is always the ground truth. The right side has a 50% chance of being computer generated. The task consists
of deciding whether the right side is “real” and identical to the ground truth, or “fake”. We summarize our
analysis with the box plot in Figure 3.16 using 150 turkers. Overall, (5) outperforms all other solutions
and different variations of our method have similar means and medians, which indicates that non-technical
turkers have a hard time distinguishing between them.
39
input picture albedo map (LF) rendering (HF) rendering side (HF) rendering zoom (HF) albedo map (HF)
Figure 3.12: Our method successfully reconstructs high-quality textured face models on a wide variety
of people from challenging unconstrained images. We compare the estimated low-frequency albedo map
based on PCA model fitting (second column) and our synthesized high-frequency albedo map (third col-
umn). Photorealistic renderings are produced using the commercial Arnold renderer [135] and only the
estimated shape and synthesized texture map.
40
input picture albedo map (LF) rendering (HF) rendering side (HF) rendering zoom (HF) albedo map (HF)
Figure 3.13: Additional results with images from the annotated faces-in-the-wild (AFW) dataset [113].
41
ground truth PCA visio-lization ours
Figure 3.14: Comparison of our method with PCA-based model fitting [12], visio-lization [98], and the
ground truth.
input albedo map PatchMatch ours
Figure 3.15: Comparison with PatchMatch [8] on a partial input data.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
real ours ours -
least square
ours -
closest
visio-lization PCA
Positive Rate
Figure 3.16: Box plots of 150 turkers rating whether the image looks realistic and identical to the ground
truth. Each plot contains the positive rates for 11 subjects in the Chicago Face Database.
42
Light Stage ours PCA
Figure 3.17: Side-by-side renderings of 3D faces for AMT.
User Study B: Our method vs. Light Stage Capture. We also compare the photo-realism of renderings
produced using our method with the ones from Light Stage [43]. We use an interface on AMT that allows
turkers to rank the renderings from realistic to unrealistic. We show side-by-side renderings of 3D face
models as shown in Figure 3.17 using (1) our synthesized textures, (2) the ones from the Light Stage, and
(3) one obtained using PCA model fitting [12]. We asked 100 turkers to each sort 3 sets of pre-rendered
images, which are randomly shuffled. We used three subjects and perturbed their head rotations to produce
more samples. We found that our synthetically generated details can confuse the turkers for subjects that
have smoother skins, which resulted in 56% thinking that results from (1) are more realistic from (2). Also,
74% of the turkers found that faces from (2) are more realistic than from (3) and 72% think that method
(1) is superior to (3). Our experiments indicate that our results are visually comparable to those from the
Light Stage and that the level of photorealism is challenging to judge by a non-technical audience.
43
Chapter 4
Hair Rendering
reference image reference hair model rendering results reference image rendering results reference hair model
Figure 4.1: We propose a real-time hair rendering method. Given a reference image, we can render a 3D
hair model with the referenced color and lighting in real-time.
4.1 Introduction
Computer-generated (CG) characters are widely used in visual effects and games, and are becoming in-
creasingly prevalent for photo manipulation and as virtual reality applications; they utilize hair as an essen-
tial visual component. While significant advancements in hair rendering have been made in the computer
graphics community, the production of aesthetically realistic and desirable hair renderings still relies on a
careful design process of strand models, shaders, lights, and even composites, generally by experienced
look development artists. Due to the geometric complexity and volumetric structure of hair, modern hair
44
rendering pipelines often combine the use of efficient hair representation, physically-based shading mod-
els, shadow mapping techniques, and scattering approximations, which not only increase the computational
cost, but also the difficulty for tweaking parameters. In high-end film production, it is not unusual that a
single frame for a photorealistic hair on a rendering farm takes several minutes to generate. While com-
pelling real-time techniques have been introduced recently, including commercial solutions (e.g., NVIDIA
HairWorks, Unity Hair Tools), the results often appear synthetic and are difficult to author, even by skilled
digital artists. For instance, several weeks are often necessary to produce individualized hair geometries,
textures, and shaders for hero hair assets in modern games, such as Uncharted 4 and Call of Duty: Ghosts.
Inspired by recent advances in generative adversarial networks (GANs), we introduce the first deep
learning-based technique for rendering photorealistic hair. Our method takes a 3D hair model as input in
strand representation and uses an example input photograph to specify the desired hair color and lighting.
In addition to our intuitive user controls, we also demonstrate real-time performance, which makes our
approach suitable for interactive hair visualization and manipulation, as well as 3D avatar rendering.
Compared to conventional graphics rendering pipelines, which are grounded on complex parametric
models, reflectance properties, and light transport simulation, deep learning-based image synthesis tech-
niques have proven to be a promising alternative for the efficient generation of photorealistic images.
Successful image generations have been demonstrated on a wide range of data including urban scenes,
faces, and rooms, but fine level controls remain difficult to implement. For instance, when conditioned on
a semantic input, arbitrary image content and visual artifacts often appear, and variations are also difficult
to handle due to limited training samples. This problem is further challenged by the diversity of hairstyles,
the geometric intricacy of hair, and the aesthetic complexity of hair in natural environments. For a viable
photorealistic hair rendering solution, we need to preserve the intended strand structures of a given 3D hair
model, as well as provide controls such as color and lighting.
Furthermore, the link between CG and real-world images poses another challenge, since such training
data is difficult to obtain for supervised learning. Photoreal simulated renderings are time-consuming to
45
generate and often difficult to match with real-world images. In addition, capturing photorealistic hair
models is hard to scale, despite advances in hair digitization techniques.
In this work, we present an approach, based on a sequential processing of a rendered input hair models
using multiple GANs, that converts a semantic strand representation into a photorealistic image (See Fig-
ure 4.1 and Section 4.5). Color and lighting parameters are specified at intermediate stages. The input 3D
hair model is first rendered without any shading information, but strand colors are randomized to reveal
the desired hair structures. We then compute an edge activation map, which is an important intermediate
representation based on adaptive thresholding, which allows us to connect the strand features between our
input CG representation and a photoreal output for effective semi-supervised training. A conditional GAN
is then used to translate this edge activation map into a dense orientation map that is consistent with those
obtained from real-world hair photographs. We then concatenate two multi-modal image translation net-
works to disentangle color and lighting control parameters in latent space. These high-level controls are
specified using reference hair images as input, which allows us to describe complex hair color variations
and natural lighting conditions intuitively. We provide extensive evaluations of our technique and demon-
strate its effectiveness on a wide range of hairstyles. We compare our rendered images to ground truth
photographs and renderings obtained from state-of-the-art computer graphics solutions. We also conduct
a user study to validate the achieved level of realism.
Contributions: We demonstrate that a photorealistic and directable rendering of hair is possible using
a sequential GAN architecture and an intermediate conversion from edge activation map to orientation
field. Our network decouples color, illumination, and hair structure information using a semi-supervised
approach and does not require synthetic images for training. Our approach infers parameters from input
examples without tedious explicit low-level modeling specifications. We show that color and lighting
parameters can be smoothly interpolated in latent space to enable fine-level and user-friendly control.
Compared to conventional hair rendering techniques, our method does not require any low-level parameter
tweaking or ad-hoc texture design. Our rendering is computed in a feed forward pass through the network,
46
which is fast enough for real-time applications. Our method is also significantly easier to implement than
traditional global illumination techniques. We plan to release the code and data to the public.
4.2 Related Work
In this section we provide an overview of state-of-the-art techniques for hair rendering and image manip-
ulation and synthesis.
Fiber-level hair renderings produce highly realistic output, but incurs substantial computational cost [94,
29, 157, 156, 158], but also require some level of expertise for asset preparation and parameter tuning by
a digital artist. Various simplified models have been proposed recently, such as dual scattering [170], but
its real-time variant have a rather plastic and solid appearance. Real-time rendering techniques generally
avoid physically-based models, and instead rely on approximations that only mimics its appearance, by
modeling hair as parametric surfaces [74, 68], meshes [160, 57], textured morphable models [21], or mul-
tiple semi-transparent layers [134, 159] Choosing the right parametric model and setting the parameters for
the desired appearance requires substantial artist expertise. Converting across different hair models can be
casted as a challenging optimization or learning problem [157]. Instead, in this work, we demonstrate that
one can directly learn a representation for hair structure, appearance, and illumination using a sequence of
GANs, and that this representation can be intuitively manipulated by using example images.
Substantial efforts have been dedicated to estimating hair structures from natural images, such as with
multi-view hair capturing methods [162, 56, 54, 89, 88, 90, 105, 104]. Recently, single-view hair recon-
struction methods [22, 24, 25, 23, 55] are becoming increasingly important because of the popularity
in manipulating internet portraits and selfies. We view our work as complementary to these hair captur-
ing methods, since they rely on existing rendering techniques and do not estimate the appearance and
illumination for the hair. Our method can be used in similar applications, such as hair manipulation in
images [24, 25], but with simpler control over the rendering parameters.
47
Neural networks are increasingly used for the manipulation and synthesis of visual data such as
faces [132, 59], object views [165], and materials [86]. Generative models with an adversary [48, 111] can
successfully learn a data representation without explicit supervision. To enable more control these models
have been further modified to consider user input [166] or to condition on a guiding image [61]. While the
latter provides a powerful manipulation tool via image-to-image translation, it requires strong supervision
in the form of paired images. This limitation has been further addressed by enforcing cycle-consistency
across unaligned datasets [167]. Another limitation of the original image translation architecture is that
it does not handle multimodal distributions which are common in synthesis tasks. This is addressed by
encouraging bijections between the output and latent spaces [168] in a recently introduced architecture
known as BicycleGAN. We assess this architecture as part of our sequential GAN for hair rendering.
Our method is also related to unsupervised learning methods that remove part of the input signal, such
as color [163], and then try to recover it via an auto-encoder architecture. However, in this work we focus
on high quality hair renderings instead of generic image analysis, and unlike SplitBrain [163], we use
image processing to connect two unrelated domains (CG hair models and real images).
Variants of these models have been applied to many applications including image compositing [81],
font synthesis [7], texture synthesis [155], facial texture synthesis [101], sketch colorization [127], makeup
transfer [26], and many more. Hair has a more intricate appearance model due to its thin semi-transparent
structures, inter-reflections, scattering effects, and very detailed geometry.
4.3 Method
We propose a semi-supervised approach to train our hair rendering network using only real hair pho-
tographs during training. The key idea of our approach is to gradually reduce the amount of information
by processing the image, eventually bringing it to a simple low dimensional representation, edge activa-
tion map. This representation can also be trivially derived from a CG hair model, enabling us to connect
two domains. The encoder-decoder architecture is applied to each simplified representation, where the
48
input image segmented hair I1 gray image I2 orientation map I3 edge activation map
reconstructed
orientation map
reconstructed
gray image
reconstructed
color image
E1 E2 E3
color z lighting z structure z
G3 G2 G1
CG hair model edge activation map
F1 F2 F3
F
Figure 4.2: Overview of our method. Given a natural image, we first use simple image processing to
strip it off salient details such as color, lighting, and fiber-level structure, leaving only coarse structure
captured in activation map (top row). We encode each simplified image into its own latent space which are
further used by generators. The CG hair is rendered in a style mimicing the extracted coarse structure, and
generators are applied in inverse order to add the details from real hair encoded in the latent space yielding
the realistic reconstruction (bottom row).
goal of the encoder is to capture the information removed by the image processing and the goal of the
decoder is to recover it (see Figure 4.2).
Given a 2D imageI
1
we define image processing filtersfF
i
g
3
i=1
that generate intermediate simplified
imagesI
i+1
:=F
i
(I
i
). Each intermediate imageI
i
is first encoded by a networkE
i
(I
i
) to a feature vector
z
i
and then decoded with a generatorG
i
(z
i
;I
i
). The decoder is trained to recover the information, that
is lost in a particular image processing step. We use a conditional variational autoencoder GAN [168] for
each encoder-decoder pair.
49
Figure 4.3: Examples of our training data. For each set of images from left to right, we show input image;
segmented hair; gray image; orientation map; and edge activation map.
Our sequential image processing operates in the following three steps. First,I
2
:=F
1
(I
1
) desaturates
a segmented hair region of an input photograph to produce a grayscale image. Second, I
3
:= F
2
(I
2
)
is the orientation map using the maximal response of a rotating DoG filter [88]. Third, F
3
(I
3
) is an
edge activation map obtained using adaptive thresholding, and each pixel contains only the values 1 or
-1 indicating if it is activated (response higher than its neighboring pixels) or not. This edge activation
map provides a basic representation for describing hair structure from a particular viewpoint, and the edge
activation map derived from natural images or rendering of CG model with random strand colors can be
processed equally well with our generators. Figure 4.3 demonstrates some examples of our processing
pipeline applied to real hair.
At the inference time, we are given an input 3D hair model in strand representation (100 vertices
each) and we render it with randomized strand colors from a desired viewpoint. We apply the full image
processing stackF
3
F
2
F
1
to obtain the edge activation map. We then can use the generatorsG
1
G
2
G
3
to recover the realistic looking image from the the edge activation map. Note that these generators rely
on encoded features z
i
to recover the desired details, which provides an effective tool for controlling
rendering attributes such as color and lighting. We demonstrate that our method can effectively transfer
these attributes by encoding an example image and by feeding the resulting vector to the generator (where
fine-grained hair structure is encoded inz
3
, natural illumination and hair appearance properties are encoded
inz
2
, and detailed hair color is encoded inz
1
).
50
E G
Q(z |B)
input image
network output
ground truth
output
N(z)
KL
L1 + D
Figure 4.4: Our entire network is composed of three encoder-decoder pairs (E
i
;D
i
) with the same cV AE-
GAN architecture depicted in this figure.
4.4 Implementation
Given a pair of input/output images for each stage in the rendering pipeline (e.g. the segmented color
imageI
1
and its grayscale versionI
2
) we train both the encoder network and generator network together:
The encoderE
1
extracts the color informationz
1
:=E
1
(I
1
), and the generator reconstructs a color image
identical to I
1
, but only from the gray scaled image I
2
and the parameter z
1
as follows: G
1
(z
1
;I
2
) =
G
1
(E
1
(I
1
);I
2
)I
1
. If we have two such networks, it means that we can convert the image in the lower
dimentional domain (gray v.s. color in this case) back to the domain with higher dimensionality.
Since the three sets of networks (E
i
;G
i
) are trained in a similar manner and the input imagesI
i
;I
i+1
are all derived from the input imageI, as well as are invariant to the learning process, we can treat these
three sets of networks independently and train them in parallel. Without loss of generality, we reuse the
notationG :=G
i
;E :=E
i
;I :=I
i
;I
0
:=I
i+1
=F
i
(I
i
);z :=z
i
=E
i
(I
i
) for the rest of this section to
simplify the description of the training process.
4.4.1 Architecture
Figure. 4.4 shows the training architecture of a pair of networksE andG. We use the conditional varia-
tional autoencoder GAN(cV AE-GAN) as described in the work of Zhu et al. [168]. A ground truth image
51
I is being processed by an encoderE, producing the latent vectorz. Thisz and the input imageI
0
are both
inputs of the generatorG. We define the subjective function as a combination of the following three parts:
L
V AE
1
(G;E) =kIG(E(I);I
0
)k
1
reports the reconstruction error betweenI and the prediction fromG.
L
KL
(E) =D
KL
(E(I)kN (0;I))
enforces thatz =E(I) should be distributed in the latent space with a distribution similar to the Gaussian
distribution. HereD
KL
is the KL-divergence of two probability distributions. This loss preserves the
diversity of z and allows us to efficiently re-sample a random z from the normal distribution. Finally,
an adversarial loss is introduced by training a patch based discriminator [61] D at the same time. The
distinguisher takes eitherI orG(z;I
0
) and classify whether it comes from realistic data or from synthesis.
A path based discriminator will learn to distinguish the local feature from its receptive field, so that it can
avoid artifacts being produced in the local regions ofG(z;I
0
).
We use the add to input method from the work of [168] to replicate z. For a tensor I
0
with size
HWC and z with size 1Z, we copy value of z, extending it to a HWZ tensor, and
concatenate this with the tensorI
0
on the channel dimension, resulting aHW (Z +C) tensor. This
tensor is used asG’s input. We are aware that additional constraints such as providing a randomly drawn
latent vector to the generator [31, 34] is possible, but we did not achieve visible improvements by adding
more loss terms in our experiments. We provide a comparison between cV AE-GAN and the BicycleGAN
architecture in the results section.
52
4.4.2 Data Preparation
Segmentation. Since we are only focusing on hair appearance synthesis, we mask non-hair regions to
black to ignore their effect. We also trained a Pyramid Scene Parsing Network [164] to perform automatic
hair segmentation for our training. We first pick 3000 images from the CelebA-HQ dataset [65] and
manually segment out the corresponding hair regions. The network is trained from these images and
we use it to process the remaining ones from CelebA-HQ. We then manually remove obviously wrong
segmentations. This gives us around 25,000 images of segmented hair, from the 30,000 total images in
CelebA-HQ.
1
In practice, we used only 5,000 images as our training data in our experiments.
Precomputation. We perform all deterministic operations on all training data, in order to obtain the
corresponding gray image, orientation maps, and edge activation maps for each hair image. For each pair
of (E;G), only the two corresponding classes ofI andI
0
in the training data are loaded.
4.4.3 Training
We apply data augmentation including random rotation, translation, and color perturbation (only for in-
put RGB images) to add more variations to the training set. Scaling is not applied, as the orientation map
depends on the scale of the texture details from the gray-scaled image. We choose the U-net [118] architec-
ture forG, which has a encoder-decoder achitecture with symmetric skip connections allowing generation
of pixel-level details as well as preserving the global information. ResNet [52] is used for encoder E,
which consists of 6 groups of residual blocks. In all experiments, we use a fixed resolution 512 512 for
all images, and the dimension ofz is 8 in each transformation, following the choice from the work of Zhu
et al. [168]. We train each set of networks from 5000 images for 100 epochs, with a learning rate gradually
decreasing from 0.0001 to zero. The training time for each set is around 24 hours. Lastly, we also add ran-
dom Gaussian noise withdrawn fromN (0;I) with gradually decreasing to the image, before feeding
them toD, to stabilize the GAN training.
1
We will also release the segmentation masks for CelebA-HQ.
53
Figure 4.5: Our interactive interface.
4.5 Results
Our hair models in Figure 4.1, 4.6 and 4.10 are generated by the methods introduced in [55, 56, 54]. The
CG models in Figure 4.11 are manually created in Maya with XGen. We use the model from USC Hair
Salon [53] in Figure 4.7.
Real-Time Rendering System. To demonstrate the utility of our method, we developed a real-time
rendering interface for hair (see Figure 4.5). The user can load a CG hair model, pick the desired viewpoint,
and then provide an example image for color and lighting specification. This example-based approach is
user friendly as it is often difficult to describe hair colors using a single RGB value as they might have
dyed highlights or natural variations in follicle pigmentations. Figure 4.1 and Figure 4.6 demonstrate the
rendering results with our method. Note that our method successfully handles a diverse set of hairstyles,
complex hair textures, and natural lighting conditions. One can further refine the color and lighting by
providing additional examples for these attributes and interpolate between them in latent space (Figure 4.7).
This feature provides an intuitive user control when a desired input example is not available. We show in
Figure 4.8 that we can control the lightingz and colorz of the output image independently, and eachz can
be interpolated across the corresponding latent space.
54
Comparison. We compare our sequential network to running BicycleGAN to directly render the colored
image from the orientation field, thus skipping one separation of networks (Figure 4.9). Even with double
amount of parameters and training time, we still notice that only the lighting, but not color, is accurately
captured. We believe that the lighting parameters are much harder to be captured than the color parameters,
hence the combined network may always try to minimize the error brought by lighting changes first,
without considering the change of color as an equally important factor. To qualitatively evaluate, our
system we compare to state-of-the-art rendering techniques. Unfortunately, these rendering methods do
not take reference images into account and thus lack similar level of control. We asked a professional
artist to tweak the lighting and shading parameters to match our reference images. We used an enhanced
version of Unity’s Hair Tool, a high-quality real-time hair rendering plugin based on Kajiya-Kay reflection
model [64] (Figure 4.10). Notice that a similar real-time shading technique that approximates multiple
scattering components was used in the state-of-the-art real-time avatar digitization work of [57]. Our
approach appears less synthetic and contains more strand-scale texture variations. Our artist used the
default hair shader in Solid Angle’s Arnold renderer, which is a commercial implementation of a hybrid
shaded based on the works of [30] and [170], to match the reference image. This is an offline system and
it takes about 5 minutes to render a single frame (Figure 4.11) on an 8-core AMD Ryzen 1800x machine
with 32 GB RAM. While our renderer may not reach the level of fidelity obtained from high-end offline
rendering systems, it offers real time performance and instantaneously, produces a realistic result, while
matching the lighting and hair color of a reference image, a task that took our experienced artist over an
hour to perform.
User Study. We further conduct a user study to evaluate the quality of our renderings in comparison to
real hair extracted from images. We presented our synthesized result and an image crop to MTurk workers
side by side for 1 second, and asked them to pick an image that contained the real hair. We tested this with
edge activation fields coming from either CG model or a reference image (in both cases we used a latent
space from another image). If the synthetic images are not distinguishable from real images, the portion of
55
Hair Source AMT Fooling Rate [%]
Real 50.0%
Image 49.9%
CG Model 48.5%
CG Model(3 sec) 45.6%
Table 4.1: MTurk workers were provided two images in a random order and were asked to select one that
looked more real to them. Either CG models or images were used to provide activation fields. Note that in
all cases MTurk users had a hard time distinguishing our results from real hair, even if we extend the time
that images are visible to 3 seconds.
MTurk workers being “fooled” and think they are real will be exactly 50%. We present qualitative results
in Figure 4.12 and the rate of selecting the synthetic images in Table 4.1. Note that in all cases, the users
had a hard time distinguishing our results from real hair, and that our system successfully rendered hair
produced from a CG model or a real image. We even extend the time that images are visible from 1 second
to 3 seconds, allowing people to assess the images more carefully.
Performance. We measure the performance of our feed forward networks on an real-time hair rendering
application. We use a desktop with 2 Titan Xp each with 12GB of GPU memory. All online processing
steps includingF
1
;F
2
;F
3
and all generator are running per frame. The average amount of time used inF
2
is 9ms for looping over all possible rotating angles, and the computation time forF
1
andF
3
are negligible.
The three networks have identical architecture and thus runs in consistent speed, each taking around 15ms.
For a single GPU, our demo runs at around 15fps with GPU memory consumption 2.7GB. Running the
demo on multiple GPUs allows real-time performance with fps varying between 24 to 27fps.
56
reference image
reference
hair model
reconstructed
orientation map
rendering result other views
Figure 4.6: Rendering results for different CG hair models where color and lighting conditions are ex-
tracted from a reference image.
57
reference image A result A reference image B result B interpolation results
color lighting specular roughness
Figure 4.7: We demonstrate rendering of the same CG hair model where material attributes and lighting
conditions are extracted from different reference images, and demonstrate how the appearance can be
further refined by interpolating between these attributes in latent space.
58
color
lighting
Figure 4.8: Example of 2D interpolations in color and lighting. Horizontally hair color changes from
blonde(left) to red(right). Vertically lighting changes from bright(top) to dark(bottom).
(a) (b) (c)
Figure 4.9: Comparisons between our method with BicycleGAN (i.e., no sequential network). From left
to right, we show (a) reference image; (b) rendering result with BicycleGAN; (c) rendering result with our
sequential pipeline.
59
reference image Unity ours reference hair model
Figure 4.10: Comparisons between our method with real-time rendering technique in Unity.
reference image reference hair model Arnold ours
Figure 4.11: Comparisons between our method with offline rendering Arnold.
(a) (b) (c)
Figure 4.12: User study example. From left to right, we show (a) real image, which is used to infer latent
spacez; (b) rendering result generated withz and the activation map of another real image; (c) rendering
result ofz and a CG model.
60
Chapter 5
Discussion
We have shown that neural network is a flexible and powerful tool to handle different steps considered to be
independent before altogether. The results from neural network outperforms existing algorithms invented
specifically for certain steps e.g. dense body correspondence and texture inpainting and synthesis. This
new unified pipeline can significantly decrease the amount of time and computation needed, and reduce the
cost of purchasing specific hardwares only applicable to few tasks. With the rapid development of neural
network in the field of vision, we expect that the quality and accuracy of capture can be further improved
by adapting to more advanced network architectures.
A deep learning framework can be particularly effective at establishing accurate and dense correspon-
dences between partial scans of clothed subjects in arbitrary poses. The key insight is that a smooth em-
bedding needs to be learned to reduce misclassification artifacts at segmentation boundaries when using
traditional classification networks. We have shown that a loss function based on the integration of multiple
random segmentations can be used to enforce smoothness. This segmentation scheme also significantly
decreases the amount of training data needed as it eliminates an exhaustive pairwise distance computation
between the feature descriptors during training as apposed to methods that work on pairs or triplets of
samples. Compared to existing classification networks, we also present the first framework that unifies
the treatment of human body shapes and clothed subjects. In addition to its remarkable efficiency, our
approach can handle both full models and partial scans, such as depth maps captured from a single view.
61
While not as general as some state of the art shape matching methods [70, 83, 114, 27], our technique
significantly outperforms them for partial input shapes that are human bodies with clothing.
Digitizing high-fidelity albedo texture maps is possible from a single unconstrained image. Despite
challenging illumination conditions, non-frontal faces, and low-resolution input, we can synthesize plausi-
ble appearances and realistic mesoscopic details. Our user study indicates that the resulting high-resolution
textures can yield photorealistic renderings that are visually comparable to those obtained using a state-of-
the-art Light Stage system. Mid-layer feature correlations are highly effective in capturing high-frequency
details and the general appearance of the person. Our proposed neural synthesis approach can handle high-
resolution textures, which is not possible with existing deep learning frameworks [35]. We also found that
convex combinations are crucial when blending feature correlations in order to ensure consistent fine-scale
details.
We presented the first deep learning approach for rendering photorealistic hair, which performs in
real-time. We have shown that our sequential GAN architecture and semi-supervised training approach
can effectively disentangle strand-level structures, appearance, and illumination properties from the highly
complex and diverse range of hairstyles. In particular, our evaluations show that without our sequential
architecture, the lighting parameter would dominate over color, and color specification would no longer
be effective. Moreover, our trained latent space is smooth, which allows us to interpolate continuously
between arbitrary color and lighting samples. Our evaluations also suggests that there are no significant
differences between a vanilla conditional GAN and a state-of-the-art network such as bicycleGAN, which
uses additional smoothness constraints in the training. Our experiments further indicate that a direct con-
version from a CG rendering to a photoreal image using existing adversarial networks would lead to signif-
icant artifacts or unwanted hairstyles. Our intermediate conversion step from edge activation to orientation
map has proven to be an effective way for semi-supervised training and transitioning from synthetic input
to photoreal output while ensuring that the intended hairstyle structure is preserved.
62
5.1 Limitations and Future Work.
Our data-driven frameworks for appearance analysis captures contiguous, smooth and accurate parameters
efficiently. However, the latent space of such inferred parameter has no direct correspondence to existing
parameter space that is used in traditional pipelines, e.g. splitting color into RGB channels, or precisely
defining the direction of light in 3D. We would like to explore the ability to specify lighting configurations
and advanced shading parameters for a seamless integration of our hair rendering into virtual environments
and game engines. We believe that additional supervised training with controlled simulations and captured
hair data would be necessary.
The hair rendering is not entirely temporally coherent when rotating the view. While the per frame
predictions are reasonable and most strand structure are consistent between frames, there are still visible
flickering artifacts. We believe that temporal consistency could be trained with augmentations with 3D
rotations, or video training data.
Our image synthesis results are based on training data consisting of all realistic images. While this
ensures a consistent and realistic feeling of our results, we have only the rendering of face and hair regions.
The integration with existing CG renderings on other regions is not straightforward, as the synthetic feeling
of other parts also decreases the realism of our methods, making the composition fall back into the uncanny
valley. General image blending with style adaptation would be required to blend network synthesized
images and graphics rendered images together.
Like other GAN techniques, our results are also not fully indistinguishable from real ones for a trained
eye and an extended period of observation, but we are confident that our proposed approach would benefit
from future advancements in GANs.
63
Reference List
[1] 3d animation online services, 3d characters, and character rigging - mixamo. https://www.
mixamo.com/. Accessed: 2015-10-03.
[2] Yobi3d - free 3d model search engine. https://www.yobi3d.com. Accessed: 2015-11-03.
[3] Brian Amberg, Andrew Blake, and Thomas Vetter. On compositional image alignment, with an
application to active appearance models. In IEEE CVPR, pages 1714–1721, 2009.
[4] Dragomir Anguelov, Praveen Srinivasan, Hoi cheung Pang, Daphne Koller, Sebastian Thrun, and
James Davis. The correlated correspondence algorithm for unsupervised registration of nonrigid
surfaces. In NIPS, pages 33–40. MIT Press, 2004.
[5] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James
Davis. Scape: Shape completion and animation of people. pages 408–416, 2005.
[6] M. Aubry, U. Schlickewei, and D. Cremers. The wave kernel signature: A quantum mechanical
approach to shape analysis. In IEEE ICCV Workshops, 2011.
[7] Samaneh Azadi, Matthew Fisher, Vladimir G. Kim, Zhaowen Wang, Eli Shechtman, and Trevor
Darrell. Multi-content gan for few-shot font style transfer. CVPR, 2018.
[8] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan Goldman. Patchmatch: a random-
ized correspondence algorithm for structural image editing. ACM Transactions on Graphics-TOG,
28(3):24, 2009.
[9] Jonathan T Barron and Jitendra Malik. Shape, albedo, and illumination from a single image of an
unknown object. CVPR, 2012.
[10] Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. High-quality single-
shot capture of facial geometry. ACM Trans. on Graphics (Proc. SIGGRAPH), 29(3):40:1–40:9,
2010.
[11] Berkeley Vision and Learning Center. Caffenet, 2014. https://github.com/BVLC/cae/tree/
master/models/bvlc reference caenet.
[12] V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Proceedings
of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194,
1999.
[13] Federica Bogo, Michael J. Black, Matthew Loper, and Javier Romero. Detailed full-body recon-
structions of moving people from monocular RGB-D sequences. December 2015.
[14] Federica Bogo, Javier Romero, Matthew Loper, and Michael J. Black. FAUST: Dataset and evalu-
ation for 3D mesh registration. 2014.
64
[15] James Booth, Anastasios Roussos, Stefanos Zafeiriou, Allan Ponniah, and David Dunaway. A 3d
morphable model learnt from 10,000 faces. In IEEE CVPR, 2016.
[16] Alexander M. Bronstein, Michael M. Bronstein, and Ron Kimmel. Generalized multidimen-
sional scaling: A framework for isometry-invariant partial surface matching. Proc. of the National
Academy of Science, pages 1168–1172, 2006.
[17] AlexanderM. Bronstein, MichaelM. Bronstein, Ron Kimmel, Mona Mahmoudi, and Guillermo
Sapiro. A gromov-hausdorff framework with diffusion geometry for topologically-robust non-rigid
shape matching. 89(2-3):266–286, 2010.
[18] Gabriel J Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A
high-definition ground truth database. Pattern Recognition Letters, 30(2):88–97, 2009.
[19] Michael Buhrmester, Tracy Kwang, and Samuel D Gosling. Amazon’s mechanical turk a new
source of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1):3–5,
2011.
[20] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. Facewarehouse: A 3d facial
expression database for visual computing. IEEE TVCG, 20(3):413–425, 2014.
[21] Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. Real-time facial animation
with image-based dynamic avatars. ACM Transactions on Graphics (TOG), 35(4):126, 2016.
[22] Menglei Chai, Linjie Luo, Kalyan Sunkavalli, Nathan Carr, Sunil Hadap, and Kun Zhou. High-
quality hair modeling from a single portrait photo. ACM Transactions on Graphics (Proc. SIG-
GRAPH Asia), 34(6), November 2015.
[23] Menglei Chai, Tianjia Shao, Hongzhi Wu, Yanlin Weng, and Kun Zhou. Autohair: Fully automatic
hair modeling from a single image. ACM Transactions on Graphics (TOG), 35(4):116, 2016.
[24] Menglei Chai, Lvdi Wang, Yanlin Weng, Xiaogang Jin, and Kun Zhou. Dynamic hair manipulation
in images and videos. ACM Trans. Graph., 32(4):75:1–75:8, July 2013.
[25] Menglei Chai, Lvdi Wang, Yanlin Weng, Yizhou Yu, Baining Guo, and Kun Zhou. Single-view hair
modeling for portrait manipulation. ACM Transactions on Graphics (TOG), 31(4):116, 2012.
[26] Huiwen Chang, Jingwan Lu, Fisher Yu, and Adam Finkelstein. Makeupgan: Makeup transfer via
cycle-consistent adversarial networks. CVPR, 2018.
[27] Qifeng Chen and Vladlen Koltun. Robust nonrigid registration by convex optimization. 2015.
[28] Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, and Westley Sarokin. Acquiring
the Reflectance Field of a Human Face. In SIGGRAPH, 2000.
[29] Eugene d’Eon, Guillaume Francois, Martin Hill, Joe Letteri, and Jean-Marie Aubry. An energy-
conserving hair reflectance model. In Proceedings of the Twenty-second Eurographics Conference
on Rendering, EGSR ’11, pages 1181–1187, Aire-la-Ville, Switzerland, Switzerland, 2011. Euro-
graphics Association.
[30] Eugene d’Eon, Steve Marschner, and Johannes Hanika. Importance sampling for physically-based
hair fiber models. In SIGGRAPH Asia 2013 Technical Briefs, SA ’13, pages 25:1–25:4, New York,
NY , USA, 2013. ACM.
[31] Jeff Donahue, Philipp Kr¨ ahenb¨ uhl, and Trevor Darrell. Adversarial feature learning. CoRR,
abs/1605.09782, 2016.
65
[32] R. Donner, M. Reiter, G. Langs, P. Peloschek, and H. Bischof. Fast active appearance model search
using canonical correlation analysis. IEEE TPAMI, 28(10):1690–1694, 2006.
[33] Mingsong Dou, Jonathan Taylor, Henry Fuchs, Andrew Fitzgibbon, and Shahram Izadi. 3d scanning
deformable objects with a single rgbd sensor. 2015.
[34] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Mart´ ın Arjovsky, Olivier Mastropi-
etro, and Aaron C. Courville. Adversarially learned inference. CoRR, abs/1606.00704, 2016.
[35] Chi Nhan Duong, Khoa Luu, Kha Gia Quach, and Tien D Bui. Beyond principal components: Deep
boltzmann machines for face modeling. In IEEE CVPR, pages 4786–4794, 2015.
[36] G. J. Edwards, C. J. Taylor, and T. F. Cootes. Interpreting face images using active appearance
models. In Proceedings of the 3rd. International Conference on Face and Gesture Recognition, FG
’98, pages 300–. IEEE Computer Society, 1998.
[37] Alexei A. Efros and William T. Freeman. Image quilting for texture synthesis and transfer. In
Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques,
SIGGRAPH ’01, pages 341–346. ACM, 2001.
[38] Alexei A. Efros and Thomas K. Leung. Texture synthesis by non-parametric sampling. In IEEE
ICCV, pages 1033–, 1999.
[39] A. Elad and R. Kimmel. On bending invariant signatures for surfaces. 25(10):1285–1295, 2003.
[40] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip H¨ ausser, Caner Hazirbas, Vladimir Golkov,
Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with
convolutional networks. CoRR, abs/1504.06852, 2015.
[41] Leon A. Gatys, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. Preserving color in neural
artistic style transfer. CoRR, abs/1606.05897, 2016.
[42] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional
neural networks. In IEEE CVPR, pages 2414–2423, 2016.
[43] Abhijeet Ghosh, Graham Fyffe, Borom Tunwattanapong, Jay Busch, Xueming Yu, and Paul De-
bevec. Multiview face capture using polarized spherical gradient illumination. ACM Trans. Graph.,
30(6):129:1–129:10, 2011.
[44] Abhijeet Ghosh, Tim Hawkins, Pieter Peers, Sune Frederiksen, and Paul Debevec. Practical mod-
eling and acquisition of layered facial reflectance. In ACM Transactions on Graphics (TOG), vol-
ume 27, page 139. ACM, 2008.
[45] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accu-
rate object detection and semantic segmentation. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2014.
[46] Aleksey Golovinskiy, Wojciech Matusik, Hanspeter Pfister, Szymon Rusinkiewicz, and Thomas
Funkhouser. A statistical model for synthesis of detailed facial geometry. ACM Trans. Graph.,
25(3):1025–1034, 2006.
[47] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http:
//www.deeplearningbook.org.
[48] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the 27th
International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages
2672–2680, Cambridge, MA, USA, 2014. MIT Press.
66
[49] Paul Graham, Borom Tunwattanapong, Jay Busch, Xueming Yu, Andrew Jones, Paul Debevec, and
Abhijeet Ghosh. Measurement-based Synthesis of Facial Microgeometry. In EUROGRAPHICS,
2013.
[50] Xufeng Han, Thomas Leung, Yangqing Jia, Rahul Sukthankar, and Alexander C Berg. Matchnet:
Unifying feature and metric learning for patch-based matching. In Computer Vision and Pattern
Recognition (CVPR), 2015 IEEE Conference on, pages 3279–3286. IEEE, 2015.
[51] Antonio Haro, Brian Guenterz, and Irfan Essay. Real-time, Photo-realistic, Physically Based Ren-
dering of Fine Scale Human Skin Structure. In S. J. Gortle and K. Myszkowski, editors, Eurograph-
ics Workshop on Rendering, 2001.
[52] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[53] Liwen Hu. http://www-scf.usc.edu/ liwenhu/shm/database.html. 2015.
[54] Liwen Hu, Chongyang Ma, Linjie Luo, and Hao Li. Robust hair capture using simulated examples.
ACM Transactions on Graphics (Proc. SIGGRAPH), 33(4), July 2014.
[55] Liwen Hu, Chongyang Ma, Linjie Luo, and Hao Li. Single-view hair modeling using a hairstyle
database. ACM Transactions on Graphics (Proc. SIGGRAPH), 34(4), August 2015.
[56] Liwen Hu, Chongyang Ma, Linjie Luo, Li-Yi Wei, and Hao Li. Capturing braided hairstyles. ACM
Trans. Graph., 33(6):225:1–225:9, 2014.
[57] Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi,
Carrie Sun, Yen-Chun Chen, and Hao Li. Avatar digitization from a single image for real-time
rendering. ACM Trans. Graph., 36(6):195:1–195:14, November 2017.
[58] Qi-Xing Huang, Bart Adams, Martin Wicke, and Leonidas J. Guibas. Non-rigid registration under
isometric deformations. pages 1449–1457, 2008.
[59] Loc Huynh, Weikai Chen, Shunsuke Saito, Jun Xing, Koki Nagano, Andrew Jones, Paul Debevec,
and Hao Li. Mesoscopic facial geometry inference using deep neural networks. In Computer Vision
and Pattern Recognition (CVPR), pages –. IEEE, 2018.
[60] Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. Dynamic 3d avatar creation from hand-
held video input. ACM Trans. Graph., 34(4):45:1–45:14, 2015.
[61] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with
conditional adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), July 2017.
[62] Varun Jain and Hao Zhang. Robust 3d shape correspondence in the spectral domain. In SMI,
page 19. IEEE Computer Society, 2006.
[63] Andrew E. Johnson and Martial Hebert. Using spin images for efficient object recognition in clut-
tered 3d scenes. 21(5):433–449, May 1999.
[64] J. T. Kajiya and T. L. Kay. Rendering fur with three dimensional textures. SIGGRAPH Comput.
Graph., 23(3):271–280, July 1989.
[65] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for
improved quality, stability, and variation. In International Conference on Learning Representations,
2018.
67
[66] Ira Kemelmacher-Shlizerman. Internet-based morphable model. IEEE ICCV, 2013.
[67] Ira Kemelmacher-Shlizerman and Ronen Basri. 3d face reconstruction from a single image using a
single reference face shape. IEEE TPAMI, 33(2):394–405, 2011.
[68] Tae-Yong Kim and Ulrich Neumann. Interactive multiresolution hair modeling and editing. ACM
Trans. Graph., 21(3):620–629, July 2002.
[69] Vladimir G. Kim, Yaron Lipman, Xiaobai Chen, and Thomas Funkhouser. M¨ obius Transformations
For Global Intrinsic Symmetry Analysis. 2010.
[70] Vladimir G. Kim, Yaron Lipman, and Thomas Funkhouser. Blended Intrinsic Maps. volume 30,
2011.
[71] Aniket Kittur, Ed H Chi, and Bongwon Suh. Crowdsourcing user studies with mechanical turk.
In Proceedings of the SIGCHI conference on human factors in computing systems, pages 453–456.
ACM, 2008.
[72] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo-
lutional neural networks. In Advances in neural information processing systems, pages 1097–1105,
2012.
[73] Vivek Kwatra, Arno Sch¨ odl, Irfan Essa, Greg Turk, and Aaron Bobick. Graphcut textures: Image
and video synthesis using graph cuts. In ACM SIGGRAPH 2003 Papers, SIGGRAPH ’03, pages
277–286. ACM, 2003.
[74] Doo-Won Lee and Hyeong-Seok Ko. Natural hairstyle modeling and animation. Graph. Models,
63(2):67–85, March 2001.
[75] Marius Leordeanu and Martial Hebert. A spectral technique for correspondence problems using
pairwise constraints. pages 1482–1489, Washington, DC, USA, 2005.
[76] Chen Li, Kun Zhou, and Stephen Lin. Intrinsic face image decomposition with human face priors.
In ECCV (5)’14, pages 218–233, 2014.
[77] Hao Li, Bart Adams, Leonidas J. Guibas, and Mark Pauly. Robust single-view geometry and motion
reconstruction. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia 2009), 28(5), 2009.
[78] Hao Li, Robert W Sumner, and Mark Pauly. Global correspondence optimization for non-rigid
registration of depth scans. In Computer graphics forum, volume 27, pages 1421–1430. Wiley
Online Library, 2008.
[79] Hao Li, Laura Trutoiu, Kyle Olszewski, Lingyu Wei, Tristan Trutna, Pei-Lun Hsieh, Aaron Nicholls,
and Chongyang Ma. Facial performance sensing head-mounted display. ACM Transactions on
Graphics (Proceedings SIGGRAPH 2015), 34(4), July 2015.
[80] Hao Li, Etienne V ouga, Anton Gudym, Linjie Luo, Jonathan T. Barron, and Gleb Gusev. 3d self-
portraits. 2013.
[81] C. Lin, S. Lucey, E. Yumer, O. Wang, and E. Shechtman. St-gan: Spatial transformer generative
adversarial networks for image compositing. In Computer Vision and Pattern Recognition, 2018.
CVPR 2018. IEEE Conference on, pages –. -, 2018.
[82] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James
Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ ar, and C. Lawrence Zitnick. Microsoft COCO:
common objects in context. CoRR, abs/1405.0312, 2014.
68
[83] Yaron Lipman and Thomas Funkhouser. M¨ obius voting for surface correspondence. pages 72:1–
72:12, 2009.
[84] R. Litman and A.M. Bronstein. Learning spectral descriptors for deformable shape correspondence.
36(1):171–180, 2014.
[85] Ce Liu, Heung-Yeung Shum, and William T. Freeman. Face hallucination: Theory and practice.
Int. J. Comput. Vision, 75(1):115–134, 2007.
[86] Guilin Liu, Duygu Ceylan, Ersin Yumer, Jimei Yang, and Jyh-Ming Lien. Material editing using a
physically based rendering network. In ICCV, 2017.
[87] Jonathan L. Long, Ning Zhang, and Trevor Darrell. Do convnets learn correspondence? In NIPS,
pages 1601–1609. 2014.
[88] Linjie Luo, Hao Li, Sylvain Paris, Thibaut Weise, Mark Pauly, and Szymon Rusinkiewicz. Multi-
view hair capture using orientation fields. In Computer Vision and Pattern Recognition (CVPR),
June 2012.
[89] Linjie Luo, Hao Li, and Szymon Rusinkiewicz. Structure-aware hair capture. ACM Transactions
on Graphics (Proc. SIGGRAPH), 32(4), July 2013.
[90] Linjie Luo, Cha Zhang, Zhengyou Zhang, and Szymon Rusinkiewicz. Wide-baseline hair capture
using strand-based refinement. In Computer Vision and Pattern Recognition (CVPR), June 2013.
[91] Debbie S. Ma, Joshua Correll, and Bernd Wittenbrink. The chicago face database: A free stimulus
set of faces and norming data. Behavior Research Methods, 47(4):1122–1135, 2015.
[92] Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, and Paul Debevec.
Rapid Acquisition of Specular and Diffuse Normal Maps from Polarized Spherical Gradient Illumi-
nation. In Eurographics Symposium on Rendering, 2007.
[93] Satya P Mallick, Todd E Zickler, David J Kriegman, and Peter N Belhumeur. Beyond lambert:
Reconstructing specular surfaces using color. In IEEE CVPR, pages 619–626, 2005.
[94] Stephen R. Marschner, Henrik Wann Jensen, Mike Cammarano, Steve Worley, and Pat Hanrahan.
Light scattering from human hair fibers. ACM Trans. Graph., 22(3):780–791, July 2003.
[95] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and
its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc.
8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001.
[96] Jonathan Masci, Davide Boscaini, Michael M. Bronstein, and Pierre Vandergheynst. Geodesic
convolutional neural networks on riemannian manifolds. In The IEEE International Conference on
Computer Vision (ICCV) Workshops, December 2015.
[97] Iain Matthews and Simon Baker. Active appearance models revisited. Int. J. Comput. Vision,
60(2):135–164, 2004.
[98] Umar Mohammed, Simon J. D. Prince, and Jan Kautz. Visio-lization: Generating novel facial
images. In ACM SIGGRAPH 2009 Papers, pages 57:1–57:8. ACM, 2009.
[99] Koki Nagano, Graham Fyffe, Oleg Alexander, Jernej Barbiˇ c, Hao Li, Abhijeet Ghosh, and Paul
Debevec. Skin microstructure deformation with displacement map convolution. ACM Transactions
on Graphics (Proceedings SIGGRAPH 2015), 34(4), 2015.
69
[100] Richard A. Newcombe, Dieter Fox, and Steven M. Seitz. Dynamicfusion: Reconstruction and
tracking of non-rigid scenes in real-time. 2015.
[101] Kyle Olszewski, Zimo Li, Chao Yang, Yi Zhou, Ronald Yu, Zeng Huang, Sitao Xiang, Shunsuke
Saito, Pushmeet Kohli, and Hao Li. Realistic dynamic facial textures from a single image using
gans. ICCV, 2017.
[102] Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. High-fidelity facial and speech ani-
mation for vr hmds. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia 2016), 35(6),
December 2016.
[103] Maks Ovsjanikov, Quentin M´ erigot, Facundo M´ emoli, and Leonidas J. Guibas. One point isometric
matching with the heat kernel. 29(5):1555–1564, 2010.
[104] Sylvain Paris, Hector M Brice˜ no, and Franc ¸ois X Sillion. Capture of hair geometry from multiple
images. In ACM Transactions on Graphics (TOG), volume 23, pages 712–719. ACM, 2004.
[105] Sylvain Paris, Will Chang, Oleg I Kozhushnyan, Wojciech Jarosz, Wojciech Matusik, Matthias
Zwicker, and Fr´ edo Durand. Hair photobooth: geometric and photometric acquisition of real
hairstyles. In ACM Transactions on Graphics (TOG), volume 27, page 30. ACM, 2008.
[106] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context
encoders: Feature learning by inpainting. In IEEE CVPR, 2016.
[107] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d face
model for pose and illumination invariant face recognition. In Advanced video and signal based
surveillance, 2009. AVSS’09. Sixth IEEE International Conference on, pages 296–301. IEEE, 2009.
[108] Jonathan Pokrass, Alexander M Bronstein, Michael M Bronstein, Pablo Sprechmann, and Guillermo
Sapiro. Sparse modeling of intrinsic correspondences. In Computer Graphics Forum, volume 32,
pages 459–468. Wiley Online Library, 2013.
[109] Gerard Pons-Moll, Jonathan Taylor, Jamie Shotton, Aaron Hertzmann, and Andrew Fitzgibbon.
Metric regression forests for correspondence estimation. International Journal of Computer Vision,
113(3):163–175, 2015.
[110] Helmut Pottmann, Johannes Wallner, Qi-Xing Huang, and Yong-Liang Yang. Integral invariants for
robust geometry processing. Computer Aided Geometric Design, 26(1):37–60, 2009.
[111] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep
convolutional generative adversarial networks. ICLR, 2016.
[112] Ravi Ramamoorthi and Pat Hanrahan. An efficient representation for irradiance environment maps.
In Proceedings of the 28th annual conference on Computer graphics and interactive techniques,
pages 497–500. ACM, 2001.
[113] Deva Ramanan. Face detection, pose estimation, and landmark localization in the wild. In IEEE
CVPR, pages 2879–2886, 2012.
[114] Emanuele Rodol` a, Samuel Rota Bulo, Thomas Windheuser, Matthias Vestner, and Daniel Cremers.
Dense non-rigid shape correspondence using random forests. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pages 4177–4184, 2014.
[115] Emanuele Rodola, Andrea Torsello, Tatsuya Harada, Yasuo Kuniyoshi, and Daniel Cremers. Elas-
tic net constraints for shape matching. In Proceedings of the IEEE International Conference on
Computer Vision, pages 1169–1176, 2013.
70
[116] Sami Romdhani. Face image analysis using a multiple features fitting strategy. PhD thesis, Univer-
sity of Basel, 2005.
[117] Sami Romdhani and Thomas Vetter. Estimating 3d shape and texture using pixel intensity, edges,
specular highlights, texture constraints and a prior. In CVPR (2), pages 986–993, 2005.
[118] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedi-
cal image segmentation. CoRR, abs/1505.04597, 2015.
[119] S. Rusinkiewicz, B. Brown, and M. Kazhdan. 3d scan matching and registration. In ICCV 2005
Short Course, 2005.
[120] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.
ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision
(IJCV), 115(3):211–252, 2015.
[121] Raif M. Rustamov. Laplace-beltrami eigenfunctions for deformation invariant shape representation.
pages 225–233, 2007.
[122] Y Sahillio˘ glu and Y¨ ucel Yemez. Coarse-to-fine combinatorial matching for dense isometric shape
correspondence. In Computer Graphics Forum, volume 30, pages 1461–1470. Wiley Online Li-
brary, 2011.
[123] Yusuf Sahillio˘ glu and Y¨ ucel Yemez. Minimum-distortion isometric shape correspondence using em
algorithm. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(11):2203–2215,
2012.
[124] Yusuf Sahillio˘ glu and Y¨ ucel Yemez. Coarse-to-fine isometric shape correspondence by tracking
symmetric flips. In Computer Graphics Forum, volume 32, pages 177–189. Wiley Online Library,
2013.
[125] Shunsuke Saito, Tianye Li, and Hao Li. Real-time facial segmentation and performance capture
from rgb input. In ECCV, 2016.
[126] Shunsuke Saito, Lingyu Wei, Liwen Hu, Koki Nagano, and Hao Li. Photorealistic facial texture
inference using deep neural networks. In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017.
[127] Patsorn Sangkloy, Jingwan Lu, Chen Fang, FIsher Yu, and James Hays. Scribbler: Controlling deep
image synthesis with sketch and color. Computer Vision and Pattern Recognition, CVPR, 2017.
[128] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face
recognition and clustering. In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2015.
[129] Pierre Sermanet, David Eigen, Xiang Zhang, Micha¨ el Mathieu, Rob Fergus, and Yann LeCun.
Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR,
abs/1312.6229, 2013.
[130] Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. Automatic acquisition of high-fidelity
facial performances using monocular videos. ACM Trans. Graph., 33(6):222:1–222:13, 2014.
[131] Jamie Shotton, Ross Girshick, Andrew Fitzgibbon, Toby Sharp, Mat Cook, Mark Finocchio,
Richard Moore, Pushmeet Kohli, Antonio Criminisi, Alex Kipman, and Andrew Blake. Efficient
human pose estimation from single depth images. 2012.
71
[132] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing
with intrinsic image disentangling. In Computer Vision and Pattern Recognition (CVPR), pages –.
IEEE, 2017.
[133] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
[134] Erik Sintorn and Ulf Assarsson. Hair self shadowing and transparency depth ordering using occu-
pancy maps. In Proceedings of the 2009 Symposium on Interactive 3D Graphics and Games, I3D
’09, pages 67–74, New York, NY , USA, 2009. ACM.
[135] Solid Angle, 2016. http://www.solidangle.com/arnold/.
[136] Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. A concise and provably informative multi-scale
signature based on heat diffusion. pages 1383–1392, 2009.
[137] Jochen S¨ ußmuth, Marco Winter, and G¨ unther Greiner. Reconstructing animated meshes from time-
varying point clouds. In Computer Graphics Forum, volume 27, pages 1469–1476. Wiley Online
Library, 2008.
[138] Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M Seitz. Total moving face
reconstruction. In ECCV, pages 796–812. Springer, 2014.
[139] James Taylor, Jamie Shotton, Toby Sharp, and Andrew Fitzgibbon. The Vitruvian manifold: Infer-
ring dense correspondences for one-shot human pose estimation. In Computer Vision and Pattern
Recognition (CVPR), 2012 IEEE Conference on, pages 103–110. IEEE, 2012.
[140] Art Tevs, Alexander Berner, Michael Wand, Ivo Ihrke, Martin Bokeloh, Jens Kerber, and Hans-Peter
Seidel. Animation cartography – intrinsic reconstruction of shape and motion. ACM Transactions
on Graphics, 31(2):12:1–12:15, April 2012.
[141] The Digital Human League. Digital Emily 2.0, 2015. http://gl.ict.usc.edu/Research/
DigitalEmily2/.
[142] J. Thies, M. Zollh¨ ofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Real-time face
capture and reenactment of rgb videos. In IEEE CVPR, 2016.
[143] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland,
Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications
of the ACM, 59(2):64–73, 2016.
[144] Jonathan Tompson, Murphy Stein, Yann Lecun, and Ken Perlin. Real-time continuous pose re-
covery of human hands using convolutional networks. ACM Transactions on Graphics (TOG),
33(5):169, 2014.
[145] Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large data set for
nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine
intelligence, 30(11):1958–1970, 2008.
[146] Matthew Turk and Alex Pentland. Eigenfaces for recognition. J. Cognitive Neuroscience, 3(1):71–
86, 1991.
[147] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popovi´ c. Articulated mesh animation from
multi-view silhouettes. In ACM Transactions on Graphics (TOG), volume 27, page 97. ACM, 2008.
72
[148] Michael Wand, Bart Adams, Maksim Ovsjanikov, Alexander Berner, Martin Bokeloh, Philipp
Jenke, Leonidas Guibas, Hans-Peter Seidel, and Andreas Schilling. Efficient reconstruction of
nonrigid shape and motion from real-time 3d scanner data. ACM Transactions on Graphics, 28(2),
2009.
[149] Li-Yi Wei and Marc Levoy. Fast texture synthesis using tree-structured vector quantization. In
Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques,
SIGGRAPH ’00, pages 479–488, 2000.
[150] Lingyu Wei, Qixing Huang, Duygu Ceylan, Etienne V ouga, and Hao Li. Dense human body cor-
respondences using convolutional networks. In Computer Vision and Pattern Recognition (CVPR),
2016.
[151] Xiaolin Wei, Peizhao Zhang, and Jinxiang Chai. Accurate realtime full-body motion capture using
a single depth camera. ACM Transactions on Graphics, 31(6):188:1–188:12, November 2012.
[152] Tim Weyrich, Wojciech Matusik, Hanspeter Pfister, Bernd Bickel, Craig Donner, Chien Tu, Janet
McAndless, Jinho Lee, Addy Ngan, Henrik Wann Jensen, and Markus Gross. Analysis of human
faces using a measurement-based skin reflectance model. ACM Trans. on Graphics (Proc. SIG-
GRAPH 2006), 25(3):1013–1024, 2006.
[153] T. Windheuser, U. Schlickewei, F.R. Schmidt, and D. Cremers. Geometrically consistent elastic
matching of 3d shapes: A linear programming solution. pages 2134–2141, 2011.
[154] T. Windheuser, M. Vestner, E. Rodola, R. Triebel, and D. Cremers. Optimal intrinsic descriptors for
non-rigid shape analysis. In British Machine Vision Conf., 2014.
[155] Wenqi Xian, Patsorn Sangkloy, Varun Agrawal, Amit Raj, Jingwan Lu, Chen Fang, Fisher Yu, and
James Hays. Texturegan: Controlling deep image synthesis with texture patches. CVPR, 2018.
[156] Ling-Qi Yan, Henrik Wann Jensen, and Ravi Ramamoorthi. An efficient and practical near and far
field fur reflectance model. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2017),
36(4), 2017.
[157] Ling-Qi Yan, Weilun Sun, Henrik Wann Jensen, and Ravi Ramamoorthi. A bssrdf model for ef-
ficient rendering of fur with global illumination. ACM Transactions on Graphics (Proceedings of
SIGGRAPH Asia 2017), 2017.
[158] Ling-Qi Yan, Chi-Wei Tseng, Henrik Wann Jensen, and Ravi Ramamoorthi. Physically-accurate fur
reflectance: Modeling, measurement and rendering. ACM Transactions on Graphics (Proceedings
of SIGGRAPH Asia 2015), 34(6), 2015.
[159] Xuan Yu, Jason C. Yang, Justin Hensley, Takahiro Harada, and Jingyi Yu. A framework for ren-
dering complex scattering effects on hair. In Proceedings of the ACM SIGGRAPH Symposium on
Interactive 3D Graphics and Games, I3D ’12, pages 111–118, New York, NY , USA, 2012. ACM.
[160] Cem Yuksel, Scott Schaefer, and John Keyser. Hair meshes. ACM Transactions on Graphics
(Proceedings of SIGGRAPH Asia 2009), 28(5):166:1–166:7, 2009.
[161] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional
neural networks. pages 4353–4361, 2015.
[162] Meng Zhang, Menglei Chai, Hongzhi Wu, Hao Yang, and Kun Zhou. A data-driven approach to
four-view image-based hair modeling. ACM Transactions on Graphics (TOG), 36(4):156, 2017.
[163] Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning
by cross-channel prediction. In CVPR, 2017.
73
[164] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In 2017 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), pages 6230–6239, July 2017.
[165] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View synthesis
by appearance flow. In European Conference on Computer Vision, 2016.
[166] Jun-Yan Zhu, Philipp Kr¨ ahenb¨ uhl, Eli Shechtman, and Alexei A. Efros. Generative visual manipu-
lation on the natural image manifold. In Proceedings of European Conference on Computer Vision
(ECCV), 2016.
[167] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation
using cycle-consistent adversarial networks. ICCV, 2017.
[168] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli
Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information
Processing Systems, pages 465–476, 2017.
[169] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z. Li. Face alignment across large
poses: A 3d solution. CoRR, abs/1511.07212, 2015.
[170] Arno Zinke, Cem Yuksel, Andreas Weber, and John Keyser. Dual scattering approximation for
fast multiple scattering in hair. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2008),
27(3):32:1–32:10, 2008.
74
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Scalable dynamic digital humans
PDF
Data-driven 3D hair digitization
PDF
Efficient graph learning: theory and performance evaluation
PDF
Invariant representation learning for robust and fair predictions
PDF
Learning distributed representations from network data and human navigation
PDF
Multi-scale dynamic capture for high quality digital humans
PDF
Effective data representations for deep human digitization
PDF
Shift-invariant autoregressive reconstruction for MRI
PDF
Simulation and machine learning at exascale
PDF
Complete human digitization for sparse inputs
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Generating gestures from speech for virtual humans using machine learning approaches
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Deep learning models for temporal data in health care
PDF
Sharpness analysis of neural networks for physics simulations
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Deep learning for subsurface characterization and forecasting
PDF
Hashcode representations of natural language for relation extraction
Asset Metadata
Creator
Wei, Lingyu (Cosimo)
(author)
Core Title
Human appearance analysis and synthesis using deep learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/28/2018
Defense Date
03/19/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
appearance capture,deep learning,feature learning,generative network,human appearance,image inpainting,image synthesis,image‐based modeling,image‐based rendering,neural network,OAI-PMH Harvest,performance capture,photorealism,semi‐supervised learning,texture synthesis,unsupervised learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Li, Hao (
committee chair
), Nakano, Aiichiro (
committee member
), Nayak, Krishna Shrinivas (
committee member
)
Creator Email
cosimo.dw@gmail.com,lingyuw@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-495740
Unique identifier
UC11268033
Identifier
etd-WeiLingyuC-6214.pdf (filename),usctheses-c40-495740 (legacy record id)
Legacy Identifier
etd-WeiLingyuC-6214.pdf
Dmrecord
495740
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wei, Lingyu (Cosimo)
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
appearance capture
deep learning
feature learning
generative network
human appearance
image inpainting
image synthesis
image‐based modeling
image‐based rendering
neural network
performance capture
photorealism
semi‐supervised learning
texture synthesis
unsupervised learning