Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Effective data representations for deep human digitization
(USC Thesis Other)
Effective data representations for deep human digitization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Effective Data Representations for Deep Human Digitization
by
Shunsuke Saito
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
May 2020
Copyright 2020 Shunsuke Saito
Acknowledgements
First of all, I would like to thank my advisor Prof. Hao Li for mentoring me in conducting cutting-edge
research, and instilling in me a never-give-up spirit. My time at USC has been the most intensive and fruitful
time of my career. His thoughtful suggestions and guidance allowed me to achieve results that would not
have been possible otherwise. I would also like to thank my committee members, Prof. Andy Nealen and
Prof. Aiichiro Nakano for the direction, discussions, and insightful suggestions. I am also indebted to
Kathleen Haase and Christina Trejo for handling administration at USC ICT, without whom none of my
research would have been possible.
Furthermore, I would like to express gratitude to my close collaborators: Thank you Angjoo Kanazawa,
Koki Nagano, Chongyang Ma, Linjie Luo, Weikai Chen, Jun Xing, and Yajie Zhao for the wonderful
collaborations. Their positive and constructive feedback always helped me proceed with confidence. I
would like to extend my gratitude to Jason Saragih, Duygu Ceylan, Nobuyuki Umetani, and Hanbyul Joo
for the excellent guidance and support during my internships at Facebook Reality Labs, Adobe Research,
University of Tokyo, and Facebook AI Research. I would also like to take the opportunity to thank my
undergraduate and masters supervisor, Shigeo Morishima, who introduced me to computer graphics and
computer vision.
I would also like to acknowledge my lab mates at USC for being amazing collaborators and sharing both
hard and fun times throughout many deadlines: Lingyu Wei, Liwen Hu, Tianye Li, Kyle Olszewski, Zeng
Huang, Shichen Liu, Zimo Li, and Yi Zhou. I also had the great pleasure to supervise many talented junior
students: Ronald Yu, Shugo Yamaguchi, and Ryota Natsume. I want to thank Zimo Li and Kyle Morgenroth
for proofreading this dissertation. Thanks also goes to my friends during my internship at Facebook Reality
ii
Labs and Adobe Research: Timur Bagautdinov, Alejandro Newell, Fait Poms, Aayush Bansal, Chia-Yin
Tsai, Kyungdon Joo, Xinshuo Weng, Xuanyi Dong, Zhiwei Deng, Ke Xian and Simon Niklaus.
Finally, I would like to thank my father, my mother, and my sister for their encouragement and
understanding. A special thank you must go to my wife Rui for giving me support and comfort, as well as
sometimes being a capture subject for experiments.
iii
Table of Contents
Acknowledgements ii
List Of Tables vi
List Of Figures viii
Abstract xiv
Chapter 1: Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2: Template-based Approach for High-Fidelity Face Digitization 10
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Facial Reflectance and Geometry Capture. . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Texture Synthesis and Image Completion . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Deep Learning Based Image Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 High-fidelity Facial Texture and Geometry Inference . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Reflectance and Geometry Inference . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 Symmetry-Aware Texture Completion and Refinement . . . . . . . . . . . . . . . 24
2.2.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chapter 3: Volumetric Representation for 3D Hair Digitization 43
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Single-View 3D Hair Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.2 Hair Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.3 V olumetric Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.4 Hair Embedding Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.5 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
iv
Chapter 4: Implicit Shape Representations for Clothed Human Digitization 67
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Silhouette-based Shape Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.2 Multi-View Silhouette Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.3 Deep Visual Hull Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.4 Front-to-Back Texture Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Pixel-Aligned Implicit Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3.2 Single-view Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.3 Texture Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.4 Multi-View Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Chapter 5: Conclusion and Future Directions 114
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2 Open Questions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Reference List 118
v
List Of Tables
2.1 Quantitative evaluation. We measure the peak signal-to-noise ratio (PSNR) and the structural
similarity (SSIM) of the inferred images for 100 test images compared to the ground truth.
The inferred displacement value is computed using the output medium- and high-frequency
displacement maps to recover the overall displacement. . . . . . . . . . . . . . . . . . . . 28
2.2 Quantitative comparison of our diffuse albedo inference with several alternative methods,
measured using the PSNR and the root-mean-square error (RMSE). . . . . . . . . . . . . 36
2.3 Runtime performance for each component of our system. . . . . . . . . . . . . . . . . . . 36
3.1 Our volumetric V AE architecture. The last convolution layer in the encoder is duplicated
for and for reparameterization trick. The decoders for occupancy field and orientation
field are the same architecture except the last channel size (1 and 3 respectively). The
weights on the decoders are not shared. All the convolutional layers are followed by batch
normalization and ReLU activation except the last layer in both the encoder and the decoder. 52
3.2 Evaluation of training loss functions in terms of reconstruction accuracy for occupancy field
(IOU, precision and recall) and flow field (L2 loss). We evaluate the effectiveness of our
proposed loss function by comparing it with (1) our loss function without the KL-divergence
loss term denoted as “Ours (AE)”, (2) a state-of-the-art volumetric generative model using
V AE [20] and (3) vanilla V AE [103]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Evaluation of different embedding methods. First row: our linear PCA embedding. Second
row: a single-vector V AE (in contrast to our volumetric V AE). Third row: a non-linear
embedding with fully connected layers and ReLU activations. The dimension of latent
space is 512 for all the three methods here. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Evaluation of prediction methods. We compare our embedding method based on Iterative
Error Feedback (IEF) [26] with direct prediction in a single shot for hair coefficients and
end-to-end training where the network directly predicts the volumetric representation given
an input image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Evaluation of our silhouette-based representation compared to direct voxel prediction.
The errors are measured using Chamfer Distance(CD) and Earth Mover’s Distance(EDM)
between the reconstructed meshes and the ground-truth. . . . . . . . . . . . . . . . . . . . 87
4.2 Evaluation of our greedy sampling method to compute deep visual hull. . . . . . . . . . . 88
vi
4.3 Ablation study of our silhouette-based representation. . . . . . . . . . . . . . . . . . . . . 89
4.4 Quantitative evaluation on RenderPeople and BUFF dataset for single-view reconstruction. 108
4.5 Quantitative comparison between multi-view reconstruction algorithms using 3 views. . . 108
4.6 Quantitative comparison between a template-based method [5] using a dense video sequence
and ours using 3 views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.7 Ablation study on the sampling strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.8 Ablation study on network architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
vii
List Of Figures
1.1 Recent real-time performance capture systems enable us to bring avatars to life at interactive
rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 We humans have a mental model of a 3D human, which allows us to imagine the complete
3D shape and color of unseen subjects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 The choice of data representation affects expressiveness and memory-efficiency. . . . . . . 4
1.4 Our system infers high-fidelity facial reflectance and geometry maps from a single image
(diffuse albedo, specular albedo, as well as medium- and high-frequency displacements).
These maps can be used for high-fidelity rendering under novel illumination conditions. . . 6
1.5 Our method automatically generates 3D hair strands from a variety of single-view inputs.
Each panel from left to right: input image, volumetric representation with color-coded local
orientations predicted by our method, and final synthesized hair strands rendered from two
viewing points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Given a single image of a person from the frontal view, we can automatically reconstruct a
complete and textured 3D clothed body shape. . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Pixel-aligned Implicit function (PIFu): We present pixel-aligned implicit function (PIFu),
which allows recovery of high-resolution 3D textured surfaces of clothed humans from a
single input image (top row). Our approach can digitize intricate variations in clothing, such
as wrinkled skirts and high-heels, as well as complex hairstyles. The shape and textures can
be fully recovered including unseen regions such as the back of the subject. PIFu can be
also extended to multi-view input images (bottom row). . . . . . . . . . . . . . . . . . . . 9
2.1 System Overview. Given an unconstrained input image (left), the base mesh and corre-
sponding facial texture map are extracted. The diffuse and specular reflectance, and the
mid- and high-frequency displacement maps are inferred from the visible regions (Sec.
2.2.3). These maps are then completed, refined to include additional details inferred from
the visible regions, and then upsampled using a super-resolution algorithm (Sec. 2.2.4). The
resulting high-resolution reflectance and geometry maps may be used to render high-fidelity
avatars (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
viii
2.2 Solving the described sub-tasks separately makes the complete texture inference pipeline
more tractable, allowing us to generate highly plausible output. Directly generating a
complete texture map from a partial input with a single network produces significantly
inferior results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Reflectance inference pipeline. The texture extracted from the input image and the corre-
sponding visibility mask are passed through a U-net encoder-decoder framework to produce
a diffuse reflectance map. Another network takes the same input and produces the specular
reflectance and mid- and high-frequency displacement maps. These networks are trained
using a combination of L1 and adversarial loss (D), as well as feature matching loss (FM),
using features extracted from the discriminator. . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Our texture completion pipeline. The inferred texture and visibility mask are downsampled
by a factor of 4 and completed. The resulting low-resolution texture is upsampled to the
original resolution and blended with the input texture, then passed through a network that
refines the texture to add subtle yet crucial details. Finally, a super-resolution algorithm is
applied to generate high-fidelity 2048 2048 textures. . . . . . . . . . . . . . . . . . . . 24
2.5 Feature flipping in the latent space. The intermediate features obtained from the convolu-
tional layers of the network are flipped across the V-axis and concatenated to the original
features. This process allows the texture completion process to exploit the natural near-
symmetry in human faces to infer texture maps that contain local variations but are nearly
symmetric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Our rendering pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Inference in the wild. The first column contains the input image and the corresponding
inferred output applied to the base mesh. The second and third columns contain new
renderings of the avatar under novel lighting conditions (the lighting environments we use
are inset in the top left example renderings). . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8 Additional results with images from the Chicago Face Dataset [125] . . . . . . . . . . . . 30
2.9 Zoom-in results showing synthesized mesoscopic details. . . . . . . . . . . . . . . . . . 31
2.10 Ablation study demonstrating the importance of each of the outputs of the proposed network.
The second column shows the rendering of the diffuse reflectance using the diffuse albedo
map, which lacks the surface geometry cues provided by the specular reflection. With
the specular reflection added (third and fifth column), the view-dependent nature of the
surface reflection enhances the sense of the object’s 3D shape. The specular albedo map
simulates the specular occlusion and provides local variations in the specular intensities
(third column). Adding the displacement map provides more lifelike skin textures, but the
rendering looks too shiny without the specular albedo map (fourth column). Combining all
the maps provides the best rendering result (fifth column). . . . . . . . . . . . . . . . . . 32
2.11 Consistency of the output obtained using input images of the same subject from different
viewpoints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.12 Consistency of the output obtained using input images of the same subject captured under
different lighting conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
ix
2.13 Evaluation on expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.14 Comparison with [82] and our network, both with and without the feature flipping layer. . 36
2.15 Comparison with PCA, Visio-lization [135], and a state-of-the-art diffuse albedo inference
method [167]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.16 Comparison of diffuse albedo inference with a data-driven intrinsic decomposition method
[112] (produced by original authors). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.17 Comparison of diffuse albedo inference with an unsupervised face alignment method [185],
in which skin textures are represented by a linear basis. . . . . . . . . . . . . . . . . . . . 38
2.18 Comparison of diffuse albedo inference with an unsupervised intrinsic decomposition
method [174]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.19 Comparison with PCA, Visio-lization [135], and a state-of-the-art diffuse albedo inference
method [167] using Light Stage ground truth data. . . . . . . . . . . . . . . . . . . . . . 39
2.20 Ground truth comparison using Light Stage (LS) data. . . . . . . . . . . . . . . . . . . . 40
2.21 Limitations. Our method produces artifacts in the presence of strong shadows (lower right)
and non-skin objects due to segmentation failures (upper left). Also volumetric beards are
not faithfully reconstructed (upper right). Strong dynamic wrinkles (lower left) may cause
artifacts in the inferred displacement maps. . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1 Our pipeline overview. Our volumetric V AE consists of an encoder and two decoders (blue
blocks) with blue arrows representing the related dataflow. Our hair embedding network
(orange blocks) follows the red arrows to synthesize hair strands from an input image. . . . 48
3.2 V olumetric hairstyle representation. From left to right: original 3D hairstyle represented as
strands; our representation using occupancy and flow fields defined on regular grids, with
the visualization of the occupancy field boundary as mesh surface and encoding of local
flow value ans surface color; regrown strands from our representation. . . . . . . . . . . . 49
3.3 Comparison of different training schemes. From left to right, we show (1) the original
strands, (2) the ground-truth volumetric representation, the reconstruction results using (3)
vanilla V AE [103], (4) a volumetric V AE [20], (5) our proposed V AE, (6) our V AE with
non-linear embedding and (7) our V AE with PCA embedding respectively. . . . . . . . . . 58
3.4 Modeling results of 3D hairstyle from single input image. From left to right, we show the
input image, occupancy field with color-coded local orientations predicted by our single-
view hair modeling pipeline, as well as the synthesized output strands. None of these input
images has been used for training of our embedding network. . . . . . . . . . . . . . . . 59
3.5 Comparisons between our method with AutoHair [29]. From left to right, we show the
input image, the result from AutoHair, the volumetric output of our V AE network, and our
final strands. We achieve comparable results on input images of typical hairstyles (a)-(f)(l),
and can generate results closer to the modeling target on more challenging examples (g)-(k).
The inset images of (g) and (h) show the intermediate segmentation masks generated by
AutoHair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
x
3.6 Comparisons between our method with the state-of-the-art avatar digitization method [76]
using the same input images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.7 Interpolation results of multiple hairstyles. The four input hairstyles are shown at the corners
while all the interpolation results are shown in-between based on bi-linear interpolation
weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8 Comparison between direct interpolation of hair strands [210] and our latent space interpo-
lation results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1 Overview of our framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Illustration of our silhouette synthesis network. . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 GAN helps generate clean silhouettes in presence of ambiguity in silhouette synthesis from
a single view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Illustration of our front-to-back synthesis network. . . . . . . . . . . . . . . . . . . . . . 76
4.5 Illustration of our back-view rendering approach. . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Our synthetically rendered training samples in our dataset. . . . . . . . . . . . . . . . . . 80
4.7 Illustration of the baseline voxel regression network. . . . . . . . . . . . . . . . . . . . . 81
4.8 Our 3D reconstruction results of clothed human body using test images from the DeepFash-
ion dataset [120]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.9 Our 3D reconstruction results of clothed human body using test images from the syntheti-
cally rendered data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.10 Comparison with multiview visual hull algorithms. Despite the single view input, our
method produces comparable reconstruction results. Note that input image in red color is
the single-view input for our method and the top four views are used for Huang et al. [79]. 85
4.11 We qualitatively compare our method with two state-of-the-art single view human recon-
struction techniques, HMR [92] and BodyNet [193]. . . . . . . . . . . . . . . . . . . . . 86
4.12 Qualitative evaluation of our silhouette-based shape representation as compared to direct
voxel prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.13 Comparisons between our deep visual hull method with a native visual hull algorithm, using
both random view selection and our greedy view sampling strategy. . . . . . . . . . . . . 88
4.14 Qualitative evaluation of different silhouette synthesis methods. From left to right: silhouette
of the input view, ground-truth silhouette of the target view, results of our full algorithm,
results without 2D pose information, results from a set of predefined view points, and the
ones by training on the SURREAL dataset [194]. . . . . . . . . . . . . . . . . . . . . . . 89
4.15 Failure cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
xi
4.16 Overview of our clothed human digitization pipeline. Given an input image, a pixel-aligned
implicit function (PIFu) predicts the continuous inside/outside probability field of a clothed
human. Similarly, PIFu for texture inference (Tex-PIFu) infers an RGB value at given 3D
positions of the surface geometry with arbitrary topology. . . . . . . . . . . . . . . . . . 94
4.17 Multi-view PIFu. PIFu can be extended to support multi-view inputs by decomposing
implicit function f into a feature embedding function f
1
and a multi-view reasoning
functionf
2
.f
1
computes a feature embedding from each view in the 3D world coordinate
system, which allows aggregation from arbitrary views.f
2
takes aggregated feature vector
to make a more informed 3D surface and texture prediction. . . . . . . . . . . . . . . . . 98
4.18 Qualitative single-view results on real images from DeepFashion dataset [120]. The pro-
posed Pixel-Aligned Implicit Functions, PIFu, achieves a topology-free, memory efficient,
spatially-aligned 3D reconstruction of geometry and texture of clothed human. . . . . . . . 100
4.19 Results on video sequences obtained from [196]. While ours uses a single view input, the
ground truth is obtained from 8 views with controlled lighting conditions. . . . . . . . . . 103
4.20 Comparison with other human digitization methods from a single image. For each input
image on the left, we show the predicted surface (top row), surface normal (middle row),
and the point-to-surface errors (bottom row). . . . . . . . . . . . . . . . . . . . . . . . . 104
4.21 Comparison with SiCloPe [138] on texture inference. While texture inference via a view
synthesis approach suffers from projection artifacts, proposed approach does not as it
directly inpaints textures on the surface geometry. . . . . . . . . . . . . . . . . . . . . . 105
4.22 Comparison with learning-based multi-view methods. Ours outperforms other learning-
based multi-view methods qualitatively and quantitatively. Note that all methods are trained
with three view inputs from the same training data. . . . . . . . . . . . . . . . . . . . . . 106
4.23 Our surface and texture predictions increasingly improve as more views are added. . . . . 107
4.24 Comparison with a template-based method [5]. Note that while Alldieck et al. uses a dense
video sequence without camera calibration, ours uses the calibrated three views as input. . 107
4.25 Comparison with V oxel Regression Network [86]. While [86] suffers from texture projection
error due to the limited precision of voxel representation, our PIFu representation efficiently
represents not only surface geometry in a pixel-aligned manner, but also complete texture
on the missing region. Note that [86] can only texture the visible portion of the person by
projecting the foreground to the recovered surface. In comparison, we recover the texture
of the entire surface, including the unseen regions. . . . . . . . . . . . . . . . . . . . . . 110
4.26 Reconstructed geometry and point to surface error visualization using different sampling
methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.27 Reconstructed geometry and point to surface error visualization using different architectures
for the image encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
xii
5.1 While explicit shape representations may suffer from poor visual quality due to limited
resolutions or failure to handle arbitrary topologies (a), implicit surfaces handle arbitrary
topologies with high resolutions in a memory efficient manner (b). However, in contrast
to the explicit representations, it is not feasible to directly project an implicit field onto a
2D domain via perspective transformation. Thus, we introduce a field probing approach
based on efficient ray sampling that enables unsupervised learning of implicit surfaces from
image-based supervision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
xiii
Abstract
If creating 3D digital humans were as easy as taking a picture, the way content could be created and
consumed would be changed forever, and the impact would be felt in a wide range of applications from
communication, entertainment, design, and even manufacturing. Especially accessible human digitization
technologies are the key to democratizing immersive social experiences in a digital world enabled by
augmented and virtual reality. Furthermore, the prevalence of sensor-packed commodity devices is further
facilitating this trend. Consumer devices such as mobile phones are now capable of imaging objects and
environments in high resolution. However, reconstructing a 3D human from minimal image inputs has
been an ill-posed problem due to depth ambiguity, the complexity of shapes and colors, and incomplete
observations.
In this thesis, we address the ill-posed nature of human digitization from minimal inputs by leveraging
high-capacity machine learning models (i.e., deep learning) together with effective data representations on
several domains including the face, hair, and clothed human body. For each of these domains, we develop a
novel method to infer shape and appearance from unconstrained inputs (e.g., a single image).
First, we introduce a method to infer physically plausible facial reflectance and geometry from a single
image. While the existing approaches either model only a coarse level of facial attributes or naively
hallucinate details with ad-hoc formulations, the proposed approach learns these attributes from physically
accurate ground truth with carefully designed training data. Furthermore, we propose a novel texture
inpainting approach that effectively leverages the shared parameterization across faces, generating complete
facial attributes with high-fidelity details retained. We demonstrate that our approach achieves a significantly
more realistic reconstruction of human faces from images in the wild compared to other methods.
xiv
Second, we propose a data-driven approach to model 3D hair from a monocular input. Unlike the face,
it is non-trivial to apply a template-based shape representations to 3D hair due to the large shape variation.
Thus, we introduce a volumetric shape representation that is interchangeably convertible from hair-strand
representation. Using this volumetric shape representation, we learn a 3D hair manifold where we can
sample and interpolate new plausible hair styles using a variational autoencoder. Additionally, using the
learned compact embeddings, we train a regression network to infer corresponding 3D hair styles from
unconstrained images. We demonstrate that the proposed approach successfully handles very challenging
inputs where other current existing approaches fail.
Third, we explore implicit shape representations for clothed human digitization to achieve the best
trade-off between representation power and memory-efficiency. As clothed humans exhibit extremely
large variations in both shape and color, a highly expressive shape representation is required. Moreover, a
representation needs to be memory-efficient so that it can be incorporated into a deep learning framework.
To this end, we introduce a novel method to infer clothed humans by representing underlying 3D shapes by
2D images. This way, a wide range of clothing types can be efficiently modeled in a unified framework.
Finally, we further push the envelope of implicit shape representations by introducing a Pixel-Aligned
Implicit Function (PIFu), which models shapes and colors in continuous scalar/vector fields, eliminating the
need for discretization. The proposed representation not only demonstrates unprecedentedly high-resolution
reconstructions of fully textured clothed human, but also handles singe-view and multi-view inputs in a
unified manner.
xv
Chapter 1
Introduction
Creating a believable digital double is a long-standing challenge in the computer graphics and computer
vision communities. As shown in Fig. 1.1, together with the advent of real-time performance capture
techniques [23, 22, 166], photo-realistic embodiment of ourselves will connect distant people, bringing a
totally new experience to collaboration, entertainment, and communication. Furthermore, instantaneous
avatar generation could impact art, design, and even manufacturing in numerous ways. Recent movies and
games demonstrate that almost indistinguishable digital humans can be created with the help of advanced
photogrammetry capture systems [56, 48, 137]. However, it has remained largely inaccessible to the general
community due to its reliance on professional capture systems with strict environmental constraints (e.g.,
high number of cameras, controlled illuminations) that are prohibitively expensive and cumbersome to
deploy.
Meanwhile, powerful mobile devices have been widely available, providing access to mega-pixel
cameras for non-professionals. To ease the inaccessibility of high-end human digitization systems, a
natural direction is to utilize such light-weight capture systems for reconstructing 3D humans, which we
term as human digitization from minimal inputs. However, this is generally an ill-posed problem. First,
observed images are formed by the integration of interaction between lights and materials (i.e., geometry
and reflectance) [91], inherently introducing ambiguity to the disentanglement. While many researchers
have attempted to solve the inverse problem by general clues such as shading [72, 11], solutions using
1
Figure 1.1: Recent real-time performance capture systems enable us to bring avatars to life at interactive
rates.
general priors have not reached the granularity to demonstrate authenticity of digitization. Furthermore,
observations from monocular sensors are typically incomplete. That is, we lack the observation of the other
side, posing a significant challenge obtaining complete 3D models.
Consider the image of the person in Figure 1.2. We humans can imagine how he looks in 3D, including
the backside. This is because we have a mental model of 3D humans that has been acquired from
accumulated observations of a wide range of humans in our daily life. In particular, since humans are our
primary beings to interact with, our mental reconstruction of a human is surprisingly precise and accurate,
even from incomplete observations. This fact motivates us to utilize category-specific priors learned from
data itself. To this end, this dissertation proposes data-driven human digitization algorithms that combine
observations from minimal inputs with existing high-quality data.
To infer fine-grained 3D shape and appearance from images, where designing hand-crafted features
is non-trivial, we need a high-capacity machine learning algorithm. The advent of deep learning shows
promise by eliminating the need for hand-crafted features and demonstrates groundbreaking performance
in many computer vision tasks including semantic segmentation [172], object detection [69], and human
2
Figure 1.2: We humans have a mental model of a 3D human, which allows us to imagine the complete 3D
shape and color of unseen subjects.
pose estimation [25]. To fully harness category-specific data priors, we also base our algorithms on this
paradigm.
Aside from the learning algorithms, in this dissertation, we argue that the key to successful human
digitization is the choice of data representation. As Fig. 1.3 shows, while various data representations
exist for human digitization, each of them exhibit different properties. When it comes to deep learning,
the important criteria is three-fold. First of all, expressiveness plays a critical role in fidelity of the final
reconstructions. For example, certain domains such as the clothed human body require representations that
support topology change and large deformation. Thus, volumetric representations may be preferable over
template-based approaches. Second, memory-efficiency is another essential aspect for effective training of
deep neural networks. Due to the limited capacity of the current hardware, data representations that have a
large memory footprint are prohibitive in practice. Lastly, compatibility for deep learning techniques is
of great importance. Especially leveraging spatial correlation by convolution layers is a common practice
for 3D and 2D deep learning. Converting geometric information into uniform grid structures enables us to
3
effectively utilize modern deep learning techniques such as generative adversarial networks (GAN) [60] or
variational autoencoder (V AE) [103].
Figure 1.3: The choice of data representation affects expressiveness and memory-efficiency.
In this dissertation, we develop learning algorithms to reconstruct high-fidelity shape and appearance
by choosing an effective data representation that suits various body parts. Based on its practical value and
importance, we choose face, hair, and the clothed human body as the target domains. Our first work focuses
on physically plausible facial reflectance and geometry modeling. To this end, we employ a template-based
shape representation as the shape and color variations of faces are well constrained. On top of it, we
represent person-specific details in reflectance and geometry on a 2D shared parameterization by utilizing a
texture mapping technique [27]. We demonstrate that this 2D representation allows us to view the inference
as an image-to-image translation problem, resulting in high-fidelity reconstructions in a memory-efficient
manner. Our second work addresses the machine learning compatibility problem of 3D hair by converting
hair strands into a regular volumetric representation. We show that complex, diverse hair styles can be
embedded into a low-dimensional subspace for plausible interpolation, sampling, and direct regression
from a single image without any preprocessing required. Our third and fourth works explore effective shape
representations for clothed human bodies that exhibit extremely large shape and color variations. More
specifically, to address the memory limitation of an explicit volumetric representation (i.e., voxels), we
introduce novel implicit surface representations that prescribe the underlying surface geometry by the level
set of continuous scalar occupancy fields. The third work uses a set of inferred silhouettes from novel
4
view points as an intermediate shape representation, demonstrating memory efficiency and flexibility of
the resulting geometry compared to other existing approaches. Our fourth work achieves unprecedentedly
high-resolution inference and memory efficiency by introducing a Pixel-Aligned Implicit Function (PIFu),
which queries arbitrary depth values along camera rays and generates occupancy probability fields without
the necessity of discretization.
1.1 Contributions
Template-based Approach for High-Fidelity Face Digitization
We present a deep learning-based technique to infer high-quality facial reflectance and geometry given a
single unconstrained image of the subject, which may contain partial occlusions and arbitrary illumination
conditions. The reconstructed high-resolution textures, which are generated in only a few seconds, include
high-resolution skin surface reflectance maps, representing both the diffuse and specular albedo, and
medium- and high-frequency displacement maps, thereby allowing us to render compelling digital avatars
under novel lighting conditions. To extract this data, we train our deep neural networks with a high-quality
skin reflectance and geometry database created with a state-of-the-art multi-view photometric stereo system
using polarized gradient illumination. Given the raw facial texture map extracted from the input image,
our neural networks synthesize complete reflectance and displacement maps, as well as complete missing
regions caused by occlusions. The completed textures exhibit consistent quality throughout the face due to
our network architecture, which propagates texture features from the visible region, resulting in high-fidelity
details that are consistent with those seen in visible regions. We describe how this highly underconstrained
problem is made tractable by dividing the full inference into smaller tasks, which are addressed by dedicated
neural networks. We demonstrate the effectiveness of our network design with robust texture completion
from images of faces that are largely occluded. With the inferred reflectance and geometry data, Fig. 1.4
demonstrates the rendering of high-fidelity 3D avatars from a variety of subjects captured under different
5
lighting conditions. In addition, we perform evaluations demonstrating that our method can infer plausible
facial reflectance and geometric details comparable to those obtained from high-end capture devices, and
outperform alternative approaches that require only a single unconstrained input image.
Figure 1.4: Our system infers high-fidelity facial reflectance and geometry maps from a single image
(diffuse albedo, specular albedo, as well as medium- and high-frequency displacements). These maps can
be used for high-fidelity rendering under novel illumination conditions.
Volumetric Representation for 3D Hair Digitization
We propose to represent the manifold of 3D hairstyles implicitly through a compact latent space of a
volumetric variational autoencoder (V AE). This deep neural network is trained with volumetric orientation
field representations of 3D hair models and can synthesize new hairstyles from a compressed code. To enable
end-to-end 3D hair inference, we train an additional embedding network to predict the code in the V AE latent
space from any input image. Strand-level hairstyles can then be generated from the predicted volumetric
representation. Our fully automatic framework does not require any ad-hoc face fitting, intermediate
classification and segmentation, or hairstyle database retrieval. Our hair synthesis approach is significantly
more robust and can handle a much wider variation of hairstyles than the state-of-the-art data-driven hair
modeling techniques with challenging inputs, including photos that are low-resolution, overexposed, or
contain extreme head poses (see Fig. 1.5). The storage requirements are minimal, and a 3D hair model
can be produced from an image in a second. Our evaluations also show that successful reconstructions
6
are possible from highly stylized cartoon images, non-human subjects, and pictures taken from behind a
person. Our approach is particularly well suited for continuous and plausible hair interpolation between
very different hairstyles.
Figure 1.5: Our method automatically generates 3D hair strands from a variety of single-view inputs. Each
panel from left to right: input image, volumetric representation with color-coded local orientations predicted
by our method, and final synthesized hair strands rendered from two viewing points.
Implicit Shape Representations for Clothed Human Digitization
We introduce a new silhouette-based representation for modeling clothed human bodies using deep genera-
tive models. Our method can reconstruct a complete and textured 3D model of a person wearing clothes
from a single input picture. Inspired by the visual hull algorithm, our implicit representation uses 2D
silhouettes and 3D joints of a body pose to describe the immense shape complexity and variations of clothed
people. Given a segmented 2D silhouette of a person and its inferred 3D joints from the input picture, we
first synthesize consistent silhouettes from novel view points around the subject. The synthesized silhouettes
which are the most consistent with the input segmentation are fed into a deep visual hull algorithm for
robust 3D shape prediction. We then infer the texture of the subject’s back view using the frontal image and
segmentation mask as input to a conditional generative adversarial network. Our experiments demonstrate
that our silhouette-based model is an effective representation, and the appearance of the back view can be
predicted reliably using an image-to-image translation network. While classic methods based on parametric
7
models often fail for single-view images of subjects with challenging clothing, our approach can still
produce successful results, which are comparable to those obtained from multi-view input (see Fig. 1.6).
input image fully textured 3D mesh
Figure 1.6: Given a single image of a person from the frontal view, we can automatically reconstruct a
complete and textured 3D clothed body shape.
Lastly, we introduce Pixel-aligned Implicit Function (PIFu), an implicit representation that locally aligns
pixels of 2D images with the global context of their corresponding 3D object. Using PIFu, we propose
an end-to-end deep learning method for digitizing highly detailed clothed humans that can infer both 3D
surface and texture from a single image, and optionally, leverage multiple input images. Highly intricate
shapes, such as hairstyles, clothing, as well as their variations and deformations, can be digitized in a
unified way. Compared to existing representations used for 3D deep learning, PIFu produces high-resolution
surfaces including largely unseen regions such as the back of a person. In particular, it is more memory
efficient than the voxel representation, can handle arbitrary topology, and the resulting surface is spatially
aligned with the input image. Furthermore, while previous techniques are designed to process either a
single image or multiple views, PIFu extends naturally to an arbitrary number of views (see Fig. 1.7). We
demonstrate high-resolution and robust reconstructions on real world images from the DeepFashion dataset,
which contains a variety of challenging clothing types. Our method achieves state-of-the-art performance
on a public benchmark, and outperforms the prior work for clothed human digitization from a single image.
8
Single-view
Multi-view
PIFu
PIFu
Figure 1.7: Pixel-aligned Implicit function (PIFu): We present pixel-aligned implicit function (PIFu), which
allows recovery of high-resolution 3D textured surfaces of clothed humans from a single input image (top
row). Our approach can digitize intricate variations in clothing, such as wrinkled skirts and high-heels, as
well as complex hairstyles. The shape and textures can be fully recovered including unseen regions such as
the back of the subject. PIFu can be also extended to multi-view input images (bottom row).
1.2 Outline
Chapter 2 presents our first contribution to facial geometry and reflectance inference from a single image.
We demonstrate the effectiveness of a template-based shape representation with texture mapping for face
digitization. In Chapter 3, we introduce an effective way to infer 3D hair styles from a single image using
variational autoencoder with a volumetric hair representation. Chapter 4 introduces memory-efficient
learning frameworks with implicit shape representations for the clothed human body. Finally in Chapter 5,
we conclude this dissertation with closing messages and future research directions.
9
Chapter 2
Template-based Approach for High-Fidelity Face Digitization
Realistic digital faces are increasingly important in digital media. The capabilities of modern graphics
hardware are perpetually reaching new heights, allowing for the use of effects comparable to those created
using offline, state-of-the-art cinematic special effects systems in real-time, consumer-grade applications
and video games. Meanwhile, the recent surge in augmented and virtual reality (AR/VR) platforms has
created an even stronger demand for high-quality content for virtual environments, with applications ranging
from entertainment to professional concerns, such as telepresence [113, 145, 188]. However, as immersive
virtual experiences are driven by compelling human interaction, the ability to create, animate and render
realistic faces plays a crucial role in achieving engaging face-to-face communication between digital avatars
in simulated environments.
To render a face that appears realistic in an arbitrary virtual environment, high-quality geometry and
reflectance data are required. However, acquiring this data from a real person is currently a time-consuming
and cumbersome process, requiring substantial manual effort, extensive computation, and specialized
capture systems operating in constrained and controlled conditions. While it would ideally be possible for
a novice user to accurately model a subject’s facial shape and reflectance from a single photograph (e.g.
ubiquitous mobile “selfie” images), in practice, significant compromises are made to balance the amount of
input data to be captured, the amount of computation required, and the quality of the final output.
10
We seek to efficiently create accurate, high-fidelity 3D avatars from a single input image, captured in
an unconstrained environment. These avatars must be close in quality to those created by professional
capture systems, with the appropriate mesoscopic geometry and reflectance attributes, yet require minimal
computation and no special expertise on the part of the photographer. These requirements pose several
significant technical challenges. A single photograph only provides partial data and may be taken under
challenging illumination conditions. Most importantly, skin reflectance is highly complex, and as such the
separation of the surface and subsurface components of the skin has only been achieved in constrained
environments. Furthermore, the acquisition of accurate mesoscopic surface geometry as represented in
displacement maps requires sophisticated capture hardware such as photometric multi-view stereo systems.
Less intrusive methods are based on simplifying assumptions such as the Lambertian reflectance of skin,
and often make use of linear appearance models that can recover low frequency facial appearances such as
the coarse shape and diffuse albedo, but fail for complex lighting conditions and detailed fine-scale facial
textures, such as those containing facial hair, wrinkles, pores, and moles.
Some state-of-the-art techniques infer texture details using a database of high-resolution face textures
and synthesize using a patch-based [135] or a neural synthesis approach [167]. However, these approaches
have only been demonstrated on the reflectance aspect of the facial appearance, and thus do not provide the
corresponding fine-scale geometric details needed to produce a realistic 3D rendering of the face in different
views and illumination conditions. Furthermore, while [167] creates a globally consistent diffuse reflectance
map from a partially occluded input texture, it replaces existing high-resolution details in the visible region,
rather than preserving them and only synthesizing consistent details in the missing regions. In addition, it
requires an expensive iterative optimization process, resulting in several minutes of computation time to
produce the final output.
We propose a deep-learning based approach for inferring a high-fidelity set of reflectance and geometric
data (including a diffuse albedo map, a specular albedo map, and medium- and high-frequency displacement
maps representing mesoscopic surface details) from a single unconstrained RGB input image. To achieve
robust and accurate inference in the wild, we train our model with high-resolution facial scans obtained
11
using a state-of-the-art multi-view photometric facial scanning system [55]. Given the unconstrained 2D
input image, which can be captured under arbitrary illumination and contain partial occlusions of the face,
our process infers these high-resolution and high-fidelity geometric and reflectance maps, which can then
be used to render a compelling and realistic 3D avatar in novel lighting environments, in only seconds.
Using our approach, it is now possible to robustly infer realistic and accurate high-fidelity mesoscopic-
level facial reflectance and geometric details from unconstrained images containing significant occlusions
and arbitrary illumination conditions. The resulting data can be used with the fitted 3D model to render high-
fidelity avatars in different lighting conditions and from arbitrary viewpoints. Furthermore, the resulting
avatars have specific features such as facial hair, moles, and other fine-scale facial details unique to the
captured subject. Once trained, our models can produce this data in only seconds, with quality comparable
to that obtained from much slower and more cumbersome active-illumination capture systems.
We thus present the following contributions:
A system for obtaining a complete set of geometric and reflectance maps from a single input image.
We demonstrate that the proposed technique outperforms the state-of-the-art in terms of robustness
under challenging conditions, appearance preservation, and the ability to handle large appearance
variations (such as facial hair or specific fine-scale features).
The demonstration and evaluation of how our approach makes this highly ill-posed problem tractable,
by performing the initial inference and texture completion using separate networks, each trained on
high-fidelity 3D scans obtained using a multi-view photometric facial capture system. We describe
how the architecture, training data and procedure, and data augmentation techniques are carefully
chosen so as to make it possible to train these networks to robustly and accurately infer an arbitrary
subject’s facial appearance.
A multi-resolution, symmetry-aware texture completion and refinement technique designed to handle
the high resolution and complexity of the training data. Our approach maintains a plausible degree of
12
symmetry in the resulting textures consistent with that seen in human faces, yet is consistent with the
data observed in the visible regions.
2.1 Related Work
2.1.1 Facial Reflectance and Geometry Capture.
High-Fidelity Capture Photorealistic facial appearances can be captured by specialized hardware in
controlled environments with camera arrays, e.g. the Light Stage [36, 128, 62, 55]. Though restricted
to studio environments, such techniques have enabled production-level measurement of lighting and
appearance maps, e.g. diffuse albedo, specular maps, bump maps, subsurface scattering, etc., which can be
used to create realistic digital humans [2, 197, 186]. The appearance captured using such techniques can
also be used with videos of the subject performing dynamic expressions to achieve high-fidelity performance
capture [49]. Haro et al. [68] synthesize the full-face skin structures from partial data with a high degree of
accuracy. Cao et al. [22] perform local regression of medium-scale details (e.g. dynamic wrinkles caused
by facial expressions) using captured high-resolution geometry as training data. Once trained, their method
scales well to new users without additional training. Optical acquisition devices and elastomeric sensors
have also been introduced to the capturing pipeline for modeling facial microstructure details [62, 90]
and skin microstructure deformations [137]. Beeler et al. [12, 14] applied shape from shading to emboss
high-frequency skin shading as hallucinated mesoscopic geometric details for skin pores and creases. In
dynamic face capture, fine-scale facial appearance can be recovered using photometric stereo techniques,
e.g. photometric scene flow [61], spherical gradient illumination [212] and polynomial displacement maps
[129]. However, such systems require multiple images from a stereo capture, meaning that they cannot be
applied to legacy content such as unconstrained images and online videos.
Linear Modeling Modeling facial appearance variations as a linear combination of multiple bases has
proven to be a popular and effective method for representing faces. Turk and Pentland [192] present
13
Eigenfaces for face recognition, which is one of the earliest works to represent facial appearance using
a linear model. The active appearance model (AAM) proposed by Edwards et al. [41] is another widely-
adopted framework that employs a similar concept, in which faces are represented as a linear combination
of both shape and appearance. It has inspired several important works in the domains of image alignment
[163, 130] and appearance retrieval [38]. The seminal work by Blanz and Vetter [16] put forward the
concept of a morphable model for representing 3D textured faces. By leveraging Principal Component
Analysis (PCA), they first transform the shape and texture of example faces into a vector representation and
estimate the coefficients of a linear basis for fitting the model to the input image. This approach is useful
not only for appearance and expression modeling, but also for pose and expression normalization for face
recognition [237]. Extensions of morphable models have been developed by exploiting Internet images
[97, 99] and large-scale facial scans [18]. While computationally efficient, PCA-based models are limited
by the linear space spanned by the training samples, and thus are incapable of capturing fine-scale details or
large variations in facial appearance.
Capturing from Unconstrained Images Inferring local surface details using shape-from-shading is
a well-established technique for unconstrained geometry capture [107, 58, 11], and has been employed
in digitizing human faces [52, 98, 173]. However, the fidelity of the inferred details is limited by the
illumination conditions of the given input images, that are often captured under unconstrained settings.
There has been a substantial effort towards the goal of making facial digitization more accessible.
Monocular systems that record multiple views have been investigated to generate seamless texture maps
for digital avatars [182, 173, 81, 24, 187]. [214] improve the quality and robustness of monocular face
capture by introducing local constraints based on the anatomy of the face so as to better capture details
that are difficult to capture and express using traditional blendshape models. In the case that only a single
image is available, Kemelmacher-Shilzerman and Basri [98] leverage shading information and the closest
existing reference models to estimate both facial geometry and the albedo map. Barron and Malik [11]
utilize a hybrid approach to produce a reasonable estimate of shape, surface normals, reflectance and
14
illumination under a series of preset priors. [117] provide a comprehensive evaluation of the impact of
several important factors, such as the number of facial landmarks and mesh vertices used, when performing
cascaded regression to reconstruct 3D face shapes from a single RGB image. Li et al.[112] take advantage of
intrinsic image decomposition techniques to decouple the estimation of the specular and diffuse components
of the human face. While the aforementioned techniques succeed in generating high-quality appearance
models, they cannot infer the fine-scale reflectance and geometry in unseen regions. Recently, unsupervised
or weakly supervised learning on facial geometry and reflectance has been proposed using color consistency
[185, 170, 100, 184] or synthetic data [160, 133, 19, 170, 171, 161] as an additional supervisory signal.
2.1.2 Texture Synthesis and Image Completion
Many textures can be synthesized given a small exemplar patch using approaches based on the Markov
Random Field model, as the statistical features of local regions of the texture are quite similar to all others
across the entire image [205]. State-of-the-art texture synthesis techniques use various non-parametric
exemplar-based techniques, such as synthesizing textures by assembling individual pixels [206, 43] or
stitching patches [42, 106, 108] of the exemplar; progressively refining the texture using a global optimiza-
tion [105, 66]; or by computing high-dimensional appearance vectors for each exemplar pixel exemplar
and performing synthesis in this space [111]. In general, however, such texture synthesis techniques only
work for stochastic textures, such as micro-scale skin structures [68], and cannot be trivially applied to
medium- or fine-scale facial details, as they are highly structured in addition to exhibiting local consistency.
Li et al. [116] hallucinate high frequency details from low-resolution input using a patch-based Markov
network. However, the results remain blurry and missing regions cannot be inferred. Mohammed et
al. [135] generate novel faces from a random patch by combining both global and local models. Although
the synthesized faces look realistic, noisy artifacts are introduced in high-resolution images. A statistical
model for synthesizing detailed facial geometry has been introduced by Golovinskiy et al. [59], but it has
only been demonstrated in the geometric domain.
15
2.1.3 Deep Learning Based Image Synthesis
The advent of deep learning and its astonishing success in tasks such as image classification and face
recognition has led to recent efforts to apply these networks to the task of generating images [157, 104].
While early efforts suffered from artifacts such as blurry images, limited resolution and little control over
the synthesized image, recent efforts making use of Generative Adversarial Networks [60] have led to
a substantial increase in the quality of images generated using deep learning techniques, compared to
networks trained using only more conventional loss metrics (such as the L1 or L2 loss on reconstructed
images). In such efforts, a discriminator network is trained in conjunction with the generator, such that
the discriminator learns to distinguish between real images and synthetic images created by the generator.
Using loss values obtained from this discriminator results in more sophisticated criteria by which to judge
the synthesized images, and thus teach the generator to synthesize higher-quality images from a distribution
that more closely reflects the manifold of natural images.
However, GAN-based networks are more difficult to train, and typically fail to generate high-quality
images beyond a very low resolution. While recent progress has been made [95] in synthesizing high-
resolution images using an adversarial framework, it has still proven quite difficult to control precise details
in the synthesized images (such as the expression in an image of a face). In recent work by Isola et al. [84],
however, GAN training has proven to improve the quality of the output and resolution for image-to-image
translation tasks, in which there is a direct correspondence between pixels of the output image and those of
an input image used to guide the synthesis (e.g., when synthesizing cityscape images from a semantic label
map of an image). This work makes use of a conditional GAN framework, in which the discriminator is
provided the per-pixel label map and must determine whether the corresponding image is real or synthesized.
Olszewski et al. [144] employs an architecture derived from this image translation framework [84] to infer
dynamic facial textures from a sequence of images of a subject making a variety of facial expressions for
the purpose of facial performance retargeting, but this work does not recover the individual surface and
subsurface reflectance maps or the underlying mesoscopic geometry. We use an architecture similar to Isola
16
et al. [84] in our work, as we synthesize the resulting reflectance and geometry texture maps based a texture
extracted from the input image that is in the corresponding UV space. However, substantial changes to the
architecture and training process were required to achieve our desired goal.
Neural networks have also been used to infer the reflectance properties of general objects [1]. In the
context of inferring facial appearance, Duong et al. [142, 39] propose a nonlinear replacement of the AAM
which leverages Deep Boltzmann Machines to capture both non-linearity and large variations of shape and
texture. Pathak et al. [151] introduce an encoder-decoder architecture that is conditioned on content for
general image inpainting task. Iizuka et al. [82] further incorporate both a local and a global discriminator
to synthesize high-quality local details that are consistent with global background. A similar approach is
used by [115] for face inpainting to enhance local and global coherency. Yeh et al. [222] iteratively search
the closest embedding of a corrupted facial image in the latent space learned by a deep generative model to
achieve realistic inpainting.
Recently, style transfer techniques using deep neural networks [54, 53] have demonstrated the capacity
to combine the content of an image with a target style while preserving the structure of key visual features
in the content image. Rather than synthesizing images using a forward pass through a network trained
for a specified image synthesis task, these approaches iteratively modify an image passed through a pre-
trained network using the feature activations of this network as guidance for the synthesis process. This
ensures that a subset of these features for the modified image closely match those of a style image (such
as an impressionist painting) while retaining the general content of the initial image. Inspired by the
idea of defining style as mid-layer feature correlations of a neural network [53, 54], Saito et al. [167]
model the facial texture as a convex combination of “style” features extracted from high-resolution face
database, thereby achieving photorealistic texture inference from a partial view. Hu et al. [76] further
extend this approach to generate a full-head digital avatar from a single image. Though Saito et al. [167]
have achieved photorealistic quality, their inference requires a slow and intensive iterative optimization for
texture synthesis.
17
Our method, on the other hand, can achieve comparable quality with [167] at a speed that is close to
real time. In addition, our method is capable of inferring a much richer set of texture maps (diffuse albedo,
specular albedo and displacement maps) unlike a significant body of previous techniques that are limited to
diffuse albedo prediction under the assumption of Lambertian surface reflectance.
2.2 High-fidelity Facial Texture and Geometry Inference
2.2.1 Overview
Figure 2.1: System Overview. Given an unconstrained input image (left), the base mesh and corresponding
facial texture map are extracted. The diffuse and specular reflectance, and the mid- and high-frequency
displacement maps are inferred from the visible regions (Sec. 2.2.3). These maps are then completed, refined
to include additional details inferred from the visible regions, and then upsampled using a super-resolution
algorithm (Sec. 2.2.4). The resulting high-resolution reflectance and geometry maps may be used to render
high-fidelity avatars (right).
Our system pipeline is illustrated in Fig. 2.1. Given a single input image captured in unconstrained
conditions, we begin by extracting the base mesh of the face and the corresponding texture map obtained
by projecting the face in the input image onto this mesh. This map is passed through 2 convolutional
neural networks (CNNs) that perform inference to obtain the corresponding reflectance and displacement
maps (Sec. 2.2.3). The first network infers the diffuse albedo map, while the second infers the specular
albedo as well as the mid- and high-frequency displacement maps. However, these maps may contain large
missing regions due to occlusions in the input image. In the next stage, we perform texture completion and
18
input single network ours
Figure 2.2: Solving the described sub-tasks separately makes the complete texture inference pipeline more
tractable, allowing us to generate highly plausible output. Directly generating a complete texture map from
a partial input with a single network produces significantly inferior results.
refinement to fill these regions with content that is consistent with that found in the visible regions (Sec.
2.2.4). Finally, we perform super-resolution to increase the pixel resolution of the completed texture from
512 512 into 2048 2048. The resulting textures contain natural and high-fidelity details that can be
used with the base mesh to render high-fidelity avatars in novel lighting environments.
To obtain high-quality results, we found that it was essential to divide the inference and completion
process into these smaller objectives so as to make training process more tractable, as seen in Fig. 2.2. Using
a single network that performs both the texture completion and detail refinement on all of the desired output
data (reflectance and geometry maps) produces significantly worse results than our described approach, in
which the problems are decomposed into separate stages addressed by networks trained for more specific
tasks, and in which the diffuse albedo is generated by a separate network than the one that generates the
remaining output data.
2.2.2 Training Data
Training the networks to infer and complete the geometry and reflectance maps from the projected texture
obtained from an input image requires a substantial corpus of input texture maps with corresponding ground
19
truth reflectance and geometry maps. This data is captured with seven high-resolution DSLR cameras and a
spherical LED dome, using the polarized gradient spherical illumination technique of [55]. The captured
data includes high-resolution photographs of the subject from multiple views, sub-millimeter accurate facial
geometry with a displacement map and a set of specular and diffuse albedo maps. The diffuse albedo (RGB
channel) and specular albedo (single channel) respectively indicate the view-independent diffuse intensities
and specular intensities with the Fresnel reflection normalized, derived from polarized spherical gradient
illumination as in [128, 55]. Thorough definitions of these terms can be found in [211]. The displacement
map contains the high- and medium-frequency geometric details relative to the base surface mesh, while
the original high-resolution mesh is recovered by embossing the base surface with these displacements.
Dense correspondences for the 3D scans are obtained with a state-of-the-art multi-view dynamic facial
capture method [50]. These texture maps are stored in a consistent UV space such that we can learn the
variation in common skin features shared by different individuals. The displacement maps are separated
into medium- and high-frequency displacements. We found that this separation, which is common for
facial capture [129, 62, 137], is necessary to make the training process for our networks tractable. Training
using the original displacement maps, in which both the medium-frequency displacements (which contain
geometric details in the range of several millimeters) and the high-frequency displacements (which may
be in the sub-millimeter range) are represented in a single map leads to the high-frequency displacements
being regarded as noise that is disregarded during training, and thus is not inferred properly. We separate
these components using a standard Difference of Gaussians operation. The very low-frequency components
of the displacement are first removed by subtracting the result of a 201 201 Gaussian filter from the raw
displacements. The medium-frequency displacements are extracted by applying a 17 17 Gaussian filter to
the resulting displacement maps. Subtracting these medium-frequency components from the input to this
filter yields the high-frequency displacements.
Our data set includes both male and female subjects covering a variety of ages and races. The population
ratio of our data is the following:male=female = 1 : 1,Caucasian=Asian=African = 80 : 15 : 5, and
Ages : 10
0
s=20
0
s=30
0
s=40
0
s=50
0
s=60
0
s = 5 : 40 : 25 : 20 : 5 : 5. It consists of 329 high-resolution facial
20
scans from 25 subjects performing up to 30 different facial expressions. We increase the data variation
using several data augmentation techniques:
Synthetic lighting augmentation: we augment the variation of the input lighting with synthetic
rendering in order to obtain robust inference in the wild. To do this, we employ the ground truth
facial geometry and reflectance to render the face in multiple natural HDR environments using the
hybrid normal rendering [128] and ambient occlusion technique. To simulate the natural occlusion
seen in unconstrained images, we randomly perturb the head orientation and generate a visibility
mask in UV space indicating which pixels are visible from this viewpoint. This visibility mask is
used both at training and test time. We note that synthetic renderings have been shown to reduce the
amount of training data required for and improve the robustness of learned subject-specific priors for
facial expression capture [133]. In this work we demonstrate that similar techniques can be used to
improve the quality of appearance capture results attainable using a tractable amount of high-quality
ground-truth geometry and reflectance data.
Skin diffuse albedo augmentation: we employ the Chicago Face Database (CFD) [125], which
contains photographs of subjects from a wide variety of races, to increase the variety of skin tones
in our dataset. We sample a number of subjects from the missing races from the CFD database
and transfer the overall skin tone to the subjects in our dataset. This process is performed during
training such that the distribution of skin tones in the diffuse albedo textures is similar to the skin
color distribution found in the CFD. We find that this makes our approach more robust to the variety
of skin tones seen in the unconstrained images used in our evaluations, particularly the darker skin
tones that are underrepresented in our captured dataset.
2.2.3 Reflectance and Geometry Inference
We first adopt a pixel-wise optimization algorithm [187, 76] to obtain the base facial geometry, head
orientation, and camera parameters. Using this data, we can project the face in the input image into a
21
L1
FM
Decoder Encoder
skip connection
D
input texture
Real/Fake
visibility mask
Figure 2.3: Reflectance inference pipeline. The texture extracted from the input image and the corresponding
visibility mask are passed through a U-net encoder-decoder framework to produce a diffuse reflectance map.
Another network takes the same input and produces the specular reflectance and mid- and high-frequency
displacement maps. These networks are trained using a combination of L1 and adversarial loss (D), as well
as feature matching loss (FM), using features extracted from the discriminator.
texture map in the UV space used in our pipeline. The non-skin region is removed in image space using
a state-of-the-art semantic segmentation [231] technique fine-tuned on the facial segmentation dataset
provided by [166]. Once the input RGB texture is extracted, it may be used in the reflectance and geometry
inference networks (Fig. 2.3) to obtain the corresponding diffuse and specular reflectance maps and the
mid- and high-frequency displacement maps.
For this task, we employ a U-net architecture with skip connections similar to [84]. Such an architecture
is well-suited to our task, as the skip connections between layers of the encoder and decoder modules allow
for the easy preservation of the overall structure of the input image in the output image, thereby avoiding the
artifacts and limited resolutions found in more typical encoder-decoder networks. This allows the network
to use more of its overall capacity to learn the appropriate transformation from the provided input to the
desired output. As we perform inference in UV space, there is a direct correspondence between each pixel
in the input RGB texture map and those in the inferred reflectance and geometry maps. However, we found
that using this network architecture and training process is insufficient to obtain reasonable results for our
task. To make this problem more tractable, we introduce several significant modifications, described below,
to increase the resulting image quality and stabilize the training process:
Training Loss. During training, the L1 and GAN discriminator losses are computed only within the
aforementioned visibility mask. This allows the network to focus on inferring details from only the regions
22
that will be used in the final output. We also add a feature matching loss term using the features obtained
from the discriminator, and use unconditional GAN loss, following recent efforts [236]. We found that
these modifications lead to better overall visual quality in the generated output. Note that, as we employ
high fidelity rendering including ambient occlusion and subsurface scattering with hybrid normal rendering
[128] in our training data, it is non-trivial to obtain a differentiable composition on-the-fly to compute
reconstruction loss, unlike [174, 185].
Dual Networks. We use two networks with identical architectures, one operating on the diffuse albedo
map (subsurface component), and the other on the tensor obtained by concatenating the specular albedo
map with the mid- and high-frequency displacement maps (collectively surface components). We observed
that concatenating all the data into a single input tensor leads to poor overall performance. This is because
that surface and subsurface components capture different optical features of the skin, and the conflicting
features interfere with one another and cause the network to fail to robustly recover each component. On
the other hand, separating each component and inferring them in isolation causes instability in the training
that interferes with the network’s ability to recover the high-frequency displacement. We found that this
separation of the diffuse component from the others results in the best overall performance, which is
reasonable given that the specular reflection has a significant correlation with fine-scale details in the surface
geometry (e.g., [128, 55] use specular analysis to recover such geometric details).
Network Architecture. To improve the accuracy of the high-frequency details, we change the stride size
from 2 to 1 in the first and last convolution layers. We also add additional convolutional layers to the U-net
such that the spatial dimension of the deepest layer is 1 1, which leads to a better encoding of the global
context.
23
x1/4
completion
x4
detail
completion
super
resolution
(512, 512)
(128, 128)
(512, 512) (512, 512) (2048, 2048)
x4
(512, 512)
Figure 2.4: Our texture completion pipeline. The inferred texture and visibility mask are downsampled by
a factor of 4 and completed. The resulting low-resolution texture is upsampled to the original resolution
and blended with the input texture, then passed through a network that refines the texture to add subtle
yet crucial details. Finally, a super-resolution algorithm is applied to generate high-fidelity 2048 2048
textures.
2.2.4 Symmetry-Aware Texture Completion and Refinement
As the inferred reflectance and geometry maps often contain large missing regions due to occlusions caused
by various factors (e.g., hair and non-frontal viewpoints), this inference is followed by another stage in
which these missing regions are completed (Fig. 2.4). As with the inference stage, we find that the best
results are obtained by training one network pipeline to complete the diffuse albedo and another to complete
the other components (specular albedo, mid- and high-level displacement). However, we observe that
completing large areas at a high resolution still does not converge to a reasonable result due to the high
complexity of the learning objective. Furthermore, state-of-the-art inpainting methods work very poorly in
our scenario, in which the missing region can be quite large in the case of extreme occlusions. These regions
must be completed in a manner that results in natural, globally coherent facial textures free of distracting
artifacts. In such cases, the convolutional layers of these networks cannot extract meaning features within
their receptive fields.
Thus, we propose to stabilize the training and improve the resulting quality by dividing the inpainting
problem into simpler sub-problems. The 512 512 resolution input textures are first resized to 128 128
and texture completion is performed by a network to obtain complete low-resolution textures. Second, we
perform bilinear upsampling by a factor of 4 on the completed textures and blend each with the visible
region in the corresponding input texture. This process is followed by detail refinement, in which these
24
flip
conv
concatenate
Figure 2.5: Feature flipping in the latent space. The intermediate features obtained from the convolutional
layers of the network are flipped across the V-axis and concatenated to the original features. This process
allows the texture completion process to exploit the natural near-symmetry in human faces to infer texture
maps that contain local variations but are nearly symmetric.
completed textures are processed to create globally coherent textures with the same level of high-fidelity
details. These networks make use of the same architecture as those used for reflectance and geometry
inference (though the low-resolution completion network is modified to account for the 128128 resolution
input and output).
Furthermore, we leverage the spatial symmetry of UV parameterization and maximize the feature
coverage by flipping intermediate features over the V-axis in UV space and concatenate them to the original
features (Fig. 2.5). This technique allows the network to use the context provided by visible regions of the
face to complete missing parts of the corresponding region on the opposite side, such as when the left half
of the face is largely occluded due to a non-frontal viewpoint. We demonstrate that this feature flipping
results in completed textures that do not display an uncanny degree of near-perfect symmetry, but rather
contain a natural degree of symmetry as is seen in real faces. We found that this technique provided superior
results to common methods for expanding the receptive field of convolution layers, such as making use
of dilated convolutions or global pooling layers. Finally, the resulting 512 512 resolution textures are
upsampled to 2048 2048 using a state-of-the-art super-resolution algorithm [110].
25
2.2.5 Implementation Details
We train each network using the Adam optimizer [102] with a learning rate set to 0.0002. In addition to the
aforementioned data augmentation techniques, we perform random flipping of the input images across the
V-axis to further increase the training dataset size. All training was performed on an NVIDIA GTX 1080 Ti
graphics card. To train the texture completion networks, we use an occlusion mask from the input image or
a random rectangular mask. We generate a mask at a random point in the image with an area ranging from
0:25WH to 0:5WH. For the inference network, we set the weights of the L1, discriminator,
and feature matching losses to 10, 1, and 0.005, respectively. For the completion network, the weights are
set to 10, 1, and 0.2. For the refinement network, the weights are set to 20, 1, and 0.05.
We use three separate discriminators, one for each of the output maps, to train the network that infers the
specular albedo and displacements. While this results in increased memory usage and computation when
training this network, we found that superior results were obtained compared to using one discriminator that
decides whether the combined output maps are real or fake. For the completion and refinement networks,
we found a single discriminator operating on the entire output tensor to be sufficient.
In addition to the specified input, each network also accepts the visibility mask extracted from the initial
3D mesh fitting, as seen in Fig. 2.4. This allows them to better distinguish between the regions on which
they must focus their capacity (such as the visible region for the initial reflectance and geometry inference,
or the occluded region that must be completed for the texture completion network). We only compute
and backpropagate loss for the visible region in the initial inference network, as the other regions will be
completed and refined by the subsequent networks. We found that superior results were obtained from the
refinement network when the adversarial and feature matching loss were backpropagated only from the
occluded regions, while the L1 loss is backpropagated from the entire image. This allows the network to
focus its capacity on refining the incomplete regions, which are only filled with the low-resolution output
of the completion network, while maintaining the overall quality of the visible regions of the inferred
26
reflectance and geometry maps. For the texture completion network, all losses are computed for the entire
input image.
The reflectance and geometry inference networks are trained for 60,000 iterations (requiring approx. 12
hours and 6 GB of GPU memory). The texture completion networks are each trained for 60,000 iterations
using the masked ground truth images, and another 30,000 iterations to fine-tune the network using the
output of the initial inference networks (approx. 6 hours, 1.5 GB GPU memory). The detail refinement
network is likewise trained for 60,000 iterations and fine-tuned for another 30,000 iterations using the output
of the trained texture completion networks (approx. 12 hours, 4.5 GB GPU memory). The super-resolution
network is trained for 1000 epochs using our training data as ground truth.
2.2.6 Results
Figure 2.6: Our rendering pipeline.
Our final rendering is produced with a layered surface and subsurface skin material, as shown in Fig.
2.6, using brute-force path tracing and the image-based lighting in Solid Angle’s Arnold renderer [178]. The
output displacement is applied on the base surface mesh to simulate the mesoscopic geometric deviations
27
diffuse specular disp
PSNR 22.42 17.96 23.89
SSIM 0.81 0.44 0.73
Table 2.1: Quantitative evaluation. We measure the peak signal-to-noise ratio (PSNR) and the structural
similarity (SSIM) of the inferred images for 100 test images compared to the ground truth. The inferred
displacement value is computed using the output medium- and high-frequency displacement maps to recover
the overall displacement.
through geometry tessellation. The resulting specular albedo and diffuse albedo are respectively used to
influence the intensities of the specular BRDF and deep scattering components of the skin, as is done in the
open source Digital Human project [186].
In Fig. 2.7, we show several results obtained using unconstrained input images from the CelebA dataset
[121], with the input images, the corresponding inferred textures, and sample renderings using the inferred
data. Despite the widely varying subject appearance, lighting conditions, facial expressions, and view
angles with occlusions, the results demonstrate that the system is able to infer the data needed to render
subjects well enough to allow for the rendering of compelling and high-fidelity avatars.
We provide additional results with pictures from the Chicago Face Dataset [125] to further demonstrate
the fidelity of our reflectance and geometry inference for a wide range of ethnicities. In Fig. 2.8 on the left
side, we visualize the input (upper left), inferred diffuse albedo (lower left), specular albedo (lower right),
and meso-scale geometry (upper right). On the right, we provide a 3D rendering under a novel lighting
environment. As seen in the figure, our method captures person-specific skin tones, dynamic wrinkles (e.g.,
first column, third subject), and distinguishing features such as stubble hair (e.g., first column, sixth subject),
while illumination-dependent shading in the input image is successfully removed in the resulting reflectance
maps, making it possible to insert the digital faces into arbitrary virtual environments.
Evaluation We quantitatively measure the ability of our system to faithfully recover the reflectance
and geometry data from a set of 100 test images for which we have the corresponding ground-truth
measurements. The results are seen in Table 2.1. We see that the system is able to recover the diffuse albedo
and overall displacement quite well, though the higher complexity of the specular albedo results in a larger
28
Figure 2.7: Inference in the wild. The first column contains the input image and the corresponding inferred
output applied to the base mesh. The second and third columns contain new renderings of the avatar under
novel lighting conditions (the lighting environments we use are inset in the top left example renderings).
difference from the ground truth. However, our qualitative evaluations demonstrate that the inferred data is
still sufficient for rendering compelling and high-quality avatars.
29
Figure 2.8: Additional results with images from the Chicago Face Dataset [125]
Fig. 2.10 highlights the importance of each output from the proposed network. While rendering with
the diffuse reflectance layer only (second column) provides person-specific skin tones, it lacks the visual
cues for the overall surface geometry provided by the specular reflection layer (third and fifth column).
30
input input
zoom-in
diffuse albedo
zoom-in
specular albedo
zoom-in
geometry
zoom-in
Figure 2.9: Zoom-in results showing synthesized mesoscopic details.
Using the specular albedo map in addition to the diffuse albedo map (third column) provides local specular
variations in the surface reflection, but omits the subtle surface indentations provided by the displacement
map (fourth column). Using all the maps provides the most realistic result (fifth column).
Figs. 2.11 and 2.12 demonstrate that we obtain comparable results using a single image that is captured
from different viewpoints and different lighting conditions, respectively. As can be seen in Figs. 2.11, the
missing region such as the side of the cheek is plausibly completed with our texture completion method,
with appropriate natural symmetry, and is consistent with the rest of the skin. Despite the varying lighting
colors, the amount of specularity, and the contrast in the images due to shadowing, the reconstructed textures
display a consistent skin quality matching the subject’s identity (Fig. 2.12). These results demonstrate that
our approach can recover plausible and consistent output despite large variations in the input images, such
as vastly differing viewpoints or extreme changes to the lighting environment.
Fig.2.13 shows how our method performs under different facial expressions. While distinct, person-
specific details such as freckles are maintained across the expressions, the shading introduced due to the
expression around the nasolabial fold and the compressed forehead is successfully removed in the resulting
31
diff diff + spec diff + disp diff + spec + disp input
Figure 2.10: Ablation study demonstrating the importance of each of the outputs of the proposed network.
The second column shows the rendering of the diffuse reflectance using the diffuse albedo map, which lacks
the surface geometry cues provided by the specular reflection. With the specular reflection added (third
and fifth column), the view-dependent nature of the surface reflection enhances the sense of the object’s
3D shape. The specular albedo map simulates the specular occlusion and provides local variations in the
specular intensities (third column). Adding the displacement map provides more lifelike skin textures, but
the rendering looks too shiny without the specular albedo map (fourth column). Combining all the maps
provides the best rendering result (fifth column).
reflectance layer. With the captured dynamic geometry, the final rendering exhibits a similar magnitude of
the dynamic folds through simulated shading.
Comparison In Fig. 2.14, we compare our approach to [82]. Severe occlusions resulting in large missing
regions in the input texture cause their method to fail to faithfully recover the entire diffuse albedo map.
Our method, in contrast, is able to infer plausible and coherent data to fill the missing regions, resulting in a
32
input image input texture output
Figure 2.11: Consistency of the output obtained using input images of the same subject from different
viewpoints.
much more natural albedo map that is suitable for rendering a digital avatar. We provide results using our
described approach both with and without the aforementioned feature flipping strategy, demonstrating the
importance of this technique in producing output images that are both complete and natural.
In Fig. 2.15, we compare our approach to several alternatives on a variety of input subjects captured
under different conditions. We show the results obtained by simply reconstructing the captured texture
using the PCA coefficients obtained from the 3D face fitting process [187] used to extract raw texture that
is provided as input to our system; the results obtained using [135]; and the result of applying [167]. We
show both the entire recovered diffuse texture as well as a close-up of a region of the texture. This clearly
demonstrates our approach’s ability to faithfully recover fine-scale details corresponding to the input image,
resulting in more coherent and plausible facial textures than these alternative approaches. Figs. 2.16, 2.17
and 2.18 provide additional comparisons with the results obtained using our approach and those obtained
33
input output
Figure 2.12: Consistency of the output obtained using input images of the same subject captured under
different lighting conditions.
using several recently developed facial capture techniques. As seen in the figures, our method produces
significantly better skin texture (Fig. 2.16), sharp details (Fig. 2.17), and preserves distinct, person-specific
details such as freckles (Fig. 2.18).
We also provide quantitative comparisons of the fidelity of our diffuse albedo inference with that
obtained using these techniques. As seen in Table 2.2, our method produces albedo maps that are closer to
the ground truth than any of these alternatives (see Fig. 2.19 for a qualitative comparison). In Fig. 2.20, we
show rendering results using the data inferred with our approach, and compare with renderings generated
using the high-fidelity data acquired directly from the multi-view stereo system used to generate our training
data. The renderings with our inferred reflectance data applied to the ground truth base mesh from the Light
Stage suggest that our method can capture all the reflectance data necessary to render a high-fidelity avatar.
The last column shows the result using the base mesh obtained with our method. The final rendering of our
single-view technique indicates comparable quality to that obtained with a Light Stage capture device.
34
Figure 2.13: Evaluation on expressions.
35
input ours w/o flip Iizuka et al.
ours w/ flip
Figure 2.14: Comparison with [82] and our network, both with and without the feature flipping layer.
method PSNR RSME
[187] 17.6354 0.1369
[167] 15.6308 0.1767
[135] 18.34 0.1271
ours 19.333 0.1102
Table 2.2: Quantitative comparison of our diffuse albedo inference with several alternative methods,
measured using the PSNR and the root-mean-square error (RMSE).
stage diffuse specular, disp
inference 8 ms 8 ms
completion 6 ms 6 ms
refinement 3 ms 3 ms
super-resolution 300 ms 300 ms
Table 2.3: Runtime performance for each component of our system.
Performance Table 2.3 shows the runtime performance of each stage of our pipeline.
36
input texture PCA Visio-lization Saito et al. ours input image
Figure 2.15: Comparison with PCA, Visio-lization [135], and a state-of-the-art diffuse albedo inference
method [167].
2.2.7 Future Work
Despite these findings, our approach has several limitations. While it is able to quickly infer high-fidelity
details given a sufficiently high-resolution input image, it cannot infer these details if the input image is
of very low quality or resolution, unlike the more computationally intensive transfer-based technique of
[167]. Furthermore, while it can recover details such as facial stubble, which can be represented as fine
37
input Li et al. ours
Figure 2.16: Comparison of diffuse albedo inference with a data-driven intrinsic decomposition method
[112] (produced by original authors).
input T ewari et al. ours
Figure 2.17: Comparison of diffuse albedo inference with an unsupervised face alignment method [185], in
which skin textures are represented by a linear basis.
details in the reflectance and geometry maps, it cannot recover other larger variations in facial appearances,
such as very dense and long facial hair. Furthermore, other features that do not correspond to semantic
features of the human face, such as glasses, cannot be recovered and may interfere with the fitting process
used to recover the base mesh and corresponding texture map from the input image. As our ability to
recover the input facial texture is limited by our ability to recover the base mesh and camera parameters
using a photometric-consistency optimization, very challenging conditions in the input images, such as
38
input Shu et al. ours
Figure 2.18: Comparison of diffuse albedo inference with an unsupervised intrinsic decomposition method
[174].
ground truth PCA Visio-lization Saito et al. Ours
Figure 2.19: Comparison with PCA, Visio-lization [135], and a state-of-the-art diffuse albedo inference
method [167] using Light Stage ground truth data.
39
Figure 2.20: Ground truth comparison using Light Stage (LS) data.
extreme lighting conditions or largely non-frontal viewpoints, may cause failures in this stage. In addition,
strong dynamic expressions can introduce transient wrinkles that may lead to inconsistent reflectance and
geometry maps for a given subject compared to those that would be obtained using an image with a more
neutral facial expression. (Figure 2.21 contains example output produced under some of the aforementioned
conditions). Very specific and unique features, such as scars, will not be recovered as accurately as when
using a more cumbersome and computationally intensive approach relying on multi-view stereo capture of
each new subject.
In addition to addressing the aforementioned limitations, we believe that there are many avenues of
future work in the domain of high-quality facial capture in unconstrained scenarios that could build upon
our approach and make use of our high-quality facial scan database. We plan to expand our database to
cover dynamic facial details, such as those caused by strong facial expressions. Extending our approach to
recover dynamic fine-scale facial details from multiple input images, such as those taken in a short video
or a sequence of still images, is another promising area of exploration. This would allow for the recovery
of additional details when some of the input images suffer from issues such as low resolution or extreme
occlusions. It may also allow for a more accurate reconstruction of the base mesh, thereby allowing for
even more accurate renderings of digital avatars using the inferred textures.
40
Figure 2.21: Limitations. Our method produces artifacts in the presence of strong shadows (lower right)
and non-skin objects due to segmentation failures (upper left). Also volumetric beards are not faithfully
reconstructed (upper right). Strong dynamic wrinkles (lower left) may cause artifacts in the inferred
displacement maps.
2.2.8 Discussion
We have demonstrated the feasibility of inferring high-resolution reflectance and geometry maps using a
single unconstrained image of the captured subject. Not only are these maps high-fidelity and sufficient
for rendering compelling and realistic avatars, but they contain the fine details essential for preserving the
likeness of the captured subject (such as pores, moles, and facial hair). This is possible in large part due
to our use of high-quality ground truth 3D scans and the corresponding input images. This allows for the
training of networks specially designed for the inference, texture completion and detail refinement tasks
necessary to generate the data for rendering these avatars. By decomposing this problem into smaller tasks
that are addressed using specific variations of the network architecture and training procedure, we are able
to obtain high-resolution textures containing all the data needed to render characters with reflectance and
fine-scale geometry matching the target subject. This output is comparable in quality to that obtained by
[167], but is obtained in only a fraction of the time (several seconds rather than several minutes). Unlike the
aforementioned approach, the output includes all the mesoscopic geometric and illumination-independent
41
reflectance data required to produce realistic renderings under novel lighting conditions. Furthermore, our
approach maintains high-resolution details in the reflectance of the input image, rather than changing the
entire image to match the statistics of those in our training database, but still produces globally coherent
textures. To render realistic faces, the inferred textures should not have perfect symmetry, which would
result in uncanny renderings, but need to have local variations comparable to those seen in real faces. Our
technique of flipping and concatenating convolutional features encoded in the latent space of our model
allows us to perform texture completion in a manner that respects the natural degree of symmetry seen in
the human face.
42
Chapter 3
Volumetric Representation for 3D Hair Digitization
The 3D acquisition of human hair has become an active research area in computer graphics in order to
make the creation of digital humans more efficient, automated, and cost effective. High-end hair capture
techniques based on specialized hardware [148, 87, 71, 13, 124, 40, 218] can already produce high-quality
3D hair models, but can only operate in well-controlled studio environments. More consumer-friendly
techniques, such as those that only require a single input image [74, 28, 29, 76], are becoming increasingly
popular and important as they can facilitate the mass adoption of new 3D avatar-driven applications,
including personalized gaming, communication in VR [113, 145, 188]. Existing single-view hair modeling
methods all rely on a large database containing hundreds of 3D hairstyles, which is used as shape prior for
further refinement and to handle the complex variations of possible hairstyles.
This paradigm comes with several fundamental limitations: (1) the large storage footprints of the hair
model database prohibit their deployment on resource-constrained platforms such as mobile devices; (2)
the search steps are usually slow and difficult to scale as the database grows to handle increasingly various
hairstyles; (3) these techniques also rely on well-conditioned input photographs and are susceptible to the
slightest failures during the image pre-processing and analysis step, such as failed face detection, incorrect
head pose fitting, or poor hair segmentation. Furthermore, these data-driven algorithms are based on
hand-crafted descriptors and do not generalize well beyond their designed usage scenarios. They often fail
43
in practical scenarios, such as those with occluded face/hair, poor resolution, degraded quality, or artistically
stylized input.
To address the above challenges, we propose an end-to-end single-view 3D hair synthesis approach using
a deep generative model to represent the continuous space of hairstyles. We implicitly model the continuous
space of hairstyles using a compact generative model so that plausible hairstyles can be effectively sampled
and interpolated, and hence, eliminate the need for a comprehensive database. We also enable end-to-end
training and 3D hairstyle inference from a single input image by learning deep features from a large set of
unconstrained images.
To effectively model the space of hairstyles, we introduce the use of volumetric occupancy and flow
fields to represent 3D hairstyles for our generative hair modeling framework. We present a variant of
volumetric variational autoencoder (V AE) [103] to learn the mapping from a compact latent space to the
space of hairstyles represented by a volumetric representation of a large database of hairstyles [74].
To achieve end-to-end 3D hair inference, we train an additional hair embedding neural network to
predict the code in the learned V AE latent space from input images. Instead of direct prediction to the latent
space, we perform Principled Component Analysis (PCA) in the latent space for an embedding subspace to
achieve better generalization performance via prediction to this subspace. In addition, we apply Iterative
Error Feedback (IEF) [26] to our embedding network to further facilitate generalization.
We include an ablation study of different algorithmic components to validate our proposed architecture
(Section 3.3). We show that our method can synthesize faithful 3D hairstyles from a wide range of input
images with various occlusions, degraded image quality, extreme lighting conditions, uncommon hairstyles,
and significant artistic abstraction (see Fig 1.5 and Section 3.3.1). We also compare our technique to the
latest algorithm for single-view 3D hair modeling [29] and show that our approach is significantly more
robust on challenging input photos. Using our learned generative model, we further demonstrate that
plausible hairstyles can be interpolated effectively between drastically different ones, while the current
state-of-the-art method [210] fails.
Our main contributions are:
44
The first end-to-end framework for synthesis of 3D hairstyles from a single input image without
requirement of face detection or hair segmentation. Our approach can handle a wider range of
hairstyles and is significantly more robust for challenging input images than existing data-driven
techniques.
A variational autoencoder using a volumetric occupancy and flow field representation. The corre-
sponding latent space is compact and models the wide range of possible hairstyles continuously.
Plausible hairstyles can be sampled and interpolated effectively using this V AE-based generative
model, and converted into a strand-based hair representation.
A hair embedding network with robust generalization performance using PCA embedding and an
iterative error feedback technique.
3.1 Related Work
The creation of high-quality 3D hair models is one of the most time consuming tasks when modeling CG
characters. Despite the availability of various design tools [101, 34, 213, 46, 224, 223] and commercial
solutions such as XGen, Ornatrix and HairFarm, production of a single 3D hair model for a hero character
can take hours or even days for professional character artists. A detailed discussion of seminal hair modeling
techniques can be found in Ward et al. [203].
Multi-View Hair Capture Hair digitization techniques have been introduced in attempts to reduce
and eliminate the laborious and manual effort of 3D hair modeling. Most high-end 3D hair capture
systems [147, 148, 87, 71, 13, 124, 40, 218] maximize the coverage of hair during acquisition and are
performed under controlled lighting conditions. The multi-view stereo technique of Luo et al. [124] shows
that, for the first time, highly complex real-world hairstyles can be convincingly reconstructed in 3D by
discovering locally coherent wisp structures. Hu et al. [73] later proposes a data-driven variant using
pre-simulated hair strands, which eliminates the generation of physically implausible hair strands. Their
45
follow-up work [75] solves the problem of capturing constrained hairstyles such as braids using procedurally
generated braid structures. In this work they used an RGB-D sensor (Kinect) that is swept around the
subjects instead of a collection of calibrated cameras. [229] recently proposes a generalized four-view
image-based hair modeling method that does not require all views to be from the same hairstyle, which
allows the creation of new hair models. These multi-view capture systems are not easily accessible to
end-user, as they often require expensive hardware equipment, controlled capture settings, and professional
manual clean-up.
Single-View Hair Modeling With the availability of internet pictures and the ease of taking selfies, single-
view hair modeling solutions are becoming increasingly important within the context of consumer-friendly
3D avatar digitization. Single-view hair modeling techniques were first introduced by Chai et al. [31, 30] for
portrait manipulation purposes. These early geometric optimization methods are designed for reconstructing
front-facing subjects and have difficulty approximating the geometry of the back of the hair.
Hu et al. [74] proposes a data-driven method to produce entire hairstyles from a single input photograph
and some user interactions. Their method assembles different hairstyles from a 3D hairstyle database
developed for the purpose of shape reconstruction. Chai et al. [29] later presents a fully automated variant
using an augmented 3D hairstyle database and a deep convolutional neural network to segment hair regions.
Hu et al. [76] further improves the retrieval performance by introducing a deep learning-based hair attribute
classifier, that increases the robustness for challenging input images from which local orientation fields are
difficult to extract. However, these data-driven methods rely on the quality and diversity of the database,
as well as a successful pre-processing and analysis of the input image. In particular, if a 3D hair model
with identifiable likeness is not available in the database, the reconstructed hair model is likely to fail.
Furthermore, handcrafted descriptors become difficult to optimize as the diversity or number of hair models
increases. Recently, Zhou et al. [234] presents a method for single-view hair modeling by directly inferring
3D strands from a 2D orientation field of segmented hair region.
46
Shape Space Embedding Embedding a high-dimensional shape space into a compact subspace has
been widely investigated for the shape modeling of human bodies [6, 122] and faces [16]. Since different
subjects are anatomically compatible, it is relatively easy to present them into a continuous low dimensional
subspace. However, it is not straightforward to apply these embedding techniques to hairstyles due to their
complex volumetric and topological structures, and the difficulty of annotating correspondences between
hairstyles.
3D Deep Learning The recent success of deep neural networks for tasks such as classification and
regression can be explained in part by their effectiveness in converting data into a high-dimensional feature
representation. Because convolutional neural networks are designed to process images, 3D shapes are
often converted into regular grid representations to enable convolutions. Multi-view CNNs [180, 155]
render 3D point clouds or meshes into depth maps and then apply 2D convolutions to them. V olumetric
CNNs [217, 131, 225, 155] apply 3D convolutions directly on the voxels, which are converted from a
3D mesh or point cloud. PointNet [154, 156] presents a unified architecture that can directly take point
clouds as input. Brock et al. [20] applies 3D CNNs to variational autoencoder [103] in order to embed
3D volumetric objects into a compact subspace. These methods are limited to very low resolutions (e.g.,
323232) and focus on man-made shapes, while our goal is to encode high-resolution (128192128)
3D orientation fields as well as volumes of hairstyles. Recently, Jackson et al. [85] proposed to infer 3D
face shape in the image space via direct volumetric regression from single-view input. While we are also
embedding a volumetric representation, our hairstyle representation uses a 3D direction field in addition
to an occupancy grid. Furthermore, we learn this embedding in a canonical space with fixed head size
and position, which allows us to handle cropped images, as well as head models in arbitrary positions and
orientations.
47
orientation
field
occupancy
field
volumetric
encoder
!
σ
#~% 0,(
∗
z
+
occupancy field
decoder
orientation field
decoder
input image
hair coefficients
PCA
-1
hair strands
image
encoder
(ResNet-50)
embedding y
+
,-./
+
0.-
+
1-
+
2
Figure 3.1: Our pipeline overview. Our volumetric V AE consists of an encoder and two decoders (blue
blocks) with blue arrows representing the related dataflow. Our hair embedding network (orange blocks)
follows the red arrows to synthesize hair strands from an input image.
3.2 Single-View 3D Hair Inference
3.2.1 Overview
In this section, we describe the entire pipeline of our algorithm for single-view 3D hair modeling (Figure 3.1).
We first explain our hair data representation using volumetric occupancy and flow fields (Section 3.2.2).
Using a dataset of more than two thousand different 3D hairstyles, we train a volumetric variational
autoencoder to obtain a compact latent space, which encodes the immerse space of plausible 3D hairstyles
(Section 3.2.3). To enable end-to-end single-view 3D hairstyle modeling, we train an additional embedding
network to help predict the volumetric representation from an input image (Section 3.2.4). Finally, we
synthesize hair strands by growing them from the scalp of a head model based on the predicted volume. If a
face can be detected or manually fitted from the input image, we can optionally refine the output strands to
better match the single-view input (Section 3.2.5).
3.2.2 Hair Data Representation
Our hair data representation is constrained by two factors. First, the data representation itself needs to be
easily handled by neural networks for our training and inference algorithms. Second, our representation
48
volume+flow regrown strands strands
Figure 3.2: V olumetric hairstyle representation. From left to right: original 3D hairstyle represented as
strands; our representation using occupancy and flow fields defined on regular grids, with the visualization
of the occupancy field boundary as mesh surface and encoding of local flow value ans surface color; regrown
strands from our representation.
should be compatible with traditional strand-based representation for high-fidelity modeling and rendering.
To achieve these two goals, we adopt a similar concept used in previous approaches [147, 208, 148, 200]
and convert hair strands into a representation of two components, i.e., a 3D occupancy field and the
corresponding flow field, both defined on uniformly sampled grids of resolution 128 192 128. We use
a large resolution along they-axis (vertical direction) to better accommodate longer hairstyles.
Specifically, given a hairstyle of 3D strands, we generate an occupancy fieldO using the outer surface
extraction method proposed in Hu et al. [76]. Each grid ofO has a value of 1 if the grid center is inside of
the hair volume and is set to 0 otherwise. We also generate a 3D flow fieldF from the 3D hair strands. We
first compute the local 3D orientation for those grids inside the hair volume by averaging the orientations
49
of nearby strands [200]. Then we smoothly diffuse the flow field into the entire volume as proposed by
Paris et al. [147]. Conversely, given an occupancy fieldO and the corresponding flow fieldF, we can easily
regenerate 3D strands by growing from hair roots on a fixed scalp. The hair strands are grown following
the local orientation of the flow fieldF until hitting the surface boundary defined by the volumeO. See
Figure 3.2 for some concrete examples.
3.2.3 Volumetric Variational Autoencoder
Variational Autoencoder Our approach is based on variational autoencoder (V AE), which has emerged
as one of the most popular generative models in recent years [103, 159, 183]. A typical V AE consists of
an encoderE
(x) and a decoderD
(z). The encoderE
encodes an inputx into a latent codez, and the
decoderD
generates an outputx
0
from the latent codez. The parameters and of the encoder and
the decoder can be jointly trained so that the reconstruction error betweenx andx
0
is minimized. While
a vanilla autoencoder [15] uses a deterministic function for the encoderE
(x), a variational autoencoder
(V AE) [103] approximatesE
(x) as a posterior distributionq(zjx), allowing us to generate a new datax
0
by samplingz from a prior distribution. We train the encoding and decoding parameters and using
stochastic gradient variational Bayes (SGVB) algorithm [103] as follows:
;
= argmin
;
E
zE
(x)
logp
x
z
+D
kl
E
(x)
p(z)
; (3.1)
where D
kl
denotes the Kullback-Leibler divergence. Assuming a multivariate Gaussian distribution
E
(x)N (z
; diag(z
)) as a posterior and a standard isotropic Gaussian prior p(z)N (0;I), the
Kullback-Leibler divergenceD
kl
is formulated as
D
kl
E
(x)jN (0;I)
=
1
2
X
i
1 + 2 logz
;i
z
2
;i
z
2
;i
; (3.2)
50
where z
, z
are the multidimentional output ofE
(x), representing the mean and standard deviation
respectively, and D
kl
is computed as summation over all the channels of z
and z
. To make all the
operations differentiable for back propagation, the random variablez is sampled from the distributionE
(x)
via a reparameterization trick [103] as below:
z =z
+z
; N (0;I); (3.3)
where is an element-wise matrix multiplication operator.
Hairstyle Dataset To train the encoding and decoding parameters of a V AE using our volumetric repre-
sentation, we first collect 816 portrait images of various hairstyles and use a state-of-the-art single-view hair
modeling method [74] to reconstruct 3D hair strands as our dataset. For each portrait image, we manually
draw 1 4 strokes to model the global structure, the local strand shapes are then refined automatically.
By adding the 343 hairstyles from the USC-HairSalon dataset [74], we have collected 1159 different 3D
hairstyles in total. We further augment the data by flipping each hairstyle horizontally and obtain a dataset
of 2318 different hairstyles. The hair geometry is normalized and aligned by fitting to a fixed head model.
We randomly split the entire dataset into a training set of 2164 hairstyles and a test set of 154 hairstyles.
V AE Architecture From the volumetric representation of the collected 3D hairstyle dataset, we train an
encoder-decoder network to obtain a compact model for the space of 3D hairstyles. The architecture of
our V AE model is shown in Table 3.1. The encoder concatenates the occupancy fieldO and flow field
F together as volumetric input of resolution 128 192 128 (see Section 3.2.2) and encode the input
into a volumetric latent spacez
andz
of resolution 4 6 4. Each voxel in the latent space has a
feature vector of dimension 64. Then we sample a latent codez2R
46464
fromz
andz
using the
reparameterization trick [103]. The latent codez is used as input for two decoders. One of them generates a
scalar field as a level-set representing the hair volume while the other is used to generate the 3D flow field.
51
Net Type Kernel Stride Output
enc. conv. 4 4 2 2 64 96 64 4
enc. conv. 4 4 2 2 32 48 32 8
enc. conv. 4 4 2 2 16 24 16 16
enc. conv. 4 4 2 2 8 12 8 32
enc. conv. 4 4 2 2 4 6 4 64
dec. transconv. 4 4 2 2 8 12 8 32
dec. transconv. 4 4 2 2 16 24 16 16
dec. transconv. 4 4 2 2 32 48 32 8
dec. transconv. 4 4 2 2 64 96 64 4
dec. transconv. 4 4 2 2 128 192 128f1; 3g
Table 3.1: Our volumetric V AE architecture. The last convolution layer in the encoder is duplicated for
and for reparameterization trick. The decoders for occupancy field and orientation field are the same
architecture except the last channel size (1 and 3 respectively). The weights on the decoders are not shared.
All the convolutional layers are followed by batch normalization and ReLU activation except the last layer
in both the encoder and the decoder.
Loss Function Our loss function to train the network weights consists of reconstruction errors for
occupancy field and flow field, as well as KL divergence loss [103]. We use Binary Cross-Entropy (BCE)
loss for the reconstruction of occupancy fields. The standard BCE loss is
L
BCE
=
1
jVj
X
i2V
h
O
i
log
^
O
i
+
1O
i
log
1
^
O
i
i
;
whereV demotes the uniformly sampled grids,jVj is the total number of grids, O
i
2f0; 1g is the
ground-truth occupancy field value at a voxelv
i
, and
^
O
i
is the value predicted by the network and is in the
range of [0; 1]. Brock et al. [20] modifies the BCE loss by setting the range of target value asf1; 2g to
prevent the gradient vanishing problem:
L
0
BCE
=
1
jVj
X
i2V
h
O
i
log
^
O
i
+
1
1O
i
log
1
^
O
i
i
;
where
is a relative weight to penalize more on false negatives [20]. Although the modified loss function
above improves the overall reconstruction accuracy, the details around the hair volume boundary from
52
a typical encoder-decoder network are usually over-smoothed, which may change the hairstyle structure
unnaturally (see Figure 3.3). To address this issue, we introduce a boundary-aware weighting scheme by
changing the loss function into:
L
vol
=
1
n( 1) +jVj
X
i2V
w
i
O
i
log
^
O
i
+
1
1O
i
log
1
^
O
i
; (3.4)
w
i
=
8
>
>
>
<
>
>
>
:
iv
i
2N (B
t
)
1 otherwise
wherew
i
takes a constant weight larger than 1 when the voxelv
i
belongs to the one-ring neighborN (B
t
)
of any boundary voxelB
t
inside the ground-truth hair volume, andn is the number of voxelsfv
i
2N (B
t
)g.
For 3D orientation field, we use L1 loss because L2 loss is known to produce over smoothed prediction
results [84]:
L
flow
=
X
i2V
O
i
f
i
^
f
i
1
X
i2V
O
i
; (3.5)
wheref
i
and
^
f
i
are the ground-truth and predicted flow vectors at voxelv
i
respectively. Our KL-divergence
loss is defined as:
L
kl
=D
kl
q
z
O;f
N
0;I
: (3.6)
whereq is the Gaussian posteriorE
(O;f). Then our total loss becomes
L =L
vol
+w
flow
L
flow
+w
kl
L
kl
; (3.7)
53
wherew
flow
;w
kl
are relative weights for the orientation field reconstruction loss and the KL divergence
loss respectively.
Implementation Details. For both the encoder and decoder networks, we use kernel size of 4, stride of
2 and padding of 1 for all the convolution operations. All the convolutional layers are followed by batch
normalization and ReLU activation except the last layer in both networks. We usesigmoid andtanh as the
nonlinear activations for the occupancy and flow fields respectively. The training parameters are fixed to
be
= 0:97, = 50,w
flow
= 1:0 andw
kl
= 2 10
5
based on cross validation. We minimize the loss
function for 400 epochs using Adam solver [102]. We use a batch size of 4 and learning rate of 1 10
3
.
3.2.4 Hair Embedding Network
To achieve end-to-end single-view 3D hair synthesis, we train an embedding network to predict the hair
latent codez in the latent space from input images. We use the collected dataset of portrait photos and the
corresponding 3D hairstyles as training data (see Section 3.2.3).
Since our training data is limited, it is desirable to reduce the number of unknowns to be predicted
for more robust training of the embedding. We assume that the latent space of 3D hairstyles can be well-
approximated in a low-rank linear space. Based on this assumption, we compute the PCA embedding of the
volumetric latent space and use 512-dimensional PCA coefficientsy as a compact feature representation
of the feasible space of 3D hairstyles. Then the goal of the embedding task is to match predicted hair
coefficients ^ y to the ground-truth coefficientsy by minimizing the following L2 loss:
L
y
=
y ^ y
2
: (3.8)
Note that we usez
instead of stochastically sampled latent codezN (z
;z
) to eliminate randomness
in the embedding process. Our hair embedding pipeline is shown in Figure 3.1 (bottom part).
54
Training Loss IOU Precision Recall L2 (flow)
Ours (V AE) 0.8243 0.8888 0.9191 0.2118
Ours (AE) 0.8135 0.8832 0.9116 0.2403
[20] 0.6879 0.8249 0.8056 0.2308
Vanilla V AE 0.5977 0.7672 0.7302 0.2341
Table 3.2: Evaluation of training loss functions in terms of reconstruction accuracy for occupancy field
(IOU, precision and recall) and flow field (L2 loss). We evaluate the effectiveness of our proposed loss
function by comparing it with (1) our loss function without the KL-divergence loss term denoted as “Ours
(AE)”, (2) a state-of-the-art volumetric generative model using V AE [20] and (3) vanilla V AE [103].
We use a ResNet-50 model [70] pretrained on ImageNet [37] and fine tune the model as an image
encoder. We apply average pooling in the last convolution layer and take the output vector as an image
feature vectorI2R
2048
. Then we apply the process of Iterative Error Feedback (IEF) [26, 92] to train our
hair embedding network. The embedding networkP takes the image feature vectorI together with the
current hair coefficientsy
t
as input and predicts the updated coefficientsy
t+1
as below:
^ y
t+1
= ^ y
t
+P
I; ^ y
t
: (3.9)
IEF is known to have better generalization performance compared to direct embedding in a single shot,
which usually overfits the ground-truth training data (see Section 3.3 for some evaluations). We run three
iterations of IEF since no further performance improvement is observed afterwards.
Our hair embedding network consists of two 1024-dimensional fully connected layers with ReLU and
dropout layers in-between, followed by an output layer with 512 neurons. The learning rate is set to 10
5
and 10
4
for the image encoder and the hair embedding network respectively. We train the network with
1000 epochs using Adam solver [102] on our collected hairstyle dataset (Section 3.2.3). We use a batch
size of 16 and learning rate of 1 10
4
. To make our embedding network more robust against input
variations, we augment our image dataset by applying different random image manipulations, including
Gaussian noise (of standard deviation 0:15), Gaussian blur (of standard deviation 0:15), rotation (maximum
20 degrees), scaling (within the range of [0:5; 1:5]), occlusion (maximum 40% with random color [166])
and color jittering (brightness, contrast, hue and saturation).
55
Discussions about PCA Embedding. Unlike most V AE based approaches, which reshape the encoding
result into one long vector and apply fully connected layers to further reduce dimension, we use a volumetric
latent space to preserve the spatial dimensions. We argue that reshaping into a one-dimensional vector will
limit the expressiveness of the network significantly because it is difficult to fully cover hair local variation
in the training process. See Section 3.3 for the results of our ablation study. Zhang et al. [230] shows that
PCA is the optimal solution for low-rank approximation in linear cases. We have also experimented with a
non-linear embedding using multilayer perceptron (MLP) [165]. Although MLP should be more general
than PCA with its nonlinear layers, we have found that using MLP will overfit our training data and lead to
larger generalization errors on the test set.
3.2.5 Post-Processing
After we predict the hair volume with local orientations using our hair embedding and decoding networks,
we can synthesize hair strands by growing them from the scalp, following the orientation inside the hair
volume. Since we represent the 3D hair geometry in a normalized model space, the synthesized strands may
not align with the head pose in the input image. If the head pose is available (e.g., via manual fitting or
face landmark detection) and the segmentation/orientation can be estimated reliably from the input image,
we can optionally apply several post-processing steps to further improve the modeling results, following
some prior methods with similar data representation [74, 29, 76]. Starting from the input image, we first
segment the pixel-level hair mask and digitize the head model [76]. Then we run a spatial deformation
step as proposed in Hu et al. [76] to fit our hairstyle to the personalized head model. Next we apply the
mask-based deformation method [29] to improve alignment with the hair segmentation mask. Finally, we
adopt the 2D orientation deformation and the depth estimation method, from Hu et al. [74], to match the
local details of synthesized strands to the 2D orientation map from the input image.
56
Embedding IOU Precision Recall L2 (flow)
PCA 0.8127 0.8797 0.9143 0.2170
Single-vector V AE 0.6278 0.7214 0.7907 0.2223
Non-Linear 0.6639 0.7784 0.8186 0.2637
Table 3.3: Evaluation of different embedding methods. First row: our linear PCA embedding. Second row:
a single-vector V AE (in contrast to our volumetric V AE). Third row: a non-linear embedding with fully
connected layers and ReLU activations. The dimension of latent space is 512 for all the three methods here.
Method IOU Precision Recall L2 (flow)
IEF 0.6487 0.8187 0.7565 0.1879
Direct prediction 0.6346 0.8063 0.7374 0.2080
End-to-end 0.4914 0.5301 0.8630 0.3844
Table 3.4: Evaluation of prediction methods. We compare our embedding method based on Iterative Error
Feedback (IEF) [26] with direct prediction in a single shot for hair coefficients and end-to-end training
where the network directly predicts the volumetric representation given an input image.
3.3 Evaluation
In this section, we evaluate the design options of several algorithm components by comparing our method
with alternative approaches.
Loss Functions We run an ablation study on the proposed loss function in Eqn 3.4 by comparing it with
three other functions: non-variational autoencoder with our reconstruction loss function, a state-of-the-art
volumetric generative model using V AE [20] and vanilla V AE [103]. We refer to vanilla V AE as a V AE
trained using naive Binary Cross-Entropy loss for occupancy fields, with the rest remaining the same. For
fair comparison, we use the same architecture and same parameters, with the exception of the loss function
used for training. Table 3.2 shows that our proposed loss function achieves the greatest intersection of
union (IOU), the greatest precision and recall for reconstruction of occupancy field, and smallest error for
reconstruction of flow field. Fig 3.3 demonstrates that our reconstruction results match the ground-truth
data closely, whereas alternative approaches lead to over-smoothed output. Although we use the same
flow term defined in Eqn 3.5 for all the comparisons in Table 3.2, our training scheme achieves superior
reconstruction accuracy for flow field as well.
57
ground truth Vanilla VAE original strands [Brock et al. 2016] ours+FC ours+PCA ours(VAE)
Figure 3.3: Comparison of different training schemes. From left to right, we show (1) the original strands,
(2) the ground-truth volumetric representation, the reconstruction results using (3) vanilla V AE [103], (4) a
volumetric V AE [20], (5) our proposed V AE, (6) our V AE with non-linear embedding and (7) our V AE with
PCA embedding respectively.
PCA Embedding We compare our PCA embedding to a non-linear embedding with fully connected
layers, which is commonly used for convolutional variational training [20]. For a fully connected V AE
with a latent space of resolution 4 6 4 64, the output of our encoder is reshaped into a long vector
and is passed to a multilayer perceptron (MLP). The dimension of each layer of the MLP is 1024, 512,
1024 and 6144 respectively. Each layer is followed by batch normalization and ReLU activation, except
the layer with 512 neurons for variational training. The output of the MLP is connected to the first layer
of the decoder by reshaping it back into 4 6 4 64. Table 3.3 shows that PCA embedding achieves
significantly better reconstruction accuracy compared to a non-linear embedding V AE with fully connected
layers and ReLU activations (the second row in Table 3.3). We also compare our linear PCA embedding
to non-linear embedding, using the same MLP architecture as above. The MLP is trained to obtain a low
dimensional embedding by minimizing the L2 reconstruction loss of the volumetric latent variables from
the proposed volumetric V AE on the dataset used in the PCA embedding. Due to our limited number of
training data, we have observed poor generalization in the test set (the third row in Table 3.3). Additionally,
compared to our V AE model (the first row in Table 3.2), our PCA embedding (the first row in Table 3.3) has
led to very little increase in reconstruction errors. This observation validates our low-rank assumption of
the hair latent space.
58
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 3.4: Modeling results of 3D hairstyle from single input image. From left to right, we show the input
image, occupancy field with color-coded local orientations predicted by our single-view hair modeling
pipeline, as well as the synthesized output strands. None of these input images has been used for training of
our embedding network.
59
Embedding Method We compare our embedding network using IEF with two alternative approaches.
The first method directly predicts parameters in a single shot [145, 190], and the second one, end-to-end
training, directly predicts the volumetric representation given an input image. The numbers in Table 3.4
show that compared to direct prediction, our IEF based embedding network achieves better performance in
terms of IOU, precision and recall for occupancy field, as well as lower L2 error for prediction of flow field.
Moreover, end-to-end training has substantially worse reconstruction accuracy for both occupancy field and
flow field. These comparison results show that our two-step approach improves the stability of the training
process by separately learning the generative model and the embedding network.
3.3.1 Results
Single-View Hair Modeling We show single-view 3D hairstyle modeling results from a variety of input
images in Figures 1.5 and 3.4. For each image, we show the predicted occupancy field with color-coded
local orientation as well as synthesized strands with manually specified color. Note that none of these test
images are used to train our hair embedding network. Our method is end-to-end and does not require any
user interactions such as manually fitting a head model and drawing guiding strokes. Moreover, several
input images in Figure 3.4 are particularly challenging, because they are either over-exposed (the third row),
have low contrast between the hair and the background (the fourth row), have low resolution (the fifth row
and the sixth row), or are illustrated in a cartoon style (the last two rows). Although our training dataset
for the hair embedding network only consists of examples modeled from normal headshot photographs
without any extreme cases (e.g. poorly illuminated images or pictures of dogs), our method generalizes
very well due to the robustness of deep image features. A typical face detector will fail to detect a human
face from the third, the fifth and the sixth input images in Figure 3.4, which will prevent existing automatic
hair modeling method [76] from generating any meaningful results. In Figure 3.4, only the first image can
be handled by the system proposed by Chai et al. [29], since their algorithm requires both successful face
detection and high-quality hair segmentation.
60
[Chai et al. 2016] ours input image [Chai et al. 2016] ours input image
(a)
(b)
(c)
(d)
(e) (k)
(j)
(i)
(h)
(g)
(f) (l)
Figure 3.5: Comparisons between our method with AutoHair [29]. From left to right, we show the input
image, the result from AutoHair, the volumetric output of our V AE network, and our final strands. We
achieve comparable results on input images of typical hairstyles (a)-(f)(l), and can generate results closer
to the modeling target on more challenging examples (g)-(k). The inset images of (g) and (h) show the
intermediate segmentation masks generated by AutoHair.
61
[Hu et al. 2017] ours input image
Figure 3.6: Comparisons between our method with the state-of-the-art avatar digitization method [76]
using the same input images.
In Figure 3.5, we compare our method to a state-of-the-art automatic single-view hair modeling
technique [29] on a variety of input images. Our results are comparable to those by Chai et al. [29] on
those less challenging input of typical hairstyles (Figure 3.5(a) - (f) and (l)). For these challenging cases
(Figure 3.5(g) - (k)), we can generate more faithful modeling output, since the method of Chai et al. [29]
relies on accurate hair segmentation which can be difficult to achieve with partial occlusions or less typical
hairstyles.
We also compare our method with another recent automatic avatar digitization method [76] in Figure 3.6.
Their hair attribute classifier can successfully identify the long hairstyle for the first image, but fails to
retrieve a proper hairstyle from the database because the hair segmentation is not accurate enough. For the
second input image in Figure 3.6, their method generates a less faithful result because the classifier cannot
correctly identify the target hairstyle as “with fringe”.
62
In all of our results, we have only applied the post-processing step (Section 3.2.5) to the top-left example
in Figure 1.5, the first one in Figure 3.4 and all those in Figure 3.5. All the other results are generated by
growing strands directly from the fields predicted by our network.
Hair Interpolation Our compact represention of latent space for 3D hairstyles can be easily applied to
hair interpolation. Given multiple input hairstyles and the normalized weights for interpolation, we first
compute the corresponding hair coefficients of PCA embedding in the latent space for each hairstyle. Then
we obtain the interpolated PCA coefficients for the output by averaging the coefficients of input hairstyles
based on the weights. Finally we generate the interpolated hairstyle via our decoder network.
In Figure 3.7, we show interpolation results of multiple hairstyles. The four input hairstyles are shown at
the corners. All the interpolation results are obtained by bi-linearly interpolating the hair coefficients of PCA
embedding computed from the four input hairstyles. We also compare our hairstyle interpolation results to
the output using a state-of-the-art method [210]. As shown in Figure 3.8, our compact representation of hair
latent space leads to much more plausible interpolation results from the two input hairstyles with drastically
different structures.
Memory and Timing Statistics Our hair model takes about 14MB to store in main memory, including
PCA embedding and weights of decoders for both occupancy and flow fields. The training of our V AE
model and hair embedding network takes about eight hours and one day respectively. The prediction of
PCA coefficients takes less than one second for an input image of resolution 256 256 while decoding into
volumetric representation of occupancy and flow fields only takes about two milliseconds on GPU. The
generation of final strands from occupancy and flow fields takes 0:7 1:7 seconds, depending on the strand
lengths. All the timing statistics are measured on a single PC with an Intel Core i7 CPU, 64GB of memory,
and an NVIDIA GeForce GTX 1080 Ti graphics card.
63
Figure 3.7: Interpolation results of multiple hairstyles. The four input hairstyles are shown at the corners
while all the interpolation results are shown in-between based on bi-linear interpolation weights.
64
[Wen et al. 2103] ours
hairstyle A hairstyle B interpolation results
Figure 3.8: Comparison between direct interpolation of hair strands [210] and our latent space interpolation
results.
65
3.3.2 Discussion
We have presented a fully automatic single-view 3D hair reconstruction method based on a deep learning
framework that is trained end-to-end using a combination of artistically created and synthetically digitized
hair models. Convolutions are made possible by converting 3D hair strands into a volumetric occupancy
grid and a 3D orientation field. We also show that our volumetric variational autoencoder is highly effective
in encoding the immense space of possible hairstyles into a compact feature embedding. Plausible hairstyles
can be sampled and interpolated from the latent space of this V AE. We further show the effectiveness of
using a PCA embedding and iterative error feedback technique to improve the hairstyle embedding network
for handling difficult input images. Compared to state-of-the-art data-driven techniques, our approach is
significantly faster and more robust, as we do not rely on successful image pre-processing, analysis, or
database retrieval. In addition to our ability to produce hairstyles that were not included in the training
data, we can also handle extremely challenging cases, such as in inputs including occluded faces, poorly-lit
subjects, and stylized pictures. Due to its minimal storage requirements and superior robustness compared
to existing methods, our 3D hair synthesis framework is particularly well-suited for next generation avatar
digitization solutions. While we focus on the application of 3D hair digitization, we believe that our
volumetric V AE-based synthesis algorithm can be extended to reconstruct a broad range of non-trivial
shapes such as clothing, furry animals, and facial hair.
66
Chapter 4
Implicit Shape Representations for Clothed Human Digitization
The ability to digitize and predict a complete and fully textured 3D model of a clothed subject from a single
view can open the door to endless applications, ranging from virtual and augmented reality, gaming, virtual
try-on, to 3D printing. A system that could generate a full-body 3D avatar of a person by simply taking a
picture as input would significantly impact the scalability of producing virtual humans for immersive content
creation, as well as its attainability by the general population. Such single-view inference is extremely
difficult due to the vast range of possible shapes and appearances that clothed human bodies can take in
natural conditions. Furthermore, only a 2D projection of the real world is available and the entire back view
of the subject is missing.
While 3D range sensing [114, 140] and photogrammetry [169] are popular ways of obtaining complete
3D models, they are restricted to a tedious scanning process or require specialized equipment. The modeling
of humans from a single view, on the other hand, has been facilitated by the availability of large 3D human
model repositories [6, 122], where a parametric model of human shapes is used to guide the reconstruction
process [17]. However, these parametric models only represent naked bodies and do not describe the
clothing geometry nor the texture. Another option is to use a voxel representation in order to handle
large shape variations and topology change akin to the hair representation in the previous section, but the
high-memory footprint limits reconstruction resolution as well as network architectures.
67
In this chapter, we explore alternative shape representations to achieve high-resolution clothed human
digitization. More specifically, we introduce two novel representations using implicit surface, where
underlining 3D shapes are prescribed as a level-set boundary of occupancy indicator functions.
4.1 Related Work
Single-View 3D Human Digitization. Single-view digitization techniques require strong priors due to the
ambiguous nature of the problem. Thus, parametric models of human bodies and shapes [6, 122] are widely
used for digitizing humans from input images. Silhouettes and other types of manual annotations [64, 232]
are often used to initialize the fitting of a statistical body model to images. Bogo et al. [17] proposed a
fully automated pipeline for unconstrained input data. Recent methods involve deep neural networks to
improve the robustness of pose and shape parameters estimations for highly challenging images [92, 152].
Methods that involve part segmentation as input [109, 146] can produce more accurate fittings. Despite
their capability to capture human body measurements and motions, parametric models only produce a naked
human body. The 3D surfaces of clothing, hair, and other accessories are fully ignored. For skin-tight
clothing, a displacement vector for each vertex is sometimes used to model some level of clothing as shown
in [4, 209, 3]. Nevertheless, these techniques fail for more complex topology such as dresses, skirts, and
long hair. To address this issue, template-free methods such as BodyNet [193] learn to directly generate a
voxel representation of the person using a deep neural network. Due to the high memory requirements of
voxel representations, fine-scale details are often missing in the output. More recently, [138] introduced a
multi-view inference approach by synthesizing novel silhouette views from a single image. While multi-
view silhouettes are more memory efficient, concave regions are difficult to infer as well as consistently
generated views. Consequentially, fine-scale details cannot be produced reliably. In contrast, PIFu is
memory efficient and is able to capture fine-scale details present in the image, as well as predict per-vertex
colors.
68
Multi-View 3D Human Digitization. Multi-view acquisition methods are designed to produce a complete
model of a person and simplify the reconstruction problem, but are often limited to studio settings and
calibrated sensors. Early attempts are based on visual hulls [132, 195, 47, 45] which uses silhouettes
from multiple views to carve out the visible areas of a capture volume. Reasonable reconstructions can
be obtained when large numbers of cameras are used, but concavities are inherently challenging to handle.
More accurate geometries can be obtained using multi-view stereo constraints [179, 238, 204, 48] or using
controlled illumination, such as multi-view photometric stereo techniques [196, 215]. Several methods use
parametric body models to further guide the digitization process [177, 51, 8, 78, 5, 3]. The use of motion
cues has also been introduced as additional priors [153, 221]. While it is clear that multi-view capture
techniques outperform single-view ones, they are significantly less flexible and deployable.
A middle ground solution consists of using deep learning frameworks to generate plausible 3D surfaces
from very sparse views. [35] train a 3D convolutional LSTM to predict the 3D voxel representation of objects
from arbitrary views. [94] combine information from arbitrary views using differentiable unprojection
operations. [88] also uses a similar approach, but requires at least two views. All of these techniques rely on
the use of voxels, which is memory intensive and prevents the capture of high-frequency details. [79, 57]
introduced a deep learning approach based on a volumetric occupancy field that can capture dynamic
clothed human performances using sparse viewpoints as input. At least three views are required for these
methods to produce reasonable output.
Texture Inference. When reconstructing a 3D model from a single image, the texture can be easily sampled
from the input. However, the appearance in occluded regions needs to be inferred in order to obtain a
complete texture. Related to the problem of 3D texture inference are view-synthesis approaches that predict
novel views from a single image [233, 149] or multiple images [181]. Within the context of texture mesh
inference of clothed human bodies, [138] introduced a view synthesis technique that can predict the back
view from the front one. Both front and back views are then used to texture the final 3D mesh, however
self-occluding regions and side views cannot be handled. Akin to the image inpainting problem [151],
[139] inpaints UV images that are sampled from the output of detected surface points, and [191, 67] infers
69
per voxel colors, but the output resolution is very limited. [93] directly predicts RGB values on a UV
parameterization, but their technique can only handle shapes with known topology and are therefore not
suitable for clothing inference. Our proposed method can predict per vertex colors in an end-to-end fashion
and can handle surfaces with arbitrary topology.
4.2 Silhouette-based Shape Representation
In this section, we propose a deep learning based non-parametric approach for generating the geometry and
texture of clothed 3D human bodies from a single frontal-view image. Our method can predict fine-level
geometric details of clothes and generalizes well to new subjects different from those being used during
training (See Figure 1.6).
While directly estimating 3D volumetric geometry from a single view is notoriously challenging and
likely to require a large amount of training data as well as extensive parameter tuning, two cutting-edge
deep learning techniques have shown that impressive results can be obtained using 2D silhouettes from very
sparse views [79, 193]. Inspired by these approaches based on visual hull, we propose the first algorithm to
predict 2D silhouettes of the subject from multiple views given an input segmentation, which implicitly
encodes 3D body shapes. We also show that a sparse 3D pose estimated from the 2D input [17, 162] can
help reduce the dimensionality of the shape deformation and guide the synthesis of consistent silhouettes
from novel views.
We then reconstruct the final 3D geometry from multiple silhouettes using a deep learning based visual
hull technique by incorporating a clothed human shape prior. Since silhouettes from arbitrary views can
be generated, we further improve the reconstruction result by greedily choosing view points that will lead
to improved silhouette consistency. To fully texture the reconstructed geometry, we propose to train an
image-to-image translation framework to infer the color texture of the back view given the input image
from the frontal view.
70
input image
3D pose
2D silhouette
2D silhouettes
(target views)
2D pose
(target views)
2D pose
(input view)
deep
visual hull
(§ 3.2)
reconstructed geometry
multi-view
silhouette synthesis
(§ 3.1)
front-to-back
texture inference
(§ 3.3)
back-view image
final reconstruction
Figure 4.1: Overview of our framework.
We demonstrate the effectiveness of our method on a variety of input data, including both synthetic and
real ones. We also evaluate major design decisions using ablation studies and compare our approach with
state of the art single-view as well as multi-view reconstruction techniques.
In summary, our contributions include:
The first non-parametric solution for reconstructing fully textured and clothed 3D humans from a
single-view input image.
An effective two-stage 3D shape reconstruction pipeline that consists of predicting multi-view 2D
silhouettes from a single input segmentation and a novel deep visual hull based mesh reconstruction
technique with view sampling optimization.
An image-to-image translation framework to reconstruct the texture of a full body from a single
photo.
4.2.1 Overview
Our goal is to reconstruct a wide range of 3D clothed human body shapes with a complete texture from a
single image of a person in frontal view. Figure 4.1 illustrates an overview of our system. Given an input
image, we first extract the 2D silhouette and 3D joint locations, which are fed into a silhouette synthesis
network to generate plausible 2D silhouettes from novel viewpoints (Sec. 4.2.2). The network produces
71
2D silhouette
Real/Fake
2D pose
(target view)
2D pose
(input view)
D
Ladv
generator
LBCE
Figure 4.2: Illustration of our silhouette synthesis network.
multiple silhouettes with known camera projections, which are used as input for 3D reconstruction via visual
hull algorithms [195]. However, due to possible inconsistency between the synthesized silhouettes, the
subtraction operation of visual hull tends to excessively erode the reconstructed mesh. To further improve
the output quality, we adopt a deep visual hull algorithm similar to Huang et al. [79] with a greedy view
sampling strategy so that the reconstruction results account for domain-specific clothed human body priors
(Sec. 4.2.3). Finally, we inpaint the non-visible body texture on the reconstructed mesh by inferring the
back view of the input image using an image-to-image translation network (Sec. 4.2.4).
4.2.2 Multi-View Silhouette Synthesis
We seek an effective human shape representation that can handle the shape complexity due to different
clothing types and deformations. Inspired by visual hull algorithms [132] and recent advances in conditional
image generation [44, 127, 228, 227], we propose to train a generative network for synthesizing 2D
silhouettes from viewpoints other than the input image (see Figure 4.2). We use these silhouettes as an
intermediate implicit representation for the 3D shape inference.
72
Specifically, given the subject’s 3D pose, estimated from the input image as a set of 3D joint locations,
we project the 3D pose onto the input image and a target image plane to get the 2D poseP
s
in the source
view and the poseP
t
in the target view, respectively. Our silhouette synthesis networkG
s
takes the input
silhouetteS
s
together withP
s
andP
t
as input, and predicts the 2D silhouette in the target viewP
t
:
S
t
=G
s
S
s
;P
s
;P
t
:
(4.1)
Our loss function for training the networkG
s
consists of reconstruction errors of the inferred silhouettes
using a binary cross entropy lossL
BCE
and a patch-based adversarial lossL
adv
[84]. The total objective
function is given by
L =
BCE
L
BCE
+L
adv
;
(4.2)
where the relative weight
BCE
is set to 750. In particular, the adversarial loss turns out to be critical for
synthesizing sharp and detailed silhouettes. Figure 4.3 shows that the loss function with the adversarial term
generate much sharper silhouettes, while without an adversarial loss would lead to blurry synthesis output.
Discussions. The advantages of using silhouettes to guide the 3D reconstruction are two-fold. First, since
silhouettes are binary masks, the synthesis can be formulated as a pixel-wise classification problem, which
can be trained more robustly without the need of complex loss functions or extensive hyper parameter tuning
in contrast to novel view image synthesis [126, 7]. Second, the network can predict much a higher spatial
resolution since it does not store 3D voxel information explicitly, as with volumetric representations [193],
which are bounded by the limited output resolution.
73
silhouette+2D pose (input view) 2D pose
(target view)
synthesis
w/ GAN
synthesis
w/o GAN
Figure 4.3: GAN helps generate clean silhouettes in presence of ambiguity in silhouette synthesis from a
single view.
4.2.3 Deep Visual Hull Prediction
Although our silhouette synthesis algorithm generates sharp prediction of novel-view silhouettes, the
estimated results may not be perfectly consistent as the conditioned 3D joints may fail to fully disambiguate
the details in the corresponding silhouettes (e.g., fingers, wrinkles of garments). Therefore, naively applying
conventional visual hull algorithms is prone to excessive erosion in the reconstruction, since the visual
hull is designed to subtract the inconsistent silhouettes in each view. To address this issue, we propose a
deep visual hull network that reconstructs a plausible 3D shape of clothed body without requiring perfectly
view-consistent silhouettes by leveraging the shape prior of clothed human bodies.
In particular, we use a network structure based on [79]. At a high level, Huang et al. [79] propose
to map 2D images to a 3D volumetric field through a multi-view convolutional neural network. The
3D field encodes the probabilistic distribution of 3D points on the captured surface. By querying the
resulting field, one can instantiate the geometry of clothed human body at an arbitrary resolution. However,
unlike their approach which takes carefully calibrated color images from fixed views as input, our network
only consumes the probability maps of novel-view silhouettes, which can be inconsistent across different
views. Although arbitrary number of novel-view silhouettes can be generated, it remains challenging to
74
properly select optimal input views to maximize the network performance. Therefore, we introduce several
improvements to increase the reconstruction accuracy.
Greedy view sampling. We propose a greedy view sampling strategy to choose proper views that can
lead to better reconstruction quality. Our key idea is to generate a pool of candidate silhouettes and then
select the views that are most consistent in a greedy manner. In particular, the candidate silhouettes are
rendered from 12 view binsfB
i
g: the main orientations of the bins are obtained by uniformly sampling
12 angles in the yaw axis. The first bin only contains the input view and thus has to be aligned with the
orientation of the input viewpoint. Each of the other bins consists of 5 candidate viewpoints, which are
distributed along the pitch axis with angles sampled fromf0
; 15
; 30
; 45
; 60
g. In the end, we obtain 55
candidate viewpointsfV
i
g to cover most parts of the 3D body.
To select the views with maximal consistency, we first compute an initial bounding volume of the target
model based on the input 3D joints. We then carve the bounding volume using the silhouette of the input
image and obtain a coarse visual hullH
1
. The bins with remaining views are iterated in a clockwise order,
i.e., only one candidate view will be sampled from each bin at the end of the sampling process. Starting
from the second binB
2
, the previously computed visual hullH
1
is projected to its enclosed views. The
candidate silhouette that has the maximum 2D intersection over union (IoU) withH
1
’s projection will
be selected as the next input silhouette for our deep visual hull algorithm. After the best silhouette
^
V
2
is
sampled fromB
2
,H
1
is further carved by
^
V
2
and the updated visual hullH
2
is passed to the next iteration.
We iterated until all the view bins have been sampled.
The selected input silhouettes generated by our greedy view sampling algorithm are then fed into a
deep visual hull network. The choice of our network design is similar to that of [79]. The main difference
lies in the format of inputs. Specifically, in addition to multi-view silhouettes, our network also takes the
2D projection of the 3D pose as additional channel concatenated with the corresponding silhouette. This
change helps to regularize the body part generation by passing the semantic supervision to the network and
75
input
frontal image
Real/Fake
2D silhouette
(input view)
D
Ladv
generator
LVGG
back-view image
LFM
Figure 4.4: Illustration of our front-to-back synthesis network.
thus improves robustness. Moreover, we also reduce some layers of the network of [79] to achieve a more
compact model and to prevent overfitting.
4.2.4 Front-to-Back Texture Synthesis
When capturing the subject from a single viewpoint, only one side of the texture is visible and therefore
predicting the other side of the texture appearance is required to reconstruct a fully textured 3D body shape.
Our key observation is that the frontal view and the back view of a person are spatially aligned by sharing
the same contour and many visual features. This fact has inspired us to solve the problem of back-view
texture prediction using an image-to-image translation framework based conditional generative adversarial
network. Specifically, we train a generatorG
t
to predict the back-view texture
^
I
b
from the frontal-view
input imageI
f
and the corresponding silhouetteS
f
:
^
I
b
=G
t
(I
f
;S
f
):
(4.3)
We train the generatorG
t
in a supervised manner by leveraging textured 3D human shape repositories
to generate a dataset that suffices for our training objective (Sec. 4.2.5). Adopted from a high-resolution
image-to-image translation network [201], our loss function consists of a feature matching lossL
FM
76
that minimizes the discrepancy of intermediate layer activation of the discriminatorD, a perceptual loss
L
VGG
using a VGG19 model pre-trained for image classification task [175], and an adversarial lossL
adv
conditioned by the input frontal image (see Figure 4.4). The total objective is defined as:
L =
FM
L
FM
+
VGG
L
VGG
+L
adv
;
(4.4)
where we set the relative weights as
FM
=
VGG
= 10:0 in our experiments.
The resulting back-view image is used to complete the per-vertex color texture of the reconstructed 3D
mesh. If the dot product between the surface normaln in the input camera space and the camera rayc is
negative (i.e., surface is facing towards the camera), the vertex color is sampled from the input view image
at the corresponding screen coordinate. Likewise, if the dot product is positive (i.e., surface is facing in the
opposite direction), the vertex color is sampled from the synthesized back-view image. When the surface is
perpendicular to the camera ray (i.e.,jncj = 1:0 10
4
), we blend the colors from the front and
back views so that there are no visible seams across the boundary.
4.2.5 Implementation Details
Body mesh datasets. We have collected 73 rigged meshes with full textures from aXYZ and 194 meshes
from Renderpeople. We randomly split the dataset into a training set and a test set of 247 and 20 meshes,
respectively. We apply 48 animation sequences (such as walking, waving, and Samba dancing) from
Mixamo to each mesh from Renderpeople to collect body meshes of different poses. Similarly, the meshes
from aXYZ have been animated into 11 different sequences. To render synthetic training data, we have also
obtained 163 second-order spherical harmonics of indoor environment maps from HDRI Haven and they
are randomly rotated around the yaw axis.
Camera settings for synthetic data. We place the projective camera so that the pelvis joint is aligned
with the image center and relative body size in the screen space remains unchanged. Since our silhouette
77
synthesis network takes an unconstrained silhouette as input and generate a new silhouette in predefined
view points, we separate the data generation for the source silhouettes and the target silhouettes. We render
our data images at the resolution of 256 256. For the source silhouettes a yaw angle is randomly sampled
from 360
and a pitch angle between10
and 60
, whereas for the target silhouettes, a yaw angle is
sampled from every 7:5
and a pitch angle from 10; 15; 30; 45; 60
. The camera has a randomly sampled
35mm film equivalent focal length ranged between 40 and 135mm for the source silhouettes and a fixed
focal length of 800mm for the target silhouettes. For the front-to-back image synthesis, we set the yaw
angle to be frontal and sample the pitch angle from 0; 7:5; 15
with a focal length of 800mm. Given the
camera projection, we project 13 joint locations that are compatible with MPII [136] onto each view point.
Front-to-back rendering. Figure 4.5 illustrates how we generate a pair of front and back view images.
Given a camera ray, normal rendering of 3D mesh sorts the depth of triangles per pixel and display the
rasterization results assigned from the closest triangle. To obtain the corresponding image from the other
side, we instead takes that of the furthest triangle. Note that most common graphics libraries (e.g., OpenGL,
DirectX) support this function, allowing us to generate training samples within a reasonable amount of time.
Figure 4.6 shows a collection of our rendered examples with both frontal and back views.
Network architectures. Both our silhouette synthesis network and the front-to-back synthesis network
follow the U-Net network architecture in [84, 219, 80, 202, 199] with an input channel size of 7 and 4,
respectively. All the weights in these networks are initialized based on Gaussian distribution. We use the
Adam optimizer with learning rates of 2:0 10
4
, 1:0 10
4
, and 2:0 10
4
, batch size of 30, 1, and 1,
the number of iterations of 250; 000, 160; 000, and 50; 000, and no weight decay for the silhouette synthesis,
deep visual hull, and front-to-back synthesis, respectively. The deep visual hull is trained with the output of
our silhouette synthesis network so that the distribution gap between the output of silhouette synthesis and
the input for the deep visual hull algorithm is minimized.
78
camera ray
front-view
sampled point back-view
sampled point
front-view back-view
Figure 4.5: Illustration of our back-view rendering approach.
Additional networks. Although 2D silhouette segmentation and 3D pose estimation are not our major
contributions and in practice one can use any existing methods, we train two additional networks to
automatically process the input image with consistent segmentation and joint configurations. For the
silhouette segmentation, we adopt a stacked hourglass network [141] with three stacks. Given an input
image of resolution 256 256 3, the network predicts a probability map of resolution 64 64 1 for
silhouettes. We further apply a deconvolution layer with a kernel size of 4 to obtain sharper silhouettes, after
concatenating 2 upsampled probability and the latent features after the first convolution in the hourglass
network. The network is trained with the mean-squared error between the predicted probability map and the
ground truth of UP dataset [109]. For 3D pose estimation, we adopt a state-of-the-art 3D face alignment
network [21] without modification. We train the pose estimation network using our synthetically rendered
body images of resolution 256 256 together with the corresponding 3D joints. We use the RMSProp
optimizer with a learning rate of 2:0 10
5
, a batch size of 8 and no weight decay for training both the
silhouette segmentation and pose estimation networks.
79
Figure 4.6: Our synthetically rendered training samples in our dataset.
Baseline Methods. To validate our design choice, we compare our silhouette-based reconstruction with
volumetric reconstruction using voxels [193]. Additionally, we evaluate our input of silhouette by comparing
with results from RGB input. We describe the implementation details of these baseline methods below.
80
2D silhouette
or
RGB input
3D pose
2D pose
(input view)
generator
Lvol
projection
to xz plane
projection
to xy plane
Lpf
Lps
Figure 4.7: Illustration of the baseline voxel regression network.
For 2D silhouette synthesis using RGB input, we use a network architecture based on U-Net [84] by
replacing the original single-channel segmentation with RGB images in our proposed network. We use the
same loss function and optimizer as our silhouette synthesis network. The voxel prediction network is based
on a stacked hourglass network [141]. This network takes as input silhouette/image (f1; 3g 256 256),
2D pose (3 256 256), and 3D pose (304 64 64), where the joint heat maps in depth for each
joint are concatenated into the channel dimension (16 19 = 304). Here we use two stacks for both
silhouette-input and RGB-input cases. Following [194], we concatenate the 3D pose information after a 4x
downsampling operation by pooling in the network. The network predicts an occupancy field of human
body of resolution 64 64 64, which is optimized using a BCE lossL
vol
between the ground truth and
the prediction together with an additional reprojection loss from the front viewL
pf
and the side viewL
ps
(see Figure 4.7). The reprojection loss computes BCE loss between the ground truth silhouettes and 2D
81
projected voxels alongx andz axis using max operation, constraining the resulting silhouettes from each
view to be consistent with ground truth [193]. The total loss function is given by
L =L
vol
+
p
(L
pf
+L
ps
);
where the relative weight
p
is set to 0:1 in our experiments. We use RMSProp optimizer with a learning
rate of 2:0 10
4
and a batch size of 4. Note that this ablation study uses only frontal views as input for
simplicity.
4.2.6 Results
Figure 4.9 shows our 3D reconstruction results of clothed human body using test images from synthetic
rendered data. Those test images have not been used for training.For each image, we show the back-view
synthesis result, the reconstructed 3D geometry with with plain shading, as well as the fully textured output
mesh rendered from a different view point. Figure 4.8 shows our reconstruction results of 3D clothed
human bodies with full textures on different single-view input images from the DeepFashion dataset [120].
For each input, we show the back-view texture synthesis result, the reconstructed 3D geometry rendered
with plain shading, as well as the final textured geometry. Our method can robustly handle a variety of
realistic test photos of different poses, body shapes, and cloth styles, although we train the networks using
synthetically rendered images only.
Comparisons In Figure 4.10, we compare our method using single-view input with a native visual hull
algorithm using 8 input views as well as Huang et al. [79] using four input views. For each result, we show
both the plain shaded 3D geometry and the color-coded 3D reconstruction error. Although using a single
image as input each time, we can still generate results that are visually comparable to those from methods
based on multi-view input.
82
input image back-view
synthesis
reconstruction
(geometry)
reconstruction
(textured)
Figure 4.8: Our 3D reconstruction results of clothed human body using test images from the DeepFashion
dataset [120].
83
input image back-view
synthesis
reconstruction
(geometry)
reconstruction
(textured)
input image back-view
synthesis
reconstruction
(geometry)
reconstruction
(textured)
Figure 4.9: Our 3D reconstruction results of clothed human body using test images from the synthetically
rendered data.
In Figure 4.11, we qualitatively compare our results with state-of-the-art single-view human reconstruc-
tion techniques [92, 193]. Since existing methods focus on body shape only using parametric models, our
approach can generate more faithful results in cases of complex clothed geometry.
4.2.7 Evaluation
Silhouette Representation. We verify the effectiveness of our silhouette-based representation by com-
paring it with several alternative approaches based on the Renderpeople dataset, including direct voxel
84
input visual hull
(8 views)
[Huang et al.]
(4 views)
ours
(1 view)
Figure 4.10: Comparison with multiview visual hull algorithms. Despite the single view input, our method
produces comparable reconstruction results. Note that input image in red color is the single-view input for
our method and the top four views are used for Huang et al. [79].
85
input HMR BodyNet Ours
Figure 4.11: We qualitatively compare our method with two state-of-the-art single view human reconstruc-
tion techniques, HMR [92] and BodyNet [193].
prediction from 3D pose and using RGB image to replace 2D silhouette as input for deep visual hull
algorithm. For all the methods, we report (1) the 2D Intersection over Union (IoU) for the synthetically gen-
erated side view and (2) the 3D reconstruction error based on Chamfer distances between the reconstructed
meshes and the ground-truth (in centimeter) in Table 4.1. It is evident that direct voxel prediction will lead
to poor accuracy when matching the side view in 2D and aligning with the ground-truth geometry in 3D, as
compared to our silhouette-based representation. Figure 4.12 shows qualitative comparisons demonstrating
the advantages of our silhouette-based representation.
Visual hull reconstruction. In Table 4.2 and Figure 4.13, we compare our deep visual hull algorithm
(Sec. 4.2.3) with a naive visual hull method. We also evaluate our greedy view sampling strategy by
comparing it with random view selection. We use 12 inferred silhouettes as input for different methods
86
ground truth voxel regression
ours
(silhouette + dvh)
ours
error map
voxel
error map
Figure 4.12: Qualitative evaluation of our silhouette-based shape representation as compared to direct voxel
prediction.
Input Output IoU (2D) CD EMD
RGB+2D Pose Silhouette 0:826 1:66 4:38
Silhouette+2D Pose Silhouette 0:886 1:36 3:69
RGB+3D Pose V oxel 0:471 2:49 5:67
Silhouette+3D Pose V oxel 0:462 2:77 6:23
Table 4.1: Evaluation of our silhouette-based representation compared to direct voxel prediction. The errors
are measured using Chamfer Distance(CD) and Earth Mover’s Distance(EDM) between the reconstructed
meshes and the ground-truth.
and evaluate the reconstruction errors using Chamfer distances. For random view selection, we repeat
the process 100 times and compute the average error. As additional references, we also provide the
corresponding results using the naive visual hull method with 8 ground-truth silhouettes, as well as the
method in [79] using 4 ground-truth images. As shown in Table 4.2, our deep visual algorithm outperforms
a naive approach and our greedy view sampling strategy can significantly improve the results in terms of
reconstruction errors. In addition, for the deep visual hull algorithm, our view sampling strategy is better
than 69% random selected views, while for the naive visual hull method, our approach always outperforms
random view selection. Figure 4.13 demonstrates that our deep visual hull method helps fix some artifacts
87
ground truth naive visual hull
(random 12 views)
naive visual hull
(optimized 12 views)
deep visual hull
(random 12 views)
deep visual hull
(optimized 12 views)
Figure 4.13: Comparisons between our deep visual hull method with a native visual hull algorithm, using
both random view selection and our greedy view sampling strategy.
Input Method CD EMD
Inferred
silhouettes
visual hull (random) 2:12 6:95
visual hull (optimized) 1:37 6:93
deep v-hull (random) 1:41 3:79
deep v-hull (optimized) 1:34 3:66
GT silhouettes visual hull (8 views) 0:67 3:19
GT images Huang et al. [79] (4 views) 0:98 4:09
Table 4.2: Evaluation of our greedy sampling method to compute deep visual hull.
and missing parts especially in concave regions, which are caused by inconsistency among multi-view
silhouettes synthesis results.
Silhouette Synthesis Figure 4.14 and Table 4.3 show the ablation study on the design choice of our
silhouette synthesis network. To validate the importance of 2D pose information of the input view, we train
the same silhouette synthesis network using the same configuration but without the 2D pose of the input
view. The reconstruction accuracy is evaluated by computing mean Intersection over Union (IoU) using the
subjects in our test set from the predefined 12 views spanning every 30 degrees in yaw axis.
88
Figure 4.14: Qualitative evaluation of different silhouette synthesis methods. From left to right: silhouette
of the input view, ground-truth silhouette of the target view, results of our full algorithm, results without 2D
pose information, results from a set of predefined view points, and the ones by training on the SURREAL
dataset [194].
The model without 2D pose from the input view has difficulty associating loose clothes (e.g., dresses)
with the novel view points, impairing the overall performance (see the fourth column in Figure 4.14).
We also train a silhouette synthesis network that predicts a set of silhouettes from a set of predefined
view points in one go, instead of independently predicting silhouettes from each view point together with 2D
joint information from the target view. The network generates the silhouettes from the predefined 12 view
points at once. All the other configurations are identical to our main algorithm. This alternative approach
training dataset output w/ input view pose IoU (2D)
RenderPeople single view yes 0.882
RenderPeople single view no 0.875
RenderPeople all views yes 0.806
SURREAL single view yes 0.782
Table 4.3: Ablation study of our silhouette-based representation.
89
incorrect
pose estimation
poor
back-view inference
failed segmentation
Figure 4.15: Failure cases.
also fails to produce plausible silhouettes and severely overfits to the training data samples (see the fifth
column of Figure 4.14).
Lastly, we demonstrate the importance of our clothed human training dataset to faithfully capture
subjects with various clothes. We train our proposed network on the SURREAL dataset [194] in which
all the subjects are in tightly-fitting clothes. We randomly select 14; 490 meshes from the training set of
SURREAL and train our silhouette synthesis network with the same configurations as ours. Due to the lack
of various cloth details, the resulted model is unable to predict plausible silhouettes with loose clothes (see
the last column of Figure 4.14).
4.2.8 Discussion
In this section, we present a framework for monocular 3D human reconstruction using deep neural networks.
From a single input image of the subject, we can predict the 3D textured geometry of clothed body shape,
without any requirement of a parametric model or a pre-captured template. To this end, we propose a novel-
view silhouette synthesis network based on adversarial training, an improved deep visual hull algorithm
with a greedy view selection strategy, as well as a front-to-back texture synthesis network.
One major limitation of our current implementation is that our synthetic training data is very limited
and may be biased from real images. See Figure 4.15 for a few typical failure cases, in which the 3D pose
90
estimation may fail or there are some additional accessories not covered by our training data. It would be
helpful to add realistic training data which may be tedious and costly to acquire. The output mesh using our
method is not rigged and thus cannot be directly used for animation. Also we do not explicitly separate the
geometry of cloth and human body. In the future, we plan to extend our method to predict output with high
frequency details and semantic labels. Finally, it is interesting to infer relightable textures such as diffuse
and specular albedo maps.
91
4.3 Pixel-Aligned Implicit Functions
For certain domain-specific objects, such as faces, human bodies, or known man made objects, it is already
possible to infer relatively accurate 3D surfaces from images with the help of parametric models, data-driven
techniques, or deep neural networks. Recent 3D deep learning advances have shown that general shapes can
be inferred from very few images and sometimes even a single input. However, the resulting resolutions and
accuracy are typically limited, due to ineffective model representations, even for domain specific modeling
tasks.
In this section, we propose a new Pixel-aligned Implicit Function (PIFu) representation for 3D deep
learning for the challenging problem of textured surface inference of clothed 3D humans from a single
or multiple input images. While most successful deep learning methods for 2D image processing (e.g.,
semantic segmentation [172], 2D joint detection [189], etc.) take advantage of “fully-convolutional” network
architectures that preserve the spatial alignment between the image and the output, this is particularly
challenging in the 3D domain. While voxel representations [193] can be applied in a fully-convolutional
manner, the memory intensive nature of the representation inherently restrict its ability to produce fine-scale
detailed surfaces. Inference techniques based on global representations [63, 92, 3] are more memory
efficient, but cannot guarantee that details of input images are preserved. Similarly, methods based on
implicit functions [33, 150, 134] rely on the global context of the image to infer the overall shape, which
may not align with the input image accurately. On the other hand, PIFu aligns individual local features
at the pixel level to the global context of the entire object in a fully convolutional manner, and does not
require high memory usage, as in voxel-based representations. This is particularly relevant for the 3D
reconstruction of clothed subjects, whose shape can be of arbitrary topology, highly deformable and highly
detailed. While [79] also utilize local features, due to the lack of 3D-aware feature fusion mechanism, their
approach is unable to reason 3D shapes from a single-view. In this work we show that combination of
local features and 3D-aware implicit surface representation makes a significant difference including highly
detailed reconstruction even from a single view.
92
Specifically, we train an encoder to learn individual feature vectors for each pixel of an image that takes
into account the global context relative to its position. Given this per-pixel feature vector and a specified
z-depth along the outgoing camera ray from this pixel, we learn an implicit function that can classify
whether a 3D point corresponding to this z-depth is inside or outside the surface. In particular, our feature
vector spatially aligns the global 3D surface shape to the pixel, which allows us to preserve local details
present in the input image while inferring plausible ones in unseen regions.
Our end-to-end and unified digitization approach can directly predict high-resolution 3D shapes of
a person with complex hairstyles and wearing arbitrary clothing. Despite the amount of unseen regions,
particularly for a single-view input, our method can generate a complete model similar to ones obtained
from multi-view stereo photogrammetry or other 3D scanning techniques. As shown in Figure 1.7, our
algorithm can handle a wide range of complex clothing, such as skirts, scarfs, and even high-heels while
capturing high frequency details such as wrinkles that match the input image at the pixel level.
By simply adopting the implicit function to regress RGB values at each queried point along the ray,
PIFu can be naturally extended to infer per-vertex colors. Hence, our digitization framework also generates
a complete texture of the surface, while predicting plausible appearance details in unseen regions. Through
additional multi-view stereo constraints, PIFu can also be naturally extended to handle multiple input
images, as is often desired for practical human capture settings. Since producing a complete textured mesh
is already possible from a single input image, adding more views only improves our results further by
providing additional information for unseen regions.
We demonstrate the effectiveness and accuracy of our approach on a wide range of challenging real-
world and unconstrained images of clothed subjects. We also show for the first time, high-resolution
examples of monocular and textured 3D reconstructions of dynamic clothed human bodies reconstructed
from a video sequence. We provide comprehensive evaluations of our method using ground truth 3D
scan datasets obtained using high-end photogrammetry. We compare our method with prior work and
demonstrate the state-of-the-art performance on a public benchmark for digitizing clothed humans.
93
)=
input image (s)
%
( ,
inside/
outside
%
*
image encoder
Surface Reconstruction
Texture Inference
Training
Testing
Marching
Cube
3D occupancy field reconstructed geometry textured reconstruction -view inputs ( ≥ 1)
image encoder
PIFu (∀,∀)
*
( ,
Tex-PIFu (∀,∀)
)=
RGB
PIFu Tex-PIFu
Figure 4.16: Overview of our clothed human digitization pipeline. Given an input image, a pixel-aligned im-
plicit function (PIFu) predicts the continuous inside/outside probability field of a clothed human. Similarly,
PIFu for texture inference (Tex-PIFu) infers an RGB value at given 3D positions of the surface geometry
with arbitrary topology.
4.3.1 Overview
Given a single or multi-view images, our goal is to reconstruct the underlining 3D geometry and texture of
a clothed human while preserving the detail present in the image. To this end, we introduce Pixel-Aligned
Implicit Functions (PIFu) which is a memory efficient and spatially-aligned 3D representation for 3D
surfaces. An implicit function defines a surface as a level set of a functionf, e.g. f(X) = 0 [168]. This
results in a memory efficient representation of a surface where the space in which the surface is embedded
does not need to be explicitly stored. The proposed pixel-aligned implicit function consists of a fully
convolutional image encoderg and a continuous implicit functionf represented by multi-layer perceptrons
(MLPs), where the surface is defined as a level set of
f(F (x);z(X)) =s :s2R; (4.5)
94
where for a 3D pointX,x =(X) is its 2D projection,z(X) is the depth value in the camera coordinate
space,F (x) =g(I(x)) is the image feature atx. We assume a weak-perspective camera, but extending to
perspective cameras is straightforward. Note that we obtain the pixel-aligned featureF (x) using bilinear
sampling, because the 2D projection ofX is defined in a continuous space rather than a discrete one (i.e.,
pixel).
The key observation is that we learn an implicit function over the 3D space with pixel-aligned image
features rather than global features, which allows the learned functions to preserve the local detail present in
the image. The continuous nature of PIFu allows us to generate detailed geometry with arbitrary topology
in a memory efficient manner. Moreover, PIFu can be cast as a general framework that can be extended to
various co-domains such as RGB colors.
Digitization Pipeline. Figure 4.16 illustrates the overview of our framework. Given an input image, PIFu
for surface reconstruction predicts the continuous inside/outside probability field of a clothed human, in
which iso-surface can be easily extracted (Sec. 4.3.2). Similarly, PIFu for texture inference (Tex-PIFu)
outputs an RGB value at 3D positions of the surface geometry, enabling texture inference in self-occluded
surface regions and shapes of arbitrary topology (Sec. 4.3.3). Furthermore, we show that the proposed
approach can handle single-view and multi-view input naturally, which allows us to produce even higher
fidelity results when more views are available (Sec. 4.3.4).
4.3.2 Single-view Surface Reconstruction
For surface reconstruction, we represent the ground truth surface as a 0:5 level-set of a continuous 3D
occupancy field:
f
v
(X) =
8
>
>
>
<
>
>
>
:
1; ifX is inside mesh surface
0; otherwise
: (4.6)
95
We train a pixel-aligned implicit function (PIFu)f
v
by minimizing the average of mean squared error:
L
V
=
1
n
n
X
i=1
jf
v
(F
V
(x
i
);z(X
i
))f
v
(X
i
)j
2
; (4.7)
whereX
i
2 R
3
,F
V
(x) = g(I(x)) is the image feature from the image encoderg atx = (X) andn
is the number of sampled points. Given a pair of an input image and the corresponding 3D mesh that
is spatially aligned with the input image, the parameters of the image encoderg and PIFuf
v
are jointly
updated by minimizing Eq. 4.7. As Bansal et al. [9] demonstrate for semantic segmentation, training an
image encoder with a subset of pixels does not hurt convergence compared with training with all the pixels.
During inference, we densely sample the probability field over the 3D space and extract the iso-surface
of the probability field at threshold 0:5 using the Marching Cube algorithm [123]. This implicit surface
representation is suitable for detailed objects with arbitrary topology. Aside from PIFu’s expressiveness
and memory-efficiency, we develop a spatial sampling strategy that is critical for achieving high-fidelity
inference.
Spatial Sampling. The resolution of the training data plays a central role in achieving the expressiveness
and accuracy of our implicit function. Unlike voxel-based methods, our approach does not require dis-
cretization of ground truth 3D meshes. Instead, we can directly sample 3D points on the fly from the ground
truth mesh in the original resolution using an efficient ray tracing algorithm [198]. Note that this operation
requires water-tight meshes. In the case of non-watertight meshes, one can use off-the-shelf solutions to
make the meshes watertight [10]. Additionally, we observe that the sampling strategy can largely influence
the final reconstruction quality. If one uniformly samples points in the 3D space, the majority of points
are far from the iso-surface, which would unnecessarily weight the network toward outside predictions.
On the other hand, sampling only around the iso-surface can cause overfitting. Consequently, we propose
to combine uniform sampling and adaptive sampling based on the surface geometry. We first randomly
sample points on the surface geometry and add offsets with normal distributionN (0;) ( = 5:0 cm in our
96
experiments) for x, y, and z axis to perturb their positions around the surface. We combine those samples
with uniformly sampled points within bounding boxes using a ratio of 16 : 1. We provide an ablation study
on our sampling strategy in Sec. 4.3.7.
4.3.3 Texture Inference
While texture inference is often performed on either a 2D parameterization of the surface [93, 65] or in
view-space [138], PIFu enables us to directly predict the RGB colors on the surface geometry by definings
in Eq. 4.5 as an RGB vector field instead of a scalar field. This supports texturing of shapes with arbitrary
topology and self-occlusion. However, extending PIFu to color prediction is a non-trivial task as RGB
colors are defined only on the surface while the 3D occupancy field is defined over the entire 3D space.
Here, we highlight the modification of PIFu in terms of training procedure and network architecture.
Given sampled 3D points on the surfaceX2
, the objective function for texture inference is the
average of L1 error of the sampled colors as follows:
L
C
=
1
n
n
X
i=1
jf
c
(F
C
(x
i
);z(X
i
))C(X
i
)j; (4.8)
whereC(X
i
) is the ground truth RGB value on the surface pointX
i
2
andn is the number of sampled
points. We found that naively trainingf
c
with the loss function above severely suffers from overfitting.
The problem is thatf
c
is expected to learn not only RGB color on the surface but also the underlining 3D
surfaces of the object so thatf
c
can infer texture of unseen surface with different pose and shape during
inference, which poses a significant challenge. We address this problem with the following modifications.
First, we condition the image encoder for texture inference with the image features learned for the surface
reconstructionF
V
. This way, the image encoder can focus on color inference of a given geometry even if
unseen objects have different shape, pose, or topology. Additionally, we introduce an offsetN (0;d) to
the surface points along the surface normalN so that the color can be defined not only on the exact surface
97
!
"
(Φ
%
,…,Φ
(
) = +, Φ
,
= !
%
(-
,
.
,
,/
,
(0))
Multi-View PIFu
-
-
%
-
(
0
. = 1(0)
/(0)
!(- . ,/(0)) = +
PIFu
0
⋯
.
%
= 1
%
(0)
.
(
= 1
(
(0)
/
%
(0) /
(
(0)
! = f
"
∘f
%
Figure 4.17: Multi-view PIFu. PIFu can be extended to support multi-view inputs by decomposing implicit
functionf into a feature embedding functionf
1
and a multi-view reasoning functionf
2
. f
1
computes a
feature embedding from each view in the 3D world coordinate system, which allows aggregation from
arbitrary views. f
2
takes aggregated feature vector to make a more informed 3D surface and texture
prediction.
but also on the 3D space around it. With the modifications above, the training objective function can be
rewritten as:
L
C
=
1
n
n
X
i=1
f
c
(F
C
(x
0
i
;F
V
);X
0
i;z
)C(X
i
)
; (4.9)
whereX
0
i
=X
i
+N
i
. We used = 1:0 cm for all the experiments.
4.3.4 Multi-View Stereo
Additional views provide more coverage about the person and should improve the digitization accuracy.
Our formulation of PIFu provides the option to incorporate information from more views for both surface
reconstruction and texture inference. We achieve this by using PIFu to learn a feature embedding for every
3D point in space. Specifically the output domain of Eq. 4.5 is now an-dimensional vector spaces2R
n
that represents the latent feature embedding associated with the specified 3D coordinate and the image
feature from each view. Since this embedding is defined in the 3D world coordinate space, we can aggregate
the embedding from all available views that share the same 3D point. The aggregated feature vector can be
used to make a more confident prediction of the surface and the texture.
98
Specifically we decompose the pixel-aligned functionf into a feature embedding networkf
1
and a
multi-view reasoning networkf
2
asf := f
2
f
1
. See Figure 4.17 for illustrations. The first function
f
1
encodes the image featureF
i
(x
i
) : x
i
=
i
(X) and depth valuez
i
(X) from each view pointi into
latent feature embedding
i
. This allows us to aggregate the corresponding pixel features from all the
views. Now that the corresponding 3D pointX is shared by different views, each image can projectX
on its own image coordinate system by
i
(X) andz
i
(X). Then, we aggregate the latent features
i
by
average pooling operation and obtain the fused embedding
= mean(f
i
g). The second functionf
2
maps
from the aggregated embedding
to our target implicit fields (i.e., inside/outside probability for surface
reconstruction and RGB value for texture inference). The additive nature of the latent embedding allows
us to incorporate arbitrary number of inputs. Note that a single-view input can be also handled without
modification in the same framework as the average operation simply returns the original latent embedding.
For training, we use the same training procedure as the aforementioned single-view cases including loss
functions and the point sampling scheme. While we train with three random views, our experiments show
that the model can incorporate information from more than three views (See Sec. 4.3.7).
4.3.5 Implementation Details
Experimental Setup. Since there is no large scale datasets for high-resolution clothed humans, we
collected photogrammetry data of 491 high-quality textured human meshes with a wide range of clothing,
shapes, and poses, each consisting of about 100; 000 triangles from RenderPeople. We refer to this database
as High-Fidelity Clothed Human Data set. We randomly split the dataset into a training set of 442 subjects
and a test set of 49 subjects. To efficiently render the digital humans, Lambertian diffuse shading with
surface normal and spherical harmonics are typically used due to its simplicity and efficiency [194, 138].
However, we found that to achieve high-fidelity reconstructions on real images, the synthetic renderings
need to correctly simulate light transport effects resulting from both global and local geometric properties
such as ambient occlusion. To this end, we use a precomputed radiance transfer technique (PRT) that
precomputes visibility on the surface using spherical harmonics and efficiently represents global light
99
reconstructed geometry textured reconstruction input
Figure 4.18: Qualitative single-view results on real images from DeepFashion dataset [120]. The proposed
Pixel-Aligned Implicit Functions, PIFu, achieves a topology-free, memory efficient, spatially-aligned 3D
reconstruction of geometry and texture of clothed human.
transport effects by multiplying spherical harmonics coefficients of illumination and visibility [176]. PRT
only needs to be computed once per object and can be reused with arbitrary illuminations and camera
angles. Together with PRT, we use 163 second-order spherical harmonics of indoor scene from HDRI
Haven using random rotations around y axis. We render the images by aligning subjects to the image center
using a weak-perspective camera model and image resolution of 512 512. We also rotate the subjects
for 360 degrees in yaw axis, resulting in 360 442 = 159; 120 images for training. For the evaluation, we
render 49 subjects from RenderPeople and 5 subjects from the BUFF data set [226] using 4 views spanning
every 90 degrees in yaw axis. Note that we render the images without the background. We also test our
approach on real images of humans from the DeepFashion data set [120]. In the case of real data, we use a
off-the-shelf semantic segmentation network together with Grab-Cut refinement [164].
Network Architecture. Since the framework of PIFu is not limited to a specific network architecture, one
can technically use any fully convolutional neural network as the image encoder. For surface reconstruction,
100
we adapt the stacked hourglass network [141] with modifications proposed by [86]. We also replace batch
normalization with group normalization [216], which improves the training stability when the batch sizes
are small. Similar to [86], the intermediate features of each stack are fed into PIFu, and the losses from
all the stacks are aggregated for parameter update. We have conducted ablation study on the network
architecture design and compare against other alternatives (VGG16, ResNet34) in Appendix II. The image
encoder for texture inference adopts the architecture of CycleGAN [235] consisting of 6 residual blocks
[89]. Instead of using transpose convolutions to upsample the latent features, we directly feed the output of
the residual blocks to the following Tex-PIFu.
PIFu for surface reconstruction is based on a multi-layer perceptron, where the number of neurons is
(257; 1024; 512; 256; 128; 1) with non-linear activations using leaky ReLU except the last layer that uses
sigmoid activation. To effectively propagate the depth information, each layer of MLP has skip connections
from the image featureF (x)2R
256
and depthz in spirit of [33]. For multi-view PIFu, we simply take
the 4-th layer output as feature embedding and apply average pooling to aggregate the embedding from
different views. Tex-PIFu takesF
C
(x)2R
256
together with the image feature for surface reconstruction
F
V
(x)2R
256
by setting the number of the first neurons in the MLP to 513 instead of 257. We also replace
the last layer of PIFu with 3 neurons, followed by tanh activation to represent RGB values.
Training procedure. Since the texture inference module requires pretrained image features from the
surface reconstruction module, we first train PIFu for the surface reconstruction and then for texture
inference, using the learned image featuresF
V
as condition. We use RMSProp for the surface reconstruction
following [141] and Adam for the texture inference with learning rate of 1 10
3
as in [235], the batch
size of 3 and 5, the number of epochs of 12 and 6, and the number of sampled points of 5000 and 10000 per
object in every training batch respectively. The learning rate of RMSProp is decayed by the factor of 0:1 at
10-th epoch following [141]. The multi-view PIFu is fine-tuned from the models trained for single-view
surface reconstruction and texture inference with a learning rate of 1 10
4
and 2 epochs. The training
101
of PIFu for single-view surface reconstruction and texture inference takes 4 and 2 days, respectively, and
fine-tuning for multi-view PIFu can be achieved within 1 day on a single 1080ti GPU.
4.3.6 Results
In Figure 4.18, we present our digitization results using real world input images from the DeepFashion
dataset [120]. We demonstrate our PIFu can handle wide varieties of clothing, including skirts, jackets, and
dresses. Our method can produce high-resolution local details, while inferring plausible 3D surfaces in
unseen regions. Complete textures are also inferred successfully from a single input image, which allows us
to view our 3D models from 360 degrees. In particular, we show how dynamic clothed human performances
and complex deformations can be digitized in 3D from a single 2D input video.
Results on Video Sequences. We also apply our approach to video sequences obtained from [196]. For
the reconstruction, video frames are center cropped and scaled so that the size of the subjects are roughly
aligned with our training data. Note that the cropping and scale is fixed for each sequence. Figure 4.19
demonstrates that our reconstructed results are reasonably temporally coherent even though the frames are
processed independently.
Quantitative Results We quantitatively evaluate our reconstruction accuracy with three metrics. In the
model space, we measure the average point-to-surface Euclidean distance (P2S) in cm from the vertices
on the reconstructed surface to the ground truth. We also measure the Chamfer distance between the
reconstructed and the ground truth surfaces. In addition, we introduce the normal reprojection error to
measure the fineness of reconstructed local details, as well as the projection consistency from the input
image. For both reconstructed and ground truth surfaces, we render their normal maps in the image space
from the input viewpoint respectively. We then calculate the L2 error between these two normal maps.
Single-View Reconstruction. In Table 4.4 and Figure 4.20, we evaluate the reconstruction errors for
each method on both Buff and RenderPeople test set. Note that while V oxel Regression Network (VRN)
102
sequence 1 ours ground truth
sequence 2 ours ground truth
Figure 4.19: Results on video sequences obtained from [196]. While ours uses a single view input, the
ground truth is obtained from 8 views with controlled lighting conditions.
[86], IM-GAN [33], and ours are retrained with the same High-Fidelity Clothed Human dataset we use
for our approach, the reconstruction of [138, 193] are obtained from their trained models as off-the-shelf
103
ours VRN IM-GAN SiCloPe BodyNet
Figure 4.20: Comparison with other human digitization methods from a single image. For each input image
on the left, we show the predicted surface (top row), surface normal (middle row), and the point-to-surface
errors (bottom row).
solutions. Since single-view inputs leaves the scale factor ambiguous, the evaluation is performed with
the known scale factor for all the approaches. In contrast to the state-of-the-art single-view reconstruction
method using implicit function (IM-GAN) [32] that reconstruct surface from one global feature per image,
our method outputs pixel-aligned high-resolution surface reconstruction that captures hair styles and
wrinkles of the clothing. We also demonstrate the expressiveness of our PIFu representation compared with
voxels. Although VRN and ours share the same network architecture for the image encoder, the higher
expressiveness of implicit representation allows us to achieve higher fidelity.
In Figure 4.21, we also compare our single-view texture inferences with a state-of-the-art texture
inference method on clothed human, SiCloPe [138], which infers a 2D image from the back view and
stitches it together with the input front-view image to obtain textured meshes. While SiCloPe suffers from
projection distortion and artifacts around the silhouette boundary, our approach predicts textures on the
surface mesh directly, removing projection artifacts.
Multi-View Reconstruction. In Table 4.5 and Figure 4.22, we compare our multi-view reconstruction with
other deep learning-based multi-view methods including LSM [94], and a deep visual hull method proposed
by Huang et al. [77]. All approaches are trained on the same High-Fidelity Clothed Human Dataset using
104
SiCloPe ours
input
Figure 4.21: Comparison with SiCloPe [138] on texture inference. While texture inference via a view
synthesis approach suffers from projection artifacts, proposed approach does not as it directly inpaints
textures on the surface geometry.
three-view input images. Note that Huang et al. can be seen as a degeneration of our method where the
multi-view feature fusion process solely relies on image features, without explicit conditioning on the 3D
coordinate information. To evaluate the importance of conditioning on the depth, we denote our network
architecture removingz from input of PIFu as Huang et al. in our experiments. We demonstrate that PIFu
achieves the state-of-the-art reconstruction qualitatively and quantitatively in our metrics. We also show that
our multi-view PIFu allows us to increasingly refine the geometry and texture by incorporating arbitrary
number of views in Figure 4.23.
Comparison with Template-based Method. In Figure 4.24 and Table 4.6, we compare our approach
with a template based method [5] that takes a dense 360 degrees view video as an input on BUFF dataset.
105
LSM [Huang et al.] ours
input
Figure 4.22: Comparison with learning-based multi-view methods. Ours outperforms other learning-based
multi-view methods qualitatively and quantitatively. Note that all methods are trained with three view inputs
from the same training data.
From 3 views we outperform the template based method. Note that Alldieck et al. requires an uncalibrated
dense video sequence, while ours requires calibrated sparse view inputs.
Comparison with Voxel Regression Network. We provide an additional comparison with V oxel Regres-
sion Network (VRN) [86] to clarify the advantages of PIFu. Figure 4.25 demonstrates that the proposed
PIFu representation can align the 3D reconstruction with pixels at higher resolution, while VRN suffers
from misalignment due to the limited precision of its voxel representation. Additionally, the generality of
PIFu offers texturing of shapes with arbitrary topology and self-occlusion, which has not been addressed by
the work of VRN. Note that VRN only is able to project the image texture onto the recovered surface, and
does not provide an approach to do texture inpainting on the unseen side.
106
1 view 6 view 9 view 3 view
Figure 4.23: Our surface and texture predictions increasingly improve as more views are added.
ours input [Alldieck et al. 18]
Figure 4.24: Comparison with a template-based method [5]. Note that while Alldieck et al. uses a dense
video sequence without camera calibration, ours uses the calibrated three views as input.
107
RenderPeople Buff
Methods Normal P2S Chamfer Normal P2S Chamfer
BodyNet 0.262 5.72 5.64 0.308 4.94 4.52
SiCloPe 0.216 3.81 4.02 0.222 4.06 3.99
IM-GAN 0.258 2.87 3.14 0.337 5.11 5.32
VRN 0.116 1.42 1.56 0.130 2.33 2.48
Ours 0.084 1.52 1.50 0.0928 1.15 1.14
Table 4.4: Quantitative evaluation on RenderPeople and BUFF dataset for single-view reconstruction.
RenderPeople Buff
Methods Normal P2S Chamfer Normal P2S Chamfer
LSM 0.251 4.40 3.93 0.272 3.58 3.30
Deep V-Hull 0.093 0.639 0.632 0.119 0.698 0.709
Ours 0.094 0.554 0.567 0.107 0.665 0.641
Table 4.5: Quantitative comparison between multi-view reconstruction algorithms using 3 views.
4.3.7 Evaluation
We evaluate our proposed approach on a variety of datasets, including RenderPeople [158] and BUFF [226].
Spatial Sampling. In Table 4.7 and Figure 4.26, we provide the effects of sampling methods for surface
reconstruction. The most straightforward way is to uniformly sample inside the bounding box of the target
object. Although it helps to remove artifacts caused by overfitting, the decision boundary becomes less
sharp, losing all the local details (See Figure 4.26, first column). To obtain a sharper decision boundary, we
propose to sample points around the surface with distances following a standard deviation from the actual
surface mesh. We use = 3; 5; and 15 cm. The smaller becomes, the sharper the decision boundary is
the result becomes more prone to artifacts outside the decision boundary (second column). We found that
combining adaptive sampling with = 5 cm and uniform sampling achieves qualitatively and quantitatively
the best results (right-most column). Note that each sampling scheme is trained with the identical setup as
our training procedure described in Sec. 4.3.5.
Network Architecture. In this section, we show comparisons of different architectures for the surface
reconstruction and provide insight on design choices of the image encoders. One option is to use bottleneck
108
Buff
Methods Normal P2S Chamfer
Alldieck et al. 18 (Video) 0.127 0.820 0.795
Ours (3 views) 0.107 0.665 0.641
Table 4.6: Quantitative comparison between a template-based method [5] using a dense video sequence and
ours using 3 views.
RenderPeople Buff
Methods Normal P2S Chamfer Normal P2S Chamfer
Uniform 0.119 5.07 4.23 0.132 5.98 4.53
= 3cm 0.104 2.03 1.62 0.114 6.15 3.81
= 5cm 0.105 1.73 1.55 0.115 1.54 1.41
= 15cm 0.100 1.49 1.43 0.105 1.37 1.26
= 5cm + Uniform 0.084 1.52 1.50 0.092 1.15 1.14
Table 4.7: Ablation study on the sampling strategy.
features of fully convolutional networks [89, 207, 141]. Due to its state-of-the-art performance in volumetric
regression for human faces and bodies, we choose Stacked Hourglass network [141] with a modification
proposed by [86], denoted as HG. Another option is to aggregate features from multiple layers to obtain
multi-scale feature embedding [9, 79]. Here we use two widely used network architectures: VGG16 [175]
and ResNet34 [70] for the comparison. We extract the features from the layers of ‘relu1 2’, ‘relu2 2’,
‘relu3 3’, ‘relu4 3’, and ‘relu5 3’ for VGG network using bilinear sampling based onx, resulting in 1472
dimensional features. Similarly, we extract the features before every pooling layers in ResNet, resulting in
1024-D features. We modify the first channel size in PIFu to incorporate the feature dimensions and train
the surface reconstruction model using the Adam optimizer with a learning rate of 1 10
3
, the number
of sampling of 10; 000 and batch size of 8 and 4 for VGG and ResNet respectively. Note that VGG and
ResNet are initialized with models pretrained with ImageNet [37]. The other hyper-paremeters are the same
as the ones used for our sequential network based on Stacked Hourglass.
In Table 4.8 and Figure 4.27, we show comparisons of three architectures using our evaluation data.
While ResNet has slightly better performance in the same domain as the training data (i.e., test set in
RenderPeople dataset), we observe that the network suffers from overfitting, failing to generalize to other
109
[Jackson et al.]
ours
input
Figure 4.25: Comparison with V oxel Regression Network [86]. While [86] suffers from texture projection
error due to the limited precision of voxel representation, our PIFu representation efficiently represents not
only surface geometry in a pixel-aligned manner, but also complete texture on the missing region. Note that
[86] can only texture the visible portion of the person by projecting the foreground to the recovered surface.
In comparison, we recover the texture of the entire surface, including the unseen regions.
domains (i.e., BUFF and DeepFashion dataset). Thus, we adopt a sequential architecture based on Stacked
Hourglass network as our final model.
110
!= 5$% != 3$% != 15$%
uniform
uniform + ! = 5$%
Figure 4.26: Reconstructed geometry and point to surface error visualization using different sampling
methods.
111
input HG ResNet34 VGG16
Figure 4.27: Reconstructed geometry and point to surface error visualization using different architectures
for the image encoder.
RenderPeople Buff
Methods Normal P2S Chamfer Normal P2S Chamfer
VGG16 0.125 3.02 2.25 0.144 4.65 3.08
ResNet34 0.097 1.49 1.43 0.099 1.68 1.50
HG 0.084 1.52 1.50 0.092 1.15 1.14
Table 4.8: Ablation study on network architectures.
4.3.8 Discussion
We introduced a novel pixel-aligned implicit function, which spatially aligns the pixel-level information of
the input image with the shape of the 3D object, for deep learning based 3D shape and texture inference of
clothed humans from a single input image. Our experiments indicate that highly plausible geometry can be
inferred including largely unseen regions such as the back of a person, while preserving high-frequency
details present in the image. Unlike voxel-based representations, our method can produce high-resolution
output since we are not limited by the high memory requirements of volumetric representations. Furthermore,
we also demonstrate how this method can be naturally extended to infer the entire texture on a person given
partial observations. Unlike existing methods, which synthesize the back regions based on frontal views in
112
an image space, our approach can predict colors in unseen, concave and side regions directly on the surface.
In particular, our method is the first approach that can inpaint textures for shapes of arbitrary topology.
Since we are capable for generating textured 3D surfaces of a clothed person from a single RGB camera, we
are moving a step closer toward monocular reconstructions of dynamic scenes from video without the need
of a template model. Our ability to handle arbitrary additional views also makes our approach particularly
suitable for practical and efficient 3D modeling settings using sparse views, where traditional multi-view
stereo or structure-from-motion would fail.
113
Chapter 5
Conclusion and Future Directions
5.1 Summary
This dissertation demonstrates the importance of using appropriate data representations to fully leverage
high-capacity deep neural networks for human digitization. We present methods and algorithms that infer
the shape and appearance of the face, hair, and clothed human body from minimal inputs, typically a single
image.
In Chapter 2, we introduce a framework to reconstruct physically plausible facial reflectance and
geometry from a single image. Due to the anatomically constrained shape variations, coarse-level geometry
can be successfully modeled by a single template mesh with a linear deformation subspace. Furthermore,
a single template offers a 2D shared parameterization that maps the surface of faces to the 2D texture
space, allowing for stable high-resolution synthesis in a canonical 2D image domain. As a result, we can
effectively reconstruct fine-grained details in geometry and appearance by leveraging recently introduced
image-to-image translation techniques [84] to decouple intrinsic attributes (i.e., diffuse/specular albedo,
displacements) from unconstrained facial images. To obtain complete texture maps, we also present
a novel inpainting method that leverages the symmetric nature of facial texture in a canonical space.
In our experiments, we demonstrate the efficacy of the proposed approach and present highly realistic
reconstructions.
114
Our second contribution is an algorithm to infer 3D hair from a single image without requiring an
explicit hair database or hand-crafted feature descriptors (e.g., hair segmentation and orientation maps).
We propose a direct regression framework that takes as input a single image and directly predicts 3D hair
without intermediate steps. The core insight is that while the irregular structure of hair strands is challenging
to incorporate into a deep learning framework, regular volumetric structures such as voxels are effectively
processed with high-capacity 3D convolutional neural networks. Thus, we convert hair strands into a
regular volumetric representation consisting of 3D orientation and occupancy fields. Applying volumetric
variational autoencoders, 3D hair styles are effectively embedded into a low-dimensional subspace, allowing
us to directly regress hair coefficients from input images. We demonstrate that eliminating the need for hand-
crafted features and intermediate steps significantly improve not only robustness but also generalization to
unseen domains, such as highly stylized portraits.
Third, we propose learning algorithms to achieve high-resolution reconstruction of clothed human
bodies that exhibit extremely large variations in both shapes and colors. To achieve this, we first introduce
an implicit shape representation by inferring novel-view silhouettes from inferred 3D joints and a silhouette
from the input view. Since a set of 2D silhouettes is more light-weight than explicitly modeling 3D
occupancy using voxels, the proposed approach demonstrates more expressive reconstructions together
with efficient training. Finally, we significantly improve the fidelity of reconstruction by introducing
Pixel-Aligned Implicit Function (PIFu), where continuous scalar/vector fields in 3D space are inferred
from pixel-aligned image features without discretization. We further demonstrate that PIFu is a general
framework, and is easily extended to topology-agnostic texture inference, as well as multi-view stereo in a
unified manner.
5.2 Open Questions and Future Directions
In this section, we discuss open questions and the potential research directions for the future work.
115
a) Explicit representation b) Implicit surface
Projection
(Rasterization)
Explicit shape representations
(pj)
Field probing
(Ray tracing)
Implicit representations
> 0.5
< 0.5
V oxel Point cloud Mesh Occupancy field
Differentiable
rendering
+Topology
-Fidelity
+Topology
-Fidelity
-Topology
+Fidelity
+Topology
++Fidelity
Figure 5.1: While explicit shape representations may suffer from poor visual quality due to limited resolutions or failure to handle
arbitrary topologies (a), implicit surfaces handle arbitrary topologies with high resolutions in a memory efficient manner (b). However,
in contrast to the explicit representations, it is not feasible to directly project an implicit field onto a 2D domain via perspective
transformation. Thus, we introduce a field probing approach based on efficient ray sampling that enables unsupervised learning of
implicit surfaces from image-based supervision.
Learning 3D by 2D Supervision. In this thesis, we focus on supervised learning algorithms with effective
data representations by leveraging high-fidelity ground truth data. However, its modeling capabilities are
constrained by the quantity and variations of available 3D datasets. In contrast, far more 2D photographs
are being taken and shared over the Internet. A natural extension is to incorporate these large-scale
image datasets into a learning framework to further improve the robustness and fidelity of reconstructions.
To this end, various differentiable rendering techniques have been recently proposed for different shape
representations including mesh [96, 118], point cloud [83], and voxel [220]. While the aforementioned
explicit shape representations can be efficiently rendered by projection, implicit surface representations are
non-trivial to render due to the expensive evaluation process in space (see Fig 5.1). To this end, we propose
an efficient rendering framework of implicit functions using importance sampling techniques. While
our preliminary work has shown promising results [119], the proposed approach is currently limited to
silhouettes. Supporting other attributes (e.g., color, shading) for image-based supervision can be addressed
by future work.
Additionally, since learning 3D from 2D inputs is an ill-posed problem, ad-hoc shape priors such as
Laplacian regularization [96] are employed to generate visually pleasing reconstructions. Another potential
research direction is to jointly learn shape priors from observations rather than relying on hand-crafted
regularization.
116
Dynamic Reconstruction from Monocular Inputs. In Chapter 4, we demonstrate that highly accurate
3D reconstruction is possible even from a monocular RGB input. While per-frame reconstruction is fairly
temporally coherent, it still exhibits artifacts such as jittering and lack of occluded body parts. Recently
Niemeyer et al. [143] extend implicit functions to the temporal domain. The proposed PIFu is pixel-aligned,
allowing us to effectively associate image features to predictions unlike [143], which use global image
features for encoding. One potential direction is to extend PIFu to the temporal domain and associate with
motion information in the image space (i.e., optical flow). Aside from improving stability of reconstruction,
this would open a new venue for 3D performance captures of objects that could not be captured with
traditional depth sensing or motion capture techniques (e.g., wild animals, fishes, hair, garments in motion).
Controllability In Chapter 4, we show that non-parametric shape representations are more suitable for
learning large deformations including topology change. However, one drawback of such a non-parametric
representation is that the resulting outputs do not provide explicit correspondences, and lack semantically
meaningful low-dimensional embeddings for control, unlike the parametric face models in Chapter 2. In
many graphics applications, controllability is a critical property for authoring reconstructions to create a new
content. More specifically, embedding skeletal control (i.e., rigging) is widely used to create animations
with 3D avatars. However, since traditional skinning techniques cannot handle topology change, raw
reconstructions from our algorithm may not be directly applicable to these applications. Thus, jointly
learning fine-grained control by disentangling pose and identity-specific shape information is one interesting
direction to pursue.
117
Reference List
[1] Miika Aittala, Timo Aila, and Jaakko Lehtinen. Reflectance modeling by neural texture synthesis.
ACM Trans. Graph., 35(4):65, 2016.
[2] Oleg Alexander, Mike Rogers, William Lambeth, Matt Chiang, and Paul Debevec. The digital emily
project: Photoreal facial modeling and animation. In ACM SIGGRAPH 2009 Courses, SIGGRAPH
’09, pages 12:1–12:15, New York, NY , USA, 2009. ACM.
[3] Thiemo Alldieck, Marcus Magnor, Bharat Lal Bhatnagar, Christian Theobalt, and Gerard Pons-Moll.
Learning to reconstruct people in clothing from a single RGB camera. In IEEE Conference on
Computer Vision and Pattern Recognition, pages 1175–1186, 2019.
[4] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. Detailed
human avatars from monocular video. In International Conference on 3D Vision, pages 98–109,
2018.
[5] Thiemo Alldieck, Marcus A Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. Video
based reconstruction of 3d people models. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 8387–8397, 2018.
[6] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James
Davis. SCAPE: shape completion and animation of people. ACM Transactions on Graphics,
24(3):408–416, 2005.
[7] Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. Synthesizing
images of humans in unseen poses. In IEEE Conference on Computer Vision and Pattern Recognition,
pages 8340–8348, 2018.
[8] Alexandru O Balan, Leonid Sigal, Michael J Black, James E Davis, and Horst W Haussecker.
Detailed human shape and pose from images. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 1–8, 2007.
[9] Aayush Bansal, Xinlei Chen, Bryan Russell, Abhinav Gupta, and Deva Ramanan. Pixelnet: Repre-
sentation of the pixels, by the pixels, and for the pixels. arXiv:1702.06506, 2017.
[10] Gavin Barill, Neil Dickson, Ryan Schmidt, David I.W. Levin, and Alec Jacobson. Fast winding
numbers for soups and clouds. ACM Transactions on Graphics, 37(4):43, 2018.
[11] Jonathan T Barron and Jitendra Malik. Shape, illumination, and reflectance from shading. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 37(8):1670–1687, 2015.
[12] Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. High-quality single-
shot capture of facial geometry. In ACM Transactions on Graphics, volume 29, pages 40:1–40:9,
2010.
118
[13] Thabo Beeler, Bernd Bickel, Gioacchino Noris, Paul Beardsley, Steve Marschner, Robert W. Sumner,
and Markus Gross. Coupled 3d reconstruction of sparse facial hair and skin. ACM Trans. Graph.,
31(4):117:1–117:10, 2012.
[14] Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W
Sumner, and Markus Gross. High-quality passive facial performance capture using anchor frames.
In ACM Trans. Graph., volume 30, page 75. ACM, 2011.
[15] Yoshua Bengio et al. Learning deep architectures for ai. Foundations and trends R
in Machine
Learning, 2(1):1–127, 2009.
[16] V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In SIGGRAPH
’99, pages 187–194, 1999.
[17] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J
Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In
European Conference on Computer Vision, pages 561–578, 2016.
[18] James Booth, Anastasios Roussos, Stefanos Zafeiriou, Allan Ponniah, and David Dunaway. A 3d
morphable model learnt from 10,000 faces. In Proc. CVPR, pages 5543–5552, 2016.
[19] Derek Bradley, Thabo Beeler, Kenny Mitchell, et al. Real-time multi-view facial capture with
synthetic training. In Computer Graphics Forum, volume 36, pages 325–336. Wiley Online Library,
2017.
[20] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Generative and discriminative
voxel modeling with convolutional neural networks. In 3D Deep Learning Workshop, Advances in
neural information processing systems (NIPS), pages 1–9, 2016.
[21] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment
problem?(and a dataset of 230,000 3d facial landmarks). In IEEE International Conference on
Computer Vision, pages 1021–1030, 2017.
[22] Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. Real-time high-fidelity facial performance
capture. ACM Trans. Graph., 34(4):46, 2015.
[23] Chen Cao, Qiming Hou, and Kun Zhou. Displaced dynamic expression regression for real-time
facial tracking and animation. ACM Trans. Graph., 33(4):43, 2014.
[24] Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. Real-time facial animation with
image-based dynamic avatars. ACM Trans. Graph., 35(4):126, 2016.
[25] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2D pose estimation
using part affinity fields. In IEEE Conference on Computer Vision and Pattern Recognition, pages
7291–7299, 2017.
[26] Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation
with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 4733–4742, 2016.
[27] Edwin Catmull. A subdivision algorithm for computer display of curved surfaces. Technical report,
UTAH UNIV SALT LAKE CITY SCHOOL OF COMPUTING, 1974.
[28] Menglei Chai, Linjie Luo, Kalyan Sunkavalli, Nathan Carr, Sunil Hadap, and Kun Zhou. High-quality
hair modeling from a single portrait photo. ACM Trans. Graph., 34(6):204:1–204:10, 2015.
119
[29] Menglei Chai, Tianjia Shao, Hongzhi Wu, Yanlin Weng, and Kun Zhou. Autohair: Fully automatic
hair modeling from a single image. ACM Trans. Graph., 35(4):116:1–116:12, 2016.
[30] Menglei Chai, Lvdi Wang, Yanlin Weng, Xiaogang Jin, and Kun Zhou. Dynamic hair manipulation
in images and videos. ACM Trans. Graph., 32(4):75, 2013.
[31] Menglei Chai, Lvdi Wang, Yanlin Weng, Yizhou Yu, Baining Guo, and Kun Zhou. Single-view hair
modeling for portrait manipulation. ACM Trans. Graph., 31(4):116:1–116:8, 2012.
[32] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-
decoder with atrous separable convolution for semantic image segmentation. In European Conference
on Computer Vision, pages 801–818, 2018.
[33] Zhiqin Chen and Hao Zhang. Learning implicit fields for generative shape modeling. In IEEE
Conference on Computer Vision and Pattern Recognition, pages 5939–5948, 2019.
[34] Byoungwon Choe and Hyeong-Seok Ko. A statistical wisp model and pseudophysical approaches
for interactive hairstyle generation. IEEE Transactions on Visualization and Computer Graphics,
11(2):160–170, 2005.
[35] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A
unified approach for single and multi-view 3d object reconstruction. In European Conference on
Computer Vision, pages 628–644, 2016.
[36] Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, and Westley Sarokin. Acquiring
the Reflectance Field of a Human Face. In Proc. SIGGRAPH, 2000.
[37] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale
hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition,
pages 248–255, 2009.
[38] R. Donner, M. Reiter, G. Langs, P. Peloschek, and H. Bischof. Fast active appearance model
search using canonical correlation analysis. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 28(10):1690–1694, 2006.
[39] Chi Nhan Duong, Khoa Luu, Kha Gia Quach, and Tien D Bui. Beyond principal components: Deep
boltzmann machines for face modeling. In Proc. CVPR, pages 4786–4794, 2015.
[40] Jose I Echevarria, Derek Bradley, Diego Gutierrez, and Thabo Beeler. Capturing and stylizing hair
for 3d fabrication. ACM Trans. Graph., 33(4):125, 2014.
[41] G. J. Edwards, C. J. Taylor, and T. F. Cootes. Interpreting face images using active appearance
models. In Proceedings of the 3rd. International Conference on Face and Gesture Recognition, FG
’98, pages 300–. IEEE Computer Society, 1998.
[42] Alexei A. Efros and William T. Freeman. Image quilting for texture synthesis and transfer. In
Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques,
SIGGRAPH ’01, pages 341–346. ACM, 2001.
[43] Alexei A. Efros and Thomas K. Leung. Texture synthesis by non-parametric sampling. In IEEE
ICCV, pages 1033–, 1999.
[44] Patrick Esser, Ekaterina Sutter, and Bj¨ orn Ommer. A variational u-net for conditional appearance
and shape generation. In IEEE Conference on Computer Vision and Pattern Recognition, pages
8857–8866, 2018.
120
[45] Carlos Hern´ andez Esteban and Francis Schmitt. Silhouette and stereo fusion for 3d object modeling.
Computer Vision and Image Understanding, 96(3):367–392, 2004.
[46] Hongbo Fu, Yichen Wei, Chiew-Lan Tai, and Long Quan. Sketching hairstyles. In Proceedings of
the 4th Eurographics Workshop on Sketch-based Interfaces and Modeling, pages 31–36, 2007.
[47] Yasutaka Furukawa and Jean Ponce. Carved visual hulls for image-based modeling. In European
Conference on Computer Vision, pages 564–577, 2006.
[48] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, 2010.
[49] Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. Driving
high-resolution facial scans with video performance capture. ACM Trans. Graph., 34(1):8, 2014.
[50] Graham Fyffe, Koki Nagano, Loc Huynh, Shunsuke Saito, Jay Busch, Andrew Jones, Hao Li,
and Paul Debevec. Multi-view stereo on consistent face topology. In Computer Graphics Forum,
volume 36, pages 295–309. Wiley Online Library, 2017.
[51] Juergen Gall, Carsten Stoll, Edilson De Aguiar, Christian Theobalt, Bodo Rosenhahn, and Hans-Peter
Seidel. Motion capture using joint skeleton tracking and surface estimation. In IEEE Conference on
Computer Vision and Pattern Recognition, pages 1746–1753, 2009.
[52] Pablo Garrido, Levi Valgaerts, Chenglei Wu, and Christian Theobalt. Reconstructing detailed
dynamic face geometry from monocular video. In ACM Trans. Graph., volume 32, pages 158:1–
158:10, November 2013.
[53] Leon A. Gatys, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. Preserving color in neural
artistic style transfer. CoRR, abs/1606.05897, 2016.
[54] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Texture synthesis and the controlled
generation of natural stimuli using convolutional neural networks. CoRR, abs/1505.07376, 2015.
[55] Abhijeet Ghosh, Graham Fyffe, Borom Tunwattanapong, Jay Busch, Xueming Yu, and Paul De-
bevec. Multiview face capture using polarized spherical gradient illumination. ACM Trans. Graph.,
30(6):129:1–129:10, 2011.
[56] Abhijeet Ghosh, Tim Hawkins, Pieter Peers, Sune Frederiksen, and Paul Debevec. Practical modeling
and acquisition of layered facial reflectance. In ACM Trans. Graph., volume 27, page 139. ACM,
2008.
[57] Andrew Gilbert, Marco V olino, John Collomosse, and Adrian Hilton. V olumetric performance
capture from minimal camera viewpoints. In European Conference on Computer Vision, pages
566–581, 2018.
[58] Mashhuda Glencross, Gregory J Ward, Francho Melendez, Caroline Jay, Jun Liu, and Roger Hubbold.
A perceptually validated model for surface depth hallucination. ACM Trans. Graph., 27(3):59, 2008.
[59] Aleksey Golovinskiy, Wojciech Matusik, Hanspeter Pfister, Szymon Rusinkiewicz, and Thomas
Funkhouser. A statistical model for synthesis of detailed facial geometry. ACM Trans. Graph.,
25(3):1025–1034, 2006.
[60] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling,
C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing
Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
121
[61] Paulo FU Gotardo, Tomas Simon, Yaser Sheikh, and Iain Matthews. Photogeometric scene flow for
high-detail dynamic 3d reconstruction. In Proceedings of the IEEE International Conference on
Computer Vision, pages 846–854, 2015.
[62] Paul Graham, Borom Tunwattanapong, Jay Busch, Xueming Yu, Andrew Jones, Paul Debevec, and
Abhijeet Ghosh. Measurement-based synthesis of facial microgeometry. In Computer Graphics
Forum, volume 32, pages 335–344. Wiley Online Library, 2013.
[63] Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. Atlasnet:
A papier-mˆ ach´ e approach to learning 3d surface generation. In IEEE Conference on Computer Vision
and Pattern Recognition, 2018.
[64] Peng Guan, Alexander Weiss, Alexandru O Balan, and Michael J Black. Estimating human shape and
pose from a single image. In IEEE International Conference on Computer Vision, pages 1381–1388,
2009.
[65] Rıza Alp G¨ uler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation
in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7297–7306,
2018.
[66] Jianwei Han, Kun Zhou, Li-Yi Wei, Minmin Gong, Hujun Bao, Xinming Zhang, and Baining
Guo. Fast example-based surface texture synthesis via discrete optimization. The Visual Computer,
22(9-11):918–925, 2006.
[67] Christian H¨ ane, Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3d object
reconstruction. In arXiv preprint arXiv:1704.00710. 2017.
[68] Antonio Haro, Brian Guenterz, and Irfan Essay. Real-time, Photo-realistic, Physically Based
Rendering of Fine Scale Human Skin Structure. In S. J. Gortle and K. Myszkowski, editors,
Eurographics Workshop on Rendering, 2001.
[69] Kaiming He, Georgia Gkioxari, Piotr Doll´ ar, and Ross Girshick. Mask r-cnn. In Proceedings of the
IEEE international conference on computer vision, pages 2961–2969, 2017.
[70] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778,
2016.
[71] Tomas Lay Herrera, Arno Zinke, and Andreas Weber. Lighting hair from the inside: A thermal
approach to hair reconstruction. ACM Trans. Graph., 31(6):146:1–146:9, 2012.
[72] Berthold KP Horn. Shape from shading: A method for obtaining the shape of a smooth opaque
object from one view. 1970.
[73] Liwen Hu, Chongyang Ma, Linjie Luo, and Hao Li. Robust hair capture using simulated examples.
ACM Trans. Graph., 33(4):126:1–126:10, 2014.
[74] Liwen Hu, Chongyang Ma, Linjie Luo, and Hao Li. Single-view hair modeling using a hairstyle
database. ACM Trans. Graph., 34(4):125:1–125:9, 2015.
[75] Liwen Hu, Chongyang Ma, Linjie Luo, Li-Yi Wei, and Hao Li. Capturing braided hairstyles. ACM
Trans. Graph., 33(6):225:1–225:9, 2014.
[76] Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi,
Carrie Sun, Yen-Chun Chen, and Hao Li. Avatar digitization from a single image for real-time
rendering. ACM Trans. Graph., 36(6), 2017.
122
[77] Haibin Huang, Evangelos Kalogerakis, Siddhartha Chaudhuri, Duygu Ceylan, Vladimir G Kim,
and Ersin Yumer. Learning local shape descriptors from part correspondences with multiview
convolutional networks. ACM Transactions on Graphics, 37(1):6, 2018.
[78] Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V Gehler, Javier
Romero, Ijaz Akhter, and Michael J Black. Towards accurate marker-less human shape and pose
estimation over time. In International Conference on 3D Vision, pages 421–430, 2017.
[79] Zeng Huang, Tianye Li, Weikai Chen, Yajie Zhao, Jun Xing, Chloe LeGendre, Linjie Luo, Chongyang
Ma, and Hao Li. Deep volumetric video from very sparse multi-view performance capture. In
European Conference on Computer Vision, pages 336–354, 2018.
[80] Loc Huynh, Weikai Chen, Shunsuke Saito, Jun Xing, Koki Nagano, Andrew Jones, Paul Debevec,
and Hao Li. Mesoscopic facial geometry inference using deep neural networks. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 8407–8416, 2018.
[81] Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. Dynamic 3d avatar creation from hand-
held video input. ACM Trans. Graph., 34(4):45:1–45:14, 2015.
[82] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and Locally Consistent Image
Completion. ACM Trans. Graph., 36(4):107:1–107:14, 2017.
[83] Eldar Insafutdinov and Alexey Dosovitskiy. Unsupervised learning of shape and pose with differ-
entiable point clouds. In Advances in Neural Information Processing Systems, pages 2802–2812,
2018.
[84] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with
conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition,
pages 1125–1134, 2017.
[85] Aaron S Jackson, Adrian Bulat, Vasileios Argyriou, and Georgios Tzimiropoulos. Large pose 3d
face reconstruction from a single image via direct volumetric cnn regression. In Proceedings of
International Conference on Computer Vision, pages 1031–1039, 2017.
[86] Aaron S Jackson, Chris Manafas, and Georgios Tzimiropoulos. 3D Human Body Reconstruction
from a Single Image via V olumetric Regression. In ECCV Workshop Proceedings, PeopleCap 2018,
pages 0–0, 2018.
[87] Wenzel Jakob, Jonathan T Moon, and Steve Marschner. Capturing hair assemblies fiber by fiber.
ACM Trans. Graph., 28(5):164:1–164:9, 2009.
[88] Mengqi Ji, Juergen Gall, Haitian Zheng, Yebin Liu, and Lu Fang. Surfacenet: An end-to-end 3d
neural network for multiview stereopsis. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 2307–2315, 2017.
[89] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and
super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
[90] Micah K. Johnson, Forrester Cole, Alvin Raj, and Edward H. Adelson. Microgeometry capture using
an elastomeric sensor. ACM Trans. Graph, 30(4):46:1–46:8, 2011.
[91] James T Kajiya. The rendering equation. In ACM SIGGRAPH computer graphics, volume 20, pages
143–150. ACM, 1986.
[92] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of
human shape and pose. In IEEE Conference on Computer Vision and Pattern Recognition, pages
7122–7131, 2018.
123
[93] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. Learning category-
specific mesh reconstruction from image collections. In European Conference on Computer Vision,
pages 371–386, 2018.
[94] Abhishek Kar, Christian H¨ ane, and Jitendra Malik. Learning a multi-view stereo machine. In
Advances in Neural Information Processing Systems, pages 364–375, 2017.
[95] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for
improved quality, stability, and variation. CoRR, abs/1710.10196, 2017.
[96] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 3907–3916, 2018.
[97] Ira Kemelmacher-Shlizerman. Internet-based morphable model. IEEE ICCV, 2013.
[98] Ira Kemelmacher-Shlizerman and Ronen Basri. 3d face reconstruction from a single image using
a single reference face shape. IEEE Transactions on Pattern Analysis and Machine Intelligence,
33(2):394–405, 2011.
[99] Ira Kemelmacher-Shlizerman and Steven M Seitz. Face reconstruction in the wild. In IEEE ICCV,
pages 1746–1753. IEEE, 2011.
[100] Hyeongwoo Kim, Michael Zollh¨ ofer, Ayush Tewari, Justus Thies, Christian Richardt, and Christian
Theobalt. InverseFaceNet: Deep monocular inverse face rendering. In Proc. CVPR, June 2018.
[101] Tae-Yong Kim and Ulrich Neumann. Interactive multiresolution hair modeling and editing. ACM
Trans. Graph., 21(3):620–629, 2002.
[102] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of
International Conference on Learning Representations (ICLR), 2015.
[103] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of Interna-
tional Conference on Learning Representations (ICLR), 2014.
[104] Tejas D Kulkarni, William F. Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional
inverse graphics network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
editors, Advances in Neural Information Processing Systems 28, pages 2539–2547. Curran Associates,
Inc., 2015.
[105] Vivek Kwatra, Irfan Essa, Aaron Bobick, and Nipun Kwatra. Texture optimization for example-based
synthesis. ACM Trans. Graph., 24(3):795–802, 2005.
[106] Vivek Kwatra, Arno Sch¨ odl, Irfan Essa, Greg Turk, and Aaron Bobick. Graphcut textures: Image
and video synthesis using graph cuts. In Proc. SIGGRAPH, SIGGRAPH ’03, pages 277–286. ACM,
2003.
[107] Michael S Langer and Steven W Zucker. Shape-from-shading on a cloudy day. JOSA A, 11(2):467–
478, 1994.
[108] Anass Lasram and Sylvain Lefebvre. Parallel patch-based texture synthesis. In Proceedings of the
Fourth ACM SIGGRAPH/Eurographics conference on High-Performance Graphics, pages 115–124.
Eurographics Association, 2012.
[109] Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V
Gehler. Unite the people: Closing the loop between 3d and 2d human representations. In IEEE
Conference on Computer Vision and Pattern Recognition, pages 6050–6059, 2017.
124
[110] Christian Ledig, Lucas Theis, Ferenc Husz´ ar, Jose Caballero, Andrew Cunningham, Alejandro
Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single
image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802,
2016.
[111] Sylvain Lefebvre and Hugues Hoppe. Appearance-space texture synthesis. ACM Trans. Graph.,
25(3):541–548, 2006.
[112] Chen Li, Kun Zhou, and Stephen Lin. Intrinsic face image decomposition with human face priors.
In ECCV (5)’14, pages 218–233, 2014.
[113] Hao Li, Laura Trutoiu, Kyle Olszewski, Lingyu Wei, Tristan Trutna, Pei-Lun Hsieh, Aaron Nicholls,
and Chongyang Ma. Facial performance sensing head-mounted display. ACM Trans. Graph.,
34(4):47:1–47:9, 2015.
[114] Hao Li, Etienne V ouga, Anton Gudym, Linjie Luo, Jonathan T. Barron, and Gleb Gusev. 3D
self-portraits. ACM Transactions on Graphics, 32(6):187:1–187:9, 2013.
[115] Yijun Li, Sifei Liu, Jimei Yang, and Ming-Hsuan Yang. Generative face completion. In Proc. CVPR,
2017.
[116] Ce Liu, Heung-Yeung Shum, and William T. Freeman. Face hallucination: Theory and practice. Int.
J. Comput. Vision, 75(1):115–134, 2007.
[117] Feng Liu, Dan Zeng, Jing Li, and Qi-jun Zhao. On 3d face reconstruction via cascaded regression
in shape space. Frontiers of Information Technology & Electronic Engineering, 18(12):1978–1990,
2017.
[118] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft rasterizer: A differentiable renderer for
image-based 3d reasoning. In IEEE International Conference on Computer Vision (ICCV), 2019.
[119] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li. Learning to infer implicit surfaces without
3d supervision. In Advances in Neural Information Processing Systems 32, pages 8293–8304. Curran
Associates, Inc., 2019.
[120] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust
clothes recognition and retrieval with rich annotations. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 1096–1104, 2016.
[121] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild.
In IEEE ICCV, 2015.
[122] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. SMPL:
A skinned multi-person linear model. ACM Transactions on Graphics, 34(6):248, 2015.
[123] William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction
algorithm. In ACM siggraph computer graphics, volume 21, pages 163–169. ACM, 1987.
[124] Linjie Luo, Hao Li, and Szymon Rusinkiewicz. Structure-aware hair capture. ACM Trans. Graph.,
32(4):76:1–76:12, 2013.
[125] Debbie S. Ma, Joshua Correll, and Bernd Wittenbrink. The chicago face database: A free stimulus
set of faces and norming data. Behavior Research Methods, 47(4):1122–1135, 2015.
[126] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose guided
person image generation. In Advances in Neural Information Processing Systems, pages 406–416,
2017.
125
[127] Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz.
Disentangled person image generation. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 99–108, 2018.
[128] Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, and Paul De-
bevec. Rapid Acquisition of Specular and Diffuse Normal Maps from Polarized Spherical Gradient
Illumination. In Eurographics Symposium on Rendering, 2007.
[129] Wan-Chun Ma, Andrew Jones, Jen-Yuan Chiang, Tim Hawkins, Sune Frederiksen, Pieter Peers,
Marko Vukovic, Ming Ouhyoung, and Paul Debevec. Facial performance synthesis using
deformation-driven polynomial displacement maps. In Proc. SIGGRAPH, pages 121:1–121:10.
ACM, 2008.
[130] Iain Matthews and Simon Baker. Active appearance models revisited. Int. J. Comput. Vision,
60(2):135–164, 2004.
[131] D. Maturana and S. Scherer. V oxnet: A 3d convolutional neural network for real-time object
recognition. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages
922–928, 2015.
[132] Wojciech Matusik, Chris Buehler, Ramesh Raskar, Steven J Gortler, and Leonard McMillan. Image-
based visual hulls. In ACM SIGGRAPH, pages 369–374, 2000.
[133] Steven McDonagh, Martin Klaudiny, Derek Bradley, Thabo Beeler, Iain Matthews, and Kenny
Mitchell. Synthetic prior design for real-time face tracking. In 3D Vision (3DV), 2016 Fourth
International Conference on, pages 639–648. IEEE, 2016.
[134] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Oc-
cupancy networks: Learning 3d reconstruction in function space. arXiv preprint arXiv:1812.03828,
2018.
[135] Umar Mohammed, Simon J. D. Prince, and Jan Kautz. Visio-lization: Generating novel facial images.
In ACM Trans. Graph., pages 57:1–57:8. ACM, 2009.
[136] Andriluka Mykhaylo, Pishchulin Leonid, Gehler Peter, and Bernt Schiele. 2d human pose estimation:
New benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 3686–3693, 2014.
[137] Koki Nagano, Graham Fyffe, Oleg Alexander, Jernej Barbiˇ c, Hao Li, Abhijeet Ghosh, and Paul
Debevec. Skin microstructure deformation with displacement map convolution. ACM Trans. Graph.,
34(4), 2015.
[138] Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen, Chongyang Ma, Hao Li, and Shigeo
Morishima. Siclope: Silhouette-based clothed people. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 4480–4490, 2019.
[139] Natalia Neverova, Riza Alp Guler, and Iasonas Kokkinos. Dense pose transfer. In European
Conference on Computer Vision, pages 123–138, 2018.
[140] Richard A. Newcombe, Dieter Fox, and Steven M. Seitz. DynamicFusion: Reconstruction and
tracking of non-rigid scenes in real-time. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 343–352, 2015.
[141] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation.
In European Conference on Computer Vision, pages 483–499, 2016.
126
[142] Chi Nhan Duong, Khoa Luu, Kha Gia Quach, and Tien D Bui. Beyond principal components: Deep
boltzmann machines for face modeling. In Proc. CVPR, pages 4786–4794, 2015.
[143] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Occupancy flow: 4d
reconstruction by learning particle dynamics. In International Conference on Computer Vision,
October 2019.
[144] Kyle Olszewski, Zimo Li, Chao Yang, Yi Zhou, Ronald Yu, Zeng Huang, Sitao Xiang, Shunsuke
Saito, Pushmeet Kohli, and Hao Li. Realistic dynamic facial textures from a single image using gans.
In IEEE ICCV, Oct 2017.
[145] Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. High-fidelity facial and speech
animation for vr hmds. ACM Trans. Graph., 35(6):221:1–221:14, 2016.
[146] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter V . Gehler, and Bernt Schiele. Neural
body fitting: Unifying deep learning and model-based human pose and shape estimation. In
International Conference on 3D Vision, pages 484–494, 2018.
[147] Sylvain Paris, Hector M Brice˜ no, and Franc ¸ois X Sillion. Capture of hair geometry from multiple
images. ACM Trans. Graph., 23(3):712–719, 2004.
[148] Sylvain Paris, Will Chang, Oleg I Kozhushnyan, Wojciech Jarosz, Wojciech Matusik, Matthias
Zwicker, and Fr´ edo Durand. Hair photobooth: geometric and photometric acquisition of real
hairstyles. ACM Trans. Graph., 27(3):30:1–30:9, 2008.
[149] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, and Alexander C. Berg. Transformation-
grounded image generation network for novel 3d view synthesis. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 3500–3509, 2017.
[150] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove.
Deepsdf: Learning continuous signed distance functions for shape representation. arXiv preprint
arXiv:1901.05103, 2019.
[151] Deepak Pathak, Philipp Kr¨ ahenb¨ uhl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context
encoders: Feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 2536–2544, 2016.
[152] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3D
human pose and shape from a single color image. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 459–468, 2018.
[153] Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael J Black. Clothcap: Seamless 4d clothing
capture and retargeting. ACM Transactions on Graphics, 36(4):73, 2017.
[154] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets
for 3d classification and segmentation. arXiv preprint arXiv:1612.00593, 2016.
[155] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas.
V olumetric and multi-view cnns for object classification on 3d data. In IEEE Conference on
Computer Vision and Pattern Recognition, pages 5648–5656, 2016.
[156] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning
on point sets in a metric space. In Advances in Neural Information Processing Systems, pages
5099–5108, 2017.
[157] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep
convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
127
[158] Renderpeople, 2018. https://renderpeople.com/3d-people.
[159] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and
approximate inference in deep generative models. In Proceedings of International Conference on
International Conference on Machine Learning (ICML), pages 1278–1286, 2014.
[160] Elad Richardson, Matan Sela, and Ron Kimmel. 3d face reconstruction by learning from synthetic
data. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 460–469. IEEE, 2016.
[161] Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. Learning detailed face reconstruction
from a single image. In Proc. CVPR, pages 5553–5562. IEEE, 2017.
[162] Gregory Rogez, Philippe Weinzaepfel, and Cordelia Schmid. LCR-Net++: Multi-person 2D and 3D
Pose Detection in Natural Images. arXiv preprint arXiv:1803.00455, 2018.
[163] Sami Romdhani and Thomas Vetter. Estimating 3d shape and texture using pixel intensity, edges,
specular highlights, texture constraints and a prior. In Proc. CVPR, pages 986–993, 2005.
[164] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interactive foreground extrac-
tion using iterated graph cuts. ACM Transactions on Graphics, 23(3):309–314, 2004.
[165] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by
error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science,
1985.
[166] Shunsuke Saito, Tianye Li, and Hao Li. Real-time facial segmentation and performance capture from
rgb input. In Proceedings of the European Conference on Computer Vision, pages 244–261, 2016.
[167] Shunsuke Saito, Lingyu Wei, Liwen Hu, Koki Nagano, and Hao Li. Photorealistic facial texture
inference using deep neural networks. In Proc. CVPR, 2017.
[168] Stan Sclaroff and Alex Pentland. Generalized implicit functions for computer graphics, volume 25.
ACM, 1991.
[169] Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. A comparison
and evaluation of multi-view stereo reconstruction algorithms. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 519–528, 2006.
[170] Matan Sela, Elad Richardson, and Ron Kimmel. Unrestricted facial geometry reconstruction using
image-to-image translation. In IEEE ICCV, pages 1585–1594. IEEE, 2017.
[171] Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo, and David Jacobs. Sfsnet: Learning
shape, reflectance and illuminance of faces in the wild. arXiv preprint arXiv:1712.01261, 2017.
[172] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic
segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):640–651, 2017.
[173] Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. Automatic acquisition of high-fidelity
facial performances using monocular videos. ACM Trans. Graph., 33(6):222, 2014.
[174] Zhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shechtman, and Dimitris Samaras.
Neural face editing with intrinsic image disentangling. arXiv preprint arXiv:1704.04131, 2017.
[175] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
128
[176] Peter-Pike Sloan, Jan Kautz, and John Snyder. Precomputed radiance transfer for real-time rendering
in dynamic, low-frequency lighting environments. In ACM Transactions on Graphics, volume 21,
pages 527–536, 2002.
[177] Cristian Sminchisescu and Alexandru Telea. Human pose estimation from silhouettes. a consistent
approach using distance level sets. In International Conference on Computer Graphics, Visualization
and Computer Vision, volume 10, 2002.
[178] Solid Angle, 2016. http://www.solidangle.com/arnold/.
[179] Jonathan Starck and Adrian Hilton. Surface capture for performance-based animation. IEEE
Computer Graphics and Applications, 27(3):21–31, 2007.
[180] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view Convolu-
tional Neural Networks for 3D Shape Recognition. In IEEE International Conference on Computer
Vision, pages 945–953, 2015.
[181] Shao-Hua Sun, Minyoung Huh, Yuan-Hong Liao, Ning Zhang, and Joseph J Lim. Multi-view to
novel view: Synthesizing novel views with self-learned confidence. In European Conference on
Computer Vision, pages 155–171, 2018.
[182] Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M Seitz. Total moving face
reconstruction. In ECCV, pages 796–812. Springer, 2014.
[183] Qingyang Tan, Lin Gao, Yu-Kun Lai, and Shihong Xia. Variational autoencoders for deforming 3d
mesh models. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,
pages 5841–5850, 2018.
[184] Ayush Tewari, Michael Zollh¨ ofer, Pablo Garrido, Florian Bernard, Hyeongwoo Kim, Patrick P´ erez,
and Christian Theobalt. Self-supervised multi-level face model learning for monocular reconstruction
at over 250 hz. arXiv preprint arXiv:1712.02859, 2017.
[185] Ayush Tewari, Michael Zoll¨ ofer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Perez,
and Theobalt Christian. MoFA: Model-based deep convolutional face autoencoder for unsupervised
monocular reconstruction. In Proceedings of the IEEE International Conference on Computer Vision,
pages 3735–3744, 2017.
[186] The Digital Human League. Digital Emily 2.0, 2015. http://gl.ict.usc.edu/Research/DigitalEmily2/.
[187] J. Thies, M. Zollh¨ ofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Real-time face
capture and reenactment of rgb videos. In Proc. CVPR, 2016.
[188] Justus Thies, Michael Zollh¨ ofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner.
Facevr: Real-time facial reenactment and eye gaze control in virtual reality. ACM Trans. Graph.,
37(2):25:1–25:15, 2018.
[189] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional
network and a graphical model for human pose estimation. In Advances in neural information
processing systems, pages 1799–1807, 2014.
[190] Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gerard Medioni. Regressing robust and discriminative
3d morphable models with a very deep neural network. In Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition, pages 1493–1502, 2017.
[191] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. Multi-view supervision for
single-view reconstruction via differentiable ray consistency. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 2626–2634, 2017.
129
[192] Matthew Turk and Alex Pentland. Eigenfaces for recognition. J. Cognitive Neuroscience, 3(1):71–86,
1991.
[193] G¨ ul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia
Schmid. BodyNet: V olumetric inference of 3D human body shapes. In European Conference on
Computer Vision, pages 20–36, 2018.
[194] G¨ ul Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and
Cordelia Schmid. Learning from synthetic humans. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 109–117, 2017.
[195] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popovi´ c. Articulated mesh animation from
multi-view silhouettes. ACM Transactions on Graphics, 27(3):97, 2008.
[196] Daniel Vlasic, Pieter Peers, Ilya Baran, Paul Debevec, Jovan Popovi´ c, Szymon Rusinkiewicz, and
Wojciech Matusik. Dynamic shape capture using multi-view photometric stereo. ACM Transactions
on Graphics, 28(5):174, 2009.
[197] Javier von der Pahlen, Jorge Jimenez, Etienne Danvoye, Paul Debevec, Graham Fyffe, and Oleg
Alexander. Digital ira and beyond: Creating real-time photoreal digital actors. In ACM SIGGRAPH
2014 Courses, SIGGRAPH ’14, pages 1:1–1:384, New York, NY , USA, 2014. ACM.
[198] Ingo Wald, Sven Woop, Carsten Benthin, Gregory S Johnson, and Manfred Ernst. Embree: a kernel
framework for efficient cpu ray tracing. ACM Transactions on Graphics, 33(4):143, 2014.
[199] Chuan Wang, Haibin Huang, Xiaoguang Han, and Jue Wang. Video inpainting by jointly learning
temporal structure and spatial details. arXiv preprint arXiv:1806.08482, 2018.
[200] Lvdi Wang, Yizhou Yu, Kun Zhou, and Baining Guo. Example-based hair geometry synthesis. ACM
Trans. Graph., 28(3):56:1–56:9, 2009.
[201] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-
resolution image synthesis and semantic manipulation with conditional gans. In IEEE Conference on
Computer Vision and Pattern Recognition, pages 8798–8807, 2018.
[202] Yang Wang, Haibin Huang, Chuan Wang, Tong He, Jue Wang, and Minh Hoai. Gif2video: Color
dequantization and temporal interpolation of gif images. arXiv preprint arXiv:1901.02840, 2019.
[203] Kelly Ward, Florence Bertails, Tae-Yong Kim, Stephen R Marschner, Marie-Paule Cani, and Ming C
Lin. A survey on hair modeling: Styling, simulation, and rendering. IEEE Transactions on
Visualization and Computer Graphics, 13(2):213–234, 2007.
[204] Michael Waschb¨ usch, Stephan W¨ urmlin, Daniel Cotting, Filip Sadlo, and Markus Gross. Scalable
3D video of dynamic scenes. The Visual Computer, 21(8):629–638, 2005.
[205] Li-Yi Wei, Sylvain Lefebvre, Vivek Kwatra, and Greg Turk. State of the art in example-based texture
synthesis. In Eurographics 2009, State of the Art Report, EG-STAR, pages 93–117. Eurographics
Association, 2009.
[206] Li-Yi Wei and Marc Levoy. Fast texture synthesis using tree-structured vector quantization. In
Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques,
SIGGRAPH ’00, pages 479–488, 2000.
[207] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
130
[208] Yichen Wei, Eyal Ofek, Long Quan, and Heung-Yeung Shum. Modeling hair from multiple views.
ACM Trans. Graph., 24(3):816–820, 2005.
[209] Chung-Yi Weng, Brian Curless, and Ira Kemelmacher-Shlizerman. Photo wake-up: 3d character
animation from a single photo. arXiv preprint arXiv:1812.02246, 2018.
[210] Yanlin Weng, Lvdi Wang, Xiao Li, Menglei Chai, and Kun Zhou. Hair interpolation for portrait
morphing. Computer Graphics Forum, 32(7):79–84, 2013.
[211] Tim Weyrich, Wojciech Matusik, Hanspeter Pfister, Bernd Bickel, Craig Donner, Chien Tu, Janet
McAndless, Jinho Lee, Addy Ngan, Henrik Wann Jensen, and Markus Gross. Analysis of human
faces using a measurement-based skin reflectance model. ACM Trans. Graph., 25(3):1013–1024,
2006.
[212] Cyrus A Wilson, Abhijeet Ghosh, Pieter Peers, Jen-Yuan Chiang, Jay Busch, and Paul Debevec.
Temporal upsampling of performance geometry using photometric alignment. ACM Trans. Graph.,
29(2):17, 2010.
[213] Jamie Wither, Florence Bertails, and Marie-Paule Cani. Realistic hair from a sketch. In IEEE
International Conference on Shape Modeling and Applications, pages 33–42, 2007.
[214] Chenglei Wu, Derek Bradley, Markus Gross, and Thabo Beeler. An anatomically-constrained local
deformation model for monocular face capture. ACM Trans. Graph., 35(4):115, 2016.
[215] Chenglei Wu, Kiran Varanasi, and Christian Theobalt. Full body performance capture under uncon-
trolled and varying illumination: A shading-based approach. European Conference on Computer
Vision, pages 757–770, 2012.
[216] Yuxin Wu and Kaiming He. Group normalization. In European Conference on Computer Vision,
pages 3–19, 2018.
[217] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong
Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
[218] Zexiang Xu, Hsiang-Tao Wu, Lvdi Wang, Changxi Zheng, Xin Tong, and Yue Qi. Dynamic hair
capture using spacetime optimization. ACM Trans. Graph., 33(6):224:1–224:11, 2014.
[219] Shuco Yamaguchi, Shunsuke Saito, Koki Nagano, Yajie Zhao, Weikai Chen, Kyle Olszewski,
Shigeo Morishima, and Hao Li. High-fidelity facial reflectance and geometry inference from an
unconstrained image. ACM Transactions on Graphics, 37(4):162, 2018.
[220] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer
nets: Learning single-view 3d object reconstruction without 3d supervision. In Advances in Neural
Information Processing Systems, pages 1696–1704, 2016.
[221] Jinlong Yang, Jean-S´ ebastien Franco, Franck H´ etroy-Wheeler, and Stefanie Wuhrer. Estimation
of human body shape in motion with wide clothing. In European Conference on Computer Vision,
pages 439–454, 2016.
[222] Raymond A. Yeh
, Chen Chen
, Teck Yian Lim, Schwing Alexander G., Mark Hasegawa-Johnson,
and Minh N. Do. Semantic image inpainting with deep generative models. In Proc. CVPR, 2017.
equal contribution.
[223] Xuan Yu, Zhan Yu, Xiaogang Chen, and Jingyi Yu. A hybrid image-cad based system for modeling
realistic hairstyles. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics
and Games (I3D), pages 63–70, 2014.
131
[224] Cem Yuksel, Scott Schaefer, and John Keyser. Hair meshes. ACM Trans. Graph., 28(5):166:1–166:7,
2009.
[225] M Ersin Yumer and Niloy J Mitra. Learning semantic deformation flows with 3d convolutional
networks. In Proceedings of the European Conference on Computer Vision, pages 294–311, 2016.
[226] Chao Zhang, Sergi Pujades, Michael Black, and Gerard Pons-Moll. Detailed, accurate, human shape
estimation from clothed 3D scan sequences. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 4191–4200, 2017.
[227] He Zhang and Vishal M Patel. Densely connected pyramid dehazing network. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 3194–3203, 2018.
[228] He Zhang, Vishwanath Sindagi, and Vishal M Patel. Image de-raining using a conditional generative
adversarial network. arXiv preprint arXiv:1701.05957, 2017.
[229] Meng Zhang, Menglei Chai, Hongzhi Wu, Hao Yang, and Kun Zhou. A data-driven approach to
four-view image-based hair modeling. ACM Trans. Graph., 36(4):156:1–156:11, 2017.
[230] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional
networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 38(10):1943–1955, 2016.
[231] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing
network. In Proc. CVPR, 2017.
[232] Shizhe Zhou, Hongbo Fu, Ligang Liu, Daniel Cohen-Or, and Xiaoguang Han. Parametric reshaping
of human bodies in images. In ACM Transactions on Graphics, page 126, 2010.
[233] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View synthesis
by appearance flow. In European Conference on Computer Vision, pages 286–301, 2016.
[234] Yi Zhou, Liwen Hu, Jun Xin, Weikai Chen, Han-Wei Kung, Xin Tong, and Hao Li. Hairnet: Single-
view hair reconstruction using convolutional neural networks. In Proceedings of the European
Conference on Computer Vision, pages 235–251, 2018.
[235] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation
using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision,
pages 2223–2232, 2017.
[236] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli
Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information
Processing Systems 30. 2017.
[237] Xiangyu Zhu, Zhen Lei, Junjie Yan, Dong Yi, and Stan Z Li. High-fidelity pose and expression
normalization for face recognition in the wild. In Proc. CVPR, pages 787–796, 2015.
[238] C Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski.
High-quality video view interpolation using a layered representation. ACM Transactions on Graphics,
23(3):600–608, 2004.
132
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Data-driven 3D hair digitization
PDF
Deep representations for shapes, structures and motion
PDF
Complete human digitization for sparse inputs
PDF
Human appearance analysis and synthesis using deep learning
PDF
Multi-scale dynamic capture for high quality digital humans
PDF
Recording, reconstructing, and relighting virtual humans
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
PDF
Learning to optimize the geometry and appearance from images
PDF
Dynamic topology reconfiguration of Boltzmann machines on quantum annealers
PDF
Efficient template representation for face recognition: image sampling from face collections
PDF
Rapid creation of photorealistic virtual reality content with consumer depth cameras
PDF
Artistic control combined with contact and elasticity modeling in computer animation pipelines
PDF
Computational models for multidimensional annotations of affect
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Accurate 3D model acquisition from imagery data
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
3D deep learning for perception and modeling
PDF
Visual knowledge transfer with deep learning techniques
PDF
Representation problems in brain imaging
Asset Metadata
Creator
Saito, Shunsuke
(author)
Core Title
Effective data representations for deep human digitization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/04/2020
Defense Date
12/13/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
clothed human,Face,hair,human digitization,implicit functions,OAI-PMH Harvest,texture inference
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Li, Hao (
committee chair
), Nakano, Aiichiro (
committee member
), Nealen, Andy (
committee member
)
Creator Email
saitos@usc.edu,shunsuke.saito16@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-265380
Unique identifier
UC11673571
Identifier
etd-SaitoShuns-8142.pdf (filename),usctheses-c89-265380 (legacy record id)
Legacy Identifier
etd-SaitoShuns-8142.pdf
Dmrecord
265380
Document Type
Dissertation
Rights
Saito, Shunsuke
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
clothed human
human digitization
implicit functions
texture inference