Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Complete human digitization for sparse inputs
(USC Thesis Other)
Complete human digitization for sparse inputs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Complete Human Digitization for Sparse Inputs
by
Zeng Huang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
December 2020
Copyright 2020 Zeng Huang
Dedication
This dissertation is dedicated . . .
To my parents,
Shanhua Huang and Xiaojing Chen,
the Huang and Chen family,
and especially to Dr. Zhiping Chen my grandfather.
ii
Acknowledgements
Firstly, I would like to express my sincere gratitude to my advisor Prof. Hao Li for the continuous support
of my Ph.D study and related research, for his patience, motivation, and immense vision and knowledge.
Oh, and also his craziness. His guidance helped me in all the time of research and beyond, and made me a
better person I am. I could not have imagined having a better advisor and mentor for my Ph.D study.
I would like to thank my thesis committee: Prof. Randal Hill, Prof. Remo Rohs, Prof. Stefanos
Nikolaidis, and Prof. Aiichiro Nakano, and to other members of my previous qualification and proposal
committees: Prof. Laurent Itti, Prof. Andrew Nealen, Prof. Ramesh Govindan, Prof. Ning Wang, for
their insightful comments and encouragement, but also for the questions which incented me to widen my
research from various perspectives.
Four years spent at University of Southern California and Institute for Creative Technologies have been
truly wonderful and I couldn’t thank those more who granted me access to all the learning and research
opportunities. Special thanks goes to Kathleen Haase, Christina Trejo, and Jeff Karp. Without their strong
support and wise administration it would not be possible to conduct this research. I’d also like to thank
Lizsl De Leon, Tracy Charles, and the OIS office at USC for their precious advisories and helps.
I would like to acknowledge my colleagues at ICT: Yajie Zhao, Weikai Chen, Jun Xing, Mingming
He, Chloe LeGendre, Xinglei Ren, Pratusha Prasad, Bipin Kishore, Kalle Bladin, Marcel Ramos, and
Ari Shapiro for all the hard work together in variant projects pushing forward digitization of humans.
Further acknowledgement goes to Ryota Natsume, Angjoo Kanazawa, Shigeo Morishima, Linjie Luo, and
Chongyang Ma for the wonderful collaboration. Sincere thanks also goes to Ronald Mallet, Carsten Stoll,
and Tony Tung who gave me twice the opportunities to join the Facebook Reality Labs Sausalito team as
iii
intern. It was truly amazing experiences working there with scientists, engineers and artists like Michael
Ranieri, Christoph Lassner, Yuanlu Xu, Nikolaos Sarafianos, Aaron Ferguson, Nick Porcino, and many
others.
A special acknowledgement goes to my amazing lab mates and Ph.D. buddies for not only the collab-
oration, but also sharing all the up and down, hard and fun time: Lingyu Wei, Liwen Hu, Kyle Olszewski,
Tianye Li, Shunsuke Saito, Ronald Yu, Yi Zhou, Zimo Li, Sitao Xiang, Pengda Xiang, Shichen Liu, Haiwei
Chen, Ruilong Li, Yuliang Xiu, and Jiaman Li.
For the last part but not the least, I would like to thank my parents for their wise counsel and sym-
pathetic ear. You are always there for me. Finally, I could not have completed this journey without the
support of my friends, Zhenchao Wu, Yujian Shi, Wenqian Zhu, Wenchu Zhu, Yingfei Wang, and many
others, who provided stimulating discussions as well as happy distractions to rest my mind outside of my
research.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables vii
List of Figures viii
Abstract x
Chapter 1: Overview 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 V olumetric Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Deep Human Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2: Camera Calibration 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Perspective Undistortion for Portraits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Camera Distance Prediction Network . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 FlowNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 CompletionNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Results and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Chapter 3: Volumetric Reconstruction 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Deep Visual Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Pixel-Aligned Implicit Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 Single-view Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2 Texture Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
v
3.4.3 Multi-View Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.2 Multi-View Results and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.3 Single-View Results and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 53
Chapter 4: Deep Human Model 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Animatable Reconstruction of Clothed Humans . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Semantic Space and Deformation Field . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Implicit Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.3 Granular Render-and-Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.2 Results and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Chapter 5: Conclusion and Future Work 76
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Ongoing and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Bibliography 79
vi
List of Tables
2.1 Comparison of face verification accuracy for images with and without our undistortion as
pre-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Quantitative evaluation on RenderPeople and BUFF dataset for single-view reconstruction. 54
4.1 Quantitative comparisons of normal, P2S and Chamfer errors between posed reconstruc-
tion and ground truth on the RenderPeople and BUFF datasets. . . . . . . . . . . . . . . 70
4.2 Ablation study on the effectiveness of spatial features. . . . . . . . . . . . . . . . . . . . . 73
vii
List of Figures
2.1 A learning-based method to remove perspective distortion from portraits. . . . . . . . . . 5
2.2 Perspective distortion artifacts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 The pipeline workflow and applications of our approach. . . . . . . . . . . . . . . . . . . 10
2.4 Cascade Network Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Illustration of Camera Distance Prediction Classifier. . . . . . . . . . . . . . . . . . . . . 11
2.6 Training dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Comparisons on undistortion results of our beam splitter system. . . . . . . . . . . . . . . 18
2.8 Comparisons on undistortion results of synthetic data generated from BU-4DFE dataset. . 19
2.9 Evaluation and comparisons on a variety of datasets with in the wild database. . . . . . . 23
2.10 Distance prediction probability curve of three different input portraits with query distance
sampled densely along the whole range. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.11 Ablation analysis on cascade network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.12 Ablation analysis on attach label as a input to FlowNet. . . . . . . . . . . . . . . . . . . . 25
2.13 Receiver operating characteristic (ROC) curve for face verification performance on raw
input and undistorted input using our method. . . . . . . . . . . . . . . . . . . . . . . . . 26
2.14 Cumulative error curve for facial landmark detection performance given unnormalized
image and image normalized by our method. . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.15 Comparing 3D face reconstruction from portraits, without and with our undistortion tech-
nique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 End-to-end deep learning based volumetric reconstruction. . . . . . . . . . . . . . . . . . 29
3.2 Network Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
viii
3.3 Classification boundary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Overview of our clothed human digitization pipeline. . . . . . . . . . . . . . . . . . . . . 39
3.5 Multi-view Extension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Camera setting for reported four-view results. . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Results reconstructed from four views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.8 Sequence results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.9 Reconstructions with different views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.10 Comparisons on 4-view reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.11 Qualitative single-view results on real images from DeepFashion dataset [105]. . . . . . . 53
3.12 Comparison with other human digitization methods from a single image. . . . . . . . . . . 55
3.13 Comparison with SiCloPe [118] on texture inference. . . . . . . . . . . . . . . . . . . . . 56
4.1 ARCH: Animatable Reconstruction of Clothed Humans. . . . . . . . . . . . . . . . . . . . 58
4.2 ARCH overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Illustration of the loss computation through differentiable rendering. . . . . . . . . . . . . 66
4.4 Illustration of reposing 3D scans to the canonical space. . . . . . . . . . . . . . . . . . . 69
4.5 An example for animating a predicted avatar. . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Evaluation on BUFF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.7 Reconstruction quality of clothing details. . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.8 Reconstruction example using different types of spatial features. . . . . . . . . . . . . . . 73
4.9 Qualitative comparisons against state-of-the-art methods [83, 173, 145] on unseen images. 74
4.10 Challenging cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
ix
Abstract
3D human digitization has been explored for several decades in the field of computer vision and computer
graphics. Accurate reconstruction methods have been proposed using various types of sensors, and several
applications have become popular in sports, medicine and entertainment (e.g., movies, games, AR/VR
experiences). However, these setups either require tightly controlled environments, or are only constrained
to a certain body part, like face or hair. To date, full body 3D human digitizing with detailed geometry,
appearance, and rigging from in-the-wild pictures is still challenging (i.e., taken in natural conditions as
opposed to laboratory environments).
In an era where immersive technologies and personal digital devices are becoming increasingly preva-
lent, the ability to create virtual 3D humans at scale and accessible to end users is extremely essential to
enormous applications. In this dissertation, we explore a whole pipeline of digitizing full human body from
very sparse multi-view or single-view setups, using only consumer-level RGB camera(s). We first look into
the pre-processing of the camera data, introduce our method to deal with the perspective distortions in the
images, which is commonly seen in near-range photos of humans, especially portraits. Then, we introduce
our end-to-end deep-learning based volumetric reconstruction framework which allows highly detailed re-
construction that can infer both 3D surface and texture from a single image, and optionally, multiple input
images. Finally, built upon our framework, we can perform the reconstruction alongside a pre-defined
body rig, enabling us to reconstruct a full body model in a normalized canonical pose and format ready for
animation.
x
Chapter 1
Overview
1.1 Motivation
Digitization of humans is essential for a variety of applications ranging from gaming, visual effects to
free-viewpoint videos. High-end capture solutions which mostly involves a large number of cameras and
active projections or controlled lighting conditions, are usually restricted to professional studio settings.
The increasing popularity of VR/AR technologies and personal devices has further triggered the needs for
affordable and accessible digitizing systems, which calls for an end-to-end solution for capturing dynamic
clothed digital humans. The latest related work has been able to faithfully digitize individual body parts
such as face, hair, hands with more lightweight systems, thanks to their domain specification and the
recent development of deep learning. However, due to the varying poses with large movements, changing
typologies, and large degree of freedom of the full body, Existing methods cannot apply directly to it. In
this dissertation, we address the still challenging problem of digitizing the full body, and complete the full
pipeline of digitization of humans from sparse inputs.
1.2 Camera Calibration
The first stage of our pipeline is image pre-processing, which aims at normalizing the raw data captured by
the cameras for later procedures. We utilize common image processing algorithms like tone mapping, color
1
balance, distortion correction, etc. Still, handling perspective distortion caused by the varying distances
of objects towards the camera remains unclear due to lack of existing methods. The effect of perspective
distortion is commonly seen in human photographs as the camera is usually not too far from the person,
especially for portraits when most of time the camera is within a meter from the face. An unconstrained,
perspective distorted image as input could greatly reduce the performance of other algorithms, as shown
in 2.4.4.
In Chapter 2, we present how we address the perspective distortions in portraits, and our method can
be extended to a general approach addressing perspective distortion in processing human-related photos.
Near-range portrait photographs often contain perspective distortion artifacts that bias human perception
and challenge both facial recognition and reconstruction techniques. We present the first deep learning
based approach to remove such artifacts from unconstrained portraits. In contrast to the previous state-of-
the-art approach, our method handles even portraits with extreme perspective distortion, as we avoid the
inaccurate and error-prone step of first fitting a 3D face model. Instead, we predict a distortion correction
flow map that encodes a per-pixel displacement that removes distortion artifacts when applied to the input
image. Our method also automatically infers missing facial features, i.e. occluded ears caused by strong
perspective distortion, with coherent details. We demonstrate that our approach significantly outperforms
the previous state-of-the-art both qualitatively and quantitatively, particularly for portraits with extreme
perspective distortion or facial expressions. We further show that our technique benefits a number of
fundamental tasks, significantly improving the accuracy of both face recognition and 3D reconstruction
and enables a novel camera calibration technique from a single portrait. Moreover, we also build the first
perspective portrait database with a large diversity in identities, expression and poses, which will benefit
the related research in this area.
2
1.3 Volumetric Reconstruction
After we process our captured images, we would like to establish a framework of reconstructing 3D
from 2D inputs. Conventional reconstruction systems require either pre-scanned templates, large num-
ber of cameras, or active sensors, which fail to generate plausible results for extreme sparse inputs. As of
now, only data-driven approaches are able to greatly lower the requirements of controlled environments,
but most existing methods suffer from inferior reconstruction qualities due to their highly-costly or less-
flexible representations.
In Chapter 3, we focus on the task of template-free 3D reconstruction from sparse inputs. We augment
the idea of conventional visual hull algorithm using deep learning, and extend it to a general implicit rep-
resentation called Pixel-aligned Implicit Function (PIFu), which locally aligns pixels of 2D images with
the global context of their corresponding 3D object. Using this representation, we propose an end-to-end
deep learning framework that can infer both 3D surface and texture from either a single image or multi-
ple input images, and dedicate our framework towards digitizing highly detailed clothed humans. Highly
intricate shapes, such as hairstyles, clothing, as well as their variations and deformations can be recon-
structed in a unified way. Compared to existing representations used for 3D deep learning, PIFu produces
high-resolution surfaces including largely unseen regions such as the back of a person. In particular, it is
memory efficient unlike the voxel representation, can handle arbitrary topology, and the resulting surface
is spatially aligned with the input image. We demonstrate high-resolution and robust reconstructions on
real world images from multiple datasets, which contains a variety of challenging clothing types. Our
methods achieve state-of-the-art performance on a public benchmark and outperforms the prior work for
clothed human digitization from a single image.
1.4 Deep Human Model
Our deep learning framework can handle 3D reconstructions as simple as taking a picture. Its data-driven
nature eliminates the requirements of sophisticated 3D scanning devices, multi-view stereo algorithms,
3
or tedious capture procedures, and implicitly learns the shape prior for recovering the occluded regions
and fine details, especially when trained and applied to full body digitizing. On top of this general recon-
struction approach, it remains interesting to see how we can integrate the human prior into our pipeline,
and to allow creation of human models specifically for digital avatar applications, such as simulation or
animation.
In Chapter 4, we propose ARCH (Animatable Reconstruction of Clothed Humans), a novel end-to-end
framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image.
Thus far, existing approaches to digitize 3D humans struggle to handle pose variations and recover details,
or they do not produce models that are animation ready. In contrast, ARCH is a learned pose-aware model
that produces detailed 3D rigged full-body human avatars from a single unconstrained RGB image. A
Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator.
They allow the transformation of 2D/3D clothed humans into a canonical space, reducing ambiguities
in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and
appearance are learned using an implicit function representation with spatial local features. Furthermore,
we propose additional per-pixel supervision on the 3D reconstruction using opacity-aware differentiable
rendering. Our experiments indicate that ARCH increases the fidelity of the reconstructed humans. We
obtain more than 50% lower reconstruction errors for standard metrics compared to previous methods
on public datasets. We also show numerous qualitative examples of animated, high-quality reconstructed
avatars unseen in the literature so far.
4
Chapter 2
Camera Calibration
2.1 Introduction
Perspective distortion artifacts are often observed in human related photographs, especially portraits, in
part due to the popularity of the “selfie” image captured at a near-range distance, as shown in Fig 2.2.
When the object-to-camera distance is comparable to the size of a human head, as in the 25cm distance
example, there is a large proportional difference between the camera-to-nose distance and camera-to-ear
distance. This difference creates a face with unusual proportions, with the nose and eyes appearing larger
and the ears vanishing all together [182].
Perspective distortion in portraits not only influences the way humans perceive one another [25], but
also greatly impairs a number of computer vision-based tasks, such as face verification and landmark
detection. Prior research [100, 101] has demonstrated that face recognition is strongly compromised by
(a) Capture Setting (b) Input image (c) Undistorted image (d) Reference (e) Input image (f) Undistorted image (g) Reference
Beam Splitter
Sliding Camera
Fixed Camera
Figure 2.1: A learning-based method to remove perspective distortion from portraits.
For a subject with two different facial expressions, we show input photos (b) (e), our undistortion results
(c) (f), and reference images (d) (g) captured simultaneously using a beam splitter rig (a). Our approach
handles even extreme perspective distortions.
5
the perspective distortion of facial features. Additionally, 3D face reconstruction from such portraits is
highly inaccurate, as geometry fitting starts from biased facial landmarks and distorted textures.
Correcting perspective distortion in portrait photography is a largely unstudied topic. Recently, Fried
et al. [51] investigated a related problem, aiming to manipulate the relative pose and distance between a
camera and a subject in a given portrait. Towards this goal, they fit a full perspective camera and parametric
3D head model to the input image and performed 2D warping according to the desired change in 3D.
However, the technique relied on a potentially inaccurate 3D reconstruction from facial landmarks biased
by perspective distortion. Furthermore, if the 3D face model fitting step failed, as it could for extreme
perspective distortion or dramatic facial expressions, so would their broader pose manipulation method.
In contrast, our approach does not rely on model fitting from distorted inputs and thus can handle even
these challenging inputs. Our GAN-based synthesis approach also enables high-fidelity inference of any
occluded features, not considered by Fried et al. [51].
In our approach, we propose a cascaded network that maps a near-range portrait with perspective
distortion to its distortion-free counterpart at a canonical distance of 1:6m (although any distance between
1:4m 2m could be used as the target distance for good portraits). Our cascaded network includes
a distortion correction flow network and a completion network. Our distortion correction flow method
encodes a per-pixel displacement, maintaining the original image’s resolution and its high frequency details
in the output. However, as near-range portraits often suffer from significant perspective occlusions, flowing
individual pixels often does not yield a complete final image. Thus, the completion network inpaints any
missing features. A final texture blending step combines the face from the completion network and the
warped output from the distortion correction network. As the possible range of per-pixel flow values vary
by camera distance, we first train a camera distance prediction network, and feed this prediction along with
the input portrait to the distortion correction network.
Training our proposed networks requires a large corpus of paired portraits with and without perspective
distortion. However, to the best of our knowledge, no previously existing dataset is suited for this task.
As such, we construct the first portrait dataset rendered from 3D head models with large variations in
6
camera distance, head pose, subject identity, and illumination. To visually and numerically evaluate the
effectiveness of our approach on real portrait photographs, we also design a beam-splitter photography
system (see Teaser) to capture portrait pairs of real subjects simultaneously on the same optical axis,
eliminating differences in poses, expressions and lighting conditions.
Experimental results demonstrate that our approach removes a wide range of perspective distortion
artifacts (e.g., increased nose size, squeezed face, etc), and even restores missing facial features like ears
or the rim of the face. We show that our approach significantly outperforms Fried et al. [51] both qualita-
tively and quantitatively for a synthetic dataset, constrained portraits, and unconstrained portraits from the
Internet. We also show that our proposed face undistortion technique, when applied as a pre-processing
step, improves a wide range of fundamental tasks in computer vision and computer graphics, including
face recognition/verification, landmark detection on near-range portraits (such as head mounted cameras
in visual effects), and 3D face model reconstruction, which can help 3D avatar creation and the generation
of 3D photos (Section 2.4.4). Additionally, our novel camera distance prediction provides accurate camera
calibration from a single portrait.
Our main contributions can be summarized as follows:
The first deep learning based method to automatically remove perspective distortion from an uncon-
strained near-range portrait, benefiting a wide range of fundamental tasks in computer vision and
graphics.
A novel and accurate camera calibration approach that only requires a single near-range portrait as
input.
A new perspective portrait database for face undistortion with a wide range of subject identities,
head poses, camera distances, and lighting conditions.
7
@160CM @25CM
Figure 2.2: Perspective distortion artifacts.
The person is photographed from distances of 160cm and 25cm.
2.2 Related Work
Face Modeling. We refer the reader to [129] for a comprehensive overview and introduction to the mod-
eling of digital faces. With advances in 3D scanning and sensing technologies, sophisticated laboratory
capture systems [19, 20, 24, 44, 59, 99, 113, 184] have been developed for high-quality face reconstruc-
tion. However, 3D face geometry reconstruction from a single unconstrained image remains challenging.
The seminal work of Blanz and Vetter [22] proposed a PCA-based morphable model, which laid the foun-
dation for modern image-based 3D face modeling and inspired numerous extensions including face mod-
eling from internet pictures [89], multi-view stereo [10], and reconstruction based on shading cues [90].
To better capture a variety of identities and facial expressions, the multi-linear face models [178] and the
FACS-based blendshapes [28] were later proposed. When reconstructing a 3D face from images, sparse
2D facial landmarks [41, 42, 147, 192] are widely used for a robust initialization. Shape regressions have
been exploited in the state-of-the-art landmark detection approaches [29, 88, 138] to achieve impressive
accuracy.
Due to the low dimensionality and effectiveness of morphable models in representing facial geometry,
there have been significant recent advances in single-view face reconstruction [164, 141, 92, 163, 70].
However, for near-range portrait photos, the perspective distortion of facial features may lead to erroneous
reconstructions even when using the state-of-the-art techniques. Therefore, portrait perspective undistor-
tion must be considered as a part of a pipeline for accurately modeling facial geometry.
8
Face Normalization. Unconstrained photographs often include occlusions, non-frontal views, perspec-
tive distortion, and even extreme poses, which introduce a myriad of challenges for face recognition and
reconstruction. However, many prior works [66, 200, 144, 72] only focused on normalizing head pose.
Hassner et al. [66] “frontalized” a face from an input image by estimating the intrinsic camera matrix
given a fixed 3D template model. Cole et al. [37] introduced a neural network that mapped an uncon-
strained facial image to a front-facing image with a neutral expression. Huang et al. [72] used a generative
model to synthesize an identity-preserving frontal view from a profile. Bas et al. [18] proposed an approach
for fitting a 3D morphable model to 2D landmarks or contours under either orthographic or perspective
projection.
Psychological research suggests a direct connection between camera distance and human portrait per-
ception. Bryan et al. [25] showed that there is an “optimal distance” at which portraits should be taken.
Cooper et al. [40] showed that the 50mm lens is most suitable for photographing an undistorted facial
image. Valente et al. [172] proposed to model perspective distortion as a one-parameter family of warping
functions with known focal length. Most related to our work, Fried et al. [51] investigated the problem of
editing the facial appearance by manipulating the distance between a virtual camera and a reconstructed
head model. Though this technique corrected some mild perspective distortion, it was not designed to
handle extreme distortions, as it relied on a 3D face fitting step. In contrast, our technique requires no
shape prior and therefore can generate undistorted facial images even from highly distorted inputs.
Image-based Camera Calibration. Camera calibration is an essential prerequisite for extracting precise
and reliable 3D metric information from images. We refer the reader to [137, 146, 116] for a survey of such
techniques. The state-of-the-art calibration methods mainly require a physical target such as checkerboard
pattern [166, 205] or circular control points [32, 39, 68, 80, 43], used for locating point correspondences.
Flores et al. [49] proposed the first method to infer camera-to-subject distance from a single image with a
calibrated camera. Burgos-Artizzu et al. [26] built the Caltech Multi-Distance Portraits Dataset (CMDP)
of portraits of a variety of subjects captured from seven distances. Many recent works directly estimate
9
Camera parameter
estimation
Face
reconstruction
Face verification
Yes
No
Same
Person?
Original image
Segmentation
Network
Input image
Input image
(512x512)
Camera
Distance
Prediction
Network
Correction flow
(512x512)
Correction flow Warped image
Warped image
(512x512)
Completion
Net
Completed image
(512x512)
Completed image Undistorted image
resize resize resize
blending
Landmark detection
enhancement
Preprocessing Portrait Undistortion Application
FlowNet
concatenate
D
Figure 2.3: The pipeline workflow and applications of our approach.
The input portrait is first segmented and scaled in the preprocessing stage and then fed to a network
consisting of three cascaded components. The FlowNet rectifies the distorted artifacts in the visible
regions of input by predicting a distortion correction flow map. The CompletionNet inpaints the missing
facial features due to the strong perspective distortions and obtains the completed image. The outcomes
of two networks are then scaled back to the original resolution and blended with high-fidelity mean
texture to restore fine details.
camera parameters using deep neural networks. PoseNet [91] proposed an end-to-end solution for 6-DOF
camera pose localization. Others [187, 188, 69] proposed to extract camera parameters using vanishing
points from a single scene photos with horizontal lines. To the best of our knowledge, our method is the
first to estimate camera parameters from a single portrait.
2.3 Perspective Undistortion for Portraits
The overview of our system is shown in Fig. 2.3. We pre-process the input portraits with background
segmentation, scaling, and spatial alignment (see appendix), and then feed them to a camera distance
prediction network to estimate camera-to-subject distance. The estimated distance and the portrait are
fed into our cascaded network including FlowNet, which predicts a distortion correction flow map, and
CompletionNet, which inpaints any missing facial features caused by perspective distortion. Perspective
undistortion is not a typical image-to-image translation problem, because the input and output pixels are
not spatially corresponded. Thus, we factor this challenging problem into two sub tasks: first finding a
per-pixel undistortion flow map, and then image completion via inpainting. In particular, the vectorized
10
Figure 2.4: Cascade Network Structure.
0
1
Distances
0.5
d1 d2 a b
Figure 2.5: Illustration of Camera Distance Prediction Classifier.
GreenCurve andRedCurve are the response curves of inputa andb;d1 andd2 are the predicted
distances of inputa and inputb.
flow representation undistorts an input image at its original resolution, preserving its high frequency de-
tails, which would be challenging if using only generative image synthesis techniques. In our cascaded
architecture (Fig. 2.4), CompletionNet is fed the warping result of FlowNet. Finally, we combined the
results of the two cascaded networks using the Laplacian blending [1] to synthesize high-resolution details
while maintaining plausible blending with existing skin texture.
2.3.1 Camera Distance Prediction Network
Rather than regress directly from the input image to the camera distanceD, which is known to be chal-
lenging to train, we use a distance classifier. We check if the camera-to-subject distance of the input is
larger than an query distanced (17.4cm, 130cm)
1
. Our strategy learns a continuous mapping from input
images to the target distances. Given any query distanceD and input image pair (input;D), the output is
a floating point number in the range of 0 1, which indicates the probability that the distance of the input
1
17.4cm and 130cm camera-to-subject distances correspond to 14mm and 105mm in 35mm equivalent focal length
11
image is greater than the query distanceD. As shown in Fig. 2.5, the vertical axis indicates the output of
our distance prediction network while the horizontal axis is the query distance. To predict the distance, we
locate the query distance with a network output of 0.5. With our network denoted as, our network holds
the transitivity property that ifd1>d2,(input;d1)>(input;d2).
To train the camera distance network, we append the value of log
2
d as an additional channel of the
input image and extract features from the input images using the VGG-11 network [152], followed by a
classic classifier consisting of fully connected layers. As training data, for each of the training image with
ground truth distanced, we sample a set of log
2
d using normal distribution log
2
dN (log
2
D; 0:5
2
).
2.3.2 FlowNet
The FlowNet operates on the normalized input image (512 512)A and estimates a correction forward
flowF that rectifies the facial distortions. However, due to the immense range of possible perspective
distortions, the correction displacement for portraits taken at different distances will exhibit different dis-
tributions. Directly predicting such high-dimensional per-pixel displacement is highly under-constrained
and often leads to inferior results (Figure 2.12). To ensure more efficient learning, we propose to attach
the estimated distance to the input of FlowNet in the similar way as in Section 2.3.1. Instead of directly at-
taching the predicted number, we propose to classify these distances into eight intervals
2
and use the class
label as the input to FlowNet. The use of the label will decrease the risk of accumulation errors from the
camera distance prediction network, because the accuracy of predicting a label is higher than predicting a
floating number.
FlowNet takesA and distance labelL as input, and it will predict a forward flow mapF
AB
, which
can be used to obtain undistorted outputB when applied toA. For each pixel (x;y) ofA,F
AB
encodes
the translation vector (
x
;
y
). Denote the correspondence of (x;y) onB as (x
0
;y
0
), then (x
0
;y
0
) =
(x;y) + (
x
;
y
). In FlowNet, we denote generator and discriminator asG andD separately. Then L1
loss of flow is as below:
2
The eight distance intervals are [23;26), [26;30), [30;35), [35;43), [43;62), [62;105), [105;168) and [168;+1), which
are measured in centimeters.
12
L
G
=kyF
AB
k
1
(2.1)
In which y is the ground truth flow. For the discriminator loss, as forward flowF
AB
is per-pixel corre-
spondence toA but notB, thusB will have holes, seams and discontinuities which is hard to use in the
discriminator. To make the problem more tractable, instead of applying a discriminator onB, we use the
F
AB
to mapB toA
0
and useA andA
0
as pairs for the discriminator on the condition ofL.
L
D
= min
G
max
D
E
xp
data
(x)
logD(A;L)
+E
zpz(z)
log (1D(A
0
;L))
: (2.2)
wherep
data
(x) andp
z
(z) represent the distributions of real datax(input imageA domain) and noise
variablesz in the domain ofA respectively. The discriminator will penalize the joint configuration in the
space ofA, which leads to shaper results.
As part of the FlowNet input factorized learning paradigm which leverages multiple sub-networks,
each addressing a portion of the overall regression problem. In particular, we classify the portraits into
eight intervals
3
based on the camera-to-subject distance such that within each interval the distributions
of displacement vectors stay similar. We then train eight corresponding sub-networks separately using the
categorized data, each inferring the correction flow map from the portraits classified to its assigned distance
interval. To determine which sub-network to be deployed at run time, we train a distance classifier which
predicts the distance interval that the portrait is taken at (Figure 2.4).
We denote the undistorted output image asB. For each pixel (x;y) ofF, it encodes the translation
vector (
x
;
y
) fromA toB. In particular, suppose the pixel (x;y) ofB is corresponded to the pixel
(x
0
;y
0
) ofA. Hence the translation vector encoded in the pixel (x;y) ofF is computed as (
x
;
y
) =
(x;y) (x
0
;y
0
). When recoveringB from the input imageA and the flow mapF, we traverse the pixels
inA and copy the its color values to the corresponding position inB via querying the translation vectors
encoded inF. As the undistorted face is prone to cover more pixels than that in the input, we interpolate
3
The eight distance intervals are [23;26), [26;30), [30;35), [35;43), [43;62), [62;105), [105;168) and [168;+1), which
are measured in centimeters.
13
the scattered pixel colors using a triangulation based approach [11]. Applying the output correction flow
map from FlowNet will automatically undistort existing facial features (including ears) in the input.
2.3.3 CompletionNet
The distortion-free result will then be fed into the CompletionNet, which focuses on inpainting missing
features and filling the holes. Note that as trained on a vast number of paired examples with large variations
of camera distance, CompletionNet has learned an adaptive mapping regarding to varying distortion mag-
nitude inputs. In particular, an input image with complete facial features will remain largely unchanged
after passed through the network while the ones that still lack facial clues will be automatically completed.
2.3.4 Implementation Details
Both the CompletionNet and the sub-networks in FlowNet are implemented using identical architectures.
In particular, we employ a U-net structure with skip connections similar to [76]. Such an architecture is
well-suited to our task, as the skip connections between layers of the encoder and decoder modules allow
for easy preservation of the overall structure of the input in the output image while avoiding the artifacts
and limited resolutions found in more typical encoder-decoder networks. This enables the network to
leverage more of its overall capacity to learn an appropriate transformation from the provided input to the
desired output. In addition, the face completion task is likely to introduce significant topology changes
when the missing features are recovered. We found that trivially applying the U-net structure fails to
generate reasonable results for our purpose. To make the problem more tractable, we introduce several
modifications to improve the resulting image quality and stabilize the training process. The details are
stated below.
We employ a U-net structure with skip connections similar to [76] to both FlowNet and CompletionNet.
There is no direct correspondence between each pixel in the input and those in the output. In the FlowNet,
the L1 and GAN discriminator loss are only computed within the segmentation mask, leading the network
to focus on correcting details that only will be used in the final output. In the CompletionNet, as the output
14
tends to cover more pixels than the input, during training, we compute a new mask that denotes the novel
region compared to input and assign a higher L1 loss weight to this region. In implementation, we set the
weight ratio of losses computed inside and outside the mask as 5 : 1 for the CompletionNet. We found that
this modifications leads to better inference performance while producing results with more details.
We train FlowNet and CompletionNet separately using the Adam optimizer [93] with learning rate
0.0002. All training was performed on an NVIDIA GTX 1080 Ti graphics card. For both networks, we
set the weights of L1 and GAN loss as 10.0 and 1.0 respectively. As shown in Fig. 2.4, the generator in
both networks uses the mirrored structure, in which both encoder and decoder use 8-layer convolutional
network. ReLU activation and batch normalization are used in all layers except the input and output layers.
The discriminator consists of four convolutional layers and one fully connected layer.
2.4 Experiments
2.4.1 Datasets
A neural network based approach would na¨ ıvely require pairs of portraits with and without the perspective
distortion, ideally including diverse subjects, expressions, illuminations and head poses from a variety of
distances. Although the Caltech Multi-Distance Portrait Dataset (CMDP) [26] provides datasets of 53
subjects captured at 7 distances, as these images were not captured simultaneously, both camera pose and
facial expressions change slightly across distances for each subject, which cannot define as a pair with only
perspective changing. We therefore generate a novel training dataset where we can control the subject-to-
camera distance, head pose, and illumination and ensure that the image differences are only caused only
by perspective distortion.
Training Data Acquisition and Rendering. As there is no database of paired portraits with only per-
spective changing. We therefore generate a novel training dataset where we can control the subject-to-
camera distance, head pose, illumination and ensure that the image differences are only caused only by
15
perspective distortion. Our synthetic training corpus is rendered from 3D head models acquired by two
scanning systems. The first is a light-stage[59, 113] scanning system which produces pore-level 3D head
models for photo-real rendering. Limited by the post-processing time, cost and number of individuals that
can be rapidly captured, we also employ a second capture system engineered for rapid throughput. In
total, we captured 15 subjects with well defined expressions in the high-fidelity system, generating 307
individual meshes, and 200 additional subjects in the rapid-throughput system. Since we have the 3D
ground-truth face geometry and the corresponding reflectance maps, during rendering, it is easy to trace
the 2D projected pixels of the same surface point and thereby compute the flow maps for network training
and evaluation.
We rendered synthetic portraits using a variety of camera distances, head poses, and incident illu-
mination conditions. We sample distances distributed from 23cm to 160cm with corresponding 35mm
equivalent focal length from 18mm to 128mm to ensure the same framing. We also randomly sample
candidate head poses in the range of -45
to +45
in pitch, yaw, and roll. We used 107 different image-
based environment for global illumination combined with point lights. With the random combination of
camera distances, head poses, and illuminations, we totally generate 35,654 pairs of distorted/undistorted
portraits along with forward flow. 17,000 additional portrait pairs warped from U-4DFE dataset [199](38
females and 25 males subjects for training while 20 females and 17 males for testing) are used to expand
diversity of identities.
To supplement this data with additional identities and real photos, we also generate portraits from the
3D facials scans of the BU-4DFE dataset [199] after randomly sampling candidate head poses in the range
of -10
to +10
in pitch and yaw. Out of 58 female subjects and 43 male subjects total, we trained using 38
female and 25 male subjects, keeping the rest set aside as test data, for a total of 17,000 additional training
pairs. Fig. 2.6 show the examples of our training data.
Test Data Acquisition. To demonstrate that our system scales well to real-world portraits, we also de-
vised a two-camera beam splitter capture system that would enable simultaneous photography of a subject
16
Figure 2.6: Training dataset.
The left side are the synthetic images generated from BU-4DFE dataset[199]. From the left to the right,
the camera-to-subject distances are: 24cm, 30cm, 52cm, 160cm. The right side are the synthetic images
rendered from high-resolution 3D model. From left to the right, the camera-to-subject distances are:
22cm, 30cm, 53cm, 160cm.
at two different distances. As shown in Teaser left, we setup a beam-splitter (50% reflection, 50% trans-
mission) at 45
along a metal rail, such that a first DSLR camera was set at the canonical fixed distance of
1.6m from the subject, with a 180mm prime lens to image the subject directly, while a second DSLR cam-
era was set at a variable distance of 23cm - 1.37m, to image the subject’s reflection off the beam-splitter
with a 28mm zoom lens. With careful geometry and color calibration, the two hardware-synced cameras
were able to capture nearly ground truth portraits pairs of real subjects both with and without perspective
distortions. (More details can be found in the appendix).
Before feeding an input photo to our framework, we first detect its face bounding box using the method
of [175] and then segment out the head region with a segmentation network in[106]. The network is trained
with modified portrait data from [150]. The input is further scaled and aligned with the training data format.
In order to handle images with arbitrary resolution, we pre-process the segmented images to a uniform
size of 512 512. The input image is first scaled so that its detected inter-pupillary distance matches a
target length, computed by averaging that of ground truth images. We then crop the image to 512 512,
maintaining the right eye inner alignment. Images too small to be cropped are padded with black.
2.4.2 Results and Comparisons
Face Image Undistortion. In Fig. 2.7, Fig. 2.8 and Fig. 2.9 we show the undistortion results of ours
compared to Fried et al.[51]. To better visualize the reconstruction error, we also show the error map
17
Figure 2.7: Comparisons on undistortion results of our beam splitter system.
From left to the right : inputs, results of Fried et al. [51], error maps of Fried et al. [51] compared to
references, ours results, error map of our results compared to references, reference.
18
Figure 2.8: Comparisons on undistortion results of synthetic data generated from BU-4DFE dataset.
From left to the right : inputs, results of Fried et al. [51], ours results, ground truth, error map of Fried et
al. [51] compared to ground truth, error map of our results compared to ground truth.
compared to groundtruth or references in Fig. 2.7 and Fig. 2.8. We perform histogram equalization before
computing the error maps. Our results are visually more accurate than Fried et al.[51] especially on the
face boundaries. To numerically evaluate our undistortion accuracy, we compare with Fried et al.[51] the
average error over 1000 synthetic pair from BU-4DFE dataset. With an average intensity error of 0.39 we
significantly outperform Fried et al.[51] which has an average intensity error of 1.28. In Fig. 2.9, as we do
not have references or ground truth, to better visualize the motion of before-after undistortion, we replace
theg channel of input withg channel of result image to amplify the difference.
Camera Parameter Estimation. Under the assumption of same framing(keeping the head size in the
same scale for all the photos), the distance is equivalent to focal length by multiplying a scalar. The scalar
of converting distances to focal length is s = 0:8, which means when taking photo at 160cm camera-
to-subject distance, to achieve the desired framing in this paper, a 128mm focal length should be used.
Thus, as long as we can predict accurate distances of the input photo, we can directly get the 35mm
equivalent focal length of that photo. We numerically evaluate the accuracy of our Camera Distance
Prediction network by testing with 1000 synthetic distorted portraits generated from BU-4DFE dataset.
19
The mean error of distance prediction is 8.2% with a standard deviation of 0.083. We also evaluate the
accuracy of labeling. As the intervals mentioned in Section 2.3.2 are successive, some of the images may
lie on the fence of two neighboring intervals. So we regard label prediction as correct within its one step
neighbor. Under this assumption, the accuracy of labeling is 96.9% which insures input reliability for the
cascade networks. Fig. 2.10 shows the distance prediction probability curve of three images. For each
of them we densely sampled query distances along the whole distance range and and the classifier results
are monotonically changing. We tested on 1000 images and found that on average the transitivity holds
98.83%.
2.4.3 Ablation Study
In Fig. 2.11, we compare the single network and proposed cascade network. The results show that with a
FlowNet as prior, the recovered texture will be sharper especially on boundaries. Large holes and missing
textures are more likely to be inpainted properly. Fig. 2.12 demonstrates the effectiveness of our label
channel introduced in FlowNet. The results without label channel are more distorted compared to the ones
with label as inputs, especially the proportions of noses, mouth regions and the face rims.
2.4.4 Applications
Face Verification. Our facial undistortion technique can improve the performance of face verification,
which we test using the common face recognition system OpenFace [12]. We synthesized 6,976 positive
(same identity) and 6,980 negative (different identity) pairs from BU-4DFE dataset [199] as test data. We
rendered one image A in a pair of images (A,B) as a near-distance portrait with perspective distortion;
while we renderedB at the canonical distance of 1:6m to minimize the distortion. This is the setting of
most face verification security system, which retrieves the database for the nearest neighbor. We evalu-
ated face verification performance on raw data (A;B) and data (N (A);B) and (N (A);N (B)) in which
perspective distortion was removed using our method. Verification accuracy and receiver operating char-
acteristic (ROC) comparisons are shown in Table 2.1 and Fig. 2.13.
20
Input mean std
Raw input (A;B) 0.9137 0.0090
Undistorted input (N(A);B) 0.9473 0.0067
Table 2.1: Comparison of face verification accuracy for images with and without our undistortion as
pre-processing.
Accuracy is reported on random 10-folds of test data with mean and standard deviation.
Landmark Detection Enhancement. We use the state-of-the-art facial landmark tracker OpenPose [30]
on 6,539 renderings from the BU-4DFE dataset [199] as previously described, where each input image is
rendered at a short camera-to-object distance with significant perspective distortion. We either directly
apply the landmark detection to the raw image, or the undistorted image using our network and then
locate the landmark on the raw input using the flow from our network. Landmark detection gives a 100%
performance based on our pre-processed images, on which domain alignments are applied while fails
on 9.0% original perspective-distorted portraits. For quantitative comparisons, we evaluate the landmark
accuracy using a standard metric, Normalized Mean Error (NME) [202]. Given the ground truth 3D facial
geometry, we can find the ground truth 2D landmark locations of the inputs. For images with successful
detection for both the raw and undistorted portraits, our method produces lower landmark error, with mean
NME = 4.4% (undistorted images), compared to 4.9% (raw images). Fig. 2.14 shows the cumulative error
curves, showing an obvious improvement for facial landmark detection for portraits undistorted using our
approach.
Face Reconstruction. One difficulty of reconstructing highly distorted faces is that the boundaries can be
severely self-occluded (e.g., disappearing ears or occlusion by the hair), which is a common problem in 3D
face reconstruction methods regardless if the method is based on 2D landmarks or texture. Fig. 2.15 shows
that processing a near-range portrait input using our method in advance can significantly improve 3D face
reconstruction. The 3D facial geometry is obtained by fitting a morphable model (FaceWarehouse [28]) to
2D facial landmarks. Using the original perspective-distorted image as input, the reconstructed geometry
21
appears distortion, while applying our technique as a pre-processing step retains both identity and geo-
metric details. We show the error map of 3D geometry compared to ground truth, demonstrating that our
method applied as a pre-processing step improves reconstruction accuracy, compared with the baseline
approach without any perspective-distortion correction.
Limitations. One limitation of our work is that the proposed approach does not generalize to lens dis-
tortions, as our synthetic training dataset rendered with an ideal perspective camera does not include this
artifact. Similarly, our current method is not explicitly trained to handle portraits with large occlusions and
head poses. We plan to resolve both of the limitations in future work by augmenting training examples
with lens distortions, large facial occlusions and head poses. Another future avenue is to investigate end-
to-end training of the cascaded network, which could further boost the performance of our approach, but
would require fully-differentiable image warping.
22
(a) (b) (c) (d) (e)
Figure 2.9: Evaluation and comparisons on a variety of datasets with in the wild database.
(a). inputs; (b). Fried et al. [51]; (c). Ours; (d). The Mixture of (a) and (b) for better visualization of
undistortion; (e). The Mixture of (a) and (c); Shaded portraits indicate the failure of Fried et al. [51].
23
0 20 40 60 80 100 120 140
Query Distances(cm)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Probablity
gt = 21.76 pred = 19.16
gt = 80.90 pred = 77.95
gt = 51.41 pred = 52.65
Figure 2.10: Distance prediction probability curve of three different input portraits with query distance
sampled densely along the whole range.
Figure 2.11: Ablation analysis on cascade network.
In each triplet, from left to the right: inputs, results of single image-to-image network similar to [76],
results of using cascade network including FlowNet and CompletionNet.
24
Figure 2.12: Ablation analysis on attach label as a input to FlowNet.
In each triplet, from left to the right: inputs, results from the network without label channel, results of our
proposed network.
25
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0
True Positive Rate
raw input
undistorted input (our method)
Figure 2.13: Receiver operating characteristic (ROC) curve for face verification performance on raw input
and undistorted input using our method.
Raw input (A,B) compares to undistorted input (N(A),B).
26
0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18
Normalized Mean Error (NME)
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Percentage
unnormalized
normalized (our method)
Figure 2.14: Cumulative error curve for facial landmark detection performance given unnormalized image
and image normalized by our method.
Metric is measured in normalized mean error (NME).
(a) (b) (c) (d)
Figure 2.15: Comparing 3D face reconstruction from portraits, without and with our undistortion tech-
nique.
(a) and (c) are heavily distorted portraits and the 3D mesh fitted using the landmarks of the portraits. (b)
and (d) are undistorted results of (a) and (c) with 3D reconstruction based on them. Gray meshes show
reconstructed facial geometry and color-coded meshes show reconstruction error.
27
Chapter 3
Volumetric Reconstruction
3.1 Introduction
In an era where immersive technologies and sensor-packed autonomous systems are becoming increasingly
prevalent, our ability to create virtual 3D content at scale goes hand-in-hand with our ability to digitize and
understand 3D objects in the wild. If digitizing an entire object in 3D was as simple as taking a picture,
there would be no need for sophisticated 3D scanning devices, multi-view stereo algorithms, or tedious
capture procedures, where a sensor needs to be moved around.
For certain domain-specific objects, such as faces, human bodies, or known man made objects, it is
already possible to infer relatively accurate 3D surfaces from images with the help of parametric models,
data-driven techniques, or deep neural networks. Recent 3D deep learning advances have shown that
general shapes can be inferred from very few images and sometimes even a single input. However, the
resulting resolutions and accuracy are typically limited, due to ineffective model representations, even for
domain specific modeling tasks.
We propose a new Pixel-aligned Implicit Function (PIFu) representation for 3D deep learning for the
challenging problem of textured surface inference of clothed 3D humans from a single or multiple input
images. While most successful deep learning methods for 2D image processing (e.g., semantic segmen-
tation [149], 2D joint detection [165], etc.) take advantage of “fully-convolutional” network architectures
28
Single-view
Multi-view
PIFu
PIFu
Figure 3.1: End-to-end deep learning based volumetric reconstruction.
our approach allows recovery of high-resolution 3D textured surfaces of clothed humans from a single
input image (top row) or from multi-view input images (bottom row). Our approach can digitize intricate
variations in clothing, such as wrinkled skirts and high-heels, including complex hairstyles. The shape and
textures can be fully recovered including unseen regions such as the back of the subject.
that preserve the spatial alignment between the image and the output, this is particularly challenging in
the 3D domain. While voxel representations [173] can be applied in a fully-convolutional manner, the
memory intensive nature of the representation inherently restricts its ability to produce fine-scale detailed
surfaces. Inference techniques based on global representations [60, 84, 3] are more memory efficient, but
cannot guarantee that details of input images are preserved. Similarly, methods based on implicit func-
tions [34, 127, 117] rely on the global context of the image to infer the overall shape, which may not align
with the input image accurately. On the other hand, PIFu aligns individual local features at the pixel level
to the global context of the entire object in a fully convolutional manner, and does not require high memory
usage, as in voxel-based representations. This is particularly relevant for the 3D reconstruction of clothed
subjects, whose shape can be of arbitrary topology, highly deformable and highly detailed.
Specifically, we train an encoder to learn individual feature vectors for each pixel of an image that
takes into account the global context relative to its position. Given this per-pixel feature vector and a
specified z-depth along the outgoing camera ray from this pixel, we learn an implicit function that can
29
classify whether a 3D point corresponding to this z-depth is inside or outside the surface. In particular, our
feature vector spatially aligns the global 3D surface shape to the pixel, which allows us to preserve local
details present in the input image while inferring plausible ones in unseen regions.
Our end-to-end and unified digitization approach can directly predict high-resolution 3D shapes of a
person with complex hairstyles and wearing arbitrary clothing. Despite the amount of unseen regions,
particularly for a single-view input, our method can generate a complete model similar to ones obtained
from multi-view stereo photogrammetry or other 3D scanning techniques. As shown in Figure 3.1, our
algorithm can handle a wide range of complex clothing, such as skirts, scarfs, and even high-heels while
capturing high frequency details such as wrinkles that match the input image at the pixel level.
By simply adopting the implicit function to regress RGB values at each queried point along the ray,
PIFu can be naturally extended to infer per-vertex colors. Hence, our digitization framework also generates
a complete texture of the surface, while predicting plausible appearance details in unseen regions.
We demonstrate the effectiveness and accuracy of our approach on a wide range of challenging real-
world and unconstrained images of clothed subjects. We also show for the first time, high-resolution
examples of monocular and textured 3D reconstructions of dynamic clothed human bodies reconstructed
from a video sequence. We provide comprehensive evaluations of our method using ground truth 3D
scan datasets obtained using high-end photogrammetry. We compare our method with prior work and
demonstrate the state-of-the-art performance on a public benchmark for digitizing clothed humans.
3.2 Related Work
Silhouette-Based Multi-View Reconstruction. Visual hulls created from multi-view silhouette images
are widely used for multi-view reconstruction, [176, 47, 35, 157, 211, 52, 104], since they are fast and easy
to compute and well approximate the underlying 3D geometry. Further progresses have been made to the
visual-hull-based viewing experience [115], smoothing the geometry with fewer cameras [50], and real-
time performance [107]. Approaches have also emerged to recover geometric details using multi-view
30
constraints [158, 210, 183] and photometric stereo [179, 189]. Recently, Collet et al. [38] introduced a
system for high-quality free-viewpoint video by fusing multi-view RGB, IR and silhouette inputs.
Despite the speed and robustness of silhouette-based reconstruction methods, their reliance on visual
hulls implies bias against surface concavities as well as susceptibility to artifacts in invisible space.
Multi-View 3D Deep Learning. Multi-view convolutional neural networks (CNN) have been introduced
to learn deep features for various 3D tasks including shape segmentation [82], object recognition and
classification [159, 151, 135], correspondence matching [71], and novel view synthesis [160, 208, 125].
More closely related, a number of previous works apply multi-view CNNs to 3D reconstruction problems
in both unsupervised [140] and supervised approaches to obtain the final geometry directly [153, 36],
or indirectly via normal maps [112], silhouettes [156], or color images [162]. Inspired by the multi-view
stereo constraint, others [167, 86] have formulated ray consistency and feature projection in a differentiable
manner, incorporating this formulation into an end-to-end network to predict a volumetric representation
of a 3D object.
Hartmann et al. [65] propose a deep learning based approach to predict the similarity between the
image patches across multiple views, which enables 3D reconstruction using stereopsis. In contrast, our
approach aims for a different and more challenging task of predicting per-point possibility of lying on
the reconstructed surface, and directly connects 3D volume and its 2D projections on the image planes.
Closer to our work, Ji et al. [79] propose a learned metric to infer the per-voxel possibility of being on
the reconstructed surface in a volumetric shape representation. However, due to the reliance on multi-view
stereopsis, these methods [79, 65] fail to faithfully reconstruct textureless surfaces and generate dense
reconstruction from sparse views. In addition, as both the input images and the output surface need to be
converted into volumetric representations, it remains difficult for prior methods to generate high-resolution
results. Our approach, on the other hand, can work on textureless surfaces and produce results with much
higher resolution by leveraging an implicit representation.
31
Additionally, Dibra et al. [45] propose a cross-modal neural network that captures parametric body
shape from a single silhouette image. However, this method can only predict naked body shapes in neutral
poses, while our approach generalizes well to dynamic clothed bodies in extreme poses.
Multi-View 3D Human Digitization. Multi-view acquisition methods are designed to produce a com-
plete model of a person and simplify the reconstruction problem, but are often limited to studio settings
and calibrated sensors. Early attempts are based on visual hulls [115, 176, 52, 47] which uses silhouettes
from multiple views to carve out the visible areas of a capture volume. Reasonable reconstructions can be
obtained when large numbers of cameras are used, but concavities are inherently challenging to handle.
More accurate geometries can be obtained using multi-view stereo constraints [158, 210, 183, 54] or using
controlled illumination, such as multi-view photometric stereo techniques [179, 190]. Several methods use
parametric body models to further guide the digitization process [155, 57, 14, 73, 8, 3]. The use of motion
cues has also been introduced as additional priors [132, 197]. While it is clear that multi-view capture
techniques outperform single-view ones, they are significantly less flexible and deployable.
A middle ground solution consists of using deep learning frameworks to generate plausible 3D sur-
faces from very sparse views. [36] train a 3D convolutional LSTM to predict the 3D voxel representation
of objects from arbitrary views. [86] combine information from arbitrary views using differentiable unpro-
jection operations. [79] also uses a similar approach, but requires at least two views. All of these techniques
rely on the use of voxels, which is memory intensive and prevents the capture of high-frequency details.
Single-View 3D Human Digitization. Single-view digitization techniques require strong priors due to
the ambiguous nature of the problem. Thus, parametric models of human bodies and shapes [13, 108]
are widely used for digitizing humans from input images. Silhouettes and other types of manual anno-
tations [61, 207] are often used to initialize the fitting of a statistical body model to images. Bogo et
al. [23] proposed a fully automated pipeline for unconstrained input data. Recent methods involve deep
neural networks to improve the robustness of pose and shape parameters estimations for highly challenging
images [84, 131]. Methods that involve part segmentation as input [97, 123] can produce more accurate
32
fittings. Despite their capability to capture human body measurements and motions, parametric models
only produce a naked human body. The 3D surfaces of clothing, hair, and other accessories are fully ig-
nored. For skin-tight clothing, a displacement vector for each vertex is sometimes used to model some
level of clothing as shown in [5, 185, 3]. Nevertheless, these techniques fail for more complex topology
such as dresses, skirts, and long hair. To address this issue, template-free methods such as BodyNet [173]
learn to directly generate a voxel representation of the person using a deep neural network. Due to the high
memory requirements of voxel representations, fine-scale details are often missing in the output.
Texture Inference. When reconstructing a 3D model from a single image or multiple images, the texture
can be easily sampled from the input. However, the appearance in occluded regions needs to be inferred
in order to obtain a complete texture. Related to the problem of 3D texture inference are view-synthesis
approaches that predict novel views from a single image [208, 126] or multiple images [161]. Within the
context of texture mesh inference of clothed human bodies, [118] introduced a view synthesis technique
that can predict the back view from the front one. Both front and back views are then used to texture
the final 3D mesh, however self-occluding regions and side views cannot be handled. Akin to the image
inpainting problem [130], [120] inpaints UV images that are sampled from the output of detected surface
points, and [168, 64] infers per voxel colors, but the output resolution is very limited. [85] directly predicts
RGB values on a UV parameterization, but their technique can only handle shapes with known topology
and are therefore not suitable for clothing inference.
3.3 Deep Visual Hull
We start by looking at a sparse multi-view reconstruction setup. Given multiple views and their corre-
sponding camera calibration parameters as input, our method aims to predict a dense 3D field that encodes
the probabilistic distribution of the reconstructed surface. We formulate the probability prediction as a
classification problem. At a high level, our approach resembles the spirit of the shape-from-silhouette
method: reconstructing the surface according to the consensus from multi-view images on any 3D point
33
Figure 3.2: Network Architecture.
staying inside the reconstructed object. However, instead of directly using silhouettes, which only contain
limited information, we leverage the deep features learned from a multi-view convolution neural network.
Thus, we introduce our approach as Deep Visual Hull(DVH).
As demonstrated in Figure 3.2, for each query point in the 3D space, we project it onto the multi-view
image planes using the input camera parameters. We then collect the multi-scale CNN features learned at
each projected location and aggregate them through a pooling layer to obtain the final global feature for the
query point. The per-point feature is later fed into a classification network to infer its possibilities of lying
inside and outside the reconstructed object respectively. As our method outputs a dense probability field,
the surface geometry can be faithfully reconstructed from the field using marching cube reconstruction.
3.3.1 Network Architecture
Our network consists of two parts: one feature extraction network that learns discriminative features for
each query point in the 3D space, and one classification network that consumes the output of the preceding
network and predicts the per-point possibilities of lying inside and outside the reconstructed body. Both
networks are trained in an end-to-end manner.
Feature Extraction. The feature extraction network takes multi-view images along with their corre-
sponding camera calibration parameters and 3D query points as input. The multi-view images are first
34
passed to a shared-weight fully convolutional network, whose building block includes a convolutional
layer, an Relu activation layer, and a pooling layer. Batch normalization [75] is utilized in each convolu-
tional layer.
We then associate each query point p
i
with its features by projecting it onto the multi-view image
planes. Let q
ij
denote p
i
’s projection onto image plane j. As shown in Figure 3.2, we track each q
ij
throughout the feature maps at each level of convolutional layers. The features retrieved from each layer
at the projected location are concatenated to obtain a single-view feature vectorF
ij
.
Since the view projection has floating-point precision, ambiguity may arise for feature extraction if the
projected point lies on the boundary between two adjacent pixels. To address this issue, at each level of the
feature maps, we perform bilinear interpolation on the nearest four pixels according to the local coordinate
of the projected location. It is worth mentioning that by applying bilinear interpolation, our method further
increases the receptive field of the feature vector at each layer, and makes the network more robust against
boundary points around the silhouette. If the projection of a query point is out of scope of the input image,
we fill its feature vector with zeros and do not include it in the back propagation.
Scale-invariant Symmetric Pooling. After obtaining the feature vectorF
ij
from each viewj, one key
module must effectively aggregate these view-dependent signatures. However, the viewing distance and
focal length may differ for each camera, and so the scales of projections of the same 3D volume may vary
significantly from viewpoint to viewpoint. As a result, features on the same level of convolutional layers
may have different 3D receptive fields across different views. Therefore, direct element-wise pooling on
view-dependent features may not be effective, as it could be operated on mismatched scales.
To resolve this issue, we introduce shared-weight MLP layers before the pooling operation so that
multi-scale features will be more uniformly distributed to all element entries, enabling the follow-up pool-
ing module to be feature scale invariant. Then, we apply a permutation invariant pooling module on the
output feature vectors of the MLP layers. The outputs of the pooling module are the final feature vector
associated with each query point.
35
in
out
τ
τ
Surface boundary
Figure 3.3: Classification boundary.
Classification Network. After obtaining a feature vector for a query point, we employ a classification
network to infer its probability of being on the reconstructed surface. A simple structure consisting of
multiple fully connected layers is used for this classification task. In particular, we predict two labels
(P
in
;P
out
) for each point, whereP
in
andP
out
stand for the possibility of the 3D point being inside and
outside the reconstructed object, respectively. For a query point p and a ground-truth meshM, if p is
insideM, we mark its labels as (1; 0); ifp lies on the surface, it is marked as (1; 1); otherwise,p is marked
as (0; 1).
In reality, only very few sample points lay exactly on the surface. To better capture the surface, we
relax the criteria for determining the inside/outside labels. As shown in Figure 3.3, in addition to the
points inside the surface, we also include those outside points whose distance to the surface is below a
threshold ( is set as 1 cm) and mark theirP
in
label as 1. Similarly, we apply the same threshold to mark
P
out
. Therefore, points in the near-surface region are labeled as both (1; 1). We predict the two labels
independently and train the network using sigmoid cross-entropy loss. Therefore, the predicted value of
P
in
and P
out
ranges from 0 to 1, where a larger value indicates a higher probability. More details of
network design are provided in the supplementary materials.
3.3.2 Training
As our approach aims to predict a dense probability field, for each 3D mesh, it is necessary to generate a
large amount of query points for training the network. However, uniformly sampling the 3D space would
36
be prohibitive in terms of computational cost. In fact, we only care about points that are near the final
reconstructed surface. We therefore adopt an adaptive sampling strategy to emphasize our sampling on
such points. For each ground-truth meshM, we first generate a regular point grid with resolution 256
3
filling the space of an enlarged (1:5 times) bounding box ofM. We compute signed distances of the grid
points with the method given by [193]. We then calculate the largest distancel from the interior grid point
toM’s surface:l =j min
i
dist(t
i
;M)j:
To select points that are more centered around the surface ofM, we utilize Monte Carlo sampling
to keep those grid pointst
i
whose distancejdist(t
i
;M)j satisfies the Gaussian distribution: norm( =
0; = l). For each of the combinations of multi-view images and their camera matrices that will appear
in the training, we augment the data by firstly reconstructing the visual hull from the input views; and then
randomly sampling more points inside the visual hull but ensuring the newly added points achieve an equal
distribution inside and outside the ground-truth meshM. We stop adding samples when the total number
of query points for eachM reaches 100; 000.
We train the network using various combinations of camera views. For a certain number of views (3,
4 or 8), we train an individual model. We test each model using corresponding number of views. The
combinations of views are selected such that every adjacent two of them have a wide baseline and all the
views together cover the entire subject in a loop. The query points and their labels for each mesh are
pre-computed so as to save training time.
During training, we randomly draw 10; 000 query points from the pre-computed set for each sample.
We directly take color images of each view as inputs in their original resolutions, which varies from
1600 1200 to 1920 1080. For each batch, we only load images from one multi-view scene due to
the limited GPU memory. The network is optimized using Adam optimizer. We start with a learning rate
of 0:00001 and gradually decay it exponentially every 100; 000 batches with a factor of 0:7. We train the
network for 20 epochs on a single NVIDIA GTX 1080Ti GPU.
37
3.3.3 Surface Reconstruction
At test time, we first use our network to generate a dense probability field from the input images. As the
near-surface region only occupies little volume compared to the space it encloses, it is highly inefficient
to apply uniform sampling over the space. Therefore we employ an octree-based approach to achieve a
high-resolution reconstruction with a low computational cost. In particular, we first compute the center of
the scene according to the camera positions and their calibration parameters. A bounding box of length 3
meters on each side is placed at the scene center. We then fill the bounding box with a regular 3D grid.
By traversing each cube in the grid, we subdivide those cubes whose centers are surface points, or whose
vertices consist both inside and outside points, recognized by our network. As our network predicts two
probabilities (P
in
; P
out
) per point, we propose to aggregate the two probabilities into one signed distance
for surface point prediction and later reconstruction of the entire surface.
As discussed in Section 3.3.2 and illustrated in Figure 3.3, P
in
andP
out
indicate the relaxed proba-
bilities of being inside and outside the object, respectively. SinceP
in
andP
out
are independent events,
the probability of a point being near the surface can be simply computed as: P
surf
= P
in
P
out
. By
excluding the near-surface region (defined above), we define the probability of reliably staying inside the
object as P
0
in
= P
in
(1P
out
). Similarly, the probability of lying in the outer region but having
point-to-mesh distance larger than can be calculated asP
0
out
=P
out
(1P
in
): We compute all three
probabilitiesfP
surf
;P
0
in
;P
0
out
g for each grid point. We then determine the signed distance value for each
point by selecting the largest probability. In particular, we only assign three discrete signed distance val-
ues:f1; 0; 1g, which represent inner, surface and outer points respectively. For instance, for one query
point, if itsP
surf
is larger than the other probabilities, it will be assigned with 0 and treated as a surface
point. A similar strategy is applied to determine inner and outer points and to assign their corresponding
signed distances.
We then generate a dense signed distance field in a coarse-to-fine manner. As discussed previously,
we subdivide those cubes marked by the network, further infer the signed distance for all the octant cubes,
38
)=
input image (s)
%
( ,
inside/
outside
%
*
image encoder
Surface Reconstruction
Texture Inference
Training
Testing
Marching
Cube
3D occupancy field reconstructed geometry textured reconstruction -view inputs ( ≥ 1)
image encoder
PIFu (∀,∀)
*
( ,
Tex-PIFu (∀,∀)
)=
RGB
PIFu Tex-PIFu
Figure 3.4: Overview of our clothed human digitization pipeline.
Given an input image, a pixel-aligned implicit function (PIFu) predicts the continuous inside/outside prob-
ability field of a clothed human. Similarly, PIFu for texture inference (Tex-PIFu) infers an RGB value at
given 3D positions of the surface geometry with arbitrary topology.
and iterate until a target resolution is achieved. Finally, after obtaining the signed distance field, we use
the marching cubes algorithm to reconstruct the surface whose signed distance equals 0.
3.4 Pixel-Aligned Implicit Function
After we establish Deep Visual Hull(DVH), we would like to apply to more general and challenging
capture cases. Given a single or multi-view images, our goal is to reconstruct the underlining 3D geometry
and texture of a clothed human while preserving the detail present in the image. To this end, we extend
the idea of DVH and introduce Pixel-Aligned Implicit Functions (PIFu) which is a memory efficient and
spatially-aligned 3D representation for 3D surfaces. An implicit function defines a surface as a level set of
a functionf, e.g.f(X) = 0 [148]. This results in a memory efficient representation of a surface where the
space in which the surface is embedded does not need to be explicitly stored. The proposed pixel-aligned
39
implicit function consists of a fully convolutional image encoderg and a continuous implicit functionf
represented by multi-layer perceptrons (MLPs), where the surface is defined as a level set of
f(F (x);z(X)) =s :s2R; (3.1)
where for a 3D pointX,x =(X) is its 2D projection,z(X) is the depth value in the camera coordinate
space,F (x) =g(I(x)) is the image feature atx. We assume a weak-perspective camera, but extending to
perspective cameras is straightforward. Note that we obtain the pixel-aligned featureF (x) using bilinear
sampling, because the 2D projection ofX is defined in a continuous space rather than a discrete one (i.e.,
pixel).
The key observation is that we learn an implicit function over the 3D space with pixel-aligned image
features rather than global features, which allows the learned functions to preserve the local detail present
in the image. The continuous nature of PIFu allows us to generate detailed geometry with arbitrary topol-
ogy in a memory efficient manner. Moreover, PIFu can be cast as a general framework that can be extended
to various co-domains such as RGB colors.
Digitization Pipeline. Figure 3.4 illustrates the overview of our framework. Given an input image, PIFu
for surface reconstruction predicts the continuous inside/outside probability field of a clothed human, in
which iso-surface can be easily extracted (Sec. 3.4.1). Similarly, PIFu for texture inference (Tex-PIFu)
outputs an RGB value at 3D positions of the surface geometry, enabling texture inference in self-occluded
surface regions and shapes of arbitrary topology (Sec. 3.4.2). Furthermore, we show that the proposed
approach can handle single-view and multi-view input naturally, which allows us to produce even higher
fidelity results when more views are available (Sec. 3.4.3).
40
3.4.1 Single-view Surface Reconstruction
For surface reconstruction, we represent the ground truth surface as a 0:5 level-set of a continuous 3D
occupancy field:
f
v
(X) =
8
>
>
>
<
>
>
>
:
1; ifX is inside mesh surface
0; otherwise
: (3.2)
We train a pixel-aligned implicit function (PIFu)f
v
by minimizing the average of mean squared error:
L
V
=
1
n
n
X
i=1
jf
v
(F
V
(x
i
);z(X
i
))f
v
(X
i
)j
2
; (3.3)
where X
i
2 R
3
, F
V
(x) = g(I(x)) is the image feature from the image encoder g at x = (X) and
n is the number of sampled points. Given a pair of an input image and the corresponding 3D mesh that
is spatially aligned with the input image, the parameters of the image encoderg and PIFuf
v
are jointly
updated by minimizing Eq. 3.3. As Bansal et al. [16] demonstrate for semantic segmentation, training an
image encoder with a subset of pixels does not hurt convergence compared with training with all the pixels.
During inference, we densely sample the probability field over the 3D space and extract the iso-surface
of the probability field at threshold 0:5 using the Marching Cube algorithm [111]. This implicit surface
representation is suitable for detailed objects with arbitrary topology. Aside from PIFu’s expressiveness
and memory-efficiency, we develop a spatial sampling strategy that is critical for achieving high-fidelity
inference.
Spatial Sampling. The resolution of the training data plays a central role in achieving the expressive-
ness and accuracy of our implicit function. Unlike voxel-based methods, our approach does not require
discretization of ground truth 3D meshes. Instead, we can directly sample 3D points on the fly from the
ground truth mesh in the original resolution using an efficient ray tracing algorithm [181]. Note that this
operation requires water-tight meshes. In the case of non-watertight meshes, one can use off-the-shelf so-
lutions to make the meshes watertight [17]. Additionally, we observe that the sampling strategy can largely
41
influence the final reconstruction quality. If one uniformly samples points in the 3D space, the majority of
points are far from the iso-surface, which would unnecessarily weight the network toward outside predic-
tions. On the other hand, sampling only around the iso-surface can cause overfitting. Consequently, we
propose to combine uniform sampling and adaptive sampling based on the surface geometry. We first ran-
domly sample points on the surface geometry and add offsets with normal distributionN (0;) ( = 5:0
cm in our experiments) for x, y, and z axis to perturb their positions around the surface. We combine those
samples with uniformly sampled points within bounding boxes using a ratio of 16 : 1. We provide an
ablation study on our sampling strategy in the supplemental materials.
3.4.2 Texture Inference
While texture inference is often performed on either a 2D parameterization of the surface [85, 63] or in
view-space [118], PIFu enables us to directly predict the RGB colors on the surface geometry by definings
in Eq. 3.1 as an RGB vector field instead of a scalar field. This supports texturing of shapes with arbitrary
topology and self-occlusion. However, extending PIFu to color prediction is a non-trivial task as RGB
colors are defined only on the surface while the 3D occupancy field is defined over the entire 3D space.
Here, we highlight the modification of PIFu in terms of training procedure and network architecture.
Given sampled 3D points on the surface X 2
, the objective function for texture inference is the
average of L1 error of the sampled colors as follows:
L
C
=
1
n
n
X
i=1
jf
c
(F
C
(x
i
);z(X
i
))C(X
i
)j; (3.4)
whereC(X
i
) is the ground truth RGB value on the surface pointX
i
2
andn is the number of sampled
points. We found that naively trainingf
c
with the loss function above severely suffers from overfitting.
The problem is thatf
c
is expected to learn not only RGB color on the surface but also the underlining 3D
surfaces of the object so thatf
c
can infer texture of the unseen surface with different pose and shape during
inference, which poses a significant challenge. We address this problem with the following modifications.
42
!
"
(Φ
%
,…,Φ
(
) = +, Φ
,
= !
%
(-
,
.
,
,/
,
(0))
Multi-View PIFu
-
-
%
-
(
0
. = 1(0)
/(0)
!(- . ,/(0)) = +
PIFu
0
⋯
.
%
= 1
%
(0)
.
(
= 1
(
(0)
/
%
(0) /
(
(0)
! = f
"
∘f
%
Figure 3.5: Multi-view Extension.
PIFu can be extended to support multi-view inputs by decomposing implicit function f into a feature
embedding functionf
1
and a multi-view reasoning functionf
2
. f
1
computes a feature embedding from
each view in the 3D world coordinate system, which allows aggregation from arbitrary views. f
2
takes
aggregated feature vector to make a more informed 3D surface and texture prediction.
First, we condition the image encoder for texture inference with the image features learned for the surface
reconstructionF
V
. This way, the image encoder can focus on color inference of a given geometry even if
unseen objects have different shape, pose, or topology. Additionally, we introduce an offsetN (0;d)
to the surface points along the surface normal N so that the color can be defined not only on the exact
surface but also on the 3D space around it. With the modifications above, the training objective function
can be rewritten as:
L
C
=
1
n
n
X
i=1
f
c
(F
C
(x
0
i
;F
V
);X
0
i;z
)C(X
i
)
; (3.5)
whereX
0
i
= X
i
+N
i
. We used = 1:0 cm for all the experiments. Please refer to the supplemental
material for the network architecture for texture inference.
43
3.4.3 Multi-View Case
Additional views provide more coverage about the person and should improve the digitization accuracy.
Our formulation of PIFu provides the option to incorporate information from more views for both surface
reconstruction and texture inference. We achieve this by using PIFu to learn a feature embedding for
every 3D point in space. Specifically the output domain of Eq. 3.1 is now an-dimensional vector space
s2 R
n
that represents the latent feature embedding associated with the specified 3D coordinate and the
image feature from each view. Since this embedding is defined in the 3D world coordinate space, we can
aggregate the embedding from all available views that share the same 3D point. The aggregated feature
vector can be used to make a more confident prediction of the surface and the texture.
Specifically we decompose the pixel-aligned function f into a feature embedding network f
1
and a
multi-view reasoning network f
2
as f := f
2
f
1
. See Figure 3.5 for illustrations. The first function
f
1
encodes the image featureF
i
(x
i
) : x
i
=
i
(X) and depth valuez
i
(X) from each view pointi into
latent feature embedding
i
. This allows us to aggregate the corresponding pixel features from all the
views. Now that the corresponding 3D pointX is shared by different views, each image can projectX
on its own image coordinate system by
i
(X) andz
i
(X). Then, we aggregate the latent features
i
by
average pooling operation and obtain the fused embedding
= mean(f
i
g). The second function f
2
maps from the aggregated embedding
to our target implicit fields (i.e., inside/outside probability for
surface reconstruction and RGB value for texture inference). The additive nature of the latent embedding
allows us to incorporate arbitrary number of inputs. Note that a single-view input can be also handled
without modification in the same framework as the average operation simply returns the original latent
embedding. For training, we use the same training procedure as the aforementioned single-view cases
including loss functions and the point sampling scheme. While we train with three random views, our
experiments show that the model can incorporate information from more than three views.
44
3.4.4 Implementation Details
Since the framework of PIFu is not limited to a specific network architecture, one can technically use any
fully convolutional neural network as the image encoder. For surface reconstruction, we found that stacked
hourglass [122] architectures are effective with better generalization on real images. The image encoder
for texture inference adopts the architecture of CycleGAN [209] consisting of residual blocks [81]. The
implicit function is based on a multi-layer perceptron, whose layers have skip connections from the image
featureF (x) and depthz in spirit of [34] to effectively propagate the depth information. Tex-PIFu takes
F
C
(x) together with the image feature for surface reconstructionF
V
(x) as input. For multi-view PIFu,
we simply take an intermediate layer output as feature embedding and apply average pooling to aggregate
the embedding from different views.
3.5 Experiments
3.5.1 Datasets
A good training set of sufficient size is key to a successful deep learning model. However, existing datasets
of clothed body capture usually consist of only a few subjects, making them unsuitable for training a deep
neural network. SURREAL dataset [174] has large amount of synthetic humans but it does not contain
geometric details of clothes and thus is not suitable for our task. Since there is no large scale datasets for
high-resolution clothed humans, to verify our Deep Visual Hull framework, our first attempt is to generate
a synthetic dataset by rendering rigged and animated human character models from Mixamo [2] as seen
from multiple views. The characters share the same rig, and so a variety of animations and human poses
could be rapidly synthesized of different figures dressed in many clothing types and styles. In total we
render images with 50 characters and 13 animations, for eight camera viewpoints with known projection
matrices. We use 43 characters and 10 animations for training. The remaining seven characters and three
animations are used for validation and testing.
45
Although only trained on synthetic data, our network can generalize to handle real footage from multi-
view body performance capture, as shown in Section 3.5.2. Also, due to the increasingly unseen regions
for single-view reconstruction, it is crucial that we further improve the quality of our ground-truth models
and rendered images. This encourages us to enlarge our datasets and to acquire high-resolution realistic
3D scans. We then collecte photogrammetry data of 491 high-quality textured human meshes with a wide
range of clothing, shapes, and poses, each consisting of about 100; 000 triangles from RenderPeople[139].
We refer to this database as High-Fidelity Clothed Human Data set. We randomly split the dataset into a
training set of 442 subjects and a test set of 49 subjects. To efficiently render the digital humans, Lamber-
tian diffuse shading with surface normal and spherical harmonics are typically used due to its simplicity
and efficiency [174, 118]. However, we found that to achieve high-fidelity reconstructions on real images,
the synthetic renderings need to correctly simulate light transport effects resulting from both global and
local geometric properties such as ambient occlusion. To this end, we use a precomputed radiance transfer
technique (PRT) that precomputes visibility on the surface using spherical harmonics and efficiently rep-
resents global light transport effects by multiplying spherical harmonics coefficients of illumination and
visibility [154]. PRT only needs to be computed once per object and can be reused with arbitrary illumi-
nations and camera angles. Together with PRT, we use 163 second-order spherical harmonics of indoor
scene from HDRI Haven[67] using random rotations around y axis. We render the images by aligning
subjects to the image center using a weak-perspective camera model and image resolution of 512 512.
We also rotate the subjects for 360 degrees in yaw axis, resulting in 360 442 = 159; 120 images for
training. For the evaluation, we render 49 subjects from RenderPeople and 5 subjects from the BUFF data
set [204] using 4 views spanning every 90 degrees in yaw axis. Note that we render the images without the
background. We also test our approach on real images of humans from the DeepFashion data set [105].
In the case of real data, we use a off-the-shelf semantic segmentation network together with Grab-Cut
refinement [143].
46
Figure 3.6: Camera setting for reported four-view results.
3.5.2 Multi-View Results and Comparisons
In this section, we evaluate our model on various datasets, including [176], [179], [158], as well as our
own synthetic data.
Qualitative Results. We first reconstruct these results from four views on grids of resolution 1024
3
as
shown in Figure 3.7. All the results are generated directly from our pipeline without any post-processing
except edge collapse to reduce file sizes. All the results are generated from test cases. To validate the
accuracy of the reconstructed geometry, we colorize each vertex with the visible cameras with simple
cosine weight blending. Our rendering results could be further improved via recent real-time [46] or
offline [134] texturing approaches.
Figure 3.6 shows the camera setting for the results. From only four-view inputs with limited overlap
between each of them, our network reconstructs a watertight surface that resembles the subject’s geometry
and recovers reasonable local details. Even for ambiguous areas where no cameras have line-of-sight, our
network can still predict plausible shapes. We also present results on a sequence of challenging motion
performance from Vlasic et al. [176] in Figure 3.8. Even for the challenging poses and extreme occlusion,
our network can robustly recover a plausible shape.
47
Figure 3.7: Results reconstructed from four views.
Top to bottom rows: input multi-view images, reconstructed mesh, textured mesh, and error visualization.
From left to right, median mesh-to-scan distance: 0.90cm, 0.66cm, 0.85cm, 0.54cm, 0.59cm; mean
mesh-to-scan distance: 1.18cm, 0.88cm, 1.10cm, 0.65cm, 0.76cm.
48
Figure 3.8: Sequence results.
Top to bottom rows: multi-view images, reconstructed mesh, textured mesh, and error visualization.
From left to right, median mesh-to-scan distance: 0.94cm, 0.86cm, 0.82cm, 0.76cm, 0.85cm; mean
mesh-to-scan distance: 1.31cm, 1.27cm, 1.21cm, 1.06cm, 1.25cm
49
Figure 3.9: Reconstructions with different views.
Top to bottom rows: reconstructed mesh, textured mesh, and error visualization. Left to right columns:
three-view results, four-view results, and eight-view-view results, for both two test cases respectively.
Median mesh-to-scan distance: left subject: 0:84cm (three-view), 0:77cm (four-view), 0:45cm
(eight-view); right subject: 1:38cm (three-view), 1:06cm (four-view), 0:59cm (eight-view).
50
As our network is not limited by the number of views, we train and test our models with different
numbers of views. We test our model with three-view, four-view, and eight-view settings, selecting input
views incrementally. As shown in Figure 3.9, with more views, our network can predict more details, e.g.
facial shape and hairstyle. For most of the results shown in this paper, we use a four-view setting, which
achieves the best balance between view-sparsity and reconstruction quality.
Quantitative Results. We evaluate our reconstruction accuracy by measuring Euclidean distance from
reconstructed surface vertices to the reference scan. For real world data, we use results given by [176]
and [158] as our references, which approximate the ground-truth surface using a much more advanced
capturing setup. We show visualizations for the mesh-to-scan distances and evaluate the distance statistics.
As shown in Figure 3.7, given inputs from various test sets, our network predicts accurate surface,
with median mesh-to-scan distance of all examples less than 0:9cm. As shown in Figure 3.8, our network
also predicts accurate reconstruction for the challenging input image sequences, with median mesh-to-
scan distance below 0:95cm. In Figure 3.9, we observe that the distance error decreases as more views are
available during network training. The median distance for 8-view drops below half of the distance as for
the three-view training setting.
Comparisons. We compare our approach with existing methods using four-view input in Figure 3.10.
While traditional multi-view stereo PMVS [55] is able to reconstruct an accurate point cloud, it often fails
to produce complete geometry with large baseline (four views to cover 360 degree in this case) and texture-
less inputs. As a learning-based approach, SurfaceNet [79] reconstructs a more complete point cloud, but
still fails at the region with fewer correspondences due to large baseline. It remains difficult to reconstruct
a complete surface from sparse point clouds results of PMVS and SurfaceNet. Although visual hull [107]
based approach can reconstruct a complete shape, the reconstruction deviates significantly from the true
shape due to its incapability of capturing concavities. On the contrary, our method is able to reconstruct a
complete model with the clothed human shape well sculpted given as few as four views.
51
Figure 3.10: Comparisons on 4-view reconstruction.
Top to bottom rows: input 4-view images, PMVS, SurfaceNet, visual hull, and ours.
In terms of runtime, PMVS takes 3:2 seconds with 12 threads using four views. Since SurfaceNet is
not designed to reconstruct object in 360 degrees, we run on neighboring views for four times and then
fuse them to obtain a complete reconstruction. This process takes 15 minutes with one Titan X GPU. For
52
reconstructed geometry textured reconstruction input
Figure 3.11: Qualitative single-view results on real images from DeepFashion dataset [105].
The proposed Pixel-Aligned Implicit Functions, PIFu, achieves a topology-free, memory efficient,
spatially-aligned 3D reconstruction of geometry and texture of clothed human.
visual hull, it takes 30 ms on Titan X GPU at 512
3
resolution with an octree implementation. Our multi-
view network takes 4:4 seconds for 256
3
resolution, and 18 seconds for 512
3
resolution with an octree
implementation on GTX 1080 Ti. Since operations for image feature extraction, pooling, point query,
octree transverses, and marching cubes can all be done distributively in parallel, the performance of our
method could be potentially further boosted.
3.5.3 Single-View Results and Comparisons
We evaluate PiFu reconstruction pipeline on a variety of datasets, including RenderPeople [139] and BUFF
[204], which has ground truth measurements, as well as DeepFashion [105] which contains a diverse
variety of complex clothing.
53
RenderPeople Buff
Methods Normal P2S Chamfer Normal P2S Chamfer
BodyNet 0.262 5.72 5.64 0.308 4.94 4.52
SiCloPe 0.216 3.81 4.02 0.222 4.06 3.99
IM-GAN 0.258 2.87 3.14 0.337 5.11 5.32
VRN 0.116 1.42 1.56 0.130 2.33 2.48
Ours 0.084 1.52 1.50 0.0928 1.15 1.14
Table 3.1: Quantitative evaluation on RenderPeople and BUFF dataset for single-view reconstruction.
Qualitative Results In Figure 3.11, we present our digitization results using real world input images
from the DeepFashion dataset [105]. We demonstrate our PIFu can handle wide varieties of clothing,
including skirts, jackets, and dresses. Our method can produce high-resolution local details, while inferring
plausible 3D surfaces in unseen regions. Complete textures are also inferred successfully from a single
input image, which allows us to view our 3D models from 360 degrees.
Quantitative Results We quantitatively evaluate our reconstruction accuracy with three metrics. In the
model space, we measure the average point-to-surface Euclidean distance (P2S) in cm from the vertices
on the reconstructed surface to the ground truth. We also measure the Chamfer distance between the
reconstructed and the ground truth surfaces. In addition, we introduce the normal reprojection error to
measure the fineness of reconstructed local details, as well as the projection consistency from the input
image. For both reconstructed and ground truth surfaces, we render their normal maps in the image space
from the input viewpoint respectively. We then calculate the L2 error between these two normal maps.
In Table 3.1 and Figure 3.12, we evaluate the reconstruction errors for each method on both Buff and
RenderPeople test set. Note that while V oxel Regression Network (VRN) [77], IM-GAN [34], and ours are
retrained with the same High-Fidelity Clothed Human dataset we use for our approach, the reconstruction
of [118, 173] are obtained from their trained models as off-the-shelf solutions. Since single-view inputs
leaves the scale factor ambiguous, the evaluation is performed with the known scale factor for all the
approaches. In contrast to the state-of-the-art single-view reconstruction method using implicit function
54
ours VRN IM-GAN SiCloPe BodyNet
Figure 3.12: Comparison with other human digitization methods from a single image.
For each input image on the left, we show the predicted surface (top row), surface normal (middle row),
and the point-to-surface errors (bottom row).
55
SiCloPe ours
input
Figure 3.13: Comparison with SiCloPe [118] on texture inference.
While texture inference via a view synthesis approach suffers from projection artifacts, proposed approach
does not as it directly inpaints textures on the surface geometry.
(IM-GAN) [31] that reconstruct surface from one global feature per image, our method outputs pixel-
aligned high-resolution surface reconstruction that captures hair styles and wrinkles of the clothing. We
also demonstrate the expressiveness of our PIFu representation compared with voxels. Although VRN
and ours share the same network architecture for the image encoder, the higher expressiveness of implicit
representation allows us to achieve higher fidelity.
In Figure 3.13, we also compare our single-view texture inferences with a state-of-the-art texture infer-
ence method on clothed human, SiCloPe [118], which infers a 2D image from the back view and stitches it
together with the input front-view image to obtain textured meshes. While SiCloPe suffers from projection
distortion and artifacts around the silhouette boundary, our approach predicts textures on the surface mesh
directly, removing projection artifacts.
56
Chapter 4
Deep Human Model
4.1 Introduction
3D human reconstruction has been explored for several decades in the field of computer vision and com-
puter graphics. Accurate methods based on stereo or fusion have been proposed using various types of sen-
sors [53, 170, 114, 121, 136, 194, 195], and several applications have become popular in sports, medicine
and entertainment (e.g., movies, games, AR/VR experiences). However, these setups require tightly con-
trolled environments. To date, full 3D human reconstruction with detailed geometry and appearance from
in-the-wild pictures is still challenging (i.e., taken in natural conditions as opposed to laboratory environ-
ments). Moreover, the lack of automatic rigging prevents animation-based applications.
Recent computer vision models have enabled the recovery of 2D and 3D human pose and shape es-
timation from a single image. However, they usually rely on representations that have limitations: (1)
skeletons [48] are kinematic structures that are accurate to represent 3D poses, but do not carry body shape
information. (2) surface meshes [83, 124, 196] can represent body shape geometry, but have topology
constraints; (3) voxels [173] are topology-free, but memory costly with limited resolution, and need to
be rigged for animation. In this chapter, we introduce the ARCH (Animatable Reconstruction of Clothed
Humans) framework that possesses all benefits of current representations. In particular, we introduce a
57
Input
Reconstructed
3D Avatar
Canonical
Reconstruction
Reposed
Figure 4.1: ARCH: Animatable Reconstruction of Clothed Humans.
Given an image of a subject in arbitrary pose (left), ARCH creates an accurate and animatable avatar with
detailed clothing (center). As rigging and albedo are estimated, the avatar can be reposed and relit in new
environments (right).
learned model that has human body structure knowledge (i.e., body part semantics), and is trained with
humans in arbitrary poses.
First, 3D body pose and shape estimation can be inferred from a single image of a human in arbitrary
pose by a prediction model [196]. This initialization step is used for normalized-pose reconstruction of
clothed human shape within a canonical space. This allows us to define a Semantic Space (SemS) and a
Semantic Deformation Field (SemDF) by densely sampling 3D points around the clothed body surface and
assigning skinning weights. We then learn an implicit function representation of the 3D occupancy in the
canonical space based on SemS and SemDF, which enables the reconstruction of high-frequency details
of the surface (including clothing wrinkles, hair style, etc.) superior to the state of the art [119, 145, 173].
The surface representing a clothed human in a neutral pose is implicitly rigged in order to be used as
an animatable avatar. Moreover, a differentiable renderer is used to refine normal and color information
for each 3D point in space by Granular Render-and-Compare. Here, we regard them as a sphere and
58
develop a new blending formulation based on the estimated occupancy. See Fig. 4.2 for an overview of the
framework.
In our experiments, we evaluate ARCH on the task of 3D human reconstruction from a single im-
age. Both quantitative and qualitative experimental results show ARCH outperforms state-of-the-art body
reconstruction methods on public 3D scan benchmarks and in-the-wild 2D images. We also show that
our reconstructed clothed humans can be animated by motion capture data, demonstrating the potential
applications for human digitization for animation.
Contributions. The main contributions are threefold: 1) we introduce the Semantic Space (SemS) and
Semantic Deformation Field (SemDF) to handle implicit function representation of clothed humans in ar-
bitrary poses, 2) we propose opacity-aware differentiable rendering to refine our human representation via
Granular Render-and-Compare, and 3) we demonstrate how reconstructed avatars can directly be rigged
and skinned for animation. In addition, we learn per-pixel normals to obtain high-quality surface details,
and surface albedo for relighting applications.
4.2 Related Work
3D clothed human reconstruction focuses on the task of reconstructing 3D humans with clothes. There
are multiple attempts to solve this task with video inputs [6, 7, 133, 4, 198], RGB-D data [201, 206]
and in multi-view settings [15, 56, 58, 177, 180, 186, 191, 21]. Though richer inputs clearly provide
more information than single images, the developed pipelines yield more limitations on the hardware and
additional time costs in deployment. Recently, some progress [23, 62, 83, 94, 95, 97, 169, 196, 203] has
been made in estimating parametric human bodies from a single RGB image, yet boundaries are under-
explored to what extent 3D clothing details can be reconstructed from such inputs. In recent work [96, 98,
9], the authors learn to generate surface geometry details and appearance using 2D UV maps. While details
can be learned, the methods cannot reconstruct loose clothing (e.g., dress) and recover complex shapes
such as hair or fine structures (e.g., shoe heels). Due to different types of clothing topology, volumetric
59
reconstruction has great benefits in this scenario. For example, BodyNet [173] takes a person image as
input and learns to reconstruct voxels of the person with additional supervision through body priors (e.g.,
2D pose, 3D pose, part mask); while PIFu [145] assumes no body prior and learns an implicit surface
function based on aligned image features, leading more clothes details and less robustness against pose
variations.
In this paper, we incorporate body prior knowledge to transform people in arbitrary poses to the canon-
ical space, and then learn to reconstruct an implicit representation.
Differentiable rendering makes the rendering operation differentiable and uses it to optimize param-
eters of the scene representation. Existing approaches can be roughly divided into two categories: mesh
rasterization based rendering [33, 87, 102, 109, 171] and volume based rendering [74, 103]. For example,
OpenDR [109] and Neural Mesh Renderer [87] manually define approximated gradients of the rendering
operation to move the faces. SoftRasterizer [102] and DIB-R [33], in contrast, redefine the rasterization
as a continuous and differentiable function, allowing gradients to be computed automatically. For volume-
based differentiable rendering, [74] represents each 3D point as a multivariate Gaussian and performs
occlusion reasoning with grid discretization and ray tracing. Such methods require an explicit volume to
perform occlusion reasoning. [103] develops differentiable rendering for implicit surface representations
with a focus on reconstructing rigid objects.
In contrast, we use a continuous rendering function as in [102], but revisit it to handle opacity, and we
use geometric primitives at points of interest and optimize their properties.
4.3 Animatable Reconstruction of Clothed Humans
ARCH contains three components, after 3D body estimation by [196] (see Fig. 4.2): pose-normalization
using Semantic Space (SemS) and Semantic Deformation Field (SemDF), implicit surface reconstruction,
and refinement using a differentiable renderer by Granular Render-and-Compare (see Sec. 4.3.3).
60
Semantic Deformation Field
Image Encoder (SHG, U-net)
3D Body Estimation
MLP
Normal Color
Compare
Ground
truth
Linear
Batch Norm
Conv.
ReLU
Implicit Surface
Reconstruction
Render
Render
Normal Color
Figure 4.2: ARCH overview.
The framework contains three components: i) estimation of correspondences between an input image
space and the canonical space, ii) implicit surface reconstruction in the canonical space from surface
occupancy, normal and color estimation, iii) refinement of normal and color through differentiable
rendering.
4.3.1 Semantic Space and Deformation Field
Our goal is to transform an arbitrary (deformable) object into a canonical space where the object is in a
predefined rest pose. To do so, we introduce two concepts: the Semantic Space (SemS) and the Semantic
Deformation Field (SemDF). SemSS =f(p;s
p
) :p2R
3
g is a space consisting of 3D points where each
pointp2 S is associated to semantic informations
p
enabling the transformation operation. SemDF is a
vector field represented by a vector-valued functionV that accomplishes the transformation,
In computer vision and graphics, 3D human models have been widely represented by a kinematic
structure mimicking the anatomy that serves to control the pose, and a surface mesh that represents the
human shape and geometry. Skinning is the transformation that deforms the surface given the pose. It is
parameterized by skinning weights that individually influence body part transformations [108]. In ARCH,
we define SemS in a similar form, with skinning weights.
Assuming a skinned body template modelT in a normalized A-pose (i.e., the rest pose), its associated
skeleton in the canonical space, and skinning weightsW , SemS is then
S =f(p;fwi;pg
N
K
i=1
) :p2R
3
g; (4.1)
61
where each pointp is associated to a collection of skinning weightsfw
i;p
g defined with respect toN
K
body parts (e.g., skeleton bones). In this paper, we approximatefw
i;p
g by retrieving the closest pointp
0
on the template surface top and assigning the corresponding skinning weights fromW . In practice, we
set a distance threshold to cut off points that are too far away fromT .
In ARCH, SemDF actually performs an inverse-skinning transformation, putting a human in arbitrary
pose to its normalized-pose in the canonical space. This extends standard skinning (e.g., Linear Blend
Skinning or LBS [108]) applied to structured objects to arbitrary 3D space and enables transforming an
entire space in arbitrary poses to the canonical space, as every pointp
0
can be expressed as a linear combi-
nation of pointsp with skinning weightsfw
i;p
g.
Following LBS, the canonical space of human body is tied to a skeletal rig. The state of the rig
is described by relative rotations R =fr
i
g
N
K
i=1
of all skeleton joints X =fx
i
g
N
K
i=1
. Every rotation is
relative to the orientation of the parent element in a kinematic tree. For a skeleton withN
K
body parts,
R2 R
3N
K
;X 2 R
3N
K
. Given a body template model T in rest pose with N
V
vertices, the LBS
functionV(v
i
;X;R;W ) takes as input the verticesv
i
2 T , the jointsX, a target poseR, and deforms
everyv
i
to the posed positionv
0
i
with skinning weightsW2R
N
V
N
K
, namely,
V(vi;X;R;W) =
X
N
K
k=1
w
k;i
G
k
(R;X)vi; (4.2)
whereG
k
(R;X) is the rest-pose corrected affine transformation to apply to body partk.
4.3.2 Implicit Surface Reconstruction
We use the occupancy mapO to implicitly represent the 3D clothed human, i.e.,
O =f(p;op) : p2R
3
; 0op 1g; (4.3)
whereo
p
denotes the occupancy for a pointp. To obtain a surface, we can simply threshold the occupancy
mapO to obtain the isosurfaceO
0
.
In this paper, we incorporate a human body prior by always reconstructing a neutral-posed shape in
the canonical space. Similar to [145], we develop a deep neural network that takes a canonical space point
62
p, its correspondent 2D positionq, and the 2D imageI as inputs and estimates occupancyo
p
, normaln
p
,
colorc
p
forp; that is,
op =F(f
s
p
;I;o);
np =F(f
s
p
;I;f
o
p
;n);
cp =F(f
s
p
;I;f
o
p
;f
n
p
;c);
f
s
p
2R
171
;f
o
p
2R
256
; f
n
p
2R
64
; f
c
p
2R
64
;
(4.4)
where
o
,
n
and
c
denote the occupancy, normal and color sub-network weights,f
s
p
is the spatial feature
extracted based on SemS. We use the estimated 57 canonical body landmarks from [196] and compute the
Radial Basis Function (RBF) distance betweenp and thei-th landmarkp
0
i
, that is
f
s
p
(i) = expfD(p;p
0
i
)g; (4.5)
whereD() is the Euclidean distance. We also evaluate the effects of different types of spatial features in
Sec. 4.4.2. f
o
p
andf
n
p
the feature maps extracted from occupancy and normal sub-networks, respectively
(see also Fig. 4.2). The three sub-networks are defined as follows:
The Occupancy sub-network uses a Stacked Hourglass (SHG) [122] as the image feature encoder and
a Multi-Layer Perceptron (MLP) as the regressor. Given a 512 512 input imageI, the SHG produces
a feature mapf2 R
512512256
with the same grid size. For each 3D pointp, we consider the feature
located at the corresponding projected pixelq as its visual feature descriptorf
o
p
2 R
256
. For points that
do not align onto the grid, we apply bi-linear interpolation on the feature map to obtain the feature at that
pixel-aligned location. The MLP takes the spatial feature of the 3D pointp2 R
3
and the pixel-aligned
image featuresf
o
p
2 R
256
as inputs and estimates the occupancyo
p
2 [0; 1] by classifying whether this
point lies inside the clothed body or not.
The Normal sub-network uses a U-net [142] as the image feature encoder and a MLP which takes
the spatial feature, and feature descriptorsf
n
p
2R
64
andf
o
p
2R
256
from its own backbone and from the
occupancy sub-network as inputs and estimates the normal vectorn
p
.
63
The Color sub-network also uses a U-net [142] as the image feature encoder and a MLP which takes
the spatial feature, and feature descriptorsf
c
p
2R
64
,f
n
p
2R
64
andf
o
p
2R
256
from its own backbone, as
well as the normal and occupancy sub-networks as inputs and estimates the colorc
p
in RGB space.
For each sub-network, the MLP takes the pixel-aligned image features and the spatial features (as
described in Sec. 4.3.1), where the numbers of hidden neurons are (1024; 512; 256; 128). Similar to [145],
each layer of MLP has skip connections from the input features. For the occupancy sub-network, the MLP
estimates one-dimension occupancyo
p
2 [0; 1] using Sigmoid activation. For the normal sub-network, the
MLP estimates three-dimension normaln
p
2 [0; 1]
3
;kn
p
k
2
= 1 using L2 normalization. For the color
sub-network, the MLP estimates three-dimension colorc
p
2 [0; 1]
3
using range clamping.
4.3.3 Granular Render-and-Compare
The prediction from the model is an implicit function representation. By sampling points in a predefined
volume and optimizingL
o
3d
,L
n
3d
andL
c
3d
, we can optimize the occupancy, normal and color at these points
directly given 3D ground truth. However, it is not clear what the gradients should be for points that are
located not directly on the surface of the ground truth mesh. To address this problem, we propose to use a
differentiable renderer.
We first create an explicit geometric representation of the scene at hand. For every sample point to
optimize, we place a geometric primitive with a spatial extent at its position. To be independent of the
viewpoint, we choose this to a sphere with 1 cm radius for every sampled point (for an overview of the
differentiable rendering loss computation, see Fig. 4.3). During training, every scene to render contains
51 200 spheres.
We then define a differentiable rendering function [102] to project the spheres onto the image plane so
that we can perform pixel-level comparisons with the projected ground truth. We use a linear combination
with a weightw
i
j
to associate the color contribution from pointp
i
to the pixelq
j
. Having the colorc
i
and
normaln
i
for pointp
i
, the color and normal for pixelq
j
are calculated as the weighed linear combination
of point values
P
i
w
i
j
c
i
and
P
i
w
i
j
n
i
.
64
We define w
i
j
considering two factors: the depth of the sphere for point p
i
at pixel q
j
, z
i
j
, and the
proximity of the projected surface of the sphere for pointp
i
to pixelq
j
,d
i
j
. To make occlusion possible,
the depth needs to have a strong effect on the resulting weight. Hence, [102] defines the weight as
w
i
j
=
d
i
j
exp(z
i
j
=
)
P
k
d
i
k
exp(z
i
k
=
)+exp(=
)
(4.6)
with being a small numerical constant. With this definition, the proximity has linear influence on the
resulting weight while the depth has exponential influence. The impact ratio is controlled by the scaling
factor
, which we fix to 1 10
5
in our experiments.
In contrast to [102] we also need to use an opacity
i
per sphere for rendering. We tie this opacity
value
i
directly to the predicted occupancy value through linear scaling and shifting. To stay with the
formulation of the render function, we integrate
i
into the weight formulation in Eqn. 4.6.
If the opacity is used as a linear factor in this equation, the softmax function will still render spheres
with very low opacity over other spheres with a lower depth value. The problem is the exponential function
that is applied to the scaled depth values. On the other hand, if an opacity factor is only incorporated into
the exponential function, spheres will remain visible in front of the background (their weight factor is still
larger than the background factor exp(=
)). We found a solution by using the opacity value as both,
linear scaling factor as well as exponential depth scaling factor. This solution turned out to be numerically
stable and well-usable for optimization with all desired properties. This changes the weight function to the
following:
w
i
j
=
i
d
i
j
exp(
i
z
i
j
=
)
P
k
i
d
i
k
exp(
i
z
i
k
=
)+exp(=
)
: (4.7)
Using this formulation, we optimize the color channel values c
i
and normal values n
i
per point. A
per-pixel L1 loss is computed between the rendering and a rendering of the ground truth data and back-
propagated through the model. For our experiments with
= 1 10
5
and the depth of the volume, we
map the occupancy values that define the isosurface at the value 0:5 to the threshold where shifts to
transparency. We experimentally determined this value to be roughly 0:7.
65
Pixel-wise Comparison
Warping
Sampled Points
(Canonical Space)
Sampled Points
(Original Space)
Forward
Pass
Predicted Occupancy,
Normal, Color
Ground truth
Back Propagation
Normal, Color
Render
Normal
Color
Figure 4.3: Illustration of the loss computation through differentiable rendering.
From left to right: points are sampled according to a Gaussian distribution around our template mesh in
the canonical space. They are transformed with the estimated Semantic Deformation Field and processed
by the model. The model provides estimations of occupancy, normal and color for each 3D point. We use
a differentiable renderer to project those points onto a new camera view and calculate pixel-wise
differences to the rendered ground truth.
4.3.4 Training
During training, we optimize the parameters of all three sub-models, i.e., the occupancy, normal and color
models. We define the training in three separate loops to train each part with the appropriate losses and
avoid computational bottlenecks. The total loss function is defined as
L =L
o
3d
+L
n
3d
+L
c
3d
+L
n
2d
+L
c
2d
; (4.8)
whereL
o
3d
is the 3D loss for occupancy network,L
n
3d
andL
n
2d
are the 3D and 2D losses for normal
network, andL
c
3d
andL
c
2d
are the 3D and 2D losses for color network. For every training iteration, we
perform the following three optimizations.
Occupancy. We use the available ground truth to train the occupancy prediction model in a direct
and supervised way. First, we sample 20 480 points in the canonical space. They are sampled around
the template mesh according to a normal distribution with a standard deviation of 5 cm. This turned out
to cover the various body shapes and clothing well in our experiments, but can be selected according to
the data distribution at hand. These points are then processed by the occupancy model, providing us with
an estimated occupancy value for every sampled point. We use a sigmoid function on these values to
normalize the network output to the interval [0; 1], where we select 0:5 as the position of the isosurface.
0:5 is the position where the derivative of the sigmoid function is the highest and we expect to optimize the
surface prediction best. The lossL
o
3d
is defined as the Huber loss comparing the occupancy prediction and
66
ground truth. Similar to [128], we found a less aggressive loss function than the squared error better suited
for the optimization, but found the quadratic behavior of the Huber loss around zero to be beneficial.
Normals and colors for surface points. Colors and normals can be optimized directly from the ground
truth mesh for points that lie on its surface. To use this strong supervision signal we introduce a dedicated
training stage. In this stage, we sample points only from the mesh surface and push them through the color
and normal models. In our setup, we use 51 200 point samples per model per training step. The loss terms
L
n
3d
andL
c
3d
are defined as theL1 loss comparing the predicted normals and colors with the ground truth
across all surface points. The occupancy predictions are kept unchanged.
Normals and colors for points not on the surface. For points not on the mesh surface, it is not
clear how the ground truth information can be used in the best way to improve the prediction without an
additional mapping. In a third step for the training, we sample another set of 51 200 points, and push them
through the occupancy, color and normal models and use a differentiable renderer on the prediction. We
render the image using the occupancy information as opacity, and by using the color channels to represent
colors or normals and use the gradients to update the predicted values. L
n
2d
andL
c
2d
are defined as the
per-pixel L1 loss between the rendered image and the ground truth. For details on this step, see Fig. 4.3
and the following Sec. 4.3.3.
4.3.5 Inference
For inference, we take as input a single RGB image representing a human in an arbitrary pose, and run the
forward model as described in Sec. 4.3.2 and Fig. 4.2. The network outputs a densely sampled occupancy
field over the canonical space from which we use the Marching Cube algorithm [110] to extract the iso-
surface at threshold 0:5. The isosurface represents the reconstructed clothed human in the canonical pose.
Colors and normals for the whole surface are also inferred by the forward pass and are pixel-aligned to
the input image (see Sec. 4.3.2). The human model can then be transformed to its original poseR by LBS
using SemDF and per-point corresponding skinning weightsW as defined in Sec. 4.3.1.
67
Furthermore, since the implicit function representation is equipped with skinning weights and skeleton
rig, it can naturally be warped to arbitrary poses. The proposed end-to-end framework can then be used to
create a detailed 3D avatar that can be animated with unseen sequences from a single unconstrained photo
(see Fig. 4.5).
4.3.6 Implementation Details
ARCH is implemented in PyTorch. We train the neural network model using the RMSprop optimizer
with a learning rate starting from 1e-3. The learning rate is updated using an exponential schedule every
3 epochs by multiplying with the factor 0:1. We are using 582 3D scans to train the model and use 360
views per epoch, resulting in 209 520 images for the training per epoch. Training the model on an NVIDIA
DGX-1 system with one Tesla V100 GPU takes 90 h for 9 epochs.
4.4 Experiments
We present details on ARCH implementation and datasets for training, with results and comparisons to the
state of the art.
4.4.1 Datasets
Our training dataset is composed of 375 3D scans from the RenderPeople
1
dataset, and 207 3D scans
from the AXYZ
2
dataset. The scans are watertight meshes which are mostly free of noise. They represent
subjects wearing casual clothes, and potentially holding small objects (e.g., mobile phones, books and
purses). Our test dataset contains 37 scans from the RenderPeople dataset, 1 scans from the AXYZ dataset,
26 scans from the BUFF dataset [204], and 2D images from the DeepFashion [105] dataset, representing
clothed people with a large variety of complex clothing. The subjects in the training dataset are mostly
in standing pose, while the subjects in the test dataset are in arbitrary poses (standing, bending, sitting,
1
http://renderpeople.com
2
http://secure.axyz-design.com
68
(b) (a) (c) (d)
Figure 4.4: Illustration of reposing 3D scans to the canonical space.
(a) An original 3D scan from the RenderPeople dataset. (b) Automatically detected topology changes.
Red marks points with self contacts, blue regions that are also removed before reposing to avoid problems
with normals. (c, d) Reposed scan.
. . . ). We create renders of the 3D scans using Blender. For each 3D scan, we produce 360 images by
rotating a camera around the vertical axis with intervals of 1 degree. For the current experiments, we only
considered the weak perspective projection (orthographic camera) but this can be easily adapted. We also
used 38 environment maps to render each scan with different natural lighting conditions. The proposed
model is trained to predict albedo (given by ground truth scan color). We also observed that increasing the
number of images improves the fidelity of predicted colors (as in [145]).
In order to use a 3D scan for model training, we fit a rigged 3D body template to the scan mesh to
estimate the 3D body pose (see Fig. 4.4). The estimated parametric 3D body can directly serve as ground
truth input data during the model training step (see Sec. 4.3.4). This also allows us to obtain SemS and
SemDF for the scan. However, since each 3D scan has its own topology, artifacts due to topology changes
will occur when pose-normalization is naively applied to models containing self-contact (for example arms
touching the body). This creates inaccurate deformations. Hence, we first detect regions of self-contact
and topology changes and cut the mesh before pose-normalization (see Fig. 4.4 (c) and (d)). Holes are
69
Methods
RenderPeople BUFF
Normal P2S Chamfer Normal P2S Chamfer
BodyNet [173] 0.26 5.72 5.64 0.31 4.94 4.52
SiCloPe [119] 0.22 3.81 4.02 0.22 4.06 3.99
IM-GAN [34] 0.26 2.87 3.14 0.34 5.11 5.32
VRN [78] 0.12 1.42 1.6 0.13 2.33 2.48
PIFu [145] 0.08 1.52 1.50 0.09 1.15 1.14
ARCH, baseline 0.080 1.98 1.85 0.081 1.74 1.75
+ SemDF 0.042 0.74 0.85 0.045 0.82 0.87
+ GRaC 0.038 0.74 0.85 0.040 0.82 0.87
Table 4.1: Quantitative comparisons of normal, P2S and Chamfer errors between posed reconstruction
and ground truth on the RenderPeople and BUFF datasets.
Lower values are better.
Input Animated Avatar
Figure 4.5: An example for animating a predicted avatar.
We use a predicted, skinned avatar from our test set and drive it using off-the-shelf motion capture data.
This avatar has been created using only a single, frontal view. Our model produces a plausible prediction
for the unseen parts, for example the hair and the back of the dress.
then filled up using Smooth Signed Distance Surface reconstruction [27] (see Fig. 4.4 (c) and (d)). For
inference on 2D images from the DeepFashion dataset, we obtain 3D body poses using the pre-trained
models from [196].
4.4.2 Results and Comparisons
We evaluate the reconstruction accuracy of ARCH with three metrics similar to [145]. We reconstruct the
results on the same test set and repose them back to the original poses of the input images and compare
70
Input Ours PIFu Input Ours PIFu
Figure 4.6: Evaluation on BUFF .
Our method outperforms [145] for detailed reconstruction from arbitrary poses. We show results from
different angles.
the reconstructions with the ground truth surfaces in the original poses. We report the average point-to-
surface Euclidean distance (P2S) in centimeters, the Chamfer distance in centimeters, and the L2 normal
re-projection error in Tab. 4.1.
Additionally to comparing with state-of-the-art methods [34, 78, 83, 119, 145, 173], we include scores
of an ablative study with the proposed method. In particular, we evaluate three variants and validate the
effectiveness of two main components: the Semantic Deformation Field and the Granular Render-and-
Compare loss.
ARCH, baseline: a variant of [145] using our own network specifications, taking an image as input and
directly estimating the implicit surface reconstruction.
Semantic Deformation Field (SemDF): we first estimate the human body configuration by [196] and
then reconstruct the canonical shape using the implicit surface reconstruction, and finally repose the canon-
ical shape to the original pose in the input image.
Granular Render-and-Compare (GRaC): based on the previous step, we further refine the reconstructed
surface normal and color using differentiable render-and-compare.
ARCH baseline specification already achieves state-of-the-art performance in normal estimation, but
has inferior performance w.r.t. P2S and Chamfer error compared to PIFu [145]. We use a different training
dataset compared to PIFu that apparently does not represent the test set as well. Also, PIFu normalizes
71
Input
Geometry
Reconstruction
Normal
Reconstruction
Color
Reconstruction
Figure 4.7: Reconstruction quality of clothing details.
The geometry reconstruction from our method reproduces larger wrinkles and the seam of the pants and
shoes while the predicted normals reproduce fine wrinkles. The normal and color predictions rendered
together produce a plausible image.
every scan at training and prediction time to have its geometric center at the coordinate origin, whereas
we use origin placed scans with slight displacements. Lastly, PIFu performs a size normalization of the
body using the initial 3D body configuration estimate. The image is rescaled so that the height of the
person matches the canonical size. This makes person height estimation for PIFu impossible, whereas we
properly reconstruct it—at the cost of a more difficult task to solve. The benefit of this operation is not
reflected in the scores because the metrics are calculated in the original image space.
When adding SemDF, we see a substantial gain in performance compared to our own baseline, but
also to the so far best-performing PIFu metrics. We outperform PIFu on average with an improvement of
over 50% on the RenderPeople dataset and an average improvement of over 60% on the BUFF dataset.
When adding the Granular Render-and-Compare loss, these number improve again slightly, especially on
the normal estimation. Additionally, the results gain a lot of visual fidelity and we manage to remove a lot
of visual artifacts.
72
Input XYZ L2 RBF
Figure 4.8: Reconstruction example using different types of spatial features.
XYZ: absolute coordinates, L2: Euclidean distances to each joint, RBF: Radial basis function based
distance to each joint. The proposed RBF preserves notably more details.
Spatial Feature Types Normal P2S Chamfer
XYZ 0.045 0.75 0.91
L2 0.043 0.76 0.89
RBF 0.042 0.74 0.85
Table 4.2: Ablation study on the effectiveness of spatial features.
The XYZ feature uses the plain location of body landmarks. The L2 and RBF features both improve the
performance.
Fig. 4.7 shows the level of detail of geometry, normal and color predictions our model can achieve.
Note that, for example, the zipper is not reproduced in the predicted normal map. This is an indicator that
the model does not simply reproduce differences in shading directly in the normal map, but is able to learn
about geometric and shading properties of human appearance. In Fig. 4.6, we show qualitative results on
challenging poses from the BUFF dataset. In Fig. 4.9, we provide a comparison of results of our method
with a variety of state of the art models [173, 83, 145].
Ablation Studies. We evaluate the effectiveness of different types of spatial features in Tab. 4.2 and
Fig. 4.8. We evaluate three different features: XYZ uses the absolute position of the sampled point, L2
uses the Euclidean distance from the sampled point to each body landmark, and RBF denotes our proposed
method in Sec. 4.3.1. It can be observed that RBF feature works best for this use case both qualitatively
73
Input Ours PIFu BodyNet HMR Input Ours PIFu BodyNet HMR
Figure 4.9: Qualitative comparisons against state-of-the-art methods [83, 173, 145] on unseen images.
ARCH (Ours) handles arbitrary poses with self-contact and occlusions robustly, and reconstructs a higher
level of details than existing methods. Images are from RenderPeople. Results on DeepFashion are of
similar quality but are not shown due to copyright concerns. Please contact us for more information.
74
Figure 4.10: Challenging cases
. Reconstruction of rare poses, and details in occluded areas could be further improved.
and quantitatively. RBF features strongly emphasize features that are close in distance to the currently
analyzed point and puts less emphasis on points further away, facilitating optimization and preserving
details.
Animating Reconstructed Avatars. With the predicted occupancy field we can reconstruct a mesh
that is already rigged and can directly be animated. We show the animation of an avatar we reconstructed
from the AXYZ dataset in Fig. 4.5, driven by an off-the-shelf retargetted Mixamo animation [196]. By
working in the canonical space, the avatar is automatically rigged and can be directly animated. Given
only a single view image, the avatar is reconstructed in 3D and looks plausible from all sides.
Limitations. As shown in Fig 4.10, rare poses not sufficiently covered in the training dataset (e.g.,
kneeing) return inaccurate body prior, and are then challenging to reconstruct. Also, details (i.e., normals)
in occluded areas could be improved with specific treatment of occlusion-aware estimation.
75
Chapter 5
Conclusion and Future Work
5.1 Conclusion
In this dissertation, we introduce our whole pipeline for complete digitization of humans, starting from im-
age pre-processing, over general volumetric reconstruction, to modeling full body avatar. All our methods
are based on data-driven deep learning approaches.
In Chapter 2, we present the first automatic approach that corrects the perspective distortion of uncon-
strained near-range portraits. Our approach even handles extreme distortions. We proposed a novel cascade
network including camera parameter prediction network, forward flow prediction network and feature in-
painting network. We also built the first database of perspective portraits pairs with a large variations
on identities, expressions, illuminations and head poses. Furthermore, we designed a novel duo-camera
system to capture testing images pairs of real human. Our approach significantly outperforms the state-
of-the-art approach on the task of perspective undistortion with an accurate camera parameter prediction.
Our approach also boosts the performance of fundamental tasks like face verification, landmark detection
and 3D face reconstruction.
In Chapter 3, We present DVH and PIFu, an end-to-end deep learning framework to infer 3D shape
and texture of clothed humans from a single input image, or sparse multi view images where traditional
multi-view stereo or structure-from-motion would fail. Our experiments indicate that highly plausible
76
geometry can be inferred including largely unseen regions such as the back of a person, while preserv-
ing high-frequency details present in the image(s). Unlike voxel-based representations, our method can
produce high-resolution output since we are not limited by the high memory requirements of volumetric
representations. Furthermore, we also demonstrate how this method can be naturally extended to infer the
entire texture on a person given partial observations. Unlike existing methods, which synthesize the back
regions based on frontal views in an image space, our approach can predict colors in unseen, concave and
side regions directly on the surface. In particular, our method is the first approach that can inpaint textures
for shapes of arbitrary topology. Since we are capable for generating textured 3D surfaces of a clothed per-
son from a single RGB camera, we are moving a step closer especially toward monocular reconstructions
of dynamic scenes from video without the need of a template model.
In Chapter 4, we present ARCH, an end-to-end framework to reconstruct complete clothed human
models from unconstrained photos. By introducing the Semantic Space and Semantic Deformation Field,
we are able to handle reconstruction from an arbitrary pose. We also propose a Granular Render-and-
Compare loss for our implicit function representation to further constrain visual similarity under random-
ized camera views. ARCH shows higher fidelity in clothing details including pixel-aligned colors and
normals with a wider range of human body configurations. The resulting models are animation-ready and
can be driven by arbitrary motion sequences. This work opens up many opportunities for ad-hoc animation
with applications ranging from gaming over e-commerce to research.
5.2 Ongoing and Future Work
Inference from Sequence. Thus far, all our methods use photo(s) taken at a single time as input, and
treat video inputs as individual images on a frame-by-frame basis. Due to the ill-posed nature in the
sparse capture setup where in a single frame there are too many occlusions and too few features, data-
driven approaches have to be used to infer the unseen regions. However, it is not necessarily the case with
videos, especially in which a person is moving around when multiple body regions can be seen from the
77
camera in different frames. As short video apps get popular, and sequence data become the common media
for mobile devices, there are rising needs for algorithms specifically designed for video inputs to expect
further quality improvements over per-frame based approaches. As our latest ARCH work reconstructs
in the canonical space, it will be interesting to study how dynamic fusion will work in that space along
sequences, and new deep learning based fusion technique can be developed in this direction.
In the wild inputs. Our whole pipeline has a cascade structure where our reconstruction frames assume
well calibrated and normalized inputs. In addition to improving the pre-processing quality, another future
avenue is to investigate end-to-end training of the whole pipeline, which would require fully-differentiable
connections between each stages.
78
Bibliography
[1] Edward H Adelson, Charles H Anderson, James R Bergen, Peter J Burt, and Joan M Ogden. Pyra-
mid methods in image processing. RCA engineer, 29(6):33–41, 1984.
[2] Adobe. Mixamo, 2013. https://www.mixamo.com/.
[3] Thiemo Alldieck, Marcus Magnor, Bharat Lal Bhatnagar, Christian Theobalt, and Gerard Pons-
Moll. Learning to reconstruct people in clothing from a single RGB camera. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 1175–1186, 2019.
[4] Thiemo Alldieck, Marcus Magnor, Bharat Lal Bhatnagar, Christian Theobalt, and Gerard Pons-
Moll. Learning to reconstruct people in clothing from a single RGB camera. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), jun 2019.
[5] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. De-
tailed human avatars from monocular video. In International Conference on 3D Vision, pages
98–109, 2018.
[6] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. De-
tailed human avatars from monocular video. In International Conference on 3D Vision, 2018.
[7] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. Video
based reconstruction of 3d people models. In IEEE Conference on Computer Vision and Pattern
Recognition, 2018.
[8] Thiemo Alldieck, Marcus A Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll.
Video based reconstruction of 3d people models. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 8387–8397, 2018.
[9] Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt, and Marcus Magnor. Tex2shape: Detailed
full human body geometry from a single image. In IEEE International Conference on Computer
Vision (ICCV). IEEE, oct 2019.
[10] Brian Amberg, Andrew Blake, Andrew Fitzgibbon, Sami Romdhani, and Thomas Vetter. Recon-
structing high quality face-surfaces using model based stereo. In Computer Vision, 2007. ICCV
2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
[11] Isaac Amidror. Scattered data interpolation methods for electronic imaging systems: a survey.
Journal of electronic imaging, 11(2):157–177, 2002.
[12] Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. Openface: A general-purpose
face recognition library with mobile applications. CMU School of Computer Science, 2016.
[13] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James
Davis. Scape: shape completion and animation of people. In ACM Transactions on Graphics (TOG),
volume 24, pages 408–416. ACM, 2005.
79
[14] Alexandru O Balan, Leonid Sigal, Michael J Black, James E Davis, and Horst W Haussecker.
Detailed human shape and pose from images. In Computer Vision and Pattern Recognition, 2007.
CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007.
[15] Alexandru O. Balan, Leonid Sigal, Michael J. Black, James E. Davis, and Horst W. Haussecker.
Detailed human shape and pose from images. In IEEE Conference on Computer Vision and Pattern
Recognition, 2007.
[16] Aayush Bansal, Xinlei Chen, Bryan Russell, Abhinav Gupta, and Deva Ramanan. Pixelnet: Repre-
sentation of the pixels, by the pixels, and for the pixels. arXiv:1702.06506, 2017.
[17] Gavin Barill, Neil Dickson, Ryan Schmidt, David I.W. Levin, and Alec Jacobson. Fast winding
numbers for soups and clouds. ACM Transactions on Graphics, 37(4):43, 2018.
[18] Anil Bas and William AP Smith. What does 2d geometric information really tell us about 3d face
shape? arXiv preprint arXiv:1708.06703, 2017.
[19] Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. High-quality single-
shot capture of facial geometry. In ACM Transactions on Graphics (ToG), volume 29, page 40.
ACM, 2010.
[20] Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman,
Robert W Sumner, and Markus Gross. High-quality passive facial performance capture using anchor
frames. In ACM Transactions on Graphics (TOG), volume 30, page 75. ACM, 2011.
[21] Bharat Lal Bhatnagar, Garvita Tiwari, Christian Theobalt, and Gerard Pons-Moll. Multi-garment
net: Learning to dress 3d people from images. In IEEE International Conference on Computer
Vision (ICCV). IEEE, oct 2019.
[22] V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Proceedings
of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194.
ACM Press/Addison-Wesley Publishing Co., 1999.
[23] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J.
Black. Keep it SMPL: Automatic estimation of 3d human pose and shape from a single image. In
European Conference on Computer Vision, 2016.
[24] Derek Bradley, Wolfgang Heidrich, Tiberiu Popa, and Alla Sheffer. High resolution passive facial
performance capture. In ACM transactions on graphics (TOG), volume 29, page 41. ACM, 2010.
[25] Ronnie Bryan, Pietro Perona, and Ralph Adolphs. Perspective distortion from interpersonal distance
is an implicit visual cue for social judgments of faces. PloS one, 7(9):e45301, 2012.
[26] Xavier P Burgos-Artizzu, Matteo Ruggero Ronchi, and Pietro Perona. Distance estimation of an
unknown person from a portrait. In European Conference on Computer Vision, pages 313–327.
Springer, 2014.
[27] Fatih Calakli and Gabriel Taubin. SSD: smooth signed distance surface reconstruction. Comput.
Graph. Forum, 30(7):1993–2002, 2011.
[28] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. Facewarehouse: A 3d fa-
cial expression database for visual computing. IEEE Transactions on Visualization and Computer
Graphics, 20(3):413–425, 2014.
[29] Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape regression.
International Journal of Computer Vision, 107(2):177–190, 2014.
80
[30] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: realtime multi-
person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008, 2018.
[31] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-
decoder with atrous separable convolution for semantic image segmentation. In European Confer-
ence on Computer Vision, pages 801–818, 2018.
[32] Qian Chen, Haiyuan Wu, and Toshikazu Wada. Camera calibration with two arbitrary coplanar
circles. In European Conference on Computer Vision, pages 521–532. Springer, 2004.
[33] Wenzheng Chen, Jun Gao, Huan Ling, Edward Smith, Jaakko Lehtinen, Alec Jacobson, and Sanja
Fidler. Learning to predict 3d objects with an interpolation-based differentiable renderer. In Annual
Conference on Neural Information Processing Systems, 2019.
[34] Zhiqin Chen and Hao Zhang. Learning implicit fields for generative shape modeling. IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2019.
[35] German KM Cheung, Simon Baker, and Takeo Kanade. Visual hull alignment and refinement across
time: A 3d reconstruction algorithm combining shape-from-silhouette with stereo. In Computer
Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on,
volume 2, pages II–375. IEEE, 2003.
[36] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A
unified approach for single and multi-view 3d object reconstruction. In European Conference on
Computer Vision, pages 628–644. Springer, 2016.
[37] Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, and William T Free-
man. Synthesizing normalized faces from facial identity features. In Computer Vision and Pattern
Recognition (CVPR), 2017 IEEE Conference on, pages 3386–3395. IEEE, 2017.
[38] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese, Hugues
Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint video. ACM Trans-
actions on Graphics (TOG), 34(4):69, 2015.
[39] Carlo Colombo, Dario Comanducci, and Alberto Del Bimbo. Camera calibration with two arbitrary
coaxial circles. In European Conference on Computer Vision, pages 265–276. Springer, 2006.
[40] Emily A Cooper, Elise A Piazza, and Martin S Banks. The perceptual basis of common photo-
graphic practice. Journal of vision, 12(5):8–8, 2012.
[41] Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. Active appearance models. IEEE
Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001.
[42] David Cristinacce and Tim Cootes. Automatic feature localisation with constrained local models.
Pattern Recognition, 41(10):3054–3067, 2008.
[43] Ankur Datta, Jun-Sik Kim, and Takeo Kanade. Accurate camera calibration using iterative re-
finement of control points. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th
International Conference on, pages 1201–1208. IEEE, 2009.
[44] Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Westley Sarokin, and Mark Sagar.
Acquiring the reflectance field of a human face. In Proceedings of the 27th annual conference
on Computer graphics and interactive techniques, pages 145–156. ACM Press/Addison-Wesley
Publishing Co., 2000.
81
[45] Endri Dibra, Himanshu Jain, Cengiz Oztireli, Remo Ziegler, and Markus Gross. Human shape from
silhouettes using generative hks descriptors and cross-modal neural networks. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA,
volume 5, 2017.
[46] Ruofei Du, Ming Chuang, Wayne Chang, Hugues Hoppe, and Amitabh Varshney. Montage4d:
Interactive seamless fusion of multiview video textures. In Proceedings of ACM SIGGRAPH Sym-
posium on Interactive 3D Graphics and Games (I3D), pages 124–133. ACM, May 2018.
[47] Carlos Hern´ andez Esteban and Francis Schmitt. Silhouette and stereo fusion for 3d object modeling.
Computer Vision and Image Understanding, 96(3):367–392, 2004.
[48] Hao-Shu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, and Song-Chun Zhu. Learning pose
grammar to encode human body configuration for 3d pose estimation. In AAAI Conference on
Artificial Intelligence, 2018.
[49] Arturo Flores, Eric Christiansen, David Kriegman, and Serge Belongie. Camera distance from face
images. In International Symposium on Visual Computing, pages 513–522. Springer, 2013.
[50] Jean-S´ ebastien Franco, Marc Lapierre, and Edmond Boyer. Visual shapes of silhouette sets. In
3D Data Processing, Visualization, and Transmission, Third International Symposium on, pages
397–404. IEEE, 2006.
[51] Ohad Fried, Eli Shechtman, Dan B Goldman, and Adam Finkelstein. Perspective-aware manipula-
tion of portrait photos. ACM Transactions on Graphics (TOG), 35(4):128, 2016.
[52] Yasutaka Furukawa and Jean Ponce. Carved visual hulls for image-based modeling. In European
Conference on Computer Vision, pages 564–577. Springer, 2006.
[53] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multi-view stereopsis. In IEEE
Conference on Computer Vision and Pattern Recognition, 2007.
[54] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, 2010.
[55] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Trans.
Pattern Anal. Mach. Intell., 32(8):1362–1376, August 2010.
[56] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 32(8):1362––1376, 2010.
[57] Juergen Gall, Carsten Stoll, Edilson De Aguiar, Christian Theobalt, Bodo Rosenhahn, and Hans-
Peter Seidel. Motion capture using joint skeleton tracking and surface estimation. In Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1746–1753. IEEE,
2009.
[58] Juergen Gall, Carsten Stoll, Edilson de Aguiar, Christian Theobalt, Bodo Rosenhahn, and Hans-
Peter Seidel. Motion capture using joint skeleton tracking and surface estimation. In IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2009.
[59] Abhijeet Ghosh, Graham Fyffe, Borom Tunwattanapong, Jay Busch, Xueming Yu, and Paul De-
bevec. Multiview face capture using polarized spherical gradient illumination. In ACM Transactions
on Graphics (TOG), volume 30, page 129. ACM, 2011.
[60] Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. Atlasnet:
A papier-mˆ ach´ e approach to learning 3d surface generation. In IEEE Conference on Computer
Vision and Pattern Recognition, 2018.
82
[61] Peng Guan, Alexander Weiss, Alexandru O Balan, and Michael J Black. Estimating human shape
and pose from a single image. In Computer Vision, 2009 IEEE 12th International Conference on,
pages 1381–1388. IEEE, 2009.
[62] Riza Alp Guler and Iasonas Kokkinos. Holopose: Holistic 3d human reconstruction in-the-wild. In
IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[63] Rıza Alp G¨ uler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation
in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7297–7306,
2018.
[64] Christian H¨ ane, Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3d object
reconstruction. In arXiv preprint arXiv:1704.00710. 2017.
[65] Wilfried Hartmann, Silvano Galliani, Michal Havlena, Luc Van Gool, and Konrad Schindler.
Learned multi-patch similarity. In 2017 IEEE International Conference on Computer Vision (ICCV),
pages 1595–1603. IEEE, 2017.
[66] Tal Hassner, Shai Harel, Eran Paz, and Roee Enbar. Effective face frontalization in unconstrained
images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
4295–4304, 2015.
[67] HDRI Haven, 2018. https://hdrihaven.com/.
[68] Janne Heikkila. Geometric camera calibration using circular control points. IEEE Transactions on
pattern analysis and machine intelligence, 22(10):1066–1077, 2000.
[69] Yannick Hold-Geoffroy, Kalyan Sunkavalli, Jonathan Eisenmann, Matthew Fisher, Emiliano Gam-
baretto, Sunil Hadap, and Jean-Franc ¸ois Lalonde. A perceptual measure for deep single image
camera calibration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 2354–2363, 2018.
[70] Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi,
Carrie Sun, Yen-Chun Chen, and Hao Li. Avatar digitization from a single image for real-time
rendering. ACM Transactions on Graphics (TOG), 36(6):195, 2017.
[71] Haibin Huang, Evangelos Kalogerakis, Siddhartha Chaudhuri, Duygu Ceylan, Vladimir G Kim, and
Ersin Yumer. Learning local shape descriptors from part correspondences with multiview convolu-
tional networks. ACM Transactions on Graphics (TOG), 37(1):6, 2018.
[72] Rui Huang, Shu Zhang, Tianyu Li, Ran He, et al. Beyond face rotation: Global and local
perception gan for photorealistic and identity preserving frontal view synthesis. arXiv preprint
arXiv:1704.04086, 2017.
[73] Yinghao Huang, Federica Bogo, Christoph Classner, Angjoo Kanazawa, Peter V Gehler, Ijaz
Akhter, and Michael J Black. Towards accurate markerless human shape and pose estimation over
time. arXiv preprint arXiv:1707.07548, 2017.
[74] Eldar Insafutdinov and Alexey Dosovitskiy. Unsupervised learning of shape and pose with differ-
entiable point clouds. In Annual Conference on Neural Information Processing Systems, 2018.
[75] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In International conference on machine learning, pages 448–456,
2015.
[76] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with
conditional adversarial networks. arXiv preprint, 2017.
83
[77] Aaron S Jackson, Chris Manafas, and Georgios Tzimiropoulos. 3D Human Body Reconstruction
from a Single Image via V olumetric Regression. In ECCV Workshop Proceedings, PeopleCap 2018,
pages 0–0, 2018.
[78] Aaron S. Jackson, Chris Manafas, and Georgios Tzimiropoulos. 3d human body reconstruction from
a single image via volumetric regression. European Conference of Computer Vision Workshops,
2018.
[79] Mengqi Ji, Juergen Gall, Haitian Zheng, Yebin Liu, and Lu Fang. Surfacenet: An end-to-end 3d
neural network for multiview stereopsis. arXiv preprint arXiv:1708.01749, 2017.
[80] Guang Jiang and Long Quan. Detection of concentric circles for camera calibration. In Computer
Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 1, pages 333–340.
IEEE, 2005.
[81] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and
super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
[82] Evangelos Kalogerakis, Melinos Averkiou, Subhransu Maji, and Siddhartha Chaudhuri. 3d shape
segmentation with projective convolutional networks. Proc. CVPR, IEEE, 2, 2017.
[83] Angjoo Kanazawa, Michael J. Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of
human shape and pose. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[84] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of
human shape and pose. In IEEE Conference on Computer Vision and Pattern Recognition, pages
7122–7131, 2018.
[85] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. Learning category-
specific mesh reconstruction from image collections. In European Conference on Computer Vision,
pages 371–386, 2018.
[86] Abhishek Kar, Christian H¨ ane, and Jitendra Malik. Learning a multi-view stereo machine. In
Advances in Neural Information Processing Systems, pages 364–375, 2017.
[87] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2018.
[88] Vahid Kazemi and Sullivan Josephine. One millisecond face alignment with an ensemble of regres-
sion trees. In 27th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014,
Columbus, United States, 23 June 2014 through 28 June 2014, pages 1867–1874. IEEE Computer
Society, 2014.
[89] Ira Kemelmacher-Shlizerman. Internet based morphable model. In Computer Vision (ICCV), 2013
IEEE International Conference on, pages 3256–3263. IEEE, 2013.
[90] Ira Kemelmacher-Shlizerman and Ronen Basri. 3d face reconstruction from a single image using
a single reference face shape. IEEE transactions on pattern analysis and machine intelligence,
33(2):394–405, 2011.
[91] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-
time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer
vision, pages 2938–2946, 2015.
[92] Hyeongwoo Kim, Michael Zollh¨ ofer, Ayush Tewari, Justus Thies, Christian Richardt, and Christian
Theobalt. Inversefacenet: Deep single-shot inverse face rendering from a single image. arXiv
preprint arXiv:1703.10956, 2017.
84
[93] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[94] Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and Kostas Daniilidis. Learning to recon-
struct 3d human pose and shape via model-fitting in the loop. In IEEE International Conference on
Computer Vision, 2019.
[95] Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. Convolutional mesh regression for
single-image human shape reconstruction. In IEEE Conference on Computer Vision and Pattern
Recognition, 2019.
[96] Zorah Laehner, Daniel Cremers, and Tony Tung. Deepwrinkles: Accurate and realistic clothing
modeling. In European Conference on Computer Vision, 2018.
[97] Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J. Black, and Peter V
Gehler. Unite the people: Closing the loop between 3d and 2d human representations. In IEEE
Conference on Computer Vision and Pattern Recognition, 2017.
[98] Verica Lazova, Eldar Insafutdinov, and Gerard Pons-Moll. 360-degree textures of people in clothing
from a single image. In International Conference on 3D Vision, 2019.
[99] Hao Li, Bart Adams, Leonidas J Guibas, and Mark Pauly. Robust single-view geometry and motion
reconstruction. In ACM Transactions on Graphics (TOG), volume 28, page 175. ACM, 2009.
[100] Chang Hong Liu and Avi Chaudhuri. Face recognition with perspective transformation. Vision
Research, 43(23):2393–2402, 2003.
[101] Chang Hong Liu and James Ward. Face recognition in pictures is affected by perspective transfor-
mation but not by the centre of projection. Perception, 35(12):1637–1650, 2006.
[102] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft rasterizer: A differentiable renderer for
image-based 3d reasoning. IEEE International Conference on Computer Vision, 2019.
[103] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li. Learning to infer implicit surfaces without
3d supervision. Annual Conference on Neural Information Processing Systems, 2019.
[104] Yebin Liu, Qionghai Dai, and Wenli Xu. A point-cloud-based multiview stereo algorithm for free-
viewpoint video. IEEE transactions on visualization and computer graphics, 16(3):407–418, 2010.
[105] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust
clothes recognition and retrieval with rich annotations. In IEEE Conference on Computer Vision
and Pattern Recognition, 2016.
[106] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic
segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 3431–3440, 2015.
[107] Charles Loop, Cha Zhang, and Zhengyou Zhang. Real-time high-resolution sparse voxelization
with application to image-based modeling. In Proceedings of the 5th High-Performance Graphics
Conference, pages 73–79. ACM, 2013.
[108] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black.
Smpl: A skinned multi-person linear model. ACM Transactions on Graphics, 34(6):248, 2015.
[109] Matthew M. Loper and Michael J. Black. Opendr: An approximate differentiable renderer. In
European Conference on Computer Vision, 2014.
85
[110] William E. Lorensen and Harvey E. Cline. Differentiable monte carlo ray tracing through edge
sampling. ACM SIGGRAPH Computer Graphics, 21(4):163–169, 1987.
[111] William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction
algorithm. In ACM siggraph computer graphics, volume 21, pages 163–169. ACM, 1987.
[112] Zhaoliang Lun, Matheus Gadelha, Evangelos Kalogerakis, Subhransu Maji, and Rui Wang.
3d shape reconstruction from sketches via multi-view convolutional networks. arXiv preprint
arXiv:1707.06375, 2017.
[113] Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, and Paul Debevec.
Rapid acquisition of specular and diffuse normal maps from polarized spherical gradient illumina-
tion. In Proceedings of the 18th Eurographics conference on Rendering Techniques, pages 183–194.
Eurographics Association, 2007.
[114] Takashi Matsuyama, Shohei Nobuhara, Takeshi Takai, and Tony Tung. 3D Video and Its Applica-
tions. Springer, 2012.
[115] Wojciech Matusik, Chris Buehler, Ramesh Raskar, Steven J Gortler, and Leonard McMillan. Image-
based visual hulls. In Proceedings of the 27th annual conference on Computer graphics and inter-
active techniques, pages 369–374. ACM Press/Addison-Wesley Publishing Co., 2000.
[116] Gerard Medioni and Sing Bing Kang. Emerging topics in computer vision. Prentice Hall PTR,
2004.
[117] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas
Geiger. Occupancy networks: Learning 3d reconstruction in function space. arXiv preprint
arXiv:1812.03828, 2018.
[118] Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen, Chongyang Ma, Hao Li, and Shigeo
Morishima. Siclope: Silhouette-based clothed people. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 4480–4490, 2019.
[119] Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen, Chongyang Ma, Hao Li, and Shigeo
Morishima. Siclope: Silhouette-based clothed people. In IEEE Conference on Computer Vision and
Pattern Recognition, 2019.
[120] Natalia Neverova, Riza Alp Guler, and Iasonas Kokkinos. Dense pose transfer. In European Con-
ference on Computer Vision, pages 123–138, 2018.
[121] Richard A. Newcombe, Dieter Fox, and Steven M. Seitz. Dynamicfusion: Reconstruction and
tracking of non-rigid scenes in real-time. In IEEE Conference on Computer Vision and Pattern
Recognition, 2015.
[122] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estima-
tion. In European Conference on Computer Vision, 2016.
[123] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter V . Gehler, and Bernt Schiele. Neu-
ral body fitting: Unifying deep learning and model-based human pose and shape estimation. In
International Conference on 3D Vision, pages 484–494, 2018.
[124] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter V . Gehler, and Bernt Schiele. Neu-
ral body fitting: Unifying deep learning and model-based human pose and shape estimation. In
International Conference on 3D Vision, 2018.
86
[125] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, and Alexander C Berg. Transformation-
grounded image generation network for novel 3d view synthesis. In 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 702–711. IEEE, 2017.
[126] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, and Alexander C. Berg. Transformation-
grounded image generation network for novel 3d view synthesis. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 3500–3509, 2017.
[127] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove.
Deepsdf: Learning continuous signed distance functions for shape representation. arXiv preprint
arXiv:1901.05103, 2019.
[128] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove.
Deepsdf: Learning continuous signed distance functions for shape representation. In The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[129] Frederic I Parke and Keith Waters. Computer facial animation. CRC Press, 2008.
[130] Deepak Pathak, Philipp Kr¨ ahenb¨ uhl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context
encoders: Feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 2536–2544, 2016.
[131] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3D
human pose and shape from a single color image. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 459–468, 2018.
[132] Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael Black. Clothcap: Seamless 4d clothing
capture and retargeting. ACM Transactions on Graphics,(Proc. SIGGRAPH)[to appear], 1, 2017.
[133] Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael J. Black. Clothcap: seamless 4d clothing
capture and retargeting. ACM Transactions on Graphics, 36(4):73:1––73:15, 2017.
[134] Fabi´ an Prada, Misha Kazhdan, Ming Chuang, Alvaro Collet, and Hugues Hoppe. Spatiotemporal
atlas parameterization for evolving meshes. ACM Transactions on Graphics (TOG), 36(4):58, 2017.
[135] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas.
V olumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 5648–5656, 2016.
[136] Hang Qi, Yuanlu Xu, Tao Yuan, Tianfu Wu, and Song-Chun Zhu. Scene-centric joint parsing of
cross-view videos. In AAAI Conference on Artificial Intelligence, 2018.
[137] Fabio Remondino and Clive Fraser. Digital camera calibration methods: considerations and com-
parisons. International Archives of Photogrammetry, Remote Sensing and Spatial Information Sci-
ences, 36(5):266–272, 2006.
[138] Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. Face alignment at 3000 fps via regressing
local binary features. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1685–1692, 2014.
[139] Renderpeople, 2018. https://renderpeople.com/3d-people.
[140] Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and
Nicolas Heess. Unsupervised learning of 3d structure from images. In Advances In Neural Infor-
mation Processing Systems, pages 4996–5004, 2016.
87
[141] Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. Learning detailed face reconstruc-
tion from a single image. In 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 5553–5562. IEEE, 2017.
[142] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedi-
cal image segmentation. In Medical Image Computing and Computer-Assisted Intervention, 2015.
[143] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interactive foreground extrac-
tion using iterated graph cuts. In ACM transactions on graphics (TOG), volume 23, pages 309–314.
ACM, 2004.
[144] Christos Sagonas, Yannis Panagakis, Stefanos Zafeiriou, and Maja Pantic. Robust statistical face
frontalization. In Proceedings of the IEEE International Conference on Computer Vision, pages
3871–3879, 2015.
[145] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao
Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In IEEE
International Conference on Computer Vision, 2019.
[146] Joaquim Salvi, Xavier Armangu´ e, and Joan Batlle. A comparative review of camera calibrating
methods with accuracy evaluation. Pattern recognition, 35(7):1617–1635, 2002.
[147] Jason M Saragih, Simon Lucey, and Jeffrey F Cohn. Deformable model fitting by regularized
landmark mean-shift. International Journal of Computer Vision, 91(2):200–215, 2011.
[148] Stan Sclaroff and Alex Pentland. Generalized implicit functions for computer graphics, volume 25.
ACM, 1991.
[149] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic
segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):640–651, 2017.
[150] Xiaoyong Shen, Aaron Hertzmann, Jiaya Jia, Sylvain Paris, Brian Price, Eli Shechtman, and Ian
Sachs. Automatic portrait segmentation for image stylization. In Computer Graphics Forum, vol-
ume 35, pages 93–102. Wiley Online Library, 2016.
[151] Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep panoramic representation
for 3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339–2343, 2015.
[152] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
[153] Ayan Sinha, Asim Unmesh, Qixing Huang, and Karthik Ramani. Surfnet: Generating 3d shape
surfaces using deep residual networks. In Proc. CVPR, 2017.
[154] Peter-Pike Sloan, Jan Kautz, and John Snyder. Precomputed radiance transfer for real-time ren-
dering in dynamic, low-frequency lighting environments. In ACM Transactions on Graphics, vol-
ume 21, pages 527–536, 2002.
[155] Cristian Sminchisescu and Alexandru Telea. Human pose estimation from silhouettes. a consistent
approach using distance level sets. In International Conference on Computer Graphics, Visualiza-
tion and Computer Vision, volume 10, 2002.
[156] Amir Arsalan Soltani, Haibin Huang, Jiajun Wu, Tejas D Kulkarni, and Joshua B Tenenbaum.
Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative
networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 1511–1519, 2017.
88
[157] Dan Song, Ruofeng Tong, Jian Chang, Xiaosong Yang, Min Tang, and Jian Jun Zhang. 3d body
shapes estimation from dressed-human silhouettes. In Computer Graphics Forum, volume 35, pages
147–156. Wiley Online Library, 2016.
[158] Jonathan Starck and Adrian Hilton. Surface capture for performance-based animation. IEEE com-
puter graphics and applications, 27(3), 2007.
[159] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolu-
tional neural networks for 3d shape recognition. In Proceedings of the IEEE international confer-
ence on computer vision, pages 945–953, 2015.
[160] Hao Su, Fan Wang, Li Yi, and Leonidas Guibas. 3d-assisted image feature synthesis for novel views
of an object. arXiv preprint arXiv:1412.0003, 2014.
[161] Shao-Hua Sun, Minyoung Huh, Yuan-Hong Liao, Ning Zhang, and Joseph J Lim. Multi-view to
novel view: Synthesizing novel views with self-learned confidence. In European Conference on
Computer Vision, pages 155–171, 2018.
[162] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Multi-view 3d models from single
images with a convolutional network. In European Conference on Computer Vision, pages 322–
337. Springer, 2016.
[163] Ayush Tewari, Michael Zollh¨ ofer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Perez,
and Christian Theobalt. Mofa: Model-based deep convolutional face autoencoder for unsupervised
monocular reconstruction. In The IEEE International Conference on Computer Vision (ICCV),
volume 2, 2017.
[164] Justus Thies, Michael Zollh¨ ofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner.
Face2face: Real-time face capture and reenactment of rgb videos. In Computer Vision and Pat-
tern Recognition (CVPR), 2016 IEEE Conference on, pages 2387–2395. IEEE, 2016.
[165] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolu-
tional network and a graphical model for human pose estimation. In Annual Conference on Neural
Information Processing Systems, 2014.
[166] Roger Tsai. A versatile camera calibration technique for high-accuracy 3d machine vision metrol-
ogy using off-the-shelf tv cameras and lenses. IEEE Journal on Robotics and Automation, 3(4):323–
344, 1987.
[167] Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for
single-view reconstruction via differentiable ray consistency. In CVPR, volume 1, page 3, 2017.
[168] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. Multi-view supervision for
single-view reconstruction via differentiable ray consistency. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 2626–2634, 2017.
[169] Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and Katerina Fragkiadaki. Self-supervised learning
of motion capture. In Annual Conference on Neural Information Processing Systems, 2017.
[170] Tony Tung, Shohei Nobuhara, and Takashi Matsuyama. Complete multi-view reconstruction of
dynamic scenes from probabilistic fusion of narrow and wide baseline stereo. In IEEE 12th Inter-
national Conference on Computer Vision ICCV, 2009.
[171] Fredo Durand Tzu-Mao Li, Miika Aittala and Jaakko Lehtinen. Differentiable monte carlo ray
tracing through edge sampling. ACM Transactions on Graphics, 37(6):222:1–222:11, 2018.
89
[172] Joachim Valente and Stefano Soatto. Perspective distortion modeling, learning and compensation. In
Computer Vision and Pattern Recognition Workshops (CVPRW), 2015 IEEE Conference on, pages
9–16. IEEE, 2015.
[173] Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia
Schmid. BodyNet: V olumetric inference of 3D human body shapes. In European Conference on
Computer Vision, 2018.
[174] G¨ ul Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and
Cordelia Schmid. Learning from synthetic humans. In CVPR, 2017.
[175] Paul Viola and Michael J Jones. Robust real-time face detection. International journal of computer
vision, 57(2):137–154, 2004.
[176] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popovi´ c. Articulated mesh animation from
multi-view silhouettes. In ACM Transactions on Graphics (TOG), volume 27, page 97. ACM, 2008.
[177] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popovic. Articulated mesh animation from
multi-view silhouettes. ACM Transactions on Graphics, 27(3):97:1—-97:9, 2008.
[178] Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popovi´ c. Face transfer with multilinear
models. ACM transactions on graphics (TOG), 24(3):426–433, 2005.
[179] Daniel Vlasic, Pieter Peers, Ilya Baran, Paul Debevec, Jovan Popovi´ c, Szymon Rusinkiewicz, and
Wojciech Matusik. Dynamic shape capture using multi-view photometric stereo. ACM Transactions
on Graphics (TOG), 28(5):174, 2009.
[180] Daniel Vlasic, Pieter Peers, Ilya Baran, Paul Debevec, Jovan Popovic, Szymon Rusinkiewicz, and
Wojciech Matusik. Dynamic shape capture using multi-view photometric stereo. In ACM SIG-
GRAPH, 2009.
[181] Ingo Wald, Sven Woop, Carsten Benthin, Gregory S Johnson, and Manfred Ernst. Embree: a kernel
framework for efficient cpu ray tracing. ACM Transactions on Graphics, 33(4):143, 2014.
[182] Brittany Ward, Max Ward, Ohad Fried, and Boris Paskhover. Nasal distortion in short-distance
photographs: The selfie effect. JAMA facial plastic surgery, 2018.
[183] Michael Waschb¨ usch, Stephan W¨ urmlin, Daniel Cotting, Filip Sadlo, and Markus Gross. Scalable
3d video of dynamic scenes. The Visual Computer, 21(8):629–638, 2005.
[184] Thibaut Weise, Hao Li, Luc Van Gool, and Mark Pauly. Face/off: Live facial puppetry. In Proceed-
ings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer animation, pages 7–16.
ACM, 2009.
[185] Chung-Yi Weng, Brian Curless, and Ira Kemelmacher-Shlizerman. Photo wake-up: 3d character
animation from a single photo. arXiv preprint arXiv:1812.02246, 2018.
[186] Ramesh Raskar Leonard McMillan Wojciech Matusik, Chris Buehler and Steven Gortler. Image-
based visual hulls. In ACM SIGGRAPH, 2000.
[187] Scott Workman, Connor Greenwell, Menghua Zhai, Ryan Baltenberger, and Nathan Jacobs. Deep-
focal: a method for direct focal length estimation. In Image Processing (ICIP), 2015 IEEE Interna-
tional Conference on, pages 1369–1373. IEEE, 2015.
[188] Scott Workman, Menghua Zhai, and Nathan Jacobs. Horizon lines in the wild. In British Machine
Vision Conference (BMVC), 2016.
90
[189] Chenglei Wu, Kiran Varanasi, Yebin Liu, Hans-Peter Seidel, and Christian Theobalt. Shading-based
dynamic shape refinement from multi-view video under general illumination. In Computer Vision
(ICCV), 2011 IEEE International Conference on, pages 1108–1115. IEEE, 2011.
[190] Chenglei Wu, Kiran Varanasi, and Christian Theobalt. Full body performance capture under uncon-
trolled and varying illumination: A shading-based approach. Computer Vision–ECCV 2012, pages
757–770, 2012.
[191] Chenglei Wu, Kiran Varanasi, and Christian Theobalt. Full body performance capture under uncon-
trolled and varying illumination: A shading-based approach. In European Conference on Computer
Vision, 2012.
[192] Xuehan Xiong and Fernando De la Torre. Supervised descent method and its applications to face
alignment. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages
532–539. IEEE, 2013.
[193] Hongyi Xu and Jernej Barbiˇ c. Signed distance fields for polygon soup meshes. Graphics Interface
2014, 2014.
[194] Yuanlu Xu, Xiaobai Liu, Yang Liu, and Song-Chun Zhu. Multi-view people tracking via hierarchical
trajectory composition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[195] Yuanlu Xu, Xiaobai Liu, Lei Qin, and Song-Chun Zhu. Multi-view people tracking via hierarchical
trajectory composition. In AAAI Conference on Artificial Intelligence, 2017.
[196] Yuanlu Xu, Song-Chun Zhu, and Tony Tung. DenseRaC: Joint 3D pose and shape estimation by
dense render-and-compare. In IEEE International Conference on Computer Vision, 2019.
[197] Jinlong Yang, Jean-S´ ebastien Franco, Franck H´ etroy-Wheeler, and Stefanie Wuhrer. Estimation of
human body shape in motion with wide clothing. In European Conference on Computer Vision,
pages 439–454. Springer, 2016.
[198] Jinlong Yang, Jean-S´ ebastien Franco, Franck H´ etroy-Wheeler, and Stefanie Wuhrer. Estimation of
human body shape in motion with wide clothing. In European Conference on Computer Vision,
2016.
[199] Lijun Yin, Xiaochen Chen, Yi Sun, Tony Worm, and Michael Reale. A high-resolution 3d dynamic
facial expression database. In Automatic Face & Gesture Recognition, 2008. FG’08. 8th IEEE
International Conference on, pages 1–6. IEEE, 2008.
[200] Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Manmohan Chandraker. Towards large-pose
face frontalization in the wild. In Proc. ICCV, pages 1–10, 2017.
[201] Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai Dai, Hao Li, Gerard Pons-Moll, and
Yebin Liu. Doublefusion: Real-time capture of human performances with inner body shapes from
a single depth sensor. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[202] Stefanos Zafeiriou, George Trigeorgis, Grigorios Chrysos, Jiankang Deng, and Jie Shen. The menpo
facial landmark localisation challenge: A step towards the solution. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) Workshops, volume 1, page 2, 2017.
[203] Yixuan Wei Qionghai Dai Yebin Liu Zerong Zheng, Tao Yu. Deephuman: 3d human reconstruction
from a single image. In IEEE International Conference on Computer Vision, 2019.
[204] Chao Zhang, Sergi Pujades, Michael Black, and Gerard Pons-Moll. Detailed, accurate, human shape
estimation from clothed 3D scan sequences. In IEEE Conference on Computer Vision and Pattern
Recognition, 2017.
91
[205] Zhengyou Zhang. A flexible new technique for camera calibration. IEEE Transactions on pattern
analysis and machine intelligence, 22(11):1330–1334, 2000.
[206] Zerong Zheng, Tao Yu, Hao Li, Kaiwen Guo, Qionghai Dai, Lu Fang, and Yebin Liu. Hybridfu-
sion: Real-time performance capture using a single depth sensor and sparse imus. In European
Conference on Computer Vision, 2018.
[207] Shizhe Zhou, Hongbo Fu, Ligang Liu, Daniel Cohen-Or, and Xiaoguang Han. Parametric reshaping
of human bodies in images. In ACM Transactions on Graphics, page 126, 2010.
[208] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View synthesis
by appearance flow. In European conference on computer vision, pages 286–301. Springer, 2016.
[209] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation
using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision,
pages 2223–2232, 2017.
[210] C Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski.
High-quality video view interpolation using a layered representation. In ACM Transactions on
Graphics (TOG), volume 23, pages 600–608. ACM, 2004.
[211] Xinxin Zuo, Chao Du, Sen Wang, Jiangbin Zheng, and Ruigang Yang. Interactive visual hull re-
finement for specular and transparent object surface reconstruction. In Proceedings of the IEEE
International Conference on Computer Vision, pages 2237–2245, 2015.
92
Abstract (if available)
Abstract
3D human digitization has been explored for several decades in the field of computer vision and computer graphics. Accurate reconstruction methods have been proposed using various types of sensors, and several applications have become popular in sports, medicine and entertainment (e.g., movies, games, AR/VR experiences). However, these setups either require tightly controlled environments, or are only constrained to a certain body part, like face or hair. To date, full body 3D human digitizing with detailed geometry, appearance, and rigging from in-the-wild pictures is still challenging (i.e., taken in natural conditions as opposed to laboratory environments). ❧ In an era where immersive technologies and personal digital devices are becoming increasingly prevalent, the ability to create virtual 3D humans at scale and accessible to end users is extremely essential to enormous applications. In this dissertation, we explore a whole pipeline of digitizing full human body from very sparse multi-view or single-view setups, using only consumer-level RGB camera(s). We first look into the pre-processing of the camera data, introduce our method to deal with the perspective distortions in the images, which is commonly seen in near-range photos of humans, especially portraits. Then, we introduce our end-to-end deep-learning based volumetric reconstruction framework which allows highly detailed reconstruction that can infer both 3D surface and texture from a single image, and optionally, multiple input images. Finally, built upon our framework, we can perform the reconstruction alongside a pre-defined body rig, enabling us to reconstruct a full body model in a normalized canonical pose and format ready for animation.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning to optimize the geometry and appearance from images
PDF
3D deep learning for perception and modeling
PDF
Multi-scale dynamic capture for high quality digital humans
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Semantic structure in understanding and generation of the 3D world
PDF
Data-driven 3D hair digitization
PDF
Scalable dynamic digital humans
PDF
Object detection and recognition from 3D point clouds
PDF
Deep representations for shapes, structures and motion
PDF
Reconstructing 3D reconstruction: a graphical taxonomy of current techniques
PDF
Digitizing human performance with robust range image registration
PDF
Recording, reconstructing, and relighting virtual humans
PDF
Accurate 3D model acquisition from imagery data
PDF
Face recognition and 3D face modeling from images in the wild
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Point-based representations for 3D perception and reconstruction
PDF
Human appearance analysis and synthesis using deep learning
PDF
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
PDF
Theory and applications of adversarial and structured knowledge learning
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
Asset Metadata
Creator
Huang, Zeng
(author)
Core Title
Complete human digitization for sparse inputs
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/20/2020
Defense Date
08/17/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D reconstruction,3D vision,computer graphics,deep learning,OAI-PMH Harvest,virtual humans
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Hill, Randall (
committee chair
), Nakano, Aiichiro (
committee member
), Nikolaidis, Stefanos (
committee member
), Rohs, Remo (
committee member
)
Creator Email
frequencyhzs@gmail.com,zenghuan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-371407
Unique identifier
UC11666460
Identifier
etd-HuangZeng-8990.pdf (filename),usctheses-c89-371407 (legacy record id)
Legacy Identifier
etd-HuangZeng-8990.pdf
Dmrecord
371407
Document Type
Dissertation
Rights
Huang, Zeng
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D reconstruction
3D vision
computer graphics
deep learning
virtual humans