Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scalable dynamic digital humans
(USC Thesis Other)
Scalable dynamic digital humans
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SCALABLE DYNAMIC DIGITAL HUMANS
by
Tianye Li
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2022
Copyright 2023 Tianye Li
献给我的家人
To my family
ii
Acknowledgements
I am forever in debt to Randall Hill, who provided a fantastic computer graphics research envi-
ronment. His support and advice encouraged me to work through dicult problems in research,
and perhaps more importantly, in life. I am grateful to Hao Li for introducing me to the world of
computer graphics and encouraging me to think big and be bold. I thank Stefan Scherer, Andrew
Nealen, Cyrus Shahabi, Ramesh Govindan, and Stefanos Nikolaidis for taking the time to provide
valuable feedback in the qualifying examination and the dissertation defense.
At the University of Southern California (USC) and the USC Institute for Creative Tech-
nologies (ICT), I was fortunate to collaborate with a great team of researchers: Shunsuke Saito,
Zeng Huang, Chloe Legendre, Shichen Liu, Yajie Zhao, Weikai Chen, Jun Xing, Xinglei Ren,
Ari Shapiro, and Jiayi Liu. I realize in hindsight that during the exhausting, sleepless nights we
spent together, we created not only great research results but also treasured memories. I thank
Mingming He, Karl Bladin, Pratusha Prasad, Bipin Kishore, Chinmay Chinara, Aakash Shanbhag,
Marcel Ramos, Owen Ingraham, Koki Nagano, and Andrew Jones for their valuable support, and
Kathleen Haase, Christina Trejo, Lizsl De Leon, Shereen Lanzarotta, Je Karp, Jennifer Gerson,
Tracy Charles, Andy Shangson Chen, Aaron Thompson, and the advisors at the USC Oce of In-
ternational Services for their support behind the scenes, without which none of this would have
been possible.
iii
My lab mates at USC/ICT, with whom I have been through life’s exciting ups and downs
together - Liwen Hu, Lingyu “Cosimo” Wei, Kyle Olszewski, Shunsuke Saito, Yi Zhou, Zeng
Huang, Chloe Legendre, Zimo Li, Sitao Xiang, Ronald Yu, Ruizhe Wang, Shichen Liu, Haiwei
Chen, Pengda Xiang, Yuliang Xiu, Ruilong Li, Zhengfei Kuang, Jiaman Li, Yuming Gu, Jing Yang,
Hanyuan Xiao, Ziqi Zeng, Yuka Murata, Junying Wang, Kyle Morgenroth, Yijing Li, Bohan Wang,
Hongyi Xu, Danyong Zhao, Mianlun Zheng, Giovanni Sutanto, Qiangeng Xu, Cho-Ying Wu, Yiqi
Zhong, Arka Sadhu, Xuefeng Hu, Qiangui Huang, Weiyue Wang, Loc Huynh, Kuan-Wen Huang,
Joanne Kao, Kuang Liu, and Jian Li - I appreciate your friendship.
I was fortunate to build connections outside of USC. Javier Romero - thank you for inviting
me to an amazing internship at the Max Planck Institute for Intelligent Systems (MPI). I will never
forget your innite passion for research. I appreciate Michael J. Black’s advice and insights. You
have always been a role model for young researchers like me. I thank Timo Bolkart for being a
stalwart friend to me and supporting me throughout my Ph.D. journey. I appreciate the stimulat-
ing discussions and the companionship with many members of the MPI: Dimitris Tzionas, Anurag
Ranjan, Yiyi Liao, Julieta Martinez, Yinghao Huang, Aseem Behl, Naureen Mahmood, Talha Za-
man, Fatma Güney, Siyu Tang, Jonas Wul, Sandra M. Kim (and little Lea), Osman Ulusoy, Varun
Jampani, Christoph Lassner, Gerard Pons-Moll, Sergi Pujades, Joel Janai, Thomas Nestmeyer,
Peter Gehler, Naejin Kong, Meekyoung Kim, Simon Donne, David Stut, Despoina Paschalidou,
Cassidy Laidlaw, Sivaram Prasad Mudunuri, Chao Zhang, Sergey Prokudin, Ahmed A. A. Osman,
Laura Sevilla-Lara, Alejandra Quirós-Ramírez, Lars Mescheder, Gernot Riegler, Benjamin Coors,
Shane Gu, and Rocko. Melanie Feldhofer, Nicole Overbaugh, Andrea Keller, Tsvetelina Alexiadis,
Ra Enciaud, and Jorge Márquez provided support that made everything possible.
iv
Chongyang Ma and Linjue Luo oered me the opportunity for a fantastic internship at Snap
Research. I also thank Ruotian Luo, Yongxi Lu, Jianwei Yang, Zhenglin Geng, Seonghyeon Nam,
Liuhao Ge, Jianfei Yu, Zhizhong Li, Zhengyuan Yang, Anhong Guo, Xuecheng Nie, Yuchong
Xiang, Han Wang, and Yifan Sun for the fruitful discussions and a summer full of excitement.
Furthermore, I thank Rachel Greeneld and Melanie Poblete for their support.
Zhaoyang Lv and Mira Slavcheva hosted me for a fruitful internship at Meta Reality Labs
during a rather dicult time (2020-2021). Michael Zollhöfer, Simon Green, Christoph Lassner,
Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, and Richard Newcombe pro-
vided valuable input into our project. I thank Mahmoud M. Azab for his mentorship, Binbin Xu
and Wentao Yuan for their companionship, and Sarah Rathbun, Courtney Marshall, and Braelena
Hills for their support.
I appreciate the opportunity to work as a teaching assistant with Parag Havaldar, Ram Nevatia,
and Hao Li. Saty Raghavachary’s computer graphics course and Bart Kosko’s probability and
statistics courses have always been an inspiration to me. I am grateful to Hao Li, Keith Jenkins,
and Stefan Scherer for their advice and support during the master’s program at USC. I appreciate
the warm encouragement from Kevin Wines and Charlene Delapena at Dolby Laboratories and
Ilias Diakonikolas at USC.
Thabo Beeler, Darren Cosker, Adrian Hilton, and Qianli Ma provided invaluable datasets for
my research. I thank Mike Seymour for letting me use the image fromMeetMike. I thank Federica
Bogo, J. P. Lewis, Mengqi Ji, Tao Yu, Qi Ye, and Ziqian Bai for the fruitful discussions, and I thank
Christian Richardt and Orazio Gallo for their tips on L
A
T
E
X.
v
I thank the anonymous reviewers for their sometimes sobering yet always valuable feedback,
as well as anyone who has rejected me over the years but provided constructive criticism in the
process. Your feedback often inspired the most introspection and helped me to grow.
I am thankful to my friends, Bowei, Xiaomo, Han-Wei, Sophia, Anne-Katrin, Rewati, Grace,
Mark, Suyun, Jaewoo, Rui, Nancy, and Spencer, who have always been there for me.
Finally, I am grateful to my family, especially my parents, for your unlimited love and support.
Without you, I could not have nished this long and arduous journey. My grandfather Sizhao and
my aunt Yuan, unfortunately, will not see the end result of this dissertation, but I know you would
be proud.
In addition, I thank the ICT pool table and MPI coee machine for being an endless fount of
inspiration.
vi
TableofContents
Dedication ii
Acknowledgements iii
ListofTables x
ListofFigures xii
Abstract xix
Chapter1: Introduction 1
1.1 Digital Humans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scalable Digital Humans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Generic Face Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Ecient and Automated Processing . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 General Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter2: GenericFaceModeling 12
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Temporal Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Initial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2 Single-frame Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.3 Sequential Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Capture Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.3 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.1 Pose Parameter Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.2 Expression Parameter Training . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6.3 Shape Parameter Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
vii
2.6.4 Optimization Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7.1 Registration Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7.2 Model Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.7.3 Comparison to State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . 44
2.7.4 Shape Reconstruction from Images . . . . . . . . . . . . . . . . . . . . . . 48
2.7.5 Expression Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Chapter3: EcientInferenceforTopologicallyConsistentFaceMeshes 54
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Multi-View Face Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.1 Global Geometry Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.2 Local Geometry Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.3 Appearance and Detail Capture . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter4: GeneralCaptureandModelingwithDynamicNeuralRadianceFields 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3 DyNeRF: Dynamic Neural Radiance Fields . . . . . . . . . . . . . . . . . . . . . . 83
4.3.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.2 Ecient Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.2 Evaluation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Chapter5: ConclusionandOutlook 101
5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.3 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Bibliography 114
AppendixA 134
Generic Face Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
viii
AppendixB 140
Ecient Inference for Topologically Consistent Face Meshes . . . . . . . . . . . . . . . 140
AppendixC 152
General Capture and Modeling with Dynamic Neural Radiance Fields . . . . . . . . . . 152
C.1 Supplemental Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
C.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
C.3 Importance Sampling Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
C.4 More Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
ix
ListofTables
3.1 Comparison on run time on base mesh, given images from 15 views and
measured in seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1 Quantitative comparison of our proposed method to baselines of existing
methods and radiance eld baselines trained at 200K iterations on a 10-second
sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.2 Quantitativecomparison of our proposed method to baselines using perceptual
video quality metric Just-Objectionable-Dierence (JOD) [127]. Higher number
(maximum 10) indicates less noticeable visual dierence to the ground truth. . . 96
B.1 Comparison on geometry accuracy (median s2m), correspondence accuracy
(median v2v) among the learning based methods, measured in millimeters. “PP”
denotes the result after a post-processing Procrustes alignment that solves for
the optimal rigid pose (i.e. 3D rotation and translation) and scale to best align
the reconstructed mesh with the ground truth. Note that our method requires no
post-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
C.1 Quantitative comparison of our proposed method to baselines of existing
methods and radiance eld baselines trained at 200K iterations on a 10-second
sequence. DyNeRF-IS
?
uses both sampling strategies (ISG and IST) and thus runs
for more iterations: 250K iterations of ISG, followed by 100K of IST; it is shown
here only for completeness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
C.2 Comparison in model storage size of our method (DyNeRF) to alternative
solutions. For HEVC, we use the default GoPro 7 video codec. For JPEG, we
employ a compression rate that maintains the highest image quality. For NeRF,
we use a set of the original NeRF networks [134] reconstructed frame by frame.
For HEVC, PNG and JPEG, the required memory may vary within a factor of 3
depending on the video appearance. For NeuralVolumes (NV), it only accounts
the neural network size without counting its dependency on additional input
streams. For NeRF, NeuralVolume and DyNeRF, the required memory is constant.
All calculation are based on 10 seconds of 30 FPS videos captured by 18 cameras. 163
x
C.3 Ablation studies on the latent code dimension on a sequence of 60
consecutive frames. Codes of dimension 8 are insucient to capture sharp
details, while codes of dimension 8,192 take too long to be processed by the
network. We use 1,024 for our experiments, which allows for high quality while
converging fast. *Note that with a code length of 8,192 we cannot t the same
number of samples in the GPU memory as in the other cases, so we report a score
from a later iteration when roughly the same number of samples have been used. 164
xi
ListofFigures
1.1 Realisticdigitalhumans. (a) Meet Mike [180, 181] (courtesy of Mike Seymour
and Epic Games); (b) Project Starline [78, 105] (courtesy of Google); (c) New
Dimensions in Testimony [206] (courtesy of USC ICT). . . . . . . . . . . . . . . . . 2
1.2 Examplefacialassets in making The Curious Case of Benjamin Button. Image
credit: image (a) through (d) are from TED Talk “How Benjamin Button Got His
Face” [208, 209], presented by Ed Ulbrich. Image (e) used with permission from
USC Institute for Creative Technologies, Vision and Graphics Lab. . . . . . . . . 3
2.1 FLAMEexample. Top: Samples of the D3DFACS dataset. Middle: Model-only
registration. Bottom: Expression transfer to Beeler et al. [18] subject using model
only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Parametrizationofourmodel (female model shown). Left: Activation of the
rst three shape components between3 and +3 standard deviations. Middle:
Pose parameters actuating four of the six neck and jaw joints in a rotational
manner. Right: Activation of the rst three expression components between3
and +3 standard deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Joint locations of the female (left) and male (right) FLAME models.
Pink/yellow represent right/left eyes. Red is the neck joint and blue the jaw. . . . 21
2.4 Overview of the face registration, model training, and application to expression
transfer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Predicted 49landmarks from the CMU Intraface landmark tracker [225] (left)
and the same landmarks dened on our topology (right). . . . . . . . . . . . . . . 26
2.6 Sample registrations. Top: shape data extracted from the CAESAR body
database. Middle: sample registrations of the self captured pose data with head
rotations around the neck (left) and mouth articulations (right). Bottom: samples
registrations of the expression data from D3DFACS (left) and self captured
sequences (right). Appendix A shows further registrations. . . . . . . . . . . . . . 30
xii
2.7 Visualization of the coupling weight. Head regions with higher coupling
edge weight (left) and higher Laplacian weight (right). . . . . . . . . . . . . . . . 33
2.8 Resultsofthemodel-only,coupled,andtexture-basedregistrationsteps
for one scan. Top: scan, registrations, and scan-to-mesh distance for each
registration visualized color-coded on the scan. Bottom: original texture image,
synthesized texture image for each step, and the corresponding photometric errors. 38
2.9 Resultsofthealternatingregistrationapproach. . . . . . . . . . . . . . . . . 39
2.10 Median per-vertex distance between registration and the scan surface.
Left: Distance measure across all frames of all female (a) and male (b) training
sequences. Right: Distance measure across all registered frames for the Beeler et
al. [18] sequence (c) and the ground-truth error (d) measuring the within-surface
drift. The supplemental video shows the full registration sequence. . . . . . . . . 40
2.11 Registrationquality. Sample frames, registrations, and scan-to-mesh distance
of one sequences of the D3DFACS database (top) and one sequence of our
self-captured sequence (bottom). Appendix A shows further registrations. . . . . 41
2.12 Quantitative evaluation of identity shape space (top) and expression
space (bottom) of the female and male FLAME models. From left to right:
compactness, generalization female, generalization male, specicity female, and
specicity male. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.13 Expressiveness of the FLAME identity space for tting neutral scans of
the BU-3DFE face database with a varying number of identity components.
Appendix A shows further examples. . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.14 Inuence of the pose blendshapes for dierent actuations of the neck and
yaw joints in a rotational manner. Visualization of FLAME without (top) and
with (bottom) activated pose blendshapes. . . . . . . . . . . . . . . . . . . . . . . 45
2.15 Cumulative scan-to-mesh distance computed over all model-ts of the
neutral BU-3DFE scans. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.16 Comparison on identity space of Basel Face Model (BFM) [152], FaceWare-
house model [38] and FLAME for tting neutral scans of the BU-3DFE database.
Appendix A shows further examples. . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.17 Median per-vertex distance between registration to the scan surface,
measured across all frames of the test data. Top: female data. Bottom: male data. . 48
xiii
2.18 Reconstructionquality from high-resolution motion sequences compared to
FaceWarehouse (FW). Intermediate frames of three motion sequences. FLAME is
restricted to have the same number of parameters as FW. . . . . . . . . . . . . . . 49
2.19 ComparisonofFaceWarehousemodel(top)andFLAME(bottom)for3D
facettingfromsingle2Dimage. Note, that the scan (pink) is only used for
evaluation. Appendix A shows further examples. . . . . . . . . . . . . . . . . . . . 49
2.20 Expressiontransfer from a source sequence (blue) to a static target scan (pink).
The aligned personalized template for the scan is shown in green, the transferred
expression in yellow. Appendix A shows further examples. . . . . . . . . . . . . . 51
3.1 ToFu examples. Given (a) multi-view images, our face modeling framework
ToFu uses volumetric sampling to predict (b) accurate base meshes in consistent
topology as well as (c) high-resolution details and appearances. Our ecient
pipeline enables (d) rapid creation of production-quality avatars for animation. . 54
3.2 Overviewofourend-to-endfacemodelingsystem. Given images captured
from multi-views, the progressive mesh generation network predicts an accurate
face mesh in consistent topology. Then the appearance and detail capture
network synthesizes high-resolution skin detail and attribute maps, which
enables highly detailed geometry and photo-realistic renderings. . . . . . . . . . 60
3.3 Overview of the progressive mesh generation network. . . . . . . . . . . . . . . 61
3.4 The iterative upsampling and renement process in the local geometry
stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5 Evaluationonmethodrobustness. . . . . . . . . . . . . . . . . . . . . . . . . 68
3.6 Qualitativecomparisonongeometricaccuracywiththeexistingmethods.
The scan-to-mesh distance is visualized in heatmap (red means> 5 mm). Note
that 3DMM and DFNRMVS [11] need rigid ICP as post-processing. Our outputs
require no post-processing, while outperforming the existing learning-based
method in geometry accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.7 Visualizationoncorrespondenceaccuracy. . . . . . . . . . . . . . . . . . . . . 71
3.8 Qualitativeevaluationoncorrespondencecomparedtoopticalow. . . . . 72
3.9 EectofToFuinferredappearanceanddetailmaps. Based on our reliable
base meshes, our appearance and detail capture network predicts realistic face
skin details and attributes, without special hardware such as Light Stage at
test-time, enabling photo-realistic rendering. . . . . . . . . . . . . . . . . . . . . . 73
xiv
3.10 Ablation studies. Left: number of input camera views; Right: on normal
displacement weights in mesh upsampling function. . . . . . . . . . . . . . . . . 74
3.11 Generalizationtonewcapturesetups. Results on CoMA [161] datasets. . . . 74
4.1 Neural3Dvideosynthesis. We propose a novel method for representing and
rendering high quality 3D video. Our method trains a novel and compact dynamic
neural radiance eld (DyNeRF) in an ecient way. Our method demonstrates
near photorealistic dynamic novel view synthesis for complex scenes including
challenging scene motions and strong view-dependent eects. We demonstrate
three synthesized 3D video, and show the associated high quality geometry in
the heatmap visualization in each top right corner. Theembeddedanimations
onlyplayinAdobeReaderorKDEOkular. Pleaseseethefullvideofor
thehigh-qualityrenderingsandadditionalinformation. . . . . . . . . . . . 77
4.2 Dynamic Neural Radiance Fields (DyNeRF). We learn the 6D plenoptic
function by our novel dynamic neural radiance eld that conditions on position,
view direction and a compact, yet expressive time-variant latent code. . . . . . . 84
4.3 Overview of our ecient training strategies. We perform hierarchical
training rst using keyframes (b) and then on the full sequence (c). At both
stages, we apply the ray importance sampling technique to focus on the rays
with high time-variant information based on weight maps that measure the
temporal appearance changes (a). We show a visualized example of the sampling
probability based on global median map using a heatmap (red and opaque means
high probability). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4 High-quality novel view videos synthesized by our approach for dynamic
real-world scenes. We visualize normalized depth in color space on the last
column in the each row. Our representation is compact, yet expressive and even
handles complex specular reections and translucency. . . . . . . . . . . . . . . . 94
4.5 Comparisonofournalmodeltoexistingmethods, including Multi-view
Stereo (MVS), local light eld fusion (LLFF)[133] and NeuralVolume (NV) [123].
The rst row shows novel view rendering on a test view. The second row
visualizes the FLIP compared to the ground truth image. Compared to alternative
methods, our method can achieve best visual quality. . . . . . . . . . . . . . . . . 95
4.6 Qualitative comparisons of DyNeRF variants on one image of the sequence
whose averages are reported in Tab. 4.1. From left to right we show the rendering
by each method, then zoom onto the moving ame gun, then visualize DSSIM
and FLIP for this region using the viridis colormap (dark blue is 0, yellow is 1,
lower is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.7 Snapshotsofnovelviewrenderedvideos on immersive video datasets [34]. . 98
xv
4.8 Limitation. A few examples of failed outdoor reconstruction using DyNeRF. . . 99
5.1 Summary of contributions in this thesis from a perspective of traditional
content creation pipelines. The red blocks denote manual or semi-automated
components. The blue blocks denote fully automated components. The half
transparent components were not the major emphasis of the chapter. . . . . . . . 106
5.2 Future directions. (a) Hybrid modeling [108, 168]; (b) Unconstrained
capture [131]; (c) & (d) Cross-modality, including speech animation [240] and
text-driven synthesis [157]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
A.1 Sample registrations of the shape data extracted from the CAESAR body
database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
A.2 Sample registrations of the self captured pose data. Top: Head rotations
around the neck. Bottom: Mouth articulations. . . . . . . . . . . . . . . . . . . . . 134
A.3 Samplesregistrationsoftheexpressiondata from D3DFACS (top) and self
captured sequences (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
A.4 Registrationquality. Sample frames, registrations, and scan-to-mesh distance
of one sequences of the D3DFACS database (top) and one sequence of our
self-captured sequence (bottom). The texture-based registration allows to track
subtle motions such as raising eyebrows (top). . . . . . . . . . . . . . . . . . . . . 137
A.5 ExpressivenessoftheFLAMEidentityspace for tting neutral scans of the
BU-3DFE face database with a varying number of identity components. . . . . . . 138
A.6 Additionalcomparisononidentityspace of Basel Face Model (BFM) [152],
FaceWarehouse model [38] and FLAME for tting neutral scans of the BU-3DFE
database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
A.7 Additional comparison on 3D face tting from a single 2D image of
FaceWarehouse model (top) and FLAME (bottom). Note, that the scan is only
used for evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
A.8 Additionalresultsonexpressiontransfer from a source sequence (blue) to a
static target scan (pink). The aligned personalized template for the scan is shown
in green, the transferred expression in yellow. . . . . . . . . . . . . . . . . . . . . 139
B.1 Quantitativeevaluation by cumulative error curves for scan-to-mesh distances
among learning based methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
B.2 ExampleresultsfromDFNRMVS [11]. . . . . . . . . . . . . . . . . . . . . . . . 141
xvi
B.3 DynamicfacialperformancecaptureusingToFu. Base mesh reconstruction
for a multi-view video sequence overlaid on the video frames. Our method
captures the facial performance well. The result meshes are temporally
stable and accurately align with the input images. Visualizing with a shared
checkerboard texture indicates good tracking quality. Please see thesupplemental
video for better visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
B.4 InferredmeshesforeachleveloftheToFupipeline. global stageM
0
and
after upsampling and renement for each local stageM
i
(1i 3). . . . . . . . 144
B.5 Quantitativeevaluation by cumulative error curves for scan-to-mesh distances
among local renement stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
B.6 Quantitativeevaluation by cumulative error curves for scan-to-mesh distances
among various numbers of views. . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
B.7 ToFu results on clothed human body. Our system can also infer clothed
human body surfaces in consistent topology. . . . . . . . . . . . . . . . . . . . . . 146
B.8 Visualization of cross-subject dense correspondence of the base meshes
inferred by ToFu in a shared checkerboard texture. . . . . . . . . . . . . . . . . . 147
B.9 More results of reconstructed meshes in dense correspondence. The
scan-to-mesh distance is visualized color coded on the reference scan, where red
denotes an error above 5 millimeters. . . . . . . . . . . . . . . . . . . . . . . . . . 150
B.10 ToFu-infered facial appearances. Our method can generate reliable base
alignment meshes, on top of which a comprehensive face modeling pipeline can
be built. Here we show more rendering with inferred normal displacements and
additional albedo and specular maps. . . . . . . . . . . . . . . . . . . . . . . . . . 151
C.1 Ourmulti-viewcapturesetup using synchronized GoPro Black Hero 7 cameras. 153
C.2 Frames from our captured multi-view video ame salmon sequence (top).
We use 18 camera views for training (downsized on the right), and held out
the upper row center view of the rig as novel view for quantitative evaluation.
We captured sequences at dierent physical locations, time, and under varying
illumination conditions. Our data shows a large variety of challenges in high
quality wide angle 3D video synthesis. . . . . . . . . . . . . . . . . . . . . . . . . 154
C.3 Comparisonofimportancesamplingstrategiesovertrainingiterations. . 161
xvii
C.4 Qualitative comparisons of DyNeRF variants on one image of the sequence
whose averages are reported in Tab. C.1. From left to right we show the rendering
by each method, then zoom onto the moving ame gun, then visualize DSSIM
and FLIP for this region using the viridis colormap (dark blue is 0, yellow is
1, lower is better). The three hierarchical DyNeRF variants outperform these
baselines: DyNeRF-ISG has sharper details than DyNeRF-IST, but DyNeRF-IST
recovers more of the ame, while DyNeRF
?
combines both of these benets. . . . 162
xviii
Abstract
High-delity digital humans play a crucial role in visual storytelling in the lm and game industry.
Meanwhile, digital humans are gaining interest in augmented reality (AR) and virtual reality
(VR). While state-of-the-art digital humans are approaching the line to be indistinguishable from
real humans, the creative process often requires a large team of highly-skilled artists due to a
heavy design workload. Furthermore, the resulting digital human models tend to be person-
specic and require slow and sometimes manual processing. In addition, the methodology is
hardly generalizable to new objects and scenes. As a result, these constraints prevent high-quality
digital humans from being accessible to everyone. This dissertation investigates the algorithms
and frameworks to enable the scalable creation of dynamic digital humans at high realism.
We begin our investigation by addressing the problem of generic face modeling. By utilizing
the structural similarity among the facial geometries, we can compress the high variations into
a compact model in a meaningful manner. To achieve this goal, we rst curate a massive 4D face
dataset that contains realistic shapes and deformations. The key is to establish dense correspon-
dence among the varied and deformed faces. We design a system to reconstruct and register a
large quantity of high-quality 4D faces by solving a coarse-to-ne optimization problem. To ad-
dress the generic modeling problem, we propose a modeling structure and a training procedure
that disentangles facial identities, expressions, and poses. We show that the resulting generic face
xix
model, FLAME, is a lightweight yet expressive model that covers the shape variations in a wide
range of populations and realistic deformations due to expressions and pose changes.
Next, we shift our attention to accelerating and further automating the facial capture systems,
which suer from long processing time and the requirement for manual clean-up and adjustment
(due to noises, outliers, and errors). We propose ToFu (Topologically consistent Face inference
from multi-view). This geometry inference framework can produce densely corresponded meshes
across facial identities and expressions using a volumetric representation instead of an explicit
underlying 3DMM. We show that the ToFu framework can produce high-quality face registrations
without the traditional photogrammetry and mesh registration. The system achieves state-of-
the-art geometric and correspondence accuracy by taking only 0.385 seconds, three orders of
magnitude faster than traditional techniques. We further show that ToFu captures high-quality
appearance and detail maps, readily usable by production studios for avatar creation, animation,
and physically-based rendering.
Lastly, we explore the capture and modeling methods beyond human faces (e.g., clothed hu-
man bodies, general objects, and scenes). It is dicult to adapt the classical mesh-based method-
ology to general scenes due to a lack of structural similarity among the object categories and
challenging eects (e.g., high specularity, volumetric eects, and topological changes). We pro-
pose a novel dynamic neural radiance eld (DyNeRF) representation that models the geometry
and appearance of a dynamic real-world scene. Since the ray-based training procedure is slow,
we further design a novel hierarchical training scheme with ray importance sampling, which
signicantly boosts the training speed and the perceptual quality of the generated imagery. We
demonstrate that our method can render high-quality wide-angle novel views at over 1K resolu-
tion, for complex and dynamic scenes, with a highly compact representation.
xx
Chapter1
Introduction
1.1 DigitalHumans
Any suciently advanced technology is indistinguishable from magic, said the British science c-
tion writer Arthur C. Clarke, as one of his famous “Three Laws” [44]. In the eld of computer
graphics, one of the magic that researchers, engineers, and artists have been pursuing for decades
is to create realistic digital humans. In recent years, realistic digital humans are showing signi-
cant impacts in applications ranging from entertainment, social interaction, and education. The
most straight-forward examples are in the lm-making and game industry, where the artists cre-
ate compelling visual eects to facilitate storytelling. A highly realistic counterpart of the actor
can save a great amount of funding and time to produce shots that were not taken during the pro-
duction. In some cases, a high-quality digital double can even create “impossible” performances,
for example, to allow the audience to travel through time [136, 210]. Beyond digital doubles in the
entertainment industry, real-time human capture and rendering enables people to connect and
collaborate with each other with virtual reality (VR) or augmented reality (AR) (see Fig. 1.1). For
example, [78] demonstrates a high-delity communication system through real-time 3D capture
1
(a) (b) (c)
Figure 1.1: Realistic digital humans. (a) Meet Mike [180, 181] (courtesy of Mike Seymour
and Epic Games); (b) Project Starline [78, 105] (courtesy of Google); (c) New Dimensions in Testi-
mony [206] (courtesy of USC ICT).
and rendering, enabling immersive social interactions and telepresence. Furthermore, a realistic
digital human can give valuable lessons to future generations. In [206], researchers recorded the
testimony of a holocaust survivor in high-quality. With a natural language dialogue system and
3D display, the students can experience an engaging and interactive lesson.
Seeing these great examples, one might assume the task of creating high-quality digital hu-
mans has already been achieved. The response is yes, but with great eort. The fundamental
challenges of reaching realism that is indistinguishable from reality lie in the incredible complex-
ity of humans. We all have very distinctive shapes and appearances that are linked to our very
identities. Our shape and appearances contain tremendous details, for example, in pores, eyes,
and translucent skins. Besides, we are constantly expressing ourselves through facial expressions
and body motions, conveying complicated yet subtle inner emotional states and intentions. More
importantly, we humans are highly sensitive to nuances and subtle changes in all the factors
above, as summarized in the famous “Uncanny Valley” phenomenon [135]. As many applications
(e.g. VFX in lms and games) require animated digital humans to deliver performances, the re-
quirement for the target realism is multiplied for the dynamic digital humans compared to the
2
(a) Face maquette
(c) Deformation transfer (d) Skin deformation model (e) Appearance capture
(b) Dynamic expression capture
Figure 1.2: Examplefacialassets in makingTheCuriousCaseofBenjaminButton. Image credit:
image (a) through (d) are from TED Talk “How Benjamin Button Got His Face” [208, 209], pre-
sented by Ed Ulbrich. Image (e) used with permission from USC Institute for Creative Technolo-
gies, Vision and Graphics Lab.
static cases. Therefore, for a very long time, the most realistic digital humans were only created
by highly skilled artists.
One great example of realistic digital humans that almost beat the Visual Turing Test
∗
: is
the 2008 lm, The Curious Case of Benjamin Button. In the lm, the main character Benjamin,
played by Brad Pitt, is a man who ages backward, i.e., the young Benjamin appears aged, yet is
a child’s size. What is impressive is that, for the rst hour of the lm, the head of Benjamin is
completely computer-generated [64, 179]. The visual eects would be dicult to achieve with
physical makeup, as this additive process cannot create a child-size character from the actor
(adult). To realize this impossible eect, the lmmakers chose the digital method, and the foun-
dation for the nal success was to create a digital animation model with compelling realism. The
team built a highly realistic digital head for Benjamin, containing realistic shapes, appearances,
facial expressions, and skin deformations. Some noticeable steps included building and scanning
∗
The concept “Visual Turing Test” covers a wide range of denitions [73, 182]. Here we dene it in the computer
graphics sense: the goal is to produce realistic visual eects that are indistinguishable from reality for human eyes.
3
detailed sculptures, expression capture and transfer, realistic skin appearance capture, and head
rigging with realistic skin deformations [208], as shown in Fig 1.2. These eorts achieved great
success in conveying a believable performance, which deceived many audiences. However, the
pipeline to create such high-quality digital humans is hardly scalable, as many of the steps re-
quire a large team of highly-skilled artists to manually create the assets to reach realism, which
is time-consuming and labor-intensive. For example, the realistic eects in The Curious Case of
BenjaminButton took over two years with a team of 155 people [208]. Furthermore, the process is
usually tailored to a specic actor, which makes it dicult to transfer the assets to other subjects
or even arbitrary categories of characters and objects.
1.2 ScalableDigitalHumans
Given the issues we observe in the traditional digital human creation pipeline, we aim to
make digital human technology more accessible to everyone. Achieving this vision entails the
development of scalable pipelines to create digital humans. On the one hand, this can ease the
production process in animation in lm-making and game design. On the other hand, a scalable
digital human pipeline can open new doors for future technologies, powering content creation,
virtual assistants and tutors, and VR/AR applications.
This dissertation investigates the fundamental algorithms and computational frameworks for
scalable pipelines of realistic digital human creation. While the term digital human can have
various denitions, in this dissertation, our goal is to explore automatic pipelines that produce
animation models, such that novel visual contents (3D animations and renderings) of realistic
4
humans can be created. The resulting model is a digital human representation that covers realistic
geometry and appearance. As the performances and dynamics of digital humans are crucial for
the applications, we emphasize digitizing dynamic humans, such that vivid performances are
captured and reproduced in both the spatial and temporal dimensions. Further, the model should
be manipulable by the users. A typical example would be a morphable and rigged animation
model that is readily applicable in production such as performance capture and photo-realistic
rendering. While animation and rendering are crucial for the nal result, we do not focus on
the specic techniques in those two areas but focus on producing realistic animation models that
enable high-quality animation and rendering.
What do we mean by “scalable”? The following are several major directions that are worth
considering.
1. Genericmodeling: The traditional pipeline puts a tremendous amount of eort into build-
ing the animation model for a particular character, as we discussed in Section 1.1. The re-
sulting animation model is person-specic and hardly useful for other subjects. We aim for
a digital human model that is generic enough to support arbitrary identities and capture
expressions and pose changes.
2. Ecient and automated processing: The digital human model requires many high-
quality assets, including registered surfaces for facial geometry and detailed texture and
appearance maps. These important assets are traditionally obtained with manual creation
and slow computations, which is highly inecient and requires artistic skills. In contrast,
5
we develop an ecient and robust system to reconstruct and register 4D human perfor-
mance data, such that the data processing that supports the generic modeling can be scal-
able and free of human supervision.
3. Generalmethodology: Most of the traditional process is not easily transferable to general
categories, such as clothing, animals, objects, and background environments. We strive
to develop general-purpose algorithms such that humans as well as complex scenes and
objects can be simultaneously captured and visualized.
As faces are crucial in storytelling and social interaction, while exploring full-body capture and
modeling in some settings, we put most of our attention on realistic human face (head) capture
and modeling in the rst two directions. In general methodology, we explore capturing and
modeling humans along with the general dynamic scenes.
1.2.1 GenericFaceModeling
The digital human creation in the lm industry is often tailored to specic actors, which makes
it dicult to adapt to arbitrary characters. Building a generic model for humans is, however,
challenging, as humans are highly diverse: (1) each individual contains distinct shapes and ap-
pearances that are associated with their identity; (2) the surfaces deform as the person speaks,
emotes, and articulates. As faces (heads) are at the center of storytelling and social interactions,
we focus on generic face modeling.
A desired generic face model should cover a wide range of diversities and variations, which
correspond to dierent identities (subjects), expressions, and poses [23, 149]. Such a generic
model can be helpful for animators to create new characters and animate them. In order to support
6
the artistic design process, the model should be manipulable in an intuitive manner. This requires
the model to be compact with few controllable variables to edit the shapes instead of requiring
the artists to directly modify each shape element (e.g., vertex). On the other hand, a compact
and accurate human model is crucial in computer vision. As 3D reasoning from incomplete 2D
observations is a highly non-convex optimization, a well-behaved compact model can serve as a
reliable prior model, easing challenging optimization problems for recovering human shapes and
motion [124].
1.2.2 EcientandAutomatedProcessing
As we see in making the digital Benjamin Button, the digitization process involves many steps
of manual work. Each step takes a long time and requires artistic skills. We aim to build com-
putational systems to automate and thus scale up digital human creation. A crucial asset for
building realistic face models is the registered geometry that encodes the surface deformations.
In the computer vision and computer graphics community, there has been extensive work to re-
place the manual design with computational systems. A common system design choice is to rst
build 3D models with various methods, such as multi-view stereo (MVS) [66, 77], shape-from-
silhouette (as known as visual hull) [57, 104], depth from active sensors [183, 218] or photometric
stereo [76, 126]. To analyze and model the observed deformation and variation, the next step is
to establish dense surface correspondence among the surfaces. A common way is to deform a
pre-dened template mesh to register the 3D surface [55]. We observe that those methods tend
to be slow to compute. They are prone to errors from input data, such as noises or occlusion
in raw images. Furthermore, the multi-stage design in the current computational systems can
7
propagate the errors from previous to later steps. Those issues require further human interven-
tions for quality control. These constraints are preventing the above systems from being fully
automated and scalable. Motivated by these issues, we aim to build ecient, robust, and unied
systems for scalable human digitization with minimal manual work.
1.2.3 GeneralMethodology
Humans constantly interact with the world and with other humans. It is necessary to build digital
representations for general objects and environments. Unlike humans, which share a similar
structure across subjects (instances), general objects and scenes can take arbitrary shapes and
appearances. Furthermore, the general objects and scenes can contain many challenging eects,
such as view-dependent appearances, volumetric eects, and topology changes. Consequently,
it is not trivial to adapt the methods that capture humans for general objects and scenes. The
traditional graphics pipeline requires the artists to produce those 3D assets, either by manual
design or some computational counterparts (such as photogrammetry). This process is dicult to
scale up as each individual object needs to be produced in high quality and then animated along
with the human character. There is a great need to eciently capture humans with dynamic
objects and scenes with a general (and unied) method.
1.3 Contributions
This dissertation investigates the fundamental algorithms and frameworks to scale up realistic
dynamic digital human creation in the following three aspects: (1) generic modeling; (2) ecient
8
and automated processing; (3) general methodology. As human faces are crucial for social in-
teractions and content creation, we emphasize our work for generic modeling and automated
processes on the human face (head). We then explore a generalized representation and system to
capture dynamic humans with full-body and in general dynamic scenes. We describe our method-
ologies and contributions along with the insights behind the proposed solutions as follows.
Chapter2: GenericFaceModeling We begin our investigation by addressing the problem of
scalable facial capture and modeling. Human faces are highly diverse and deformable. We can
model and compress the high variations into a compact model in a meaningful manner by utiliz-
ing the structural similarity of faces. We propose a generic face model, FLAME (Faces Learned
with an Articulated Model and Expressions). FLAME combines a linear shape space with an ar-
ticulated jaw, neck, and eyeballs, pose-dependent corrective blendshapes, and additional global
expression blendshapes. This modeling structure eectively disentangles facial deformations into
three factors: identities, expressions, and poses. The FLAME model is built based on carefully-
curated massive face datasets that contain realistic shape variations across identities, expressions,
and poses. The key is to establish dense correspondence among the facial geometries. To obtain
high-quality registration data, we design a system to reconstruct and register a large quantity
of high-quality 4D faces by solving a coarse-to-ne optimization problem with geometric, pho-
tometric, and motion cues. In total, FLAME is trained from over 33,000 scans. We demonstrate
that FLAME is low-dimensional but more expressive than state-of-the-art models such as the
FaceWarehouse model and the Basel Face Model. We further show that FLAME is able to accu-
rately reconstruct 3D faces given a single-view image and is readily applicable in performance
retargeting.
9
Chapter3:EcientInferenceforTopologicallyConsistentFaceMeshes The traditional
face capture systems often combine multi-view stereo (MVS) techniques for 3D reconstruction
and a non-rigid registration step to establish dense correspondence across identities and expres-
sions. We observe that the long processing time and the requirement for manual clean-up and ad-
justment due to noises, outliers, and errors are preventing traditional systems from being ecient
and fully automated. Although most learning-based methods are robust given in-the-wild inputs,
they cannot achieve high geometric accuracy due to the underlying 3D morphable model (3DMM)
and global bottleneck of regression architectures. We propose ToFu (Topological consistent Face
inference from multi-view), a geometry inference framework that can produce topologically con-
sistent meshes across facial identities and expressions using a volumetric representation instead
of an explicit underlying 3DMM. We show that the ToFu framework can produce high-quality
face registrations without the traditional photogrammetry and mesh registration. We demon-
strate state-of-the-art geometric and correspondence accuracy, while only taking 0.385 seconds
to compute a mesh with 10K vertices, which is three orders of magnitude faster than traditional
techniques. We further show that ToFu captures displacement maps for pore-level geometric de-
tails and facilitates high-quality rendering in the form of albedo and specular reectance maps.
These high-quality assets are readily usable by production studios for avatar creation, animation,
and physically-based skin rendering.
Chapter 4: General Capture and Modeling with Dynamic Neural Radiance Fields We
explore the capture and modeling methods beyond human faces (for example, clothed human bod-
ies, general objects, and scenes). Due to a lack of structural similarity among the object categories,
10
it is dicult to adapt the methodology that we use in modeling faces to general objects. Further-
more, the geometry and appearances are highly variant and contain challenging eects such as
high specularity, volumetric eects, and topological changes. We adopt the approach from the
eld of novel view synthesis and design a system to capture and model dynamic humans along
with arbitrary scenes. The system takes multi-view video recordings of a dynamic real-world
scene and compresses the content (including geometry and appearance) into the neural repre-
sentation that supports high-quality 3D video synthesis with features such as view synthesis and
motion interpolation. At the core of our approach is a novel time-conditioned neural radiance
eld that represents scene dynamics using a set of compact latent codes. Since the common ray-
based training procedure is inecient in utilizing the smooth nature of the spatial-temporal data,
we design a novel hierarchical training scheme in combination with ray importance sampling,
which signicantly boosts the training speed and perceptual quality of the generated imagery.
We show that our learned representation is highly compact and able to represent a 10-second 30
FPS multi-view video recording by 18 cameras with a model size of only 28MB. We demonstrate
that our method can render high-delity wide-angle novel views at over 1K resolution, even for
complex and dynamic scenes.
Finally, in Chapter 5, we summarize our contributions. In particular, we will point out the
design choices that enable the contributions as well as a few take-home messages. We will then
discuss the future directions.
11
Chapter2
GenericFaceModeling
Figure 2.1: FLAMEexample. Top: Samples of the D3DFACS dataset. Middle: Model-only regis-
tration. Bottom: Expression transfer to Beeler et al. [18] subject using model only.
2.1 Introduction
This chapter addresses a signicant gap in the eld of 3D face modeling. At one end of the
spectrum are highly accurate, photo-realistic, 3D models of individuals that are learned from
scans or images of that individual and/or involve signicant input from a 3D artist (e.g. [3]). At
12
the other end are simple generic face models that can be t to images, video, or RGB-D data but
that lack realism (e.g. [112]). What is missing are generic 3D face models that are compact, can
be t to data, capture realistic 3D face details, and enable animation. Our goal is to move the “low
end” models towards the “high end” by learning a model of facial shape and expression from 4D
scans (sequences of 3D scans).
Early generic face models are built from limited numbers of 3D face scans of mostly young
Europeans in a neutral expression [23, 152]. More recently, the FaceWarehouse model [38] uses
scans of 150 people with variation in age and ethnicity and with 20 dierent facial poses. While
widely used, the limited amount of data constrains the range of facial shapes that the above
models can express.
To address limitations of existing models, we exploit three heterogeneous datasets, using more
than 33; 000 3D scans in total. Our FLAME model (Faces Learned with an Articulated Model and
Expressions) is factored in that it separates the representation of identity, pose, and facial expres-
sion, similar to models of the human body [10, 124]. To keep the model simple, computationally
ecient, and compatible with existing game and rendering engines, we dene a vertex-based
model with a relatively low polygon count, articulation, and blend skinning. Specically FLAME
includes a learned shape space of identity variations, an articulated jaw and neck, and eyeballs
that rotate. Additionally we learn pose-dependent blendshapes for the jaw and neck from exam-
ples. Finally, we learn “expression” blendshapes to capture non-rigid deformations of the face.
We train the identity shape space from the heads of roughly 4000 CAESAR body scans [165]
spanning a wide range of ages, ethnicities, and both genders. To model pose and expression vari-
ation we use over 400 4D face capture sequences from the D3DFACS dataset [46] and additional
4D sequences that we captured, spanning more expression variation. All the model parameters
13
are learned from data to minimize 3D reconstruction error. To make this possible we perform a
detailed temporal registration of our template mesh to all the scans (CAESAR and 4D).
The CAESAR dataset has been widely used for modeling 3D body shape [5, 6, 24, 42, 86, 124,
155] but not explicitly for face modeling and existing body models built from CAESAR do not
capture facial articulation or expression. Here we take an approach similar to the SMPL body
model [124] but apply it to the face, neck, and head. SMPL is a parameterized blend-skinned
body model that combines an identity shape space, articulated pose, and pose-dependent correc-
tive blendshapes. SMPL does not model facial motion and we go beyond it to learn expression
blendshapes.
Given that faces are relatively low-resolution in full body scans, the task of precisely reg-
istering the scans is both critical and dicult. To achieve accurate registration a form of co-
registration [86] is used in which we jointly build a face model and use it to align the raw data.
Given registrations we build a facial shape model and show that the resulting identity shape space
is richer than that of the Basel Face Model (BFM) [152] and the FaceWarehouse model.
To the best of our knowledge, FaceWarehouse is the only publicly available 3D face database
with a large number of facial expression that comes together with template meshes aligned to
raw scan data (from a depth sensor). The D3DFACS dataset has much higher quality scans but
does not contain aligned meshes. Registering such 4D data presents yet another challenge. To do
so we use co-registration and image texture to obtain high quality alignment from a sequence of
3D scans with texture; this is similar to work on full bodies [25]. Including eyeballs in the model
also improves alignment for the eye region, particularly the eye lids. The registration and model
learning process is fully automatic.
14
In a departure from previous work, we do not tie the expression blendshapes to facial action
units (FACS) [56]. Instead we learn the blendshapes with a global linear model that captures
correlations across the face. FACS models are overcomplete in that multiple settings can produce
the same shape; this complicates solving for the parameters from data. The FLAME model, in
contrast, uses an orthonormal expression space, which is further factored into identity and pose.
We argue that this is advantageous for tting to noisy, partial, or sparse data. Other types of
sparse rigs can be built on top of, or derived from, our representation.
Unlike most previous models, we model the head and neck together. This allows the head
to rotate relative to the neck and we learn pose-dependent blendshapes to capture how the neck
deforms during rotation. This captures eects like the protrusion of neck tendons during rotation,
increasing realism.
Our key contribution is a statistical head model that is signicantly more accurate and ex-
pressive than existing head and face models, while remaining compatible with standard graphics
software. In contrast to existing models, FLAME explicitly models head pose and eyeball rotation.
Additionally we provide a detailed quantitative comparison between, and analysis of, dierent
models. We make our trained models publicly available for research purposes at the project web-
site https://flame.is.tue.mpg.de/ . The release comprises female and male models along with
software to animate and use the model. Furthermore, we make the temporal registration of the
D3DFACS dataset publicly available at the project website for research purposes, enabling others
to train new models.
15
2.2 RelatedWork
Blanz and Vetter [23] propose the rst generic 3D face model learned from scan data. They dene
a linear subspace to represent shape and texture using principal component analysis (PCA) and
show how to t the model to data. The model is built from head scans of 200 young, mostly Cau-
casian adults, all in a roughly neutral expression. The model has had signicant impact because
it was available for research purposes as the Basel Face Model (BFM) [152]. Booth et al. [28, 29]
learn a linear face model from almost 10; 000 facial scans of more diverse subjects in a neutral
expression.
To additionally model variations in facial expression, Amberg et al. [7] combine a PCA model
of neutral face shape with a PCA space learned on the residual vectors of expressions from the
neutral shape. The recently published Face2Face framework [202] uses a similar model com-
bining linear identity and expression models with an additional linear albedo model to capture
appearance. Yang et al. [228] build several PCA models, one per facial expression, while Vlasic et
al. [212] use a multilinear face model; i.e. a tensor-based model that jointly represents the vari-
ations of facial identity and expression. The limited data used to train these methods constrains
the range of facial shapes that they can express. Since the identity space of our method is trained
from much richer data, our model is more exible and more able to capture person-specic facial
shapes. Tensor-based models assume that facial expressions can be captured by a small number
of discrete poses that correspond between people. In contrast, our expression space is trained
from sequences of 3D scans. It is unclear how to extend existing tensor methods to deal with the
complexity and variability of our temporal data.
16
Modeling facial motion locally is inspired both by animation and the psychology community
where the idea of the Facial Action Coding System (FACS) [56] is popular. To capture localized
facial details, Neumann et al. [138] and Ferrari et al. [63] use sparse linear models. Brunton et
al. [35] use a large number of localized multilinear wavelet models. For animation, facial rigs use
localized, hand crafted, blendshapes to give the animator full control. These rigs, however, suer
from signicant complexity and redundancy, with overlapping blendshapes. This makes them ill
suited as a model to t to data since they aord multiple solutions for the same shape.
Because generic face models are often quite coarse, several methods augment coarse face
shape with additional higher-frequency details. Dutreve et al. [54], Shi et al. [184], and Li et
al. [114] add actor specic ne-scale details by dening a wrinkle displacement map from train-
ing images. Garrido et al. [71] build an actor specic blendshape model with the rest-pose shape
created from a binocular stereo reconstruction and expressions from an artist generated blend-
shape model. All these methods are non-generic as they require oine actor-specic preprocess-
ing [54, 114] or an actor-specic initial 3D mesh.
Cao et al. [36] use a probability map to model person-specic features such as wrinkles on top
of a personalized blendshape model. In their later work [72], they use a generic model to estimate
a coarse face shape of an actor, and build personalized high-frequency face rigs by relating high-
frequency details to the low-resolution parameters of the personalized blendshape model. Xu
et al. [227] decompose facial performance in a multi-resolution way to transfer details from one
mesh to another. They use pre-dened expression blendshapes and do not learn a model. The
methods above could be applied as a renement to add additional facial details to FLAME with a
displacement or normal map.
17
Kozlov et al. [102] add non-rigid dynamics to facial animation by using “blend materials” to
control physical simulation of dynamics; they do not learn the model from scans. Still other work
takes collections of images from the Internet and uses a variety of methods, including shape from
shading, to extract a person specic 3D shape [98]. They animate the face using 3D ow, warping,
and a texture synthesis approach driven by a video sequence [191, 192].
Alexander et al. [3] generate a personalized facial rig for an actress using high-resolution
facial scanning and track a facial performance of this actress using a semi-automatic animation
system. Wu et al. [222] combine an anatomical subspace with a local patch-based deformation
subspace to realistically model the facial performance of three actors. Similar to our work, the
jaw has a rotational degree of freedom, but their method uses personalized subspaces to capture
shape details and therefore is not applicable to arbitrary targets.
Personalized blendshape models are often used for facial performance capture. Such person-
alized rigs typically require a user-specic calibration or training procedure [111, 217]. Bouaziz
et al. [31] use an identity PCA model along with deformation transfer [190]. Cao et al. [38]
generate personalized blendshapes using a multilinear face model based on their FaceWarehouse
database. Ichim et al. [89] generate personalized blendshapes from images of the neutral rest pose
and facial motion recordings. These methods either use artist-designed generic blendshapes as
initialization, or low resolution local expressions designed to resemble FACS action units (Face-
Warehouse).
A key step in building our model is the alignment, or registration, of a template face mesh to
3D scan data. Generic shape alignment is a vast eld (e.g. [32, 49]), which we do not summarize
here. We focus on methods for aligning 3D meshes to scan data to build articulated shape and
pose models. There have been many approaches for aligning static face scans [172, 8] but few
18
methods focus on 4D data (sequences of 3D meshes). Approaches like Vlasic et al. [212] rely
on manual key points; such approaches do not scale to deal with thousands of scans. Beeler
et al. [18] use repeated anchor frames with the same expression to prevent drift. They register
high-resolution meshes with great detail but do so only for three actors, and demonstrate results
only qualitatively on several hundred frames; here our automated method generalizes to tens of
thousands of frames. Cosker et al. [46] describe a method to align the D3DFACS dataset using
an active appearance model. They do not evaluate the accuracy of the alignment in 3D and do
not make the aligned data available. Our approach to face registration uses co-registration [86],
which has previously only been used with full bodies.
We note that most previous methods have ignored the eyes in the alignment process. This
biases the eyelids to explain the noisy geometry of the eyeballs, and creates substantial photo-
metric errors in the eye region. Consequently we add eyeballs to our mesh and show that this
helps the alignment process.
2.3 ModelFormulation
FLAME adapts the SMPL body model formulation [124] to heads. The SMPL body model neither
models facial pose (articulation of jaw or eyes) nor facial expressions. Extending SMPL makes our
model computationally ecient and compatible with existing game engines. We use a consistent
notation with SMPL.
In SMPL, geometric deformations are due to the intrinsic shape of the subject, or deforma-
tions related to pose changes in the kinematic tree. With faces, however, many deformations
are due to muscle activation, which are not related to any articulated pose change. We therefore
19
shape pose expression
Figure 2.2: Parametrization of our model (female model shown). Left: Activation of the rst
three shape components between3 and +3 standard deviations. Middle: Pose parameters ac-
tuating four of the six neck and jaw joints in a rotational manner. Right: Activation of the rst
three expression components between3 and +3 standard deviations.
extend SMPL with additional expression blendshapes as shown in Figure 2.2. Note that in several
experiments we show just the face region for comparison to other methods but FLAME models
the face, full head, and neck.
FLAME uses standard vertex based linear blend skinning (LBS) with corrective blendshapes,
with N = 5023 vertices, K = 4 joints (neck, jaw, and eyeballs as shown in Figure 2.3), and
blendshapes, which will be learned from data. FLAME is described by a functionM(
~
;
~
;
~
) :
R
~
jj
~
jj
~
j j
!R
3N
, that takes coecients describing shape
~
2R
~
jj
, pose
~
2R
~
jj
, and expres-
sion
~
2 R
~
j j
and returns N vertices. Each pose vector
~
2 R
3K+3
contains K + 1 rotation
vectors (2R
3
) in axis-angle representation; i.e. one three-dimensional rotation vector per joint
plus the global rotation.
The model consists of a template mesh, T2R
3N
, in the “zero pose”
~
, a shape blendshape
function, B
S
(
~
;S) : R
~
jj
! R
3N
, to account for identity related shape variation, corrective
pose blendshapes, B
P
(
~
;P) : R
~
jj
! R
3N
, to correct pose deformations that cannot be ex-
plained solely by LBS, and expression blendshapes,B
E
(
~
;E) :R
~
j j
!R
3N
, that capture facial
20
Figure 2.3:Jointlocationsofthefemale(left)andmale(right)FLAMEmodels. Pink/yellow
represent right/left eyes. Red is the neck joint and blue the jaw.
expressions. A standard skinning function W (T;J;
~
;W) is applied to rotate the vertices of T
around jointsJ2R
3K
, linearly smoothed by blendweightsW2R
KN
. Figure 2.2 visualizes the
parametrization of FLAME, showing the degrees of freedom in shape (left), pose (middle), and
expression (right).
More formally, the model is dened as
M(
~
;
~
;
~
) =W (T
P
(
~
;
~
;
~
);J(
~
);
~
;W); (2.1)
where
T
P
(
~
;
~
;
~
) = T +B
S
(
~
;S) +B
P
(
~
;P) +B
E
(
~
;E) (2.2)
denotes the template with added shape, pose, and expression osets.
Since dierent face shapes imply dierent joint locations, the joints are dened as a function
of the face shape J(
~
;J;T;S) =J (T +B
S
(
~
;S)), whereJ is a sparse matrix dening how
to compute joint locations from mesh vertices. This joint regression matrix will be learned from
training examples below. Figure 2.3 illustrates the learned location of the joints, which vary
automatically with head shape.
21
ShapeBlendshapes: The variations in shape of dierent subjects are modeled by linear blend-
shapes as
B
S
(
~
;S) =
~
jj
X
n=1
n
S
n
; (2.3)
where
~
= [
1
; ;
~
jj
]
T
denotes the shape coecients, andS = [S
1
; ;S
~
jj
] 2 R
3N
~
jj
denotes the orthonormal shape basis, which will be learned below with PCA. The training of the
shape space is described in Section 2.6.3.
Pose Blendshapes: LetR(
~
) :R
~
jj
!R
9K
be a function from a face/head/eye pose vector
~
to a vector containing the concatenated elements of all the corresponding rotation matrices. The
pose blendshape function is dened as
B
P
(
~
;P) =
9K
X
n=1
R
n
(
~
)R
n
(
~
)
P
n
; (2.4)
whereR
n
(
~
) andR
n
(
~
) denote then-th element ofR(
~
), andR(
~
), respectively. The vector
P
n
2 R
3N
describes the vertex osets from the rest pose activated by R
n
, and the pose space
P = [P
1
; ;P
9K
] 2 R
3N9K
is a matrix containing all pose blendshapes. While the pose
blendshapes are linear inR, they are non-linear with respect to
~
due to the non-linear mapping
from
~
to rotation matrix elements. Details on how to compute the pose parameters from data
are described in Section 2.6.1.
22
Expression Blendshapes: Similar to the shape blendshapes, the expression blendshapes are
modeled by linear blendshapes as
B
E
(
~
;E) =
~
j j
X
n=1
~
n
E
n
; (2.5)
where
~
= [
1
; ;
~
j j
]
T
denotes the expression coecients, andE = [E
1
; ;E
~
j j
]2R
3N
~
j j
denotes the orthonormal expression basis. The SMPL model does not have anything equivalent
to these expression blendshapes, which are not driven by pose. The training of the expression
space is described in Section 2.6.2.
TemplateShape: Note that the shape, pose, and expression blendshapes are all displacements
from a template meshT. We begin with a generic face template mesh and then learn theT from
scans along with the rest of the model. We also learn the blend weights,W, as described below.
2.4 TemporalRegistration
Statistically modeling facial shape requires all training shapes to be in full vertex correspondence.
Given sequences of 3D scans, for each scan,i, the registration process computes an aligned tem-
plateT
i
2R
3N
. The registration pipeline alternates between registering meshes while regulariz-
ing to a FLAME model and training a FLAME model from the registrations as shown in Figure 2.4.
This alternating registration is similar to that used for human bodies [25].
23
CAESAR dataset
D3DFACS & self-captured data
Expression & Pose Data
Registration Initial Expression Blendshapes
Shape Model Training
Expression & Pose Model
Training
Shape Data Registration
FLAME
Face Model
Retargeted
Source
Target Scan
Figure 2.4:Overview of the face registration, model training, and application to expression trans-
fer.
2.4.1 InitialModel
The alternating registration process requires an initial FLAME model. As described in Section 2.3,
FLAME consists of parameters for shapefT;Sg, posefP;W;Jg, and expressionE, that require
an initialization, which we then rene to t registered scan data.
Shape: To get an initial head shape space, we extract the head region from the full-body reg-
istrations of SMPL [124] to the CAESAR dataset. We rene the mesh structure of the full-body
SMPL template and adjust the topology to contain holes for the mouth and eyes. We then use
deformation transfer [190], between the SMPL full-body shape registrations and our rened tem-
plate, to get full-body registrations with the rened head template. Using these registered head
templates, we compute the initial shape blend shapes, representing identity, by applying PCA to
the vertices.
To make the registration process more stable, and to increase the visual quality of our model,
we add eyeballs to our shape model. To initialize the eyes, we place the left eyeball using the eye
24
region model of Woods et al. [221] and regress its geometric center given a set of vertices around
the left eye. Finally, we apply the same regressor to the equivalent (i.e. mirrored) set of vertices
around the right eye.
Pose: The blendweightsW and joint regressorJ are initialized with weights dened manually
by an artist. The initial vertices for the eyeball joint regressors are manually selected to result in
joints close to the eyeball geometric center.
Expression: To initialize the expression parametersE, we establish a correspondence through
mesh registration between our head template and the artist generated FACS-based blendshape
model of Li et al. [112]. We then use deformation transfer, to transfer the expression blendshapes
to our model. Although this initial expression basis does not conform to our requirements of
orthogonality and expression realism, it is useful for bootstrapping the registration process.
2.4.2 Single-frameRegistration
The data to which we align our mesh includes 3D scan vertices, multi-view images (two for
D3DFACS, three for our sequences), and camera calibrations. To align a sequence of an individual,
we compute a personalized template and texture map of resolution 20482048 pixels as described
later in Section 2.4.3.
Our model-based registration of a face scan consists of three steps.
Model-only: First, we estimate the model coecientsf
~
;
~
;
~
g that best explain the scan by
optimizing
E(
~
;
~
;
~
) =E
D
+
L
E
L
+E
P
; (2.6)
25
with the data term
E
D
=
D
X
vs
min
vm2M(
~
;
~
;
~
)
kv
s
v
m
k
!
; (2.7)
that measures the scan-to-mesh distance of the scan verticesv
s
and the closest point in the surface
of the model. The weight
D
controls the inuence of the data term. A Geman-McClure robust
penalty function [74],, gives robustness to outliers in the scan.
The objectiveE
L
denotes a landmark term, measuring the L2-norm distance between image
landmarks and corresponding vertices on the model template, projected into the image using the
known camera calibration. We use CMU Intraface [225] to fully automatically predict 49 land-
marks (Figure 2.5 left) in all multi-view camera images. We manually dene the corresponding
49 landmarks in our template (see Figure 2.5 right). The weight
L
describes the inuence of the
landmark term.
Figure 2.5: Predicted 49 landmarks from the CMU Intraface landmark tracker [225] (left) and
the same landmarks dened on our topology (right).
The prior term
E
P
=
~
E
~
+
~
E
~
+
~
E
~
(2.8)
26
regularizes the pose coecients
~
, shape coecients
~
, and expression coecients
~
to be close
to zero by penalizing their squared values.
Coupled: Second, we allow the optimization to leave the model space by optimizing
E(T;
~
;
~
;
~
) =E
D
+E
C
+E
R
+E
P
; (2.9)
with respect to the model parametersf
~
;
~
;
~
g and the vertices of the template mesh T, which
is allowed to deform. In contrast to the model-only registration,E
D
now measures the scan-to-
mesh distance from the scan to the aligned mesh T. The coupling termE
C
constrains T to be
close to the current statistical model by penalizing edge dierences between T and the model
M(
~
;
~
;
~
) as
E
C
=
X
e
e
T
e
M(
~
;
~
;
~
)
e
; (2.10)
whereT
e
andM(
~
;
~
;
~
)
e
are the edges ofT andM(
~
;
~
;
~
), respectively, and
e
denotes an in-
dividual weight assigned to each edge. The coupling uses edge dierences to spread the coupling
inuence on single points across its neighbors. The optimization is performed simultaneously
over T and model parameters in order to recover possible model errors in the rst stage. The
regularization term for each vertexv
k
2R
3
inT is the discrete Laplacian approximation [100]
E
R
=
1
N
N
X
k=1
k
kU(v
k
)k
2
; (2.11)
27
withU(v) =
P
vr2N(v)
vrv
jN (v)j
, whereN (v) denotes the set of vertices in the one-ring neighborhood
ofv. The regularization term avoids fold-overs in the registration and hence makes the registra-
tion approach robust to noise and partial occlusions. The weight
k
for each vertex allows for
more regularization in noisy scan regions.
Texture-based: Third, we include a texture termE
T
to obtain
E(T;
~
;
~
;
~
) =E
D
+E
C
+
T
E
T
+E
R
+E
P
; (2.12)
whereE
T
measures the photometric error between real imageI and the rendered textured image
^
I ofT from allV views as
E
T
=
3
X
l=0
V
X
v=1
k(I
(v)
l
) (
^
I
(v)
l
)k
2
F
; (2.13)
wherekXk
F
denotes the Frobenius norm ofX. Ratio of Gaussian lters, [25], help minimize the
inuence of lighting changes between real and rendered images. Further, as photometric errors
are only meaningful for small displacements, a multi-level pyramid with four resolution levels
are used during optimization to increase the spatial extent of the photometric error. The imageI
of resolution levell from viewv is denoted byI
(v)
l
.
2.4.3 SequentialRegistration
Our temporal registration approach uses a personalization phase that builds a personalized tem-
plate for each subject in the database, which is then kept constant during tracking the facial
performance.
28
Personalization: We assume that each captured sequence begins with a neutral pose and ex-
pression. During personalization, we use a coupled registration (Equation 2.9) and we average
the resultsT
i
across multiple sequences to get a personalized template for each subject. We ran-
domly select one of the T for each subject to generate a personalized texture map that is used
later for texture-based registration. This personalization increases the stability of the registra-
tion, and improves the performance of the optimization, as it signicantly reduces the number of
parameters being optimized in each step.
Sequence Fitting: During sequence tting, we replace the generic model template T in M
(Equation 2.1) by the personalized template, and x the
~
to zero. For each frame, we initialize
the model parameters from the previous frame and use the single-frame registration 2.4.2. Given
the registered sequences, we train a new FLAME model as described below and then iterate the
registration procedure. We stop after four iterations as the visual improvement, compared to the
registrations after three iterations, is only minor.
2.5 Data
FLAME is trained from two large publicly available datasets and our self-captured sequences.
2.5.1 CaptureSetup
For our self-captured sequences we use a multi-camera active stereo system (3dMD LLC, Atlanta).
The capture system consists of three pairs of stereo cameras, three color cameras, three speckle
projectors, and three white light LED panels. The system generates 3D meshes with an average
29
Figure 2.6: Sample registrations. Top: shape data extracted from the CAESAR body database.
Middle: sample registrations of the self captured pose data with head rotations around the neck
(left) and mouth articulations (right). Bottom: samples registrations of the expression data from
D3DFACS (left) and self captured sequences (right). Appendix A shows further registrations.
of 45K vertices at 60fps. The color images are used to create a UV texture map for each frame
and we use them to nd image-based facial landmarks.
30
2.5.2 TrainingData
The identity shape parametersfT;Sg are trained on the 3800 registered heads from the US and
European CAESAR body scan database [165]. The CAESAR database contains 2100 female and
1700 male static full-body scans, capturing large variations in shape (see Figure 2.6 top). The
CAESAR scans are registered with a full-body SMPL model combined with our revised head
template using a two-step registration approach. First, the global shape is initialized by a model-
only registration with the initial model, followed by a coupled renement (Section 2.4.2). The
shape parameters are then trained on these registrations.
Training the pose parametersfP;W;Jg requires training data that represent the full range
of possible head motions, i.e. neck and jaw motions. As neither CAESAR, nor the existing 3D face
databases, provide sucient head pose articulation, we captured neck rotation and jaw motions
of 10 subjects (see Figure 2.6 middle) to ll this gap. The jaw and mouth sequences are registered
as described in Section 2.4. The head rotation sequences are registered using a coupled alignment,
where only the vertices in the neck region are allowed to leave the model space, coupled to the
model, while all other vertices stay in model space. This adds robustness to inevitable large facial
occlusions when the head is turned. Overall, the pose parameters are trained on about 8000
registered heads.
The expression modelE uses two sources of training data: registrations of D3DFACS [46]
and self-captured sequences. All motion sequences are fully automatically registered with the
registration approach described in Section 2.4, leading to a total number of 69; 000 registered
frames (see Figure 2.6 bottom). In these 3D sequences, neighboring frames can be very similar.
31
For eciency in training, we consequently sample a subset of 21; 000 registered frames to train
the model.
2.5.3 TestData
FLAME is evaluated quantitatively on three datasets. First we use the neutral scans of the BU-
3DFE [232] database with its 3D face scans of 100 subjects with a large variety in ethnicity.
Second we use self-captured sequences of seven subjects, performing dierent facial expressions,
including the six prototypical expressions, talking sequences, and dierent facial action units.
Note that the training and test subjects are fully disjoint. Third, we the 347 registered frames of
the Beeler et al. [18] sequence.
2.5.4 ImplementationDetails
The registration framework is written in Python, using Numpy and Scikit-learn [153] to compute
PCA. All other model parameters are optimized by a gradient-based dogleg method [140], where
all gradients are computed using Chumpy [125] for automatic dierentiation.
ParameterSettings: Our registrations are obtained by a bootstrapping framework that alter-
nates between model training and registration. During each iteration, we choose the parameters
as follows:
We generally choose
~
=
~
= 0:03. For model-only registration, we set
D
2f100; 300g,
and
l
= 0:002, for coupled and texture based registration, we choose
k
= 10:0. The coupling
to the model varies depending on the regions of the face to deal with noise. We set
e
= 3:0 for
the face region (Figure 2.7) and
e
= 30:0 and for all other vertices. For coupled registration,
32
high
low
Figure 2.7: Visualization of the coupling weight. Head regions with higher coupling edge
weight (left) and higher Laplacian weight (right).
we further use
D
= 1000, for texture-based registration
D
= 700 and
T
= 0:1. For the
third iteration,
e
is reduced to 1:0 in the face region, for the fourth iteration to 0:3. For the
fourth iteration, we further choose
k
= 100:0 for the non-facial regions shown in the right of
Figure 2.7.
A high coupling weight eectively prevents vertices leaving the model space and hence in-
creases the robustness to noise. As the noise within a scan diers for dierent regions, i.e. it is
signicantly higher in hair regions, we use higher coupling weights for the back of the head, back
of the neck, and the eyeballs (Figure 2.7 left). For regions like the forehead, a high coupling weight
prevents the registration from eectively capturing the motion (e.g. when raising the eyebrows).
A higher Laplacian weight (Figure 2.7 left) however, adds some smoothness and hence lowers the
inuence of noise, while allowing tangential motion to be captured.
Performance: Our registration takes about 155 s for one frame (model-only (Eq. 2.6): 25 s;
coupled (Eq. 2.9): 50 s; texture-based (Eq. 2.12): 80 s) on a single thread on a quad-core 3:2 GHz
Intel Core i5 with 32 GB RAM.
33
2.6 ModelTraining
Given registered datasets for identity (Figure 2.6 top), pose (Figure 2.6 middle), and expression
(Figure 2.6 bottom), the goal of training FLAME is to decouple shape, pose, and expression vari-
ations to compute the set of parameters =fT;S;P;E;W;Jg. To achieve this decoupling,
the pose parametersfP;W;Jg, expression parametersE, and shape parametersfT;Sg are op-
timized one at a time using an iterative optimization approach that minimizes the reconstruction
error of the training data. We use gender specic models
f
for female, and
m
for male, respec-
tively.
2.6.1 PoseParameterTraining
There are two types of pose parameters in our model. First, there are parameters specic to each
subject (indexed byi2f1;P
subj
g) such as personalized rest-pose templates T
P
i
and person
specic joints J
P
i
. Second, there are parameters spanning across subjects such as blendweights
W and the pose blendshapesP. The joint regressorJ is learned to regress person specic joints
J
P
i
of all subjects from the personalized rest-pose templatesT
P
i
.
The optimization of these parameters is done by alternating between solving for the pose
parameters
~
j
of each registration j, optimizing the subject specic parametersfT
P
i
;J
P
i
g, and
optimizing the global parametersfW;P;Jg. The objective function being optimized consists of
a data termE
D
that penalizes the squared Euclidean reconstruction error of the training data, a
regularization termE
P
that penalizes the Frobenius norm of the pose blendshapes, and a regular-
ization termE
W
that penalizes large deviations of the blendweights from their initialization. The
weighting of the regularizersfE
P
;E
W
g is a tradeo between closely resembling the training data
34
and keeping the parameters general. Hence, the regularizers prevent FLAME from overtting to
the training data, and make it more general. The method and objectives used for the optimization
of joint regressors, pose and shape parameters are described in more detail by the SMPL body
model [124], as we adapted their approach to represent pose and shape for FLAME.
In absence of a subject specic template T
P
i
, the initial estimation of the pose coecients
~
while training the pose space is done using an initial average template. To be robust with respect
to large variations in shape, this is done by minimizing the edge dierences between the template
and each registration.
To avoidT
P
i
andJ
P
i
being aected by strong facial expressions, expression eects are removed
when solving for T
P
i
and J
P
i
. This is done by jointly solving for pose
~
and expression parame-
ters
~
for each registration, subtractingB
E
(Equation 2.5), and solving for T
P
i
and J
P
i
on those
residuals.
2.6.2 ExpressionParameterTraining
Training the expression spaceE requires expressions to be decoupled from pose and shape vari-
ations. This is achieved by rst solving for the pose parameters
~
j
of each registration, and re-
moving the pose inuence by applying the inverse transformation entailed byM(
~
0;
~
;
~
0) (Equa-
tion 2.1); where
~
0 is a vector of zero-valued coecients. We call this step “unposing” and call
the vertices resulting from unposing the registrationj as V
U
j
. As we want to model expression
variations from a neutral expression, we assume that a registration dening the neutral expres-
sion is given for each subject. Let V
NE
i
denote the vertices of the neutral expression of subject
i, also unposed. To decouple the expression variations from the shape variations, we compute
35
expression residualsV
U
j
V
NE
s(j)
for each registrationj, wheres(j) is the subject indexj. We then
compute the expression spaceE by applying PCA to these expression residuals.
2.6.3 ShapeParameterTraining
Training the shape parameters consists of computing templateT and shape blendshapesS for the
registrations in the shape dataset. Similarly as before, eects of pose and expression are removed
from all training data, to ensure the decoupling of pose, expression, and shape. The template T
is then computed as the mean of these expression- and pose-normalized registrations, the shape
blendshapesS are formed by the rst
~
jj principal components computed using PCA.
2.6.4 OptimizationStructure
The training of FLAME is done iteratively by solely optimizing pose, expression, or shape param-
eters, while keeping the other parameters xed. Due to the high capacity and exibility of the
expression space formulation, pose blendshapes should be trained before expression parameters
in order to avoid expression overtting.
2.7 Experiments
We evaluate the quality of our sequence registration process and the FLAME models learned
from these registrations. Comparisons to Basel Face Model and FaceWarehouse model show
that FLAME is signicantly more expressive. Additionally we show how FLAME can be used
to t 2D image data and for expression transfer. Please see the supplemental video at http:
//flame.is.tue.mpg.de for more details.
36
Visualization: We use a common color coding to present all results throughout the entire doc-
ument. Input data such as static or dynamic 3D face scans are shown in a light red color. Meshes
that are within the space of a statistical model, obtained by model-only registration (Section 2.4.2)
or by sampling the latent space of a model, are shown in blue. For comparison, we use the same
color to visualize results of FLAME, Basel Face Model, or FaceWarehouse model. Meshes, obtained
by leaving the shape space in a coupled or texture-based alignment (Section 2.4.2) are visualized
in light green.
FLAME is a fully articulated head model (see Figure 2.2). Nevertheless, most training and
test scans only capture the face region. To facilitate comparison between methods, in such cases
we show registrations of similar facial regions only. For comparisons to scans with clean outer
boundary and without holes (e.g. Figures 2.16), we use the background of the scan images to
mask the region of interest. For scans with noisy outline and holes (e.g. Figure 2.11) we use a
common pre-dened vertex mask to visualize all registrations.
2.7.1 RegistrationQuality
Registration Process: Our registration process contains three steps: a model-only t, a cou-
pled t, and a texture-based renement. Figure 2.8 visualizes the registration results of each
optimization step. The model-only step serves the initialization of the expression, but it is un-
able to capture all personalized details. After coupled alignment, the registration tightly ts the
surface of the scan but the synthesized texture reveals misalignments at the mouth, nose, and eye-
brows. While the texture-based registration slightly raises the geometric error across the face, it
37
0 mm
>1 mm
Scan Model-only registration Coupled registration Texture-based registration
Figure 2.8: Results of the model-only, coupled, and texture-based registration steps for
one scan. Top: scan, registrations, and scan-to-mesh distance for each registration visualized
color-coded on the scan. Bottom: original texture image, synthesized texture image for each
step, and the corresponding photometric errors.
visually improves the registration around the mouth, nose, and eyebrow regions while reducing
the sliding within the surface.
Note, we do not explicitly model lighting for the synthesized image, which causes visual
dierences compared to the original image due to cast shadows (e.g. seen at the cheeks). Using
a Ratio of Gaussians for ltering alleviates the inuence of lighting changes in our optimization
setup.
AlternatingRegistration: Figure 2.9 shows representative results for each of the alternating
registration iterations. While the registration is unable to capture the facial expressions properly
in the rst iteration, after more iterations, the quality of the registration improves.
Quantitative Evaluation: Figure 2.10 (left) visualizes the median per-vertex distance to the
scan. The distance is measured across all 69; 000 registered frames of the D3DFACS database and
our self captured sequences (left) and the 347 registered frames of the Beeler et al. [18] sequence.
For the registered training data (Figure 2.10 left), within the face region (excluding the eye-
balls), 60% of the vertices have a median distance less than 0:2mm, 90% are closer than 0:5mm.
38
Scan Iteration 1 Iteration 2 Iteration 3 Iteration 4
Figure 2.9: Resultsofthealternatingregistrationapproach.
Visible regions of higher distance are mostly caused by missing data (at the neck, below the chin,
or at the ears) or noise in the scans (at the eyebrows, around the eyes). As described in Section 2.5,
our registration framework uses higher Laplacian weights in non-face regions to increase the ro-
bustness to noise and partial occlusions in the scans. While not causing visual artifacts in the
registrations, this transition between the face and non-face part causes a slightly enlarged error
at the boundary of the mask, noticeable at the forehead.
The goal of our registration framework is to fully-automatically register a large set of se-
quences (> 600) from dierent sources (i.e. D3DFACS and self-captured sequences). For robust-
ness to self-cast shadows and lighting changes, the inuence of the photometric error (Equa-
tion 2.12) has a low weight (w
T
= 0:1). Due to this, our registrations are not entirely free of
within-surface drift, especially in regions without salient features (i.e. forehead, cheeks, neck).
Figure 2.10 (right) evaluates the within-surface drift of our registration on the publicly available
Beeler et al. sequence. While the distance between our registrations and the Beeler et al. scans is
small (c), measuring the distance between our registrations and their ground-truth registration
reveals some within-surface drift (d). Note, since the Beeler et al. data are with uniform light-
ing, one could use our registration method with a higher weighted photometric error, potentially
further lowering the drift error.
39
0 mm
>1 mm
0 mm
>3 mm
(a) (b) (c) (d)
Figure 2.10: Median per-vertex distance between registration and the scan surface. Left:
Distance measure across all frames of all female (a) and male (b) training sequences. Right: Dis-
tance measure across all registered frames for the Beeler et al. [18] sequence (c) and the ground-
truth error (d) measuring the within-surface drift. The supplemental video shows the full regis-
tration sequence.
QualitativeEvaluation: Figure 2.11 shows sample registrations of the D3DFACS dataset (top)
and our self-captured sequences (bottom). For all sequences, the distance between our registra-
tion and the scan surface is small, and our registration captures the expression. Note that our
registration is able to track even subtle motions such as eye blinks well as can be seen in top row
of Figure 2.11.
2.7.2 ModelQuality
A good statistical model should ideally be compact and generalize well to new data, while staying
specic to the object class of the model. A common way to quantify these attributes is to measure
the compactness, generalization, and specicity (please refer to the Chapter 9.2 of [49]) of the
model. These measurements have previously been used to evaluate statistical models of various
classes of objects, including 3D faces (e.g. [27, 28, 35]). These evaluations provide a principled way
to determine the model dimensions that preserve a large amount of the data variability without
overtting to the training data.
40
0 mm
>1 mm
0 mm
>1 mm
Figure 2.11: Registration quality. Sample frames, registrations, and scan-to-mesh distance of
one sequences of the D3DFACS database (top) and one sequence of our self-captured sequence
(bottom). Appendix A shows further registrations.
41
Compactness: A statistical model should describe the training data with few parameters. Com-
pactness measures the amount of variability present in the training data, that is captured by the
model. The compactness for a given number ofk components isC(k) =
P
k
i=1
i
=
P
rank(D)
i=1
i
,
where
i
is thei-th eigenvalue of the data covariance matrix D. The compactness of FLAME is
independently evaluated for identity and expression, by computingC(k) for a varying number
of components.
Generalization: A statistical model ideally generalizes from the samples in the training data
to arbitrary valid samples of the same class of objects. Generalization measures the ability of the
model to represent unseen shapes of the same object class. The generalization ability is commonly
quantied by tting the model with a varying number of components to data excluded from the
model training, and measuring the tting error. The identity space of FLAME is evaluated on the
neutral BU-3DFE data, registered using a coupled alignment. The expression space is evaluated on
self-captured test sequences, registered with the texture-based registration framework. During
evaluation of the identity space, i.e. for a varying number of identity shape components, the
number of expression components is xed to 100. For evaluation of the expression space, the
number of shape parameters is xed to 300, accordingly. For each model-t, the average vertex
distance to the registration is reported as tting error.
Specicity: A statistical model is required to be specic to the modeled class of objects, by only
representing valid samples of this object class. To evaluate the specicity of the identity and ex-
pression space, we randomly draw 1000 samples from a Gaussian distribution for a varying num-
ber of identity or expression coecients, and reconstruct the sample shape using Equation 2.1.
42
0 100 200 300
# Principal components
0
20
40
60
80
100
Percentage of variability
male
female
0 100 200 300
# Principal components
0
0.5
1
1.5
2
2.5
Error [mm]
female
0 100 200 300
# Principal components
0
0.5
1
1.5
2
2.5
Error [mm]
male
0 100 200 300
# Principal components
0
0.5
1
1.5
2
2.5
3
3.5
Error [mm]
female
0 100 200 300
# Principal components
0
0.5
1
1.5
2
2.5
3
3.5
Error [mm]
male
0 50 100 150 200
# Principal components
0
20
40
60
80
100
Percentage of variability
male
female
0 50 100 150 200
# Principal components
0
0.5
1
1.5
Error [mm]
female
0 50 100 150 200
# Principal components
0
0.5
1
1.5
Error [mm]
male
0 50 100 150 200
# Principal components
0
0.5
1
1.5
Error [mm]
female
0 50 100 150 200
# Principal components
0
0.5
1
1.5
Error [mm]
male
Figure 2.12: Quantitative evaluation of identity shape space (top) and expression space
(bottom) of the female and male FLAME models. From left to right: compactness, generalization
female, generalization male, specicity female, and specicity male.
The specicity error is measured as the average distance to the closest training shape. For iden-
tity space evaluation, the expression parameters are kept at zero; for expression evaluation, the
identity parameters are zero, accordingly.
QuantitativeEvaluation: Figure 2.12 shows compactness, generalization, and specicity, in-
dependently evaluated for the identity and expression space. With 90 identity components our
model captures 98% of the data variability, and with 300 components eectively 100%. The gen-
eralization error gradually declines for up to 300 identity components, while specicity not in-
crease signicantly. Consequently, we use models with 90 and 300 identity components through-
out our evaluations. We denote these with FLAME 90 and FLAME 300, respectively. For expres-
sion, we choose 100 components, representing 98% of the data variability.
43
0 mm
>1 mm
0 mm
>1 mm
Scan FLAME 49 Error FLAME 49 FLAME 90 Error FLAME 90 FLAME 300 Error FLAME 300
Figure 2.13: ExpressivenessoftheFLAMEidentityspace for tting neutral scans of the BU-
3DFE face database with a varying number of identity components. Appendix A shows further
examples.
QualitativeEvaluation: Figure 2.13 qualitatively evaluates the inuence of a varying number
of identity components for tting the neutral BU-3DFE face scans (Appendix A shows more sam-
ples). The error measures, for each scan vertex, the distance to the closest point in the surface
of the registration. While FLAME 49 ts the global shape of the scan well, it is unable to capture
localized person specic details. Increasing the number of components increases the ability of
the model to reconstruct localized details. FLAME 300 leads to registrations with an error that is
close to zero millimeters in most facial regions.
FLAME models head and jaw motions as joint rotations. Figure 2.14 shows the inuence of
the trained pose blendshapes. The pose blendshapes recreate realistic neck details when turning
the head and stretch the cheeks when opening the mouth. The learned pose blendshapes result
in signicantly more realism than LBS.
2.7.3 ComparisontoState-of-the-art
We compare FLAME to the Basel Face Model (BFM) [152] and FaceWarehouse model (FW) [38].
We evaluate the ability of each model to account for unseen data by tting them to static and
44
Figure 2.14: Inuence of the pose blendshapes for dierent actuations of the neck and yaw
joints in a rotational manner. Visualization of FLAME without (top) and with (bottom) activated
pose blendshapes.
dynamic 3D data not part of the training; in all cases we use the same model-tting framework.
BFM is trained from 200 neutral expression shapes and all 199 identity components are available.
FW is learned from 150 shapes but the model only includes 50 identity components plus 46
expression components.
Identity: The identity space is evaluated by tting the models to the neutral BU-3DFE scans,
initializing with the landmarks provided with the database. For a fair comparison to FW and
BFM, FLAME is constrained to use comparable dimensions. Consequently we only make use of
49 FLAME shape components for comparison to FW and 198 components for comparison with
BFM (we subtract one component since we select the appropriate gender). We further show the
expressiveness of FLAME with 90 and 300 components.
45
0 0.5 1 1.5 2
Error [mm]
0
20
40
60
80
100
Percentage
FLAME 300
FLAME 198
FLAME 90
FLAME 49
FW
BFM Full
BFM 91
BFM 50
Figure 2.15: Cumulative scan-to-mesh distance computed over all model-ts of the neutral
BU-3DFE scans.
Figure 2.15 shows the cumulative scan-to-mesh distance computed over all model-ts to the
neutral BU-3DFE scans. With the same number of parameters, for FLAME 49, 74% of the scan
vertices have distance lower than 0:5mm, compared to 69% forBFM50 or 67% by FW. Compared
to BFM with all components, forFLAME198, 94% of the vertices have a distance less than 0:5mm,
compared to 92% forBFMFull. With 300 components,FLAME300 ts 96% of the vertices with a
distance of less than 0:5mm.
Figure 2.16 compares the models visually (Appendix A shows more examples). Compared to
FLAME, BFM introduces high-frequency details that make the ts look more realistic. Neverthe-
less, the comparison with the scans reveals that these details are hallucinated and spurious, as
they come from people in the dataset, rather than from the scans. While lower-resolution and
less detailed, FLAME is actually more accurate. Note, since FLAME contains modeled eyeballs,
the eye region looks more realistic than the closed surface of BFM or the empty space of FW.
46
0 mm
>1 mm
0 mm
>1 mm
Scan BFM Full Error BFM Full FW Error FW FLAME 198 Error FLAME 198
Figure 2.16: Comparison on identity space of Basel Face Model (BFM) [152], FaceWarehouse
model [38] and FLAME for tting neutral scans of the BU-3DFE database. Appendix A shows
further examples.
Expression: The ability to capture real facial expressions is evaluated by tting FW and FLAME
to our self-captured high-resolution dynamic test sequences (see Section 2.5). For comparison,
we rst compute a personalized shape space for each model per sequence by only optimizing
the identity parameters, keeping the expression xed to a neutral expression. For the rest of
the sequence, only the expression and pose are optimized, initialized by landmarks, while the
identity parameters are kept xed. To remove one source of error caused by noisy landmarks, we
register all test sequences with our texture-based registration framework and extract the same
set of landmarks as shown in Figure 2.5. As for the identity evaluation, we constrain FLAME to
be of comparable dimension to FW for a fair comparison. We use 49 components for identity and
as for FW, 46 components for expression and pose; i.e. we use 43 components for expression, and
3 degrees of freedom for the jaw rotation.
Figure 2.17 compares the median of the per-vertex distance to the scans, measured across all
registered frames of the test data. For FW, 50% of all vertices in the face region have a distance
lower than 1:0mm, compared to 67% forFLAME49, 73% forFLAME90, and 75% forFLAME300.
With the same number of parameters, FLAME ts the data closer than FW.
47
0 mm
>3 mm
0 mm
>3 mm
FW FLAME 49 FLAME 90 FLAME 300
Figure 2.17:Medianper-vertexdistancebetweenregistrationtothescansurface, measured
across all frames of the test data. Top: female data. Bottom: male data.
Figure 2.18 visualizes examples from this experiment. While FW is able to perform the ex-
pression for the rst sequence (top row), FLAME gives a more natural looking result with a lower
error. For the second sequence (bottom row), FW is unable to reconstruct the widely open mouth.
As FLAME models the mouth opening with a rotation, it better ts this extreme expression. As
Figure 2.17 shows, if we used more components, FLAME would signicantly outperform FW.
2.7.4 ShapeReconstructionfromImages
FLAME is readily usable to reconstruct 3D faces from single 2D images. For comparison to Face-
Warehouse, we t both models to 2D image landmarks by optimizing the L2-norm distance be-
tween image landmarks and corresponding model vertices, projected into the image using the
known camera calibration. Unlike other facial landmarks, the face contour does not correspond
to specic 3D points. Therefore, the correspondences are updated based on the silhouette of the
projected 3D face as described in Cao et al. [38]. The input landmarks are manually labeled in the
same format as in FaceWarehouse. As in Section 2.7.3, we use 49 components for identity and 46
components for expression and pose (43 for expression and 3 for jaw pose), for a fair comparison.
48
0 mm
>5 mm
0 mm
>5 mm
Scan FW Error FW FLAME Error FLAME
Figure 2.18:Reconstructionquality from high-resolution motion sequences compared to Face-
Warehouse (FW). Intermediate frames of three motion sequences. FLAME is restricted to have
the same number of parameters as FW.
0 mm
>10 mm
0 mm
>10 mm
Image / Scan Fitting Fitting Error
Figure 2.19: ComparisonofFaceWarehousemodel(top)andFLAME(bottom)for3Dface
ttingfromsingle2Dimage. Note, that the scan (pink) is only used for evaluation. Appendix A
shows further examples.
49
Figure 2.19 shows the 2D landmark tting using both models. FLAME better reconstructs the
identity and expression. To quantify the error in the t shown in Figure 2.19, we measure the
distance from the tted mesh to the ground truth scan. Due to the challenges in estimating depth
from merely 2D landmarks, we rstly rigidly align the tted mesh to scan using precomputed
3D landmarks, and then measure the distances. For qualitative comparison, we further show
the tted mesh from a novel view for better comparison to the ground truth scan. As shown in
Figure 2.19, FLAME has lower 3D error, suggesting that FLAME may provide a better prior for
estimating 3D facial shape from 2D image features.
2.7.5 ExpressionTransfer
FLAME can easily be used to synthesize new motion sequences, e.g. by transferring the facial
expression from a source actor to a target actor, while preserving the subject-specic details of
the target face. This transfer is performed in three steps. First, the source sequence is registered
with the proposed registration framework (Section 2.4.3) to compute the pose and expression
coecientsf
~
s
;
~
s
g for each frame of the source sequence. Second, a coupled registration (Sec-
tion 2.4.2) is used to compute a personalized template T
t
for the target scan. Finally, replacing
the average model template T by the personalized target template T
t
results in a personalized
FLAME modelM
t
(
~
;
~
;
~
) of the target actor. The result of the expression transfer is then the
model reconstructionM
t
(
~
0;
~
s
;
~
s
) using Equation 2.1.
Figure 2.20 shows the expression transfer between two subjects in our test dataset, while
Figure 2.1 shows transfer to a high-resolution scan from Beeler et al. [18]. Appendix A shows
additional results.
50
Figure 2.20: Expression transfer from a source sequence (blue) to a static target scan (pink).
The aligned personalized template for the scan is shown in green, the transferred expression in
yellow. Appendix A shows further examples.
2.7.6 Discussion
While FLAME moves closer to custom head models in realism, it still lacks the detail needed for
high-quality animation. Fine-scale details such as wrinkles and pores are subject-specic and
hence (i.e. due to the missing inter-subject correspondence) are not well modeled by a generic
face model. A dierent approach (e.g. via deep learning) could be used to infer high-frequency
and non-linear details, but this is beyond the scope of this work.
The surface-based decoupling of shape, pose, and expression variations (Sec. 2.6) requires a
lot of diverse training data (Sec. 2.5). Exploiting anatomical constraints, i.e. by using a rigid stabi-
lization method [17], could further improve the decoupling, but this would require a signicant
amount of work to handle the large amounts of training data as reasoning about the underlying
skull is needed.
51
Here we learned expression blendshapes and showed that they capture real facial expressions
better than those of FaceWarehouse. We argue that these capture important correlations across
the face and result in natural looking expressions. Still animators may prefer more semantic, or
localized, controls. Consequently one could learn a mapping from our space to semantic attributes
as shown in other works [5, 83, 212] or train a localized space as proposed by Neumann et al. [138]
from our provided expression registration, and replace the global expression space of FLAME with
a local one.
Here we found that modeling the eyes improved alignment and the nal model; we plan to do
something similar for mouths by explicitly modeling them. FLAME is connected to a neck, which
has the same topology as the SMPL body model. In the future we will combine the models, which
will enable us to capture both the body and face together. Since we have eyes in the model, we
also plan to integrate eye tracking.
One could also personalize our model to a particular actor, restricting the expression space
based on past performance. Our model could also be t to sparse marker data, enabling facial
performance capture using standard methods. Future work should also t the model to images
and video sequences by replacing simpler models in standard methods. Finally, images can be
used to add more shape detail from shading cues as in recent work [72].
2.8 Conclusion
Here we trained a new model of the face from around 33; 000 3D scans from the CAESAR body
dataset, the D3DFACS dataset, and self captured sequences. To do so, we precisely aligned a
template mesh to all static and dynamic scans and will make the alignments of the D3DFACS
52
dataset available for research purposes. We dened the FLAME model using a PCA space for
identity shape, simple rotational degrees of freedom and linear blend skinning for the neck, jaw,
and eyeballs, corrective blendshapes for these rotations, and global expression blendshapes. We
show that the learned model is signicantly more expressive and realistic than the popular Face-
Warehouse model and the Basel Face Model. We compare the models by tting to static 3D scans
and dynamic 3D sequences of novel subjects using the same optimization method. While signi-
cantly more accurate, FLAME has many fewer vertices, which also makes it more appropriate for
real-time applications. Unlike over-complete representations associated with standard manual
blendshapes, ours are easier to optimize because they are orthogonal. The model is designed to
be compatible with existing rendering systems and is available for research purposes [142].
53
Chapter3
EcientInferenceforTopologicallyConsistentFace
Meshes
(a) Input images (9 of 15 views) (b) Mesh in correspondence (c) Skin details and appearances (d) Animation with fully rigged face model
Figure 3.1: ToFu examples. Given (a) multi-view images, our face modeling framework ToFu
uses volumetric sampling to predict (b) accurate base meshes in consistent topology as well as
(c) high-resolution details and appearances. Our ecient pipeline enables (d) rapid creation of
production-quality avatars for animation.
3.1 Introduction
Creating high-delity digital humans is not only highly sought after in the lm and gaming in-
dustry, but is also gaining interest in consumer applications, ranging from telepresence in AR/VR
54
to virtual fashion models and virtual assistants. While fully automated single-view avatar dig-
itization solutions exist [87, 89, 137, 201, 223], professional studios still opt for high resolution
multi-view images as input, to ensure the highest possible delity and surface coverage in a con-
trolled setting [18, 76, 80, 123, 126, 164, 178] instead of unconstrained input data. Typically, high-
resolution geometric details (< 1mm error) are desired along with high resolution physically-
based material properties (at least 4K). Furthermore, to build a fully rigged face model for an-
imation, a large number of facial scans and alignments (often over 30) are performed, typically
following some conventions based on the Facial Action Coding System (FACS).
A typical approach used in production consists of using a multi-view stereo acquisition pro-
cess to capture detailed 3D scans of each facial expression, and a non-rigid registration [18, 109]
or inference method [113] is used to warp a 3D face model to each scan in order to ensure con-
sistent mesh topology. Between these two steps, manual clean-up is often necessary to remove
artifacts and unwanted surface regions, especially those with facial hair (beards, eyebrows) as
well as teeth and neck regions. The registration process is often assisted with manual labeling
tasks for correspondences and parameter tweaking to ensure accurate tting. In a production
setting, a completed rig of a person can easily take up to a week to nalize.
Several recent techniques have been introduced to automate this process by tting a 3D model
directly to a calibrated set of input images. The multi-view stereo face modeling method of [68]
is not only particularly slow, but relies on dynamic sequences and carefully tuned parameters
for each subject to ensure consistent parameterization between expressions. In particular facial
expressions that are not captured continuously cannot ensure accurate topological consistencies.
More recent deep learning approaches [11, 223] use a 3D morphable model (3DMM) inference
to obtain a coarse initial facial expression, but require a renement step based on optimization
55
to improve tting accuracy. Those methods are limited in tting extreme expressions due to the
constraints of linear 3DMMs and tting tightly to the ground-truth face surfaces due to the global
nature of their regression architectures. The additional photometric renement also tends to t
unwanted regions like facial hair.
We propose ToFu (Topological consistentFace from multi-view), a geometry inference frame-
work that can produce topologically consistent meshes across facial identities and expressions..
Instead of relying explicitly on a mesh-based face model such as 3DMM, our volumetric approach
is more general, allowing it to capture a wider range of expressions and subtle deformation de-
tails on the face. Our method is also three orders of magnitude faster than conventional methods,
taking only 0.385 seconds to generate a dense 3D mesh (10K vertices) as well as produce addi-
tional assets for high-delity production use cases, such as albedo, specular, and high-resolution
displacement maps.
To this end, we propose a progressive mesh generation network that can infer a topologically
consistent mesh directly. Our volumetric architecture predicts vertex locations as probability
distributions, along with volumetric features that are extracted using the underlying multi-view
geometry. The topological structure of the face is embedded into this architecture using a hier-
archical mesh representation and coarse-to-ne network.
Our experiments show that ToFu is capable of producing highly accurate geometry consistent
with topology automatically, while existing methods either rely on manual clean-up and param-
eter tuning, or are less accurate especially for subjects with facial hair. Since we can ensure a
consistent parameterization across facial identities and expressions without any human input,
our solution is suitable for scaled digitization of high-delity facial avatars, We not only reduce
the turn around time for production, but is also provide a critical solution for generating large
56
facial datasets, which is often associated with excessive manual labor. Our main contributions
are:
• A novel volumetric feature sampling and renement model for topologically consistent 3D
mesh reconstruction from multi-view images.
• An appearance capture network to infer high-resolution skin details and appearance maps,
which, combined with the base mesh, forms a complete package suitable for production in
animation and photorealisitic rendering.
• We demonstrate state-of-the-art performance for combined geometry and correspondence
accuracy, while achieving mesh inference at near interactive rates.
• Code and model are publicly available at https://tianyeli.github.io/tofu .
3.2 RelatedWork
Face Capture: Traditionally, face acquisition is separated into two steps, 3D face reconstruc-
tion and registration [55]. Facial geometry can be captured with laser scanners [107], passive
Multi-View Stereo (MVS) capture systems [16], dedicated active photometric stereo systems [76,
126], or depth sensors based on structured light or time-of-ight sensors [183, 218]. Among
these, MVS is the most commonly used [57, 66, 77, 101, 156, 213]. Although these approaches
produce high-quality geometry, they suer from heavy computation due to the pairwise features
matching across views, and they tend to fail in case of sparse view inputs due to the lack of
overlapping neighboring views. More recently, deep neural networks learn multi-view feature
57
matching for 3D geometry reconstruction [81, 91, 97, 185, 230]. Compared to classical MVS meth-
ods, these learning based methods represent a trade-o between accuracy and ecacy. All these
MVS methods output unstructured meshes, while our method produces meshes in dense vertex
correspondence.
Most registration methods use a template mesh and t it to the scan surface by minimizing
the distance between the scan’s surface and the template. For optimization, the template mesh is
commonly parameterized with a statistical shape space [7, 21, 23, 115] or a general blendshape
basis [172]. Other approaches directly optimize the vertices of the template mesh using a non-
rigid Iterative Closest Point (ICP) [109], with a statistical model as regularizer [116], or jointly
optimize correspondence across an entire dataset in a groupwise fashion [27, 235]. For a more
thorough review of face acquisition and registration, see Egger et al. [55]. All these registration
methods solve for facial correspondence independent from the data acquisition. Therefore, errors
in the raw scan data propagate into the registration.
Only few methods exist that are similar to our method of directly outputting high-quality
registered 3D faces from calibrated multi-view input [18, 30, 68]. While sharing a similar goal,
our method goes beyond these approaches in several signicant ways. Unlike our method, they
require calibrated multi-view image sequence input, contain multiple optimization steps (e.g. for
building a subject specic template [68], or anchor frame meshes [18]), and are computationally
slow (e.g. 25 minutes per frame for the coarse mesh reconstruction [68]). ToFu instead takes cal-
ibrated multi-view images as input (i.e. static) and directly outputs a high-quality mesh in dense
vertex correspondence in 0:385 seconds. Regardless, our method achieves stable reconstruction
and registration results for sequence input.
58
Model-basedReconstruction: A large body of work aims at reconstructing 3D faces from un-
constrained images or monocular videos. To constrain the problem, most methods estimate the
coecients of a statistical 3D morphable models (3DMM) in an optimization-based [2, 14, 22, 23,
203] or learning-based framework [40, 61, 75, 163, 173, 201, 204]. Due to the use of over-simplied,
mostly linear statistical models, the reconstructed meshes only capture the coarse geometry shape
while subtle details are missing. For better generalization to unconstrained conditions, [197, 205]
jointly learn a 3D prior and reconstruct 3D faces from images. Although monocular reconstruc-
tion methods can provide visually appealing 3D face reconstructions, their accuracy and quality
is not suitable for applications which require metrically accurate geometry. Recently published
work indicates that existing state-of-the-art monocular 3D face reconstructions are metrically
worse or only marginally better compared to a static model mean face, when compared to ground
truth 3D scans [173]. This comes at little surprise as inferring 3D geometry from a single image is
an ill-posed problem due to the inherent ambiguity of focal length, scale and shape [13] as under
perspective projection dierent shapes result in the same image for dierent object-camera dis-
tances. Our method instead leverages explicit calibrated multi-view information to reconstruct
metrically accurate 3D geometry.
3.3 Multi-ViewFaceInference
As shown in Fig. 3.2, given imagesfI
i
g
K
i=1
inK views with known camera calibrationfP
i
g
K
i=1
,
together denoted asI =fI
i
;P
i
g
K
i=1
, the goal of ToFu is two-fold: (1) to reconstruct an accurate
base mesh in an artist-designed topology, and (2) to estimate pore-level geometric details and
high-quality facial appearance in form of albedo and specular reectance maps. Formally, an
59
Input images
{Ii}
K
i=1
AAACBHicbVDLSsNAFJ3UV62vqEtdDBbBVUlE8AFCwY3ipoKxhSaGyXTSDp1MwsxEKCEbN/6KGxcqbv0Id/6NkzYLbT1w4XDOvdx7T5AwKpVlfRuVufmFxaXqcm1ldW19w9zcupNxKjBxcMxi0QmQJIxy4iiqGOkkgqAoYKQdDC8Kv/1AhKQxv1WjhHgR6nMaUoyUlnxz182gGyE1CMLsKvcpdHM/o+d2fp9d575ZtxrWGHCW2CWpgxIt3/xyezFOI8IVZkjKrm0lysuQUBQzktfcVJIE4SHqk66mHEVEetn4ixzua6UHw1jo4gqO1d8TGYqkHEWB7iwOltNeIf7ndVMVnngZ5UmqCMeTRWHKoIphEQnsUUGwYiNNEBZU3wrxAAmElQ6upkOwp1+eJc5h47Rh3xzVm2dlGlWwA/bAAbDBMWiCS9ACDsDgETyDV/BmPBkvxrvxMWmtGOXMNvgD4/MHjGmYJQ==
AAACKXicjVDLSsNAFJ3UV62vqEtdDBbBVUlE8AFCwY3iRsHYQhPDZDpph04mYWYilJCNX+PCjb/SjY+tP+KkzUJbFx4YOJxzL3fOCRJGpbKsD6MyN7+wuFRdrq2srq1vmJtb9zJOBSYOjlks2gGShFFOHEUVI+1EEBQFjLSCwUXhtx6JkDTmd2qYEC9CPU5DipHSkm/uuhl0I6T6QZhd5T6Fbu5n9NzOH7Lr3DfrVsMaA84SuyR1UOJ/4745crsxTiPCFWZIyo5tJcrLkFAUM5LX3FSSBOEB6pGOphxFRHrZOGoO97XShWEs9OMKjtWfGxmKpBxGgZ4sUslprxD/8jqpCk+8jPIkVYTjyaEwZVDFsOgNdqkgWLGhJggLqv8KcR8JhJVut6aj29NBZ4lz2Dht2LdH9eZZWVkV7IA9cABscAya4BLcAAdg8ASewSt4M16MkfFufE5GK0a5sw1+wfj6Bmthn54=
AAACKXicjVDLSsNAFJ3UV62vqEtdDBbBVUlE8AFCwY3iRsHYQhPDZDpph04mYWYilJCNX+PCjb/SjY+tP+KkzUJbFx4YOJxzL3fOCRJGpbKsD6MyN7+wuFRdrq2srq1vmJtb9zJOBSYOjlks2gGShFFOHEUVI+1EEBQFjLSCwUXhtx6JkDTmd2qYEC9CPU5DipHSkm/uuhl0I6T6QZhd5T6Fbu5n9NzOH7Lr3DfrVsMaA84SuyR1UOJ/4745crsxTiPCFWZIyo5tJcrLkFAUM5LX3FSSBOEB6pGOphxFRHrZOGoO97XShWEs9OMKjtWfGxmKpBxGgZ4sUslprxD/8jqpCk+8jPIkVYTjyaEwZVDFsOgNdqkgWLGhJggLqv8KcR8JhJVut6aj29NBZ4lz2Dht2LdH9eZZWVkV7IA9cABscAya4BLcAAdg8ASewSt4M16MkfFufE5GK0a5sw1+wfj6Bmthn54=
AAACKXicjVDLSsNAFJ3UV62vqEtdDBbBVUlE8AFCwY3iRsHYQhPDZDpph04mYWYilJCNX+PCjb/SjY+tP+KkzUJbFx4YOJxzL3fOCRJGpbKsD6MyN7+wuFRdrq2srq1vmJtb9zJOBSYOjlks2gGShFFOHEUVI+1EEBQFjLSCwUXhtx6JkDTmd2qYEC9CPU5DipHSkm/uuhl0I6T6QZhd5T6Fbu5n9NzOH7Lr3DfrVsMaA84SuyR1UOJ/4745crsxTiPCFWZIyo5tJcrLkFAUM5LX3FSSBOEB6pGOphxFRHrZOGoO97XShWEs9OMKjtWfGxmKpBxGgZ4sUslprxD/8jqpCk+8jPIkVYTjyaEwZVDFsOgNdqkgWLGhJggLqv8KcR8JhJVut6aj29NBZ4lz2Dht2LdH9eZZWVkV7IA9cABscAya4BLcAAdg8ASewSt4M16MkfFufE5GK0a5sw1+wfj6Bmthn54=
Face
Completion
&
Texturing
Global Stage Local Stage
Progressive Mesh Generation Appearance and Detail Capture
Synthesis Network
Base mesh
Detailed mesh
Skin detail and appearance maps
M
AAAB8XicbVBNSwMxFMzWr1q/qh69BIvgqeyKYL0VvHgRKri2sF1KNs22odlkSd4KZenP8OJBxav/xpv/xmy7B20dCAwz75F5E6WCG3Ddb6eytr6xuVXdru3s7u0f1A+PHo3KNGU+VULpXkQME1wyHzgI1ks1I0kkWDea3BR+94lpw5V8gGnKwoSMJI85JWCloJ8QGFMi8rvZoN5wm+4ceJV4JWmgEp1B/as/VDRLmAQqiDGB56YQ5kQDp4LNav3MsJTQCRmxwFJJEmbCfB55hs+sMsSx0vZJwHP190ZOEmOmSWQni4hm2SvE/7wgg7gV5lymGTBJFx/FmcCgcHE/HnLNKIipJYRqbrNiOiaaULAt1WwJ3vLJq8S/aF43vfvLRrtVtlFFJ+gUnSMPXaE2ukUd5COKFHpGr+jNAefFeXc+FqMVp9w5Rn/gfP4A6dKRLQ==
AAAB8XicbVBNSwMxFMzWr1q/qh69BIvgqeyKYL0VvHgRKri2sF1KNs22odlkSd4KZenP8OJBxav/xpv/xmy7B20dCAwz75F5E6WCG3Ddb6eytr6xuVXdru3s7u0f1A+PHo3KNGU+VULpXkQME1wyHzgI1ks1I0kkWDea3BR+94lpw5V8gGnKwoSMJI85JWCloJ8QGFMi8rvZoN5wm+4ceJV4JWmgEp1B/as/VDRLmAQqiDGB56YQ5kQDp4LNav3MsJTQCRmxwFJJEmbCfB55hs+sMsSx0vZJwHP190ZOEmOmSWQni4hm2SvE/7wgg7gV5lymGTBJFx/FmcCgcHE/HnLNKIipJYRqbrNiOiaaULAt1WwJ3vLJq8S/aF43vfvLRrtVtlFFJ+gUnSMPXaE2ukUd5COKFHpGr+jNAefFeXc+FqMVp9w5Rn/gfP4A6dKRLQ==
AAAB8XicbVBNSwMxFMzWr1q/qh69BIvgqeyKYL0VvHgRKri2sF1KNs22odlkSd4KZenP8OJBxav/xpv/xmy7B20dCAwz75F5E6WCG3Ddb6eytr6xuVXdru3s7u0f1A+PHo3KNGU+VULpXkQME1wyHzgI1ks1I0kkWDea3BR+94lpw5V8gGnKwoSMJI85JWCloJ8QGFMi8rvZoN5wm+4ceJV4JWmgEp1B/as/VDRLmAQqiDGB56YQ5kQDp4LNav3MsJTQCRmxwFJJEmbCfB55hs+sMsSx0vZJwHP190ZOEmOmSWQni4hm2SvE/7wgg7gV5lymGTBJFx/FmcCgcHE/HnLNKIipJYRqbrNiOiaaULAt1WwJ3vLJq8S/aF43vfvLRrtVtlFFJ+gUnSMPXaE2ukUd5COKFHpGr+jNAefFeXc+FqMVp9w5Rn/gfP4A6dKRLQ==
AAAB8XicbVBNSwMxFMzWr1q/qh69BIvgqeyKYL0VvHgRKri2sF1KNs22odlkSd4KZenP8OJBxav/xpv/xmy7B20dCAwz75F5E6WCG3Ddb6eytr6xuVXdru3s7u0f1A+PHo3KNGU+VULpXkQME1wyHzgI1ks1I0kkWDea3BR+94lpw5V8gGnKwoSMJI85JWCloJ8QGFMi8rvZoN5wm+4ceJV4JWmgEp1B/as/VDRLmAQqiDGB56YQ5kQDp4LNav3MsJTQCRmxwFJJEmbCfB55hs+sMsSx0vZJwHP190ZOEmOmSWQni4hm2SvE/7wgg7gV5lymGTBJFx/FmcCgcHE/HnLNKIipJYRqbrNiOiaaULAt1WwJ3vLJq8S/aF43vfvLRrtVtlFFJ+gUnSMPXaE2ukUd5COKFHpGr+jNAefFeXc+FqMVp9w5Rn/gfP4A6dKRLQ==
M
⇤
AAAB93icbVBNS8NAFHypX7V+NOrRy2IRxENJRLDeCl68CBWMLbSxbLbbdulmE3Y3Qg35JV48qHj1r3jz37hpc9DWgYVh5j3e7AQxZ0o7zrdVWlldW98ob1a2tnd2q/be/r2KEkmoRyIeyU6AFeVMUE8zzWknlhSHAaftYHKV++1HKhWLxJ2extQP8UiwISNYG6lvV3sh1mOCeXqTPaSnWd+uOXVnBrRM3ILUoECrb3/1BhFJQio04VipruvE2k+x1IxwmlV6iaIxJhM8ol1DBQ6p8tNZ8AwdG2WAhpE0T2g0U39vpDhUahoGZjKPqRa9XPzP6yZ62PBTJuJEU0Hmh4YJRzpCeQtowCQlmk8NwUQykxWRMZaYaNNVxZTgLn55mXhn9cu6e3teazaKNspwCEdwAi5cQBOuoQUeEEjgGV7hzXqyXqx362M+WrKKnQP4A+vzB04ykwY=
AAAB93icbVBNS8NAFHypX7V+NOrRy2IRxENJRLDeCl68CBWMLbSxbLbbdulmE3Y3Qg35JV48qHj1r3jz37hpc9DWgYVh5j3e7AQxZ0o7zrdVWlldW98ob1a2tnd2q/be/r2KEkmoRyIeyU6AFeVMUE8zzWknlhSHAaftYHKV++1HKhWLxJ2extQP8UiwISNYG6lvV3sh1mOCeXqTPaSnWd+uOXVnBrRM3ILUoECrb3/1BhFJQio04VipruvE2k+x1IxwmlV6iaIxJhM8ol1DBQ6p8tNZ8AwdG2WAhpE0T2g0U39vpDhUahoGZjKPqRa9XPzP6yZ62PBTJuJEU0Hmh4YJRzpCeQtowCQlmk8NwUQykxWRMZaYaNNVxZTgLn55mXhn9cu6e3teazaKNspwCEdwAi5cQBOuoQUeEEjgGV7hzXqyXqx362M+WrKKnQP4A+vzB04ykwY=
AAAB93icbVBNS8NAFHypX7V+NOrRy2IRxENJRLDeCl68CBWMLbSxbLbbdulmE3Y3Qg35JV48qHj1r3jz37hpc9DWgYVh5j3e7AQxZ0o7zrdVWlldW98ob1a2tnd2q/be/r2KEkmoRyIeyU6AFeVMUE8zzWknlhSHAaftYHKV++1HKhWLxJ2extQP8UiwISNYG6lvV3sh1mOCeXqTPaSnWd+uOXVnBrRM3ILUoECrb3/1BhFJQio04VipruvE2k+x1IxwmlV6iaIxJhM8ol1DBQ6p8tNZ8AwdG2WAhpE0T2g0U39vpDhUahoGZjKPqRa9XPzP6yZ62PBTJuJEU0Hmh4YJRzpCeQtowCQlmk8NwUQykxWRMZaYaNNVxZTgLn55mXhn9cu6e3teazaKNspwCEdwAi5cQBOuoQUeEEjgGV7hzXqyXqx362M+WrKKnQP4A+vzB04ykwY=
AAAB93icbVBNS8NAFHypX7V+NOrRy2IRxENJRLDeCl68CBWMLbSxbLbbdulmE3Y3Qg35JV48qHj1r3jz37hpc9DWgYVh5j3e7AQxZ0o7zrdVWlldW98ob1a2tnd2q/be/r2KEkmoRyIeyU6AFeVMUE8zzWknlhSHAaftYHKV++1HKhWLxJ2extQP8UiwISNYG6lvV3sh1mOCeXqTPaSnWd+uOXVnBrRM3ILUoECrb3/1BhFJQio04VipruvE2k+x1IxwmlV6iaIxJhM8ol1DBQ6p8tNZ8AwdG2WAhpE0T2g0U39vpDhUahoGZjKPqRa9XPzP6yZ62PBTJuJEU0Hmh4YJRzpCeQtowCQlmk8NwUQykxWRMZaYaNNVxZTgLn55mXhn9cu6e3teazaKNspwCEdwAi5cQBOuoQUeEEjgGV7hzXqyXqx362M+WrKKnQP4A+vzB04ykwY=
(zoom-in)
§ 3.3.1 § 3.3.2 § 3.3.3
Figure 3.2: Overviewofourend-to-endfacemodelingsystem. Given images captured from
multi-views, the progressive mesh generation network predicts an accurate face mesh in consis-
tent topology. Then the appearance and detail capture network synthesizes high-resolution skin
detail and attribute maps, which enables highly detailed geometry and photo-realistic renderings.
output base meshM contains a list of verticesV2R
N3
and a xed triangulationT. The base
meshes are required to (1) tightly t the face surfaces, (2) share a common artist-designed mesh
topology, where each vertex encodes the same semantic interpretation across all meshes, and (3)
have a sucient triangle or quad density (withN > 10
4
number of vertices).
The key to dense mesh prediction is a coarse-to-ne network architecture, as shown in Fig. 3.3.
The desired semantic mesh correspondence is naturally embedded in the hierarchical architec-
ture. Based on that, the geometry is inferred by the following two stages: (1) a coarse mesh
predictionM
0
, by the global stageV
0
=F
g
(I); and (2) iteratively upsampling and rening into
the denser meshesfM
1
;M
2
;:::;M
L
g, by the local stageV
k+1
=F
l
(I;V
k
).M
L
is the nal
prediction of base meshM.
Conceptually, the global stage mimics a learning-based MVS, while the local stage provides
“updates” as if in an iterative mesh registration. In contrast to the two traditional methods, our
two steps share consistent correspondence in a xed topology and use volumetric features for
geometry inference and surface renement.
60
Input images
Mesh
Upsampling
+
Global Stage
Final mesh
Feature Extraction
.
.
.
.
.
.
Volumetric
Feature Sampling
Lg
AAAB8nicdVBPS8MwHE39O+e/qUcvwSF4KulW2eZp4MWDhwnWDdYy0izdwtK0JKkwyr6GFw8qXv003vw2ptsEFX0QeLz3+/F7eWHKmdIIfVgrq2vrG5ulrfL2zu7efuXg8E4lmSTUIwlPZC/EinImqKeZ5rSXSorjkNNuOLks/O49lYol4lZPUxrEeCRYxAjWRvL9GOtxGOXXs8FoUKkiu15ruU4NIvsctZrINaSGGk23Dh0bzVEFS3QGlXd/mJAspkITjpXqOyjVQY6lZoTTWdnPFE0xmeAR7RsqcExVkM8zz+CpUYYwSqR5QsO5+n0jx7FS0zg0k0VG9dsrxL+8fqajZpAzkWaaCrI4FGUc6gQWBcAhk5RoPjUEE8lMVkjGWGKiTU1lU8LXT+H/xKvZLdu5cavti2UbJXAMTsAZcEADtMEV6AAPEJCCB/AEnq3MerRerNfF6Iq13DkCP2C9fQIG95Hd
AAAB8nicdVBPS8MwHE39O+e/qUcvwSF4KulW2eZp4MWDhwnWDdYy0izdwtK0JKkwyr6GFw8qXv003vw2ptsEFX0QeLz3+/F7eWHKmdIIfVgrq2vrG5ulrfL2zu7efuXg8E4lmSTUIwlPZC/EinImqKeZ5rSXSorjkNNuOLks/O49lYol4lZPUxrEeCRYxAjWRvL9GOtxGOXXs8FoUKkiu15ruU4NIvsctZrINaSGGk23Dh0bzVEFS3QGlXd/mJAspkITjpXqOyjVQY6lZoTTWdnPFE0xmeAR7RsqcExVkM8zz+CpUYYwSqR5QsO5+n0jx7FS0zg0k0VG9dsrxL+8fqajZpAzkWaaCrI4FGUc6gQWBcAhk5RoPjUEE8lMVkjGWGKiTU1lU8LXT+H/xKvZLdu5cavti2UbJXAMTsAZcEADtMEV6AAPEJCCB/AEnq3MerRerNfF6Iq13DkCP2C9fQIG95Hd
AAAB8nicdVBPS8MwHE39O+e/qUcvwSF4KulW2eZp4MWDhwnWDdYy0izdwtK0JKkwyr6GFw8qXv003vw2ptsEFX0QeLz3+/F7eWHKmdIIfVgrq2vrG5ulrfL2zu7efuXg8E4lmSTUIwlPZC/EinImqKeZ5rSXSorjkNNuOLks/O49lYol4lZPUxrEeCRYxAjWRvL9GOtxGOXXs8FoUKkiu15ruU4NIvsctZrINaSGGk23Dh0bzVEFS3QGlXd/mJAspkITjpXqOyjVQY6lZoTTWdnPFE0xmeAR7RsqcExVkM8zz+CpUYYwSqR5QsO5+n0jx7FS0zg0k0VG9dsrxL+8fqajZpAzkWaaCrI4FGUc6gQWBcAhk5RoPjUEE8lMVkjGWGKiTU1lU8LXT+H/xKvZLdu5cavti2UbJXAMTsAZcEADtMEV6AAPEJCCB/AEnq3MerRerNfF6Iq13DkCP2C9fQIG95Hd
AAAB8nicdVBPS8MwHE39O+e/qUcvwSF4KulW2eZp4MWDhwnWDdYy0izdwtK0JKkwyr6GFw8qXv003vw2ptsEFX0QeLz3+/F7eWHKmdIIfVgrq2vrG5ulrfL2zu7efuXg8E4lmSTUIwlPZC/EinImqKeZ5rSXSorjkNNuOLks/O49lYol4lZPUxrEeCRYxAjWRvL9GOtxGOXXs8FoUKkiu15ruU4NIvsctZrINaSGGk23Dh0bzVEFS3QGlXd/mJAspkITjpXqOyjVQY6lZoTTWdnPFE0xmeAR7RsqcExVkM8zz+CpUYYwSqR5QsO5+n0jx7FS0zg0k0VG9dsrxL+8fqajZpAzkWaaCrI4FGUc6gQWBcAhk5RoPjUEE8lMVkjGWGKiTU1lU8LXT+H/xKvZLdu5cavti2UbJXAMTsAZcEADtMEV6AAPEJCCB/AEnq3MerRerNfF6Iq13DkCP2C9fQIG95Hd
{Fi}
K
i=1
AAACBHicdVDLSgMxFM34rPVVdamLYBFcDZl2pK0gFAQR3FSwttCpQybNtKGZB0lGKMNs3PgrblyouPUj3Pk3pg9BRQ9cOJxzL/fe48WcSYXQhzE3v7C4tJxbya+urW9sFra2r2WUCEKbJOKRaHtYUs5C2lRMcdqOBcWBx2nLG56O/dYtFZJF4ZUaxbQb4H7IfEaw0pJb2HNS6ARYDTw/PctcBp3MTdmJld2kF5lbKCKzXKrZVgki8wjVqsjWpIQqVbsMLRNNUAQzNNzCu9OLSBLQUBGOpexYKFbdFAvFCKdZ3kkkjTEZ4j7taBrigMpuOvkigwda6UE/ErpCBSfq94kUB1KOAk93jg+Wv72x+JfXSZRf7aYsjBNFQzJd5CccqgiOI4E9JihRfKQJJoLpWyEZYIGJ0sHldQhfn8L/SbNk1kzr0i7Wj2dp5MAu2AeHwAIVUAfnoAGagIA78ACewLNxbzwaL8brtHXOmM3sgB8w3j4B7hCYaQ==
AAACKXicjVDLSgMxFM34rPVVdamLYBFcDZl2pK0gFAQR3ChYW+jUIZNm2tDMgyQjlGE2fo0LN/5KNz62/ojpQ1BR8EDgcM693JzjxZxJhdCrMTe/sLi0nFvJr66tb2wWtrZvZJQIQhsk4pFoeVhSzkLaUExx2ooFxYHHadMbnI795h0VkkXhtRrGtBPgXsh8RrDSklvYc1LoBFj1PT89y1wGncxN2YmV3aYXmVsoIrNcqtlWCSLzCNWqyNakhCpVuwwtE01QBDP8b9wtjJxuRJKAhopwLGXbQrHqpFgoRjjN8k4iaYzJAPdoW9MQB1R20knUDB5opQv9SOgXKjhRv26kOJByGHh6cpxK/vTG4m9eO1F+tZOyME4UDcn0kJ9wqCI47g12maBE8aEmmAim/wpJHwtMlG43r6N/5oN/k0bJrJnWlV2sH88qy4FdsA8OgQUqoA7OwSVoAALuwQN4As/GozEyXoy36eicMdvZAd9gvH8AOoWgKQ==
AAACKXicjVDLSgMxFM34rPVVdamLYBFcDZl2pK0gFAQR3ChYW+jUIZNm2tDMgyQjlGE2fo0LN/5KNz62/ojpQ1BR8EDgcM693JzjxZxJhdCrMTe/sLi0nFvJr66tb2wWtrZvZJQIQhsk4pFoeVhSzkLaUExx2ooFxYHHadMbnI795h0VkkXhtRrGtBPgXsh8RrDSklvYc1LoBFj1PT89y1wGncxN2YmV3aYXmVsoIrNcqtlWCSLzCNWqyNakhCpVuwwtE01QBDP8b9wtjJxuRJKAhopwLGXbQrHqpFgoRjjN8k4iaYzJAPdoW9MQB1R20knUDB5opQv9SOgXKjhRv26kOJByGHh6cpxK/vTG4m9eO1F+tZOyME4UDcn0kJ9wqCI47g12maBE8aEmmAim/wpJHwtMlG43r6N/5oN/k0bJrJnWlV2sH88qy4FdsA8OgQUqoA7OwSVoAALuwQN4As/GozEyXoy36eicMdvZAd9gvH8AOoWgKQ==
AAACKXicjVDLSgMxFM34rPVVdamLYBFcDZl2pK0gFAQR3ChYW+jUIZNm2tDMgyQjlGE2fo0LN/5KNz62/ojpQ1BR8EDgcM693JzjxZxJhdCrMTe/sLi0nFvJr66tb2wWtrZvZJQIQhsk4pFoeVhSzkLaUExx2ooFxYHHadMbnI795h0VkkXhtRrGtBPgXsh8RrDSklvYc1LoBFj1PT89y1wGncxN2YmV3aYXmVsoIrNcqtlWCSLzCNWqyNakhCpVuwwtE01QBDP8b9wtjJxuRJKAhopwLGXbQrHqpFgoRjjN8k4iaYzJAPdoW9MQB1R20knUDB5opQv9SOgXKjhRv26kOJByGHh6cpxK/vTG4m9eO1F+tZOyME4UDcn0kJ9wqCI47g12maBE8aEmmAim/wpJHwtMlG43r6N/5oN/k0bJrJnWlV2sH88qy4FdsA8OgQUqoA7OwSVoAALuwQN4As/GozEyXoy36eicMdvZAd9gvH8AOoWgKQ==
{Ii}
K
i=1
AAACBHicbVDLSsNAFJ3UV62vqEtdDBbBVUlE8AFCwY3ipoKxhSaGyXTSDp1MwsxEKCEbN/6KGxcqbv0Id/6NkzYLbT1w4XDOvdx7T5AwKpVlfRuVufmFxaXqcm1ldW19w9zcupNxKjBxcMxi0QmQJIxy4iiqGOkkgqAoYKQdDC8Kv/1AhKQxv1WjhHgR6nMaUoyUlnxz182gGyE1CMLsKvcpdHM/o+d2fp9d575ZtxrWGHCW2CWpgxIt3/xyezFOI8IVZkjKrm0lysuQUBQzktfcVJIE4SHqk66mHEVEetn4ixzua6UHw1jo4gqO1d8TGYqkHEWB7iwOltNeIf7ndVMVnngZ5UmqCMeTRWHKoIphEQnsUUGwYiNNEBZU3wrxAAmElQ6upkOwp1+eJc5h47Rh3xzVm2dlGlWwA/bAAbDBMWiCS9ACDsDgETyDV/BmPBkvxrvxMWmtGOXMNvgD4/MHjGmYJQ==
AAACKXicjVDLSsNAFJ3UV62vqEtdDBbBVUlE8AFCwY3iRsHYQhPDZDpph04mYWYilJCNX+PCjb/SjY+tP+KkzUJbFx4YOJxzL3fOCRJGpbKsD6MyN7+wuFRdrq2srq1vmJtb9zJOBSYOjlks2gGShFFOHEUVI+1EEBQFjLSCwUXhtx6JkDTmd2qYEC9CPU5DipHSkm/uuhl0I6T6QZhd5T6Fbu5n9NzOH7Lr3DfrVsMaA84SuyR1UOJ/4745crsxTiPCFWZIyo5tJcrLkFAUM5LX3FSSBOEB6pGOphxFRHrZOGoO97XShWEs9OMKjtWfGxmKpBxGgZ4sUslprxD/8jqpCk+8jPIkVYTjyaEwZVDFsOgNdqkgWLGhJggLqv8KcR8JhJVut6aj29NBZ4lz2Dht2LdH9eZZWVkV7IA9cABscAya4BLcAAdg8ASewSt4M16MkfFufE5GK0a5sw1+wfj6Bmthn54=
AAACKXicjVDLSsNAFJ3UV62vqEtdDBbBVUlE8AFCwY3iRsHYQhPDZDpph04mYWYilJCNX+PCjb/SjY+tP+KkzUJbFx4YOJxzL3fOCRJGpbKsD6MyN7+wuFRdrq2srq1vmJtb9zJOBSYOjlks2gGShFFOHEUVI+1EEBQFjLSCwUXhtx6JkDTmd2qYEC9CPU5DipHSkm/uuhl0I6T6QZhd5T6Fbu5n9NzOH7Lr3DfrVsMaA84SuyR1UOJ/4745crsxTiPCFWZIyo5tJcrLkFAUM5LX3FSSBOEB6pGOphxFRHrZOGoO97XShWEs9OMKjtWfGxmKpBxGgZ4sUslprxD/8jqpCk+8jPIkVYTjyaEwZVDFsOgNdqkgWLGhJggLqv8KcR8JhJVut6aj29NBZ4lz2Dht2LdH9eZZWVkV7IA9cABscAya4BLcAAdg8ASewSt4M16MkfFufE5GK0a5sw1+wfj6Bmthn54=
AAACKXicjVDLSsNAFJ3UV62vqEtdDBbBVUlE8AFCwY3iRsHYQhPDZDpph04mYWYilJCNX+PCjb/SjY+tP+KkzUJbFx4YOJxzL3fOCRJGpbKsD6MyN7+wuFRdrq2srq1vmJtb9zJOBSYOjlks2gGShFFOHEUVI+1EEBQFjLSCwUXhtx6JkDTmd2qYEC9CPU5DipHSkm/uuhl0I6T6QZhd5T6Fbu5n9NzOH7Lr3DfrVsMaA84SuyR1UOJ/4745crsxTiPCFWZIyo5tJcrLkFAUM5LX3FSSBOEB6pGOphxFRHrZOGoO97XShWEs9OMKjtWfGxmKpBxGgZ4sUslprxD/8jqpCk+8jPIkVYTjyaEwZVDFsOgNdqkgWLGhJggLqv8KcR8JhJVut6aj29NBZ4lz2Dht2LdH9eZZWVkV7IA9cABscAya4BLcAAdg8ASewSt4M16MkfFufE5GK0a5sw1+wfj6Bmthn54=
(2D) (3D)
Cg
AAAB8nicdVBPS8MwHE39O+e/qUcvwSF4KqmuuHoa7OJxgnWDtYw0S7ewNC1JKoyyr+HFg4pXP403v43pNkFFHwQe7/1+/F5elHGmNEIf1srq2vrGZmWrur2zu7dfOzi8U2kuCfVJylPZi7CinAnqa6Y57WWS4iTitBtN2qXfvadSsVTc6mlGwwSPBIsZwdpIQZBgPY7ioj0bjAa1OrKR67kOgsh2keNdlMTzmg3XhY6N5qiDJTqD2nswTEmeUKEJx0r1HZTpsMBSM8LprBrkimaYTPCI9g0VOKEqLOaZZ/DUKEMYp9I8oeFc/b5R4ESpaRKZyTKj+u2V4l9eP9dxMyyYyHJNBVkcinMOdQrLAuCQSUo0nxqCiWQmKyRjLDHRpqaqKeHrp/B/4p/bnu3cNOqtq2UbFXAMTsAZcMAlaIFr0AE+ICADD+AJPFu59Wi9WK+L0RVruXMEfsB6+wQC4JHb
AAAB8nicdVBPS8MwHE39O+e/qUcvwSF4KqmuuHoa7OJxgnWDtYw0S7ewNC1JKoyyr+HFg4pXP403v43pNkFFHwQe7/1+/F5elHGmNEIf1srq2vrGZmWrur2zu7dfOzi8U2kuCfVJylPZi7CinAnqa6Y57WWS4iTitBtN2qXfvadSsVTc6mlGwwSPBIsZwdpIQZBgPY7ioj0bjAa1OrKR67kOgsh2keNdlMTzmg3XhY6N5qiDJTqD2nswTEmeUKEJx0r1HZTpsMBSM8LprBrkimaYTPCI9g0VOKEqLOaZZ/DUKEMYp9I8oeFc/b5R4ESpaRKZyTKj+u2V4l9eP9dxMyyYyHJNBVkcinMOdQrLAuCQSUo0nxqCiWQmKyRjLDHRpqaqKeHrp/B/4p/bnu3cNOqtq2UbFXAMTsAZcMAlaIFr0AE+ICADD+AJPFu59Wi9WK+L0RVruXMEfsB6+wQC4JHb
AAAB8nicdVBPS8MwHE39O+e/qUcvwSF4KqmuuHoa7OJxgnWDtYw0S7ewNC1JKoyyr+HFg4pXP403v43pNkFFHwQe7/1+/F5elHGmNEIf1srq2vrGZmWrur2zu7dfOzi8U2kuCfVJylPZi7CinAnqa6Y57WWS4iTitBtN2qXfvadSsVTc6mlGwwSPBIsZwdpIQZBgPY7ioj0bjAa1OrKR67kOgsh2keNdlMTzmg3XhY6N5qiDJTqD2nswTEmeUKEJx0r1HZTpsMBSM8LprBrkimaYTPCI9g0VOKEqLOaZZ/DUKEMYp9I8oeFc/b5R4ESpaRKZyTKj+u2V4l9eP9dxMyyYyHJNBVkcinMOdQrLAuCQSUo0nxqCiWQmKyRjLDHRpqaqKeHrp/B/4p/bnu3cNOqtq2UbFXAMTsAZcMAlaIFr0AE+ICADD+AJPFu59Wi9WK+L0RVruXMEfsB6+wQC4JHb
AAAB8nicdVBPS8MwHE39O+e/qUcvwSF4KqmuuHoa7OJxgnWDtYw0S7ewNC1JKoyyr+HFg4pXP403v43pNkFFHwQe7/1+/F5elHGmNEIf1srq2vrGZmWrur2zu7dfOzi8U2kuCfVJylPZi7CinAnqa6Y57WWS4iTitBtN2qXfvadSsVTc6mlGwwSPBIsZwdpIQZBgPY7ioj0bjAa1OrKR67kOgsh2keNdlMTzmg3XhY6N5qiDJTqD2nswTEmeUKEJx0r1HZTpsMBSM8LprBrkimaYTPCI9g0VOKEqLOaZZ/DUKEMYp9I8oeFc/b5R4ESpaRKZyTKj+u2V4l9eP9dxMyyYyHJNBVkcinMOdQrLAuCQSUo0nxqCiWQmKyRjLDHRpqaqKeHrp/B/4p/bnu3cNOqtq2UbFXAMTsAZcMAlaIFr0AE+ICADD+AJPFu59Wi9WK+L0RVruXMEfsB6+wQC4JHb
M0
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQzPJmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLVBHqE8mlaodYU84E9Q0znLYTRXEcctoKRze53xpTpZkUD2aS0CDGA8EiRrCxUtCNsRkSzLO7ac/tVWtu3Z0BLROvIDUo0OxVv7p9SdKYCkM41rrjuYkJMqwMI5xOK91U0wSTER7QjqUCx1QH2Sz0FJ1YpY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPVVsCd7il5eJf1a/qnv357XGddFGGY7gGE7BgwtowC00wQcCT/AMr/DmjJ0X5935mI+WnGLnEP7A+fwBFfyR0g==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQzPJmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLVBHqE8mlaodYU84E9Q0znLYTRXEcctoKRze53xpTpZkUD2aS0CDGA8EiRrCxUtCNsRkSzLO7ac/tVWtu3Z0BLROvIDUo0OxVv7p9SdKYCkM41rrjuYkJMqwMI5xOK91U0wSTER7QjqUCx1QH2Sz0FJ1YpY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPVVsCd7il5eJf1a/qnv357XGddFGGY7gGE7BgwtowC00wQcCT/AMr/DmjJ0X5935mI+WnGLnEP7A+fwBFfyR0g==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQzPJmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLVBHqE8mlaodYU84E9Q0znLYTRXEcctoKRze53xpTpZkUD2aS0CDGA8EiRrCxUtCNsRkSzLO7ac/tVWtu3Z0BLROvIDUo0OxVv7p9SdKYCkM41rrjuYkJMqwMI5xOK91U0wSTER7QjqUCx1QH2Sz0FJ1YpY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPVVsCd7il5eJf1a/qnv357XGddFGGY7gGE7BgwtowC00wQcCT/AMr/DmjJ0X5935mI+WnGLnEP7A+fwBFfyR0g==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQzPJmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLVBHqE8mlaodYU84E9Q0znLYTRXEcctoKRze53xpTpZkUD2aS0CDGA8EiRrCxUtCNsRkSzLO7ac/tVWtu3Z0BLROvIDUo0OxVv7p9SdKYCkM41rrjuYkJMqwMI5xOK91U0wSTER7QjqUCx1QH2Sz0FJ1YpY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPVVsCd7il5eJf1a/qnv357XGddFGGY7gGE7BgwtowC00wQcCT/AMr/DmjJ0X5935mI+WnGLnEP7A+fwBFfyR0g==
E(·)
AAAB+XicbVDLSsNAFJ34rPWV6tLNYBHqpiQi+FgVRHBZwdhCE8pkMmmHTmbCzEQpsZ/ixoWKW//EnX/jpM1CWw8MHM65l3vmhCmjSjvOt7W0vLK6tl7ZqG5ube/s2rW9eyUyiYmHBROyGyJFGOXE01Qz0k0lQUnISCccXRV+54FIRQW/0+OUBAkacBpTjLSR+nbNT5AehmF+PWn4OBL6uG/XnaYzBVwkbknqoES7b3/5kcBZQrjGDCnVc51UBzmSmmJGJlU/UyRFeIQGpGcoRwlRQT6NPoFHRolgLKR5XMOp+nsjR4lS4yQ0k0VQNe8V4n9eL9PxeZBTnmaacDw7FGcMagGLHmBEJcGajQ1BWFKTFeIhkghr01bVlODOf3mReCfNi6Z7e1pvXZZtVMABOAQN4IIz0AI3oA08gMEjeAav4M16sl6sd+tjNrpklTv74A+szx85fJN/
AAAB+XicbVDLSsNAFJ34rPWV6tLNYBHqpiQi+FgVRHBZwdhCE8pkMmmHTmbCzEQpsZ/ixoWKW//EnX/jpM1CWw8MHM65l3vmhCmjSjvOt7W0vLK6tl7ZqG5ube/s2rW9eyUyiYmHBROyGyJFGOXE01Qz0k0lQUnISCccXRV+54FIRQW/0+OUBAkacBpTjLSR+nbNT5AehmF+PWn4OBL6uG/XnaYzBVwkbknqoES7b3/5kcBZQrjGDCnVc51UBzmSmmJGJlU/UyRFeIQGpGcoRwlRQT6NPoFHRolgLKR5XMOp+nsjR4lS4yQ0k0VQNe8V4n9eL9PxeZBTnmaacDw7FGcMagGLHmBEJcGajQ1BWFKTFeIhkghr01bVlODOf3mReCfNi6Z7e1pvXZZtVMABOAQN4IIz0AI3oA08gMEjeAav4M16sl6sd+tjNrpklTv74A+szx85fJN/
AAAB+XicbVDLSsNAFJ34rPWV6tLNYBHqpiQi+FgVRHBZwdhCE8pkMmmHTmbCzEQpsZ/ixoWKW//EnX/jpM1CWw8MHM65l3vmhCmjSjvOt7W0vLK6tl7ZqG5ube/s2rW9eyUyiYmHBROyGyJFGOXE01Qz0k0lQUnISCccXRV+54FIRQW/0+OUBAkacBpTjLSR+nbNT5AehmF+PWn4OBL6uG/XnaYzBVwkbknqoES7b3/5kcBZQrjGDCnVc51UBzmSmmJGJlU/UyRFeIQGpGcoRwlRQT6NPoFHRolgLKR5XMOp+nsjR4lS4yQ0k0VQNe8V4n9eL9PxeZBTnmaacDw7FGcMagGLHmBEJcGajQ1BWFKTFeIhkghr01bVlODOf3mReCfNi6Z7e1pvXZZtVMABOAQN4IIz0AI3oA08gMEjeAav4M16sl6sd+tjNrpklTv74A+szx85fJN/
AAAB+XicbVDLSsNAFJ34rPWV6tLNYBHqpiQi+FgVRHBZwdhCE8pkMmmHTmbCzEQpsZ/ixoWKW//EnX/jpM1CWw8MHM65l3vmhCmjSjvOt7W0vLK6tl7ZqG5ube/s2rW9eyUyiYmHBROyGyJFGOXE01Qz0k0lQUnISCccXRV+54FIRQW/0+OUBAkacBpTjLSR+nbNT5AehmF+PWn4OBL6uG/XnaYzBVwkbknqoES7b3/5kcBZQrjGDCnVc51UBzmSmmJGJlU/UyRFeIQGpGcoRwlRQT6NPoFHRolgLKR5XMOp+nsjR4lS4yQ0k0VQNe8V4n9eL9PxeZBTnmaacDw7FGcMagGLHmBEJcGajQ1BWFKTFeIhkghr01bVlODOf3mReCfNi6Z7e1pvXZZtVMABOAQN4IIz0AI3oA08gMEjeAav4M16sl6sd+tjNrpklTv74A+szx85fJN/
Global Geometry
Network
Mk
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQ5PMmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHHqSLUJzGPVTvEmnImqW+Y4bSdKIpFyGkrHN3kfmtMlWaxfDCThAYCDySLGMHGSkFXYDMkmGd3096oV625dXcGtEy8gtSgQLNX/er2Y5IKKg3hWOuO5yYmyLAyjHA6rXRTTRNMRnhAO5ZKLKgOslnoKTqxSh9FsbJPGjRTf29kWGg9EaGdzEPqRS8X//M6qYkug4zJJDVUkvmhKOXIxChvAPWZosTwiSWYKGazIjLEChNje6rYErzFLy8T/6x+Vffuz2uN66KNMhzBMZyCBxfQgFtogg8EnuAZXuHNGTsvzrvzMR8tOcXOIfyB8/kDby2SDQ==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQ5PMmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHHqSLUJzGPVTvEmnImqW+Y4bSdKIpFyGkrHN3kfmtMlWaxfDCThAYCDySLGMHGSkFXYDMkmGd3096oV625dXcGtEy8gtSgQLNX/er2Y5IKKg3hWOuO5yYmyLAyjHA6rXRTTRNMRnhAO5ZKLKgOslnoKTqxSh9FsbJPGjRTf29kWGg9EaGdzEPqRS8X//M6qYkug4zJJDVUkvmhKOXIxChvAPWZosTwiSWYKGazIjLEChNje6rYErzFLy8T/6x+Vffuz2uN66KNMhzBMZyCBxfQgFtogg8EnuAZXuHNGTsvzrvzMR8tOcXOIfyB8/kDby2SDQ==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQ5PMmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHHqSLUJzGPVTvEmnImqW+Y4bSdKIpFyGkrHN3kfmtMlWaxfDCThAYCDySLGMHGSkFXYDMkmGd3096oV625dXcGtEy8gtSgQLNX/er2Y5IKKg3hWOuO5yYmyLAyjHA6rXRTTRNMRnhAO5ZKLKgOslnoKTqxSh9FsbJPGjRTf29kWGg9EaGdzEPqRS8X//M6qYkug4zJJDVUkvmhKOXIxChvAPWZosTwiSWYKGazIjLEChNje6rYErzFLy8T/6x+Vffuz2uN66KNMhzBMZyCBxfQgFtogg8EnuAZXuHNGTsvzrvzMR8tOcXOIfyB8/kDby2SDQ==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQ5PMmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHHqSLUJzGPVTvEmnImqW+Y4bSdKIpFyGkrHN3kfmtMlWaxfDCThAYCDySLGMHGSkFXYDMkmGd3096oV625dXcGtEy8gtSgQLNX/er2Y5IKKg3hWOuO5yYmyLAyjHA6rXRTTRNMRnhAO5ZKLKgOslnoKTqxSh9FsbJPGjRTf29kWGg9EaGdzEPqRS8X//M6qYkug4zJJDVUkvmhKOXIxChvAPWZosTwiSWYKGazIjLEChNje6rYErzFLy8T/6x+Vffuz2uN66KNMhzBMZyCBxfQgFtogg8EnuAZXuHNGTsvzrvzMR8tOcXOIfyB8/kDby2SDQ==
Mk+1
AAAB+XicbVDLSsNAFL2pr1pfqS7dBIsgCCURwceq4MaNUMHYQhvCZDpph04mYWailJhPceNCxa1/4s6/cdJmoa0HBg7n3Ms9c4KEUals+9uoLC2vrK5V12sbm1vbO2Z9917GqcDExTGLRTdAkjDKiauoYqSbCIKigJFOML4q/M4DEZLG/E5NEuJFaMhpSDFSWvLNej9CaoQRy25yPxsfO7lvNuymPYW1SJySNKBE2ze/+oMYpxHhCjMkZc+xE+VlSCiKGclr/VSSBOExGpKephxFRHrZNHpuHWplYIWx0I8ra6r+3shQJOUkCvRkEVTOe4X4n9dLVXjuZZQnqSIczw6FKbNUbBU9WAMqCFZsognCguqsFh4hgbDSbdV0Cc78lxeJe9K8aDq3p43WZdlGFfbhAI7AgTNowTW0wQUMj/AMr/BmPBkvxrvxMRutGOXOHvyB8fkDlD6Tug==
AAAB+XicbVDLSsNAFL2pr1pfqS7dBIsgCCURwceq4MaNUMHYQhvCZDpph04mYWailJhPceNCxa1/4s6/cdJmoa0HBg7n3Ms9c4KEUals+9uoLC2vrK5V12sbm1vbO2Z9917GqcDExTGLRTdAkjDKiauoYqSbCIKigJFOML4q/M4DEZLG/E5NEuJFaMhpSDFSWvLNej9CaoQRy25yPxsfO7lvNuymPYW1SJySNKBE2ze/+oMYpxHhCjMkZc+xE+VlSCiKGclr/VSSBOExGpKephxFRHrZNHpuHWplYIWx0I8ra6r+3shQJOUkCvRkEVTOe4X4n9dLVXjuZZQnqSIczw6FKbNUbBU9WAMqCFZsognCguqsFh4hgbDSbdV0Cc78lxeJe9K8aDq3p43WZdlGFfbhAI7AgTNowTW0wQUMj/AMr/BmPBkvxrvxMRutGOXOHvyB8fkDlD6Tug==
AAAB+XicbVDLSsNAFL2pr1pfqS7dBIsgCCURwceq4MaNUMHYQhvCZDpph04mYWailJhPceNCxa1/4s6/cdJmoa0HBg7n3Ms9c4KEUals+9uoLC2vrK5V12sbm1vbO2Z9917GqcDExTGLRTdAkjDKiauoYqSbCIKigJFOML4q/M4DEZLG/E5NEuJFaMhpSDFSWvLNej9CaoQRy25yPxsfO7lvNuymPYW1SJySNKBE2ze/+oMYpxHhCjMkZc+xE+VlSCiKGclr/VSSBOExGpKephxFRHrZNHpuHWplYIWx0I8ra6r+3shQJOUkCvRkEVTOe4X4n9dLVXjuZZQnqSIczw6FKbNUbBU9WAMqCFZsognCguqsFh4hgbDSbdV0Cc78lxeJe9K8aDq3p43WZdlGFfbhAI7AgTNowTW0wQUMj/AMr/BmPBkvxrvxMRutGOXOHvyB8fkDlD6Tug==
AAAB+XicbVDLSsNAFL2pr1pfqS7dBIsgCCURwceq4MaNUMHYQhvCZDpph04mYWailJhPceNCxa1/4s6/cdJmoa0HBg7n3Ms9c4KEUals+9uoLC2vrK5V12sbm1vbO2Z9917GqcDExTGLRTdAkjDKiauoYqSbCIKigJFOML4q/M4DEZLG/E5NEuJFaMhpSDFSWvLNej9CaoQRy25yPxsfO7lvNuymPYW1SJySNKBE2ze/+oMYpxHhCjMkZc+xE+VlSCiKGclr/VSSBOExGpKephxFRHrZNHpuHWplYIWx0I8ra6r+3shQJOUkCvRkEVTOe4X4n9dLVXjuZZQnqSIczw6FKbNUbBU9WAMqCFZsognCguqsFh4hgbDSbdV0Cc78lxeJe9K8aDq3p43WZdlGFfbhAI7AgTNowTW0wQUMj/AMr/BmPBkvxrvxMRutGOXOHvyB8fkDlD6Tug==
.
.
.
.
.
.
(3D)
Local Refinement
Network
Volumetric
Feature Sampling
iterate k
{L
(j)
l
}
Nk
j=1
AAACDnicdVDLSgMxFM3UV62vqks3waLoZsiMlbaCILhxIaLgaKFTh0ya0djMgyQjlGH+wI2/4saFilvX7vwbM7WCih4InHvOveTe4yecSYXQu1EaG5+YnCpPV2Zm5+YXqotLZzJOBaEOiXks2j6WlLOIOoopTtuJoDj0OT33+/uFf35DhWRxdKoGCe2G+DJiASNYacmrrrsZdEOsrvwgO8y9jOcX2cb1Zg5dXVzvWro88vq5V60hc8tu1S0bInMbtZqoromNGs36FrRMNEQNjHDsVd/cXkzSkEaKcCxlx0KJ6mZYKEY4zStuKmmCSR9f0o6mEQ6p7GbDe3K4ppUeDGKhX6TgUP0+keFQykHo685idfnbK8S/vE6qgmY3Y1GSKhqRz4+ClEMVwyIc2GOCEsUHmmAimN4VkissMFE6wooO4etS+D9xbLNlWif12t7OKI0yWAGrYANYoAH2wAE4Bg4g4Bbcg0fwZNwZD8az8fLZWjJGM8vgB4zXD7xqnK0=
AAACDnicdVDLSgMxFM3UV62vqks3waLoZsiMlbaCILhxIaLgaKFTh0ya0djMgyQjlGH+wI2/4saFilvX7vwbM7WCih4InHvOveTe4yecSYXQu1EaG5+YnCpPV2Zm5+YXqotLZzJOBaEOiXks2j6WlLOIOoopTtuJoDj0OT33+/uFf35DhWRxdKoGCe2G+DJiASNYacmrrrsZdEOsrvwgO8y9jOcX2cb1Zg5dXVzvWro88vq5V60hc8tu1S0bInMbtZqoromNGs36FrRMNEQNjHDsVd/cXkzSkEaKcCxlx0KJ6mZYKEY4zStuKmmCSR9f0o6mEQ6p7GbDe3K4ppUeDGKhX6TgUP0+keFQykHo685idfnbK8S/vE6qgmY3Y1GSKhqRz4+ClEMVwyIc2GOCEsUHmmAimN4VkissMFE6wooO4etS+D9xbLNlWif12t7OKI0yWAGrYANYoAH2wAE4Bg4g4Bbcg0fwZNwZD8az8fLZWjJGM8vgB4zXD7xqnK0=
AAACDnicdVDLSgMxFM3UV62vqks3waLoZsiMlbaCILhxIaLgaKFTh0ya0djMgyQjlGH+wI2/4saFilvX7vwbM7WCih4InHvOveTe4yecSYXQu1EaG5+YnCpPV2Zm5+YXqotLZzJOBaEOiXks2j6WlLOIOoopTtuJoDj0OT33+/uFf35DhWRxdKoGCe2G+DJiASNYacmrrrsZdEOsrvwgO8y9jOcX2cb1Zg5dXVzvWro88vq5V60hc8tu1S0bInMbtZqoromNGs36FrRMNEQNjHDsVd/cXkzSkEaKcCxlx0KJ6mZYKEY4zStuKmmCSR9f0o6mEQ6p7GbDe3K4ppUeDGKhX6TgUP0+keFQykHo685idfnbK8S/vE6qgmY3Y1GSKhqRz4+ClEMVwyIc2GOCEsUHmmAimN4VkissMFE6wooO4etS+D9xbLNlWif12t7OKI0yWAGrYANYoAH2wAE4Bg4g4Bbcg0fwZNwZD8az8fLZWjJGM8vgB4zXD7xqnK0=
AAACDnicdVDLSgMxFM3UV62vqks3waLoZsiMlbaCILhxIaLgaKFTh0ya0djMgyQjlGH+wI2/4saFilvX7vwbM7WCih4InHvOveTe4yecSYXQu1EaG5+YnCpPV2Zm5+YXqotLZzJOBaEOiXks2j6WlLOIOoopTtuJoDj0OT33+/uFf35DhWRxdKoGCe2G+DJiASNYacmrrrsZdEOsrvwgO8y9jOcX2cb1Zg5dXVzvWro88vq5V60hc8tu1S0bInMbtZqoromNGs36FrRMNEQNjHDsVd/cXkzSkEaKcCxlx0KJ6mZYKEY4zStuKmmCSR9f0o6mEQ6p7GbDe3K4ppUeDGKhX6TgUP0+keFQykHo685idfnbK8S/vE6qgmY3Y1GSKhqRz4+ClEMVwyIc2GOCEsUHmmAimN4VkissMFE6wooO4etS+D9xbLNlWif12t7OKI0yWAGrYANYoAH2wAE4Bg4g4Bbcg0fwZNwZD8az8fLZWjJGM8vgB4zXD7xqnK0=
Local Stage
{C
(j)
l
}
Nk
j=1
AAACD3icdVDLSgMxFM34tr6qLt0Ei6ibIaMOOoJQ6MaVVLBa6NQhk2ba2MyDJCOUYT7Bjb/ixoWKW7fu/BszbQUVPRA495x7yb3HTziTCqEPY2Jyanpmdm6+tLC4tLxSXl27lHEqCG2QmMei6WNJOYtoQzHFaTMRFIc+p1d+v1b4V7dUSBZHF2qQ0HaIuxELGMFKS155280gdEOsen6Q1XIv4/l1tnOzm0NXFzcnli7PvH7ulSvIRLZjWwgi00aWs18Qxzk6sG1omWiIChij7pXf3U5M0pBGinAsZctCiWpnWChGOM1Lbippgkkfd2lL0wiHVLaz4UE53NJKBwax0C9ScKh+n8hwKOUg9HVnsbr87RXiX14rVcFRO2NRkioakdFHQcqhimGRDuwwQYniA00wEUzvCkkPC0yUzrCkQ/i6FP5PGnumY1rnB5Xq8TiNObABNsEOsMAhqIJTUAcNQMAdeABP4Nm4Nx6NF+N11DphjGfWwQ8Yb58SlpzV
AAACD3icdVDLSgMxFM34tr6qLt0Ei6ibIaMOOoJQ6MaVVLBa6NQhk2ba2MyDJCOUYT7Bjb/ixoWKW7fu/BszbQUVPRA495x7yb3HTziTCqEPY2Jyanpmdm6+tLC4tLxSXl27lHEqCG2QmMei6WNJOYtoQzHFaTMRFIc+p1d+v1b4V7dUSBZHF2qQ0HaIuxELGMFKS155280gdEOsen6Q1XIv4/l1tnOzm0NXFzcnli7PvH7ulSvIRLZjWwgi00aWs18Qxzk6sG1omWiIChij7pXf3U5M0pBGinAsZctCiWpnWChGOM1Lbippgkkfd2lL0wiHVLaz4UE53NJKBwax0C9ScKh+n8hwKOUg9HVnsbr87RXiX14rVcFRO2NRkioakdFHQcqhimGRDuwwQYniA00wEUzvCkkPC0yUzrCkQ/i6FP5PGnumY1rnB5Xq8TiNObABNsEOsMAhqIJTUAcNQMAdeABP4Nm4Nx6NF+N11DphjGfWwQ8Yb58SlpzV
AAACD3icdVDLSgMxFM34tr6qLt0Ei6ibIaMOOoJQ6MaVVLBa6NQhk2ba2MyDJCOUYT7Bjb/ixoWKW7fu/BszbQUVPRA495x7yb3HTziTCqEPY2Jyanpmdm6+tLC4tLxSXl27lHEqCG2QmMei6WNJOYtoQzHFaTMRFIc+p1d+v1b4V7dUSBZHF2qQ0HaIuxELGMFKS155280gdEOsen6Q1XIv4/l1tnOzm0NXFzcnli7PvH7ulSvIRLZjWwgi00aWs18Qxzk6sG1omWiIChij7pXf3U5M0pBGinAsZctCiWpnWChGOM1Lbippgkkfd2lL0wiHVLaz4UE53NJKBwax0C9ScKh+n8hwKOUg9HVnsbr87RXiX14rVcFRO2NRkioakdFHQcqhimGRDuwwQYniA00wEUzvCkkPC0yUzrCkQ/i6FP5PGnumY1rnB5Xq8TiNObABNsEOsMAhqIJTUAcNQMAdeABP4Nm4Nx6NF+N11DphjGfWwQ8Yb58SlpzV
AAACD3icdVDLSgMxFM34tr6qLt0Ei6ibIaMOOoJQ6MaVVLBa6NQhk2ba2MyDJCOUYT7Bjb/ixoWKW7fu/BszbQUVPRA495x7yb3HTziTCqEPY2Jyanpmdm6+tLC4tLxSXl27lHEqCG2QmMei6WNJOYtoQzHFaTMRFIc+p1d+v1b4V7dUSBZHF2qQ0HaIuxELGMFKS155280gdEOsen6Q1XIv4/l1tnOzm0NXFzcnli7PvH7ulSvIRLZjWwgi00aWs18Qxzk6sG1omWiIChij7pXf3U5M0pBGinAsZctCiWpnWChGOM1Lbippgkkfd2lL0wiHVLaz4UE53NJKBwax0C9ScKh+n8hwKOUg9HVnsbr87RXiX14rVcFRO2NRkioakdFHQcqhimGRDuwwQYniA00wEUzvCkkPC0yUzrCkQ/i6FP5PGnumY1rnB5Xq8TiNObABNsEOsMAhqIJTUAcNQMAdeABP4Nm4Nx6NF+N11DphjGfWwQ8Yb58SlpzV
(
AAACB3icdVDLSgMxFM34rPVVdenCYBFcDRntYMdVwY3LCtYW2lIy6W0bmskMSUYoQ5du/BU3LlTc+gvu/BvTh6CihwQO59ybm3vCRHBtCPlwFhaXlldWc2v59Y3Nre3Czu6NjlPFoMZiEatGSDUILqFmuBHQSBTQKBRQD4cXE79+C0rzWF6bUQLtiPYl73FGjZU6hYNWCH0uM2bf0ON8qzU9ILtzpVMoEpf4ge8RTFyfeMHphARBueT72HPJFEU0R7VTeG91Y5ZGIA0TVOumRxLTzqgynAmwE1INCWVD2oempZJGoNvZdJExPrJKF/diZa80eKp+78hopPUoCm1lRM1A//Ym4l9eMzW9cjvjMkkNSDYb1EsFNjGepIK7XAEzYmQJZYrbv2I2oIoyY7PL2xC+NsX/k9qJG7jeValYOZ+nkUP76BAdIw+doQq6RFVUQwzdoQf0hJ6de+fReXFeZ6ULzrxnD/2A8/YJBPmZiw==
AAACB3icdVDLSgMxFM34rPVVdenCYBFcDRntYMdVwY3LCtYW2lIy6W0bmskMSUYoQ5du/BU3LlTc+gvu/BvTh6CihwQO59ybm3vCRHBtCPlwFhaXlldWc2v59Y3Nre3Czu6NjlPFoMZiEatGSDUILqFmuBHQSBTQKBRQD4cXE79+C0rzWF6bUQLtiPYl73FGjZU6hYNWCH0uM2bf0ON8qzU9ILtzpVMoEpf4ge8RTFyfeMHphARBueT72HPJFEU0R7VTeG91Y5ZGIA0TVOumRxLTzqgynAmwE1INCWVD2oempZJGoNvZdJExPrJKF/diZa80eKp+78hopPUoCm1lRM1A//Ym4l9eMzW9cjvjMkkNSDYb1EsFNjGepIK7XAEzYmQJZYrbv2I2oIoyY7PL2xC+NsX/k9qJG7jeValYOZ+nkUP76BAdIw+doQq6RFVUQwzdoQf0hJ6de+fReXFeZ6ULzrxnD/2A8/YJBPmZiw==
AAACB3icdVDLSgMxFM34rPVVdenCYBFcDRntYMdVwY3LCtYW2lIy6W0bmskMSUYoQ5du/BU3LlTc+gvu/BvTh6CihwQO59ybm3vCRHBtCPlwFhaXlldWc2v59Y3Nre3Czu6NjlPFoMZiEatGSDUILqFmuBHQSBTQKBRQD4cXE79+C0rzWF6bUQLtiPYl73FGjZU6hYNWCH0uM2bf0ON8qzU9ILtzpVMoEpf4ge8RTFyfeMHphARBueT72HPJFEU0R7VTeG91Y5ZGIA0TVOumRxLTzqgynAmwE1INCWVD2oempZJGoNvZdJExPrJKF/diZa80eKp+78hopPUoCm1lRM1A//Ym4l9eMzW9cjvjMkkNSDYb1EsFNjGepIK7XAEzYmQJZYrbv2I2oIoyY7PL2xC+NsX/k9qJG7jeValYOZ+nkUP76BAdIw+doQq6RFVUQwzdoQf0hJ6de+fReXFeZ6ULzrxnD/2A8/YJBPmZiw==
AAACB3icdVDLSgMxFM34rPVVdenCYBFcDRntYMdVwY3LCtYW2lIy6W0bmskMSUYoQ5du/BU3LlTc+gvu/BvTh6CihwQO59ybm3vCRHBtCPlwFhaXlldWc2v59Y3Nre3Czu6NjlPFoMZiEatGSDUILqFmuBHQSBTQKBRQD4cXE79+C0rzWF6bUQLtiPYl73FGjZU6hYNWCH0uM2bf0ON8qzU9ILtzpVMoEpf4ge8RTFyfeMHphARBueT72HPJFEU0R7VTeG91Y5ZGIA0TVOumRxLTzqgynAmwE1INCWVD2oempZJGoNvZdJExPrJKF/diZa80eKp+78hopPUoCm1lRM1A//Ym4l9eMzW9cjvjMkkNSDYb1EsFNjGepIK7XAEzYmQJZYrbv2I2oIoyY7PL2xC+NsX/k9qJG7jeValYOZ+nkUP76BAdIw+doQq6RFVUQwzdoQf0hJ6de+fReXFeZ6ULzrxnD/2A8/YJBPmZiw==
(
AAACB3icdVDLSgMxFM34rPVVdenCYBFcDRntYMdVwY3LCtYW2lIy6W0bmskMSUYoQ5du/BU3LlTc+gvu/BvTh6CihwQO59ybm3vCRHBtCPlwFhaXlldWc2v59Y3Nre3Czu6NjlPFoMZiEatGSDUILqFmuBHQSBTQKBRQD4cXE79+C0rzWF6bUQLtiPYl73FGjZU6hYNWCH0uM2bf0ON8qzU9ILtzpVMoEpf4ge8RTFyfeMHphARBueT72HPJFEU0R7VTeG91Y5ZGIA0TVOumRxLTzqgynAmwE1INCWVD2oempZJGoNvZdJExPrJKF/diZa80eKp+78hopPUoCm1lRM1A//Ym4l9eMzW9cjvjMkkNSDYb1EsFNjGepIK7XAEzYmQJZYrbv2I2oIoyY7PL2xC+NsX/k9qJG7jeValYOZ+nkUP76BAdIw+doQq6RFVUQwzdoQf0hJ6de+fReXFeZ6ULzrxnD/2A8/YJBPmZiw==
AAACB3icdVDLSgMxFM34rPVVdenCYBFcDRntYMdVwY3LCtYW2lIy6W0bmskMSUYoQ5du/BU3LlTc+gvu/BvTh6CihwQO59ybm3vCRHBtCPlwFhaXlldWc2v59Y3Nre3Czu6NjlPFoMZiEatGSDUILqFmuBHQSBTQKBRQD4cXE79+C0rzWF6bUQLtiPYl73FGjZU6hYNWCH0uM2bf0ON8qzU9ILtzpVMoEpf4ge8RTFyfeMHphARBueT72HPJFEU0R7VTeG91Y5ZGIA0TVOumRxLTzqgynAmwE1INCWVD2oempZJGoNvZdJExPrJKF/diZa80eKp+78hopPUoCm1lRM1A//Ym4l9eMzW9cjvjMkkNSDYb1EsFNjGepIK7XAEzYmQJZYrbv2I2oIoyY7PL2xC+NsX/k9qJG7jeValYOZ+nkUP76BAdIw+doQq6RFVUQwzdoQf0hJ6de+fReXFeZ6ULzrxnD/2A8/YJBPmZiw==
AAACB3icdVDLSgMxFM34rPVVdenCYBFcDRntYMdVwY3LCtYW2lIy6W0bmskMSUYoQ5du/BU3LlTc+gvu/BvTh6CihwQO59ybm3vCRHBtCPlwFhaXlldWc2v59Y3Nre3Czu6NjlPFoMZiEatGSDUILqFmuBHQSBTQKBRQD4cXE79+C0rzWF6bUQLtiPYl73FGjZU6hYNWCH0uM2bf0ON8qzU9ILtzpVMoEpf4ge8RTFyfeMHphARBueT72HPJFEU0R7VTeG91Y5ZGIA0TVOumRxLTzqgynAmwE1INCWVD2oempZJGoNvZdJExPrJKF/diZa80eKp+78hopPUoCm1lRM1A//Ym4l9eMzW9cjvjMkkNSDYb1EsFNjGepIK7XAEzYmQJZYrbv2I2oIoyY7PL2xC+NsX/k9qJG7jeValYOZ+nkUP76BAdIw+doQq6RFVUQwzdoQf0hJ6de+fReXFeZ6ULzrxnD/2A8/YJBPmZiw==
AAACB3icdVDLSgMxFM34rPVVdenCYBFcDRntYMdVwY3LCtYW2lIy6W0bmskMSUYoQ5du/BU3LlTc+gvu/BvTh6CihwQO59ybm3vCRHBtCPlwFhaXlldWc2v59Y3Nre3Czu6NjlPFoMZiEatGSDUILqFmuBHQSBTQKBRQD4cXE79+C0rzWF6bUQLtiPYl73FGjZU6hYNWCH0uM2bf0ON8qzU9ILtzpVMoEpf4ge8RTFyfeMHphARBueT72HPJFEU0R7VTeG91Y5ZGIA0TVOumRxLTzqgynAmwE1INCWVD2oempZJGoNvZdJExPrJKF/diZa80eKp+78hopPUoCm1lRM1A//Ym4l9eMzW9cjvjMkkNSDYb1EsFNjGepIK7XAEzYmQJZYrbv2I2oIoyY7PL2xC+NsX/k9qJG7jeValYOZ+nkUP76BAdIw+doQq6RFVUQwzdoQf0hJ6de+fReXFeZ6ULzrxnD/2A8/YJBPmZiw==
E(·)
AAAB+XicbVDLSsNAFJ34rPWV6tLNYBHqpiQi+FgVRHBZwdhCE8pkMmmHTmbCzEQpsZ/ixoWKW//EnX/jpM1CWw8MHM65l3vmhCmjSjvOt7W0vLK6tl7ZqG5ube/s2rW9eyUyiYmHBROyGyJFGOXE01Qz0k0lQUnISCccXRV+54FIRQW/0+OUBAkacBpTjLSR+nbNT5AehmF+PWn4OBL6uG/XnaYzBVwkbknqoES7b3/5kcBZQrjGDCnVc51UBzmSmmJGJlU/UyRFeIQGpGcoRwlRQT6NPoFHRolgLKR5XMOp+nsjR4lS4yQ0k0VQNe8V4n9eL9PxeZBTnmaacDw7FGcMagGLHmBEJcGajQ1BWFKTFeIhkghr01bVlODOf3mReCfNi6Z7e1pvXZZtVMABOAQN4IIz0AI3oA08gMEjeAav4M16sl6sd+tjNrpklTv74A+szx85fJN/
AAAB+XicbVDLSsNAFJ34rPWV6tLNYBHqpiQi+FgVRHBZwdhCE8pkMmmHTmbCzEQpsZ/ixoWKW//EnX/jpM1CWw8MHM65l3vmhCmjSjvOt7W0vLK6tl7ZqG5ube/s2rW9eyUyiYmHBROyGyJFGOXE01Qz0k0lQUnISCccXRV+54FIRQW/0+OUBAkacBpTjLSR+nbNT5AehmF+PWn4OBL6uG/XnaYzBVwkbknqoES7b3/5kcBZQrjGDCnVc51UBzmSmmJGJlU/UyRFeIQGpGcoRwlRQT6NPoFHRolgLKR5XMOp+nsjR4lS4yQ0k0VQNe8V4n9eL9PxeZBTnmaacDw7FGcMagGLHmBEJcGajQ1BWFKTFeIhkghr01bVlODOf3mReCfNi6Z7e1pvXZZtVMABOAQN4IIz0AI3oA08gMEjeAav4M16sl6sd+tjNrpklTv74A+szx85fJN/
AAAB+XicbVDLSsNAFJ34rPWV6tLNYBHqpiQi+FgVRHBZwdhCE8pkMmmHTmbCzEQpsZ/ixoWKW//EnX/jpM1CWw8MHM65l3vmhCmjSjvOt7W0vLK6tl7ZqG5ube/s2rW9eyUyiYmHBROyGyJFGOXE01Qz0k0lQUnISCccXRV+54FIRQW/0+OUBAkacBpTjLSR+nbNT5AehmF+PWn4OBL6uG/XnaYzBVwkbknqoES7b3/5kcBZQrjGDCnVc51UBzmSmmJGJlU/UyRFeIQGpGcoRwlRQT6NPoFHRolgLKR5XMOp+nsjR4lS4yQ0k0VQNe8V4n9eL9PxeZBTnmaacDw7FGcMagGLHmBEJcGajQ1BWFKTFeIhkghr01bVlODOf3mReCfNi6Z7e1pvXZZtVMABOAQN4IIz0AI3oA08gMEjeAav4M16sl6sd+tjNrpklTv74A+szx85fJN/
AAAB+XicbVDLSsNAFJ34rPWV6tLNYBHqpiQi+FgVRHBZwdhCE8pkMmmHTmbCzEQpsZ/ixoWKW//EnX/jpM1CWw8MHM65l3vmhCmjSjvOt7W0vLK6tl7ZqG5ube/s2rW9eyUyiYmHBROyGyJFGOXE01Qz0k0lQUnISCccXRV+54FIRQW/0+OUBAkacBpTjLSR+nbNT5AehmF+PWn4OBL6uG/XnaYzBVwkbknqoES7b3/5kcBZQrjGDCnVc51UBzmSmmJGJlU/UyRFeIQGpGcoRwlRQT6NPoFHRolgLKR5XMOp+nsjR4lS4yQ0k0VQNe8V4n9eL9PxeZBTnmaacDw7FGcMagGLHmBEJcGajQ1BWFKTFeIhkghr01bVlODOf3mReCfNi6Z7e1pvXZZtVMABOAQN4IIz0AI3oA08gMEjeAav4M16sl6sd+tjNrpklTv74A+szx85fJN/
§ 3.3.1
§ 3.3.2
Figure 3.3: Overview of the progressive mesh generation network.
3.3.1 GlobalGeometryStage
Volumetric Feature Sampling: In order to extract salient features to predict surface points
in correspondence, we deploy a shared U-Net convolutional network to extract local 2D feature
mapsF
i
for each input imageI
i
. We sample volumetric featuresL by bilinearly sampling and
fusing image features at projected coordinates in all images for each local pointv2R
3
in the 3D
gridG:
L(v) =(fF
i
((v;P
i
))g
K
i=1
); (3.1)
where () is the perspective projection function and () is a view-wise fusion function, for
which common choices can be max, mean or standard deviation. The 3D gridG is a set of points
on a regular 3D grid, which can be dened at arbitrary locations with arbitrary shapes. Here we
choose cube grids, as shown in green cubes in Fig. 3.3 to feed into 3D convolution networks.
61
GlobalGeometryNetwork: To enable the vertex exibility, we design the network to predict
vertex location free of the constraint of 3DMMs. To encourage better generalization, we design
a volumetric network architecture to learn the probabilistic distribution instead of the absolute
location for each vertex. We dene a canonical global gridG
g
that covers the whole captured
volume for subject heads. We apply the volumetric feature sampling (Eq. 3.1) on the global grid
G
g
to obtain the global volumetric featureL
g
, similar to [90, 93]. We deploy the global geometry
network
g
, a 3D convolutional network with skip connections, to predict a probability volume
C
g
=
g
(L
g
), in which each channel encodes the probability distribution for the location of a
corresponding vertex in the initial meshM
0
. The vertex locations are extracted by a per-channel
soft-argmax operation,V
0
=
E
(C
g
), similar to that in [93].
3.3.2 LocalGeometryStage
(341 vertices) (1194 vertices) (3412 vertices) (10495 vertices)
V
(i,0)
k
AAAB/nicbVDLSsNAFL2pr1pfUcGNm8EiVJCSSEHdFdy4rGDaQhvLZDpph04ezEyEErPwV9y4UHHrd7jzb5y0WWj1wMDhnHu5Z44XcyaVZX0ZpaXlldW18nplY3Nre8fc3WvLKBGEOiTikeh6WFLOQuoopjjtxoLiwOO0402ucr9zT4VkUXirpjF1AzwKmc8IVloamAf9AKux56ftbJBOsru0xk6tk2xgVq26NQP6S+yCVKFAa2B+9ocRSQIaKsKxlD3bipWbYqEY4TSr9BNJY0wmeER7moY4oNJNZ/kzdKyVIfIjoV+o0Ez9uZHiQMpp4OnJPK1c9HLxP6+XKP/CTVkYJ4qGZH7ITzhSEcrLQEMmKFF8qgkmgumsiIyxwETpyiq6BHvxy3+Jc1a/rNs3jWqzUbRRhkM4ghrYcA5NuIYWOEDgAZ7gBV6NR+PZeDPe56Mlo9jZh18wPr4BAV2Vlw==
AAAB/nicbVDLSsNAFL2pr1pfUcGNm8EiVJCSSEHdFdy4rGDaQhvLZDpph04ezEyEErPwV9y4UHHrd7jzb5y0WWj1wMDhnHu5Z44XcyaVZX0ZpaXlldW18nplY3Nre8fc3WvLKBGEOiTikeh6WFLOQuoopjjtxoLiwOO0402ucr9zT4VkUXirpjF1AzwKmc8IVloamAf9AKux56ftbJBOsru0xk6tk2xgVq26NQP6S+yCVKFAa2B+9ocRSQIaKsKxlD3bipWbYqEY4TSr9BNJY0wmeER7moY4oNJNZ/kzdKyVIfIjoV+o0Ez9uZHiQMpp4OnJPK1c9HLxP6+XKP/CTVkYJ4qGZH7ITzhSEcrLQEMmKFF8qgkmgumsiIyxwETpyiq6BHvxy3+Jc1a/rNs3jWqzUbRRhkM4ghrYcA5NuIYWOEDgAZ7gBV6NR+PZeDPe56Mlo9jZh18wPr4BAV2Vlw==
AAAB/nicbVDLSsNAFL2pr1pfUcGNm8EiVJCSSEHdFdy4rGDaQhvLZDpph04ezEyEErPwV9y4UHHrd7jzb5y0WWj1wMDhnHu5Z44XcyaVZX0ZpaXlldW18nplY3Nre8fc3WvLKBGEOiTikeh6WFLOQuoopjjtxoLiwOO0402ucr9zT4VkUXirpjF1AzwKmc8IVloamAf9AKux56ftbJBOsru0xk6tk2xgVq26NQP6S+yCVKFAa2B+9ocRSQIaKsKxlD3bipWbYqEY4TSr9BNJY0wmeER7moY4oNJNZ/kzdKyVIfIjoV+o0Ez9uZHiQMpp4OnJPK1c9HLxP6+XKP/CTVkYJ4qGZH7ITzhSEcrLQEMmKFF8qgkmgumsiIyxwETpyiq6BHvxy3+Jc1a/rNs3jWqzUbRRhkM4ghrYcA5NuIYWOEDgAZ7gBV6NR+PZeDPe56Mlo9jZh18wPr4BAV2Vlw==
AAAB/nicbVDLSsNAFL2pr1pfUcGNm8EiVJCSSEHdFdy4rGDaQhvLZDpph04ezEyEErPwV9y4UHHrd7jzb5y0WWj1wMDhnHu5Z44XcyaVZX0ZpaXlldW18nplY3Nre8fc3WvLKBGEOiTikeh6WFLOQuoopjjtxoLiwOO0402ucr9zT4VkUXirpjF1AzwKmc8IVloamAf9AKux56ftbJBOsru0xk6tk2xgVq26NQP6S+yCVKFAa2B+9ocRSQIaKsKxlD3bipWbYqEY4TSr9BNJY0wmeER7moY4oNJNZ/kzdKyVIfIjoV+o0Ez9uZHiQMpp4OnJPK1c9HLxP6+XKP/CTVkYJ4qGZH7ITzhSEcrLQEMmKFF8qgkmgumsiIyxwETpyiq6BHvxy3+Jc1a/rNs3jWqzUbRRhkM4ghrYcA5NuIYWOEDgAZ7gBV6NR+PZeDPe56Mlo9jZh18wPr4BAV2Vlw==
V
(i,1)
k
AAAB/nicbVDLSsNAFL2pr1pfUcGNm8EiVJCSSEHdFdy4rGDaQhvLZDpph04ezEyEErPwV9y4UHHrd7jzb5y0WWj1wMDhnHu5Z44XcyaVZX0ZpaXlldW18nplY3Nre8fc3WvLKBGEOiTikeh6WFLOQuoopjjtxoLiwOO0402ucr9zT4VkUXirpjF1AzwKmc8IVloamAf9AKux56ftbJBOsru0xk7tk2xgVq26NQP6S+yCVKFAa2B+9ocRSQIaKsKxlD3bipWbYqEY4TSr9BNJY0wmeER7moY4oNJNZ/kzdKyVIfIjoV+o0Ez9uZHiQMpp4OnJPK1c9HLxP6+XKP/CTVkYJ4qGZH7ITzhSEcrLQEMmKFF8qgkmgumsiIyxwETpyiq6BHvxy3+Jc1a/rNs3jWqzUbRRhkM4ghrYcA5NuIYWOEDgAZ7gBV6NR+PZeDPe56Mlo9jZh18wPr4BAuKVmA==
AAAB/nicbVDLSsNAFL2pr1pfUcGNm8EiVJCSSEHdFdy4rGDaQhvLZDpph04ezEyEErPwV9y4UHHrd7jzb5y0WWj1wMDhnHu5Z44XcyaVZX0ZpaXlldW18nplY3Nre8fc3WvLKBGEOiTikeh6WFLOQuoopjjtxoLiwOO0402ucr9zT4VkUXirpjF1AzwKmc8IVloamAf9AKux56ftbJBOsru0xk7tk2xgVq26NQP6S+yCVKFAa2B+9ocRSQIaKsKxlD3bipWbYqEY4TSr9BNJY0wmeER7moY4oNJNZ/kzdKyVIfIjoV+o0Ez9uZHiQMpp4OnJPK1c9HLxP6+XKP/CTVkYJ4qGZH7ITzhSEcrLQEMmKFF8qgkmgumsiIyxwETpyiq6BHvxy3+Jc1a/rNs3jWqzUbRRhkM4ghrYcA5NuIYWOEDgAZ7gBV6NR+PZeDPe56Mlo9jZh18wPr4BAuKVmA==
AAAB/nicbVDLSsNAFL2pr1pfUcGNm8EiVJCSSEHdFdy4rGDaQhvLZDpph04ezEyEErPwV9y4UHHrd7jzb5y0WWj1wMDhnHu5Z44XcyaVZX0ZpaXlldW18nplY3Nre8fc3WvLKBGEOiTikeh6WFLOQuoopjjtxoLiwOO0402ucr9zT4VkUXirpjF1AzwKmc8IVloamAf9AKux56ftbJBOsru0xk7tk2xgVq26NQP6S+yCVKFAa2B+9ocRSQIaKsKxlD3bipWbYqEY4TSr9BNJY0wmeER7moY4oNJNZ/kzdKyVIfIjoV+o0Ez9uZHiQMpp4OnJPK1c9HLxP6+XKP/CTVkYJ4qGZH7ITzhSEcrLQEMmKFF8qgkmgumsiIyxwETpyiq6BHvxy3+Jc1a/rNs3jWqzUbRRhkM4ghrYcA5NuIYWOEDgAZ7gBV6NR+PZeDPe56Mlo9jZh18wPr4BAuKVmA==
AAAB/nicbVDLSsNAFL2pr1pfUcGNm8EiVJCSSEHdFdy4rGDaQhvLZDpph04ezEyEErPwV9y4UHHrd7jzb5y0WWj1wMDhnHu5Z44XcyaVZX0ZpaXlldW18nplY3Nre8fc3WvLKBGEOiTikeh6WFLOQuoopjjtxoLiwOO0402ucr9zT4VkUXirpjF1AzwKmc8IVloamAf9AKux56ftbJBOsru0xk7tk2xgVq26NQP6S+yCVKFAa2B+9ocRSQIaKsKxlD3bipWbYqEY4TSr9BNJY0wmeER7moY4oNJNZ/kzdKyVIfIjoV+o0Ez9uZHiQMpp4OnJPK1c9HLxP6+XKP/CTVkYJ4qGZH7ITzhSEcrLQEMmKFF8qgkmgumsiIyxwETpyiq6BHvxy3+Jc1a/rNs3jWqzUbRRhkM4ghrYcA5NuIYWOEDgAZ7gBV6NR+PZeDPe56Mlo9jZh18wPr4BAuKVmA==
V
(i,2)
k
AAAB/nicbVDNSsNAGNzUv1r/ooIXL4tFqCAlKQX1VvDisYJpC20Mm+2mXbrZhN2NUGIOvooXDypefQ5vvo2bNgdtHVgYZr6Pb3b8mFGpLOvbKK2srq1vlDcrW9s7u3vm/kFHRonAxMERi0TPR5IwyomjqGKkFwuCQp+Rrj+5zv3uAxGSRvxOTWPihmjEaUAxUlryzKNBiNTYD9JO5qWT7D6t0fPGWeaZVatuzQCXiV2QKijQ9syvwTDCSUi4wgxJ2betWLkpEopiRrLKIJEkRniCRqSvKUchkW46y5/BU60MYRAJ/biCM/X3RopCKaehryfztHLRy8X/vH6igks3pTxOFOF4fihIGFQRzMuAQyoIVmyqCcKC6qwQj5FAWOnKKroEe/HLy8Rp1K/q9m2z2moWbZTBMTgBNWCDC9ACN6ANHIDBI3gGr+DNeDJejHfjYz5aMoqdQ/AHxucPBGeVmQ==
AAAB/nicbVDNSsNAGNzUv1r/ooIXL4tFqCAlKQX1VvDisYJpC20Mm+2mXbrZhN2NUGIOvooXDypefQ5vvo2bNgdtHVgYZr6Pb3b8mFGpLOvbKK2srq1vlDcrW9s7u3vm/kFHRonAxMERi0TPR5IwyomjqGKkFwuCQp+Rrj+5zv3uAxGSRvxOTWPihmjEaUAxUlryzKNBiNTYD9JO5qWT7D6t0fPGWeaZVatuzQCXiV2QKijQ9syvwTDCSUi4wgxJ2betWLkpEopiRrLKIJEkRniCRqSvKUchkW46y5/BU60MYRAJ/biCM/X3RopCKaehryfztHLRy8X/vH6igks3pTxOFOF4fihIGFQRzMuAQyoIVmyqCcKC6qwQj5FAWOnKKroEe/HLy8Rp1K/q9m2z2moWbZTBMTgBNWCDC9ACN6ANHIDBI3gGr+DNeDJejHfjYz5aMoqdQ/AHxucPBGeVmQ==
AAAB/nicbVDNSsNAGNzUv1r/ooIXL4tFqCAlKQX1VvDisYJpC20Mm+2mXbrZhN2NUGIOvooXDypefQ5vvo2bNgdtHVgYZr6Pb3b8mFGpLOvbKK2srq1vlDcrW9s7u3vm/kFHRonAxMERi0TPR5IwyomjqGKkFwuCQp+Rrj+5zv3uAxGSRvxOTWPihmjEaUAxUlryzKNBiNTYD9JO5qWT7D6t0fPGWeaZVatuzQCXiV2QKijQ9syvwTDCSUi4wgxJ2betWLkpEopiRrLKIJEkRniCRqSvKUchkW46y5/BU60MYRAJ/biCM/X3RopCKaehryfztHLRy8X/vH6igks3pTxOFOF4fihIGFQRzMuAQyoIVmyqCcKC6qwQj5FAWOnKKroEe/HLy8Rp1K/q9m2z2moWbZTBMTgBNWCDC9ACN6ANHIDBI3gGr+DNeDJejHfjYz5aMoqdQ/AHxucPBGeVmQ==
AAAB/nicbVDNSsNAGNzUv1r/ooIXL4tFqCAlKQX1VvDisYJpC20Mm+2mXbrZhN2NUGIOvooXDypefQ5vvo2bNgdtHVgYZr6Pb3b8mFGpLOvbKK2srq1vlDcrW9s7u3vm/kFHRonAxMERi0TPR5IwyomjqGKkFwuCQp+Rrj+5zv3uAxGSRvxOTWPihmjEaUAxUlryzKNBiNTYD9JO5qWT7D6t0fPGWeaZVatuzQCXiV2QKijQ9syvwTDCSUi4wgxJ2betWLkpEopiRrLKIJEkRniCRqSvKUchkW46y5/BU60MYRAJ/biCM/X3RopCKaehryfztHLRy8X/vH6igks3pTxOFOF4fihIGFQRzMuAQyoIVmyqCcKC6qwQj5FAWOnKKroEe/HLy8Rp1K/q9m2z2moWbZTBMTgBNWCDC9ACN6ANHIDBI3gGr+DNeDJejHfjYz5aMoqdQ/AHxucPBGeVmQ==
e
V
(j)
k+1
AAACCnicdVDLSgNBEJz1GeMr6tHLYBAiwjK7JpjcAl48KhgNJDHMzvbqmNkHM7NKWPbuxV/x4kHFq1/gzb9xEiOoaEFDUdVNd5eXCK40Ie/W1PTM7Nx8YaG4uLS8slpaWz9VcSoZtFgsYtn2qALBI2hprgW0Ewk09ASceYODkX92DVLxODrRwwR6Ib2IeMAZ1Ubql7a6N9wHzYUPWTek+tILstM872eDXSc/zypXO3m/VCZ2o0Gq1Romdo24rls3hOy59YaDHZuMUUYTHPVLb10/ZmkIkWaCKtVxSKJ7GZWaMwF5sZsqSCgb0AvoGBrREFQvG/+S422j+DiIpalI47H6fSKjoVLD0DOdo3PVb28k/uV1Uh3UexmPklRDxD4XBanAOsajYLDPJTAthoZQJrm5FbNLKinTJr6iCeHrU/w/abl2w3aOq+VmdZJGAW2iLVRBDtpHTXSIjlALMXSL7tEjerLurAfr2Xr5bJ2yJjMb6Aes1w9Y6pto
AAACCnicdVDLSgNBEJz1GeMr6tHLYBAiwjK7JpjcAl48KhgNJDHMzvbqmNkHM7NKWPbuxV/x4kHFq1/gzb9xEiOoaEFDUdVNd5eXCK40Ie/W1PTM7Nx8YaG4uLS8slpaWz9VcSoZtFgsYtn2qALBI2hprgW0Ewk09ASceYODkX92DVLxODrRwwR6Ib2IeMAZ1Ubql7a6N9wHzYUPWTek+tILstM872eDXSc/zypXO3m/VCZ2o0Gq1Romdo24rls3hOy59YaDHZuMUUYTHPVLb10/ZmkIkWaCKtVxSKJ7GZWaMwF5sZsqSCgb0AvoGBrREFQvG/+S422j+DiIpalI47H6fSKjoVLD0DOdo3PVb28k/uV1Uh3UexmPklRDxD4XBanAOsajYLDPJTAthoZQJrm5FbNLKinTJr6iCeHrU/w/abl2w3aOq+VmdZJGAW2iLVRBDtpHTXSIjlALMXSL7tEjerLurAfr2Xr5bJ2yJjMb6Aes1w9Y6pto
AAACCnicdVDLSgNBEJz1GeMr6tHLYBAiwjK7JpjcAl48KhgNJDHMzvbqmNkHM7NKWPbuxV/x4kHFq1/gzb9xEiOoaEFDUdVNd5eXCK40Ie/W1PTM7Nx8YaG4uLS8slpaWz9VcSoZtFgsYtn2qALBI2hprgW0Ewk09ASceYODkX92DVLxODrRwwR6Ib2IeMAZ1Ubql7a6N9wHzYUPWTek+tILstM872eDXSc/zypXO3m/VCZ2o0Gq1Romdo24rls3hOy59YaDHZuMUUYTHPVLb10/ZmkIkWaCKtVxSKJ7GZWaMwF5sZsqSCgb0AvoGBrREFQvG/+S422j+DiIpalI47H6fSKjoVLD0DOdo3PVb28k/uV1Uh3UexmPklRDxD4XBanAOsajYLDPJTAthoZQJrm5FbNLKinTJr6iCeHrU/w/abl2w3aOq+VmdZJGAW2iLVRBDtpHTXSIjlALMXSL7tEjerLurAfr2Xr5bJ2yJjMb6Aes1w9Y6pto
AAACCnicdVDLSgNBEJz1GeMr6tHLYBAiwjK7JpjcAl48KhgNJDHMzvbqmNkHM7NKWPbuxV/x4kHFq1/gzb9xEiOoaEFDUdVNd5eXCK40Ie/W1PTM7Nx8YaG4uLS8slpaWz9VcSoZtFgsYtn2qALBI2hprgW0Ewk09ASceYODkX92DVLxODrRwwR6Ib2IeMAZ1Ubql7a6N9wHzYUPWTek+tILstM872eDXSc/zypXO3m/VCZ2o0Gq1Romdo24rls3hOy59YaDHZuMUUYTHPVLb10/ZmkIkWaCKtVxSKJ7GZWaMwF5sZsqSCgb0AvoGBrREFQvG/+S422j+DiIpalI47H6fSKjoVLD0DOdo3PVb28k/uV1Uh3UexmPklRDxD4XBanAOsajYLDPJTAthoZQJrm5FbNLKinTJr6iCeHrU/w/abl2w3aOq+VmdZJGAW2iLVRBDtpHTXSIjlALMXSL7tEjerLurAfr2Xr5bJ2yJjMb6Aes1w9Y6pto
Q
(i,0)
k
AAAB/nicdVDLSgMxFM3UV62vUcGNm2ARKsiQ0Q52dgU3LluwVmhryaSZNjTzIMkIZZyFv+LGhYpbv8Odf2OmraCiBwKHc+7lnhwv5kwqhD6MwsLi0vJKcbW0tr6xuWVu71zJKBGEtkjEI3HtYUk5C2lLMcXpdSwoDjxO2974PPfbt1RIFoWXahLTXoCHIfMZwUpLfXOvG2A18vy0mfXTcXaTVtgxOsr6ZhlZyHEdG0FkOch2T3PiurWq40DbQlOUwRyNvvneHUQkCWioCMdSdmwUq16KhWKE06zUTSSNMRnjIe1oGuKAyl46zZ/BQ60MoB8J/UIFp+r3jRQHUk4CT0/maeVvLxf/8jqJ8mu9lIVxomhIZof8hEMVwbwMOGCCEsUnmmAimM4KyQgLTJSurKRL+Pop/J+0TizXspvVcr06b6MI9sEBqAAbnIE6uAAN0AIE3IEH8ASejXvj0XgxXmejBWO+swt+wHj7BGmMleA=
AAAB/nicdVDLSgMxFM3UV62vUcGNm2ARKsiQ0Q52dgU3LluwVmhryaSZNjTzIMkIZZyFv+LGhYpbv8Odf2OmraCiBwKHc+7lnhwv5kwqhD6MwsLi0vJKcbW0tr6xuWVu71zJKBGEtkjEI3HtYUk5C2lLMcXpdSwoDjxO2974PPfbt1RIFoWXahLTXoCHIfMZwUpLfXOvG2A18vy0mfXTcXaTVtgxOsr6ZhlZyHEdG0FkOch2T3PiurWq40DbQlOUwRyNvvneHUQkCWioCMdSdmwUq16KhWKE06zUTSSNMRnjIe1oGuKAyl46zZ/BQ60MoB8J/UIFp+r3jRQHUk4CT0/maeVvLxf/8jqJ8mu9lIVxomhIZof8hEMVwbwMOGCCEsUnmmAimM4KyQgLTJSurKRL+Pop/J+0TizXspvVcr06b6MI9sEBqAAbnIE6uAAN0AIE3IEH8ASejXvj0XgxXmejBWO+swt+wHj7BGmMleA=
AAAB/nicdVDLSgMxFM3UV62vUcGNm2ARKsiQ0Q52dgU3LluwVmhryaSZNjTzIMkIZZyFv+LGhYpbv8Odf2OmraCiBwKHc+7lnhwv5kwqhD6MwsLi0vJKcbW0tr6xuWVu71zJKBGEtkjEI3HtYUk5C2lLMcXpdSwoDjxO2974PPfbt1RIFoWXahLTXoCHIfMZwUpLfXOvG2A18vy0mfXTcXaTVtgxOsr6ZhlZyHEdG0FkOch2T3PiurWq40DbQlOUwRyNvvneHUQkCWioCMdSdmwUq16KhWKE06zUTSSNMRnjIe1oGuKAyl46zZ/BQ60MoB8J/UIFp+r3jRQHUk4CT0/maeVvLxf/8jqJ8mu9lIVxomhIZof8hEMVwbwMOGCCEsUnmmAimM4KyQgLTJSurKRL+Pop/J+0TizXspvVcr06b6MI9sEBqAAbnIE6uAAN0AIE3IEH8ASejXvj0XgxXmejBWO+swt+wHj7BGmMleA=
AAAB/nicdVDLSgMxFM3UV62vUcGNm2ARKsiQ0Q52dgU3LluwVmhryaSZNjTzIMkIZZyFv+LGhYpbv8Odf2OmraCiBwKHc+7lnhwv5kwqhD6MwsLi0vJKcbW0tr6xuWVu71zJKBGEtkjEI3HtYUk5C2lLMcXpdSwoDjxO2974PPfbt1RIFoWXahLTXoCHIfMZwUpLfXOvG2A18vy0mfXTcXaTVtgxOsr6ZhlZyHEdG0FkOch2T3PiurWq40DbQlOUwRyNvvneHUQkCWioCMdSdmwUq16KhWKE06zUTSSNMRnjIe1oGuKAyl46zZ/BQ60MoB8J/UIFp+r3jRQHUk4CT0/maeVvLxf/8jqJ8mu9lIVxomhIZof8hEMVwbwMOGCCEsUnmmAimM4KyQgLTJSurKRL+Pop/J+0TizXspvVcr06b6MI9sEBqAAbnIE6uAAN0AIE3IEH8ASejXvj0XgxXmejBWO+swt+wHj7BGmMleA=
Q
(i,1)
k
AAAB/nicdVDLSgMxFM3UV62vUcGNm2ARKsiQ0Q52dgU3LluwVmhryaSZNjTzIMkIZZyFv+LGhYpbv8Odf2OmraCiBwKHc+7lnhwv5kwqhD6MwsLi0vJKcbW0tr6xuWVu71zJKBGEtkjEI3HtYUk5C2lLMcXpdSwoDjxO2974PPfbt1RIFoWXahLTXoCHIfMZwUpLfXOvG2A18vy0mfXTcXaTVtixfZT1zTKykOM6NoLIcpDtnubEdWtVx4G2haYogzkaffO9O4hIEtBQEY6l7NgoVr0UC8UIp1mpm0gaYzLGQ9rRNMQBlb10mj+Dh1oZQD8S+oUKTtXvGykOpJwEnp7M08rfXi7+5XUS5dd6KQvjRNGQzA75CYcqgnkZcMAEJYpPNMFEMJ0VkhEWmChdWUmX8PVT+D9pnViuZTer5Xp13kYR7IMDUAE2OAN1cAEaoAUIuAMP4Ak8G/fGo/FivM5GC8Z8Zxf8gPH2CWsRleE=
AAAB/nicdVDLSgMxFM3UV62vUcGNm2ARKsiQ0Q52dgU3LluwVmhryaSZNjTzIMkIZZyFv+LGhYpbv8Odf2OmraCiBwKHc+7lnhwv5kwqhD6MwsLi0vJKcbW0tr6xuWVu71zJKBGEtkjEI3HtYUk5C2lLMcXpdSwoDjxO2974PPfbt1RIFoWXahLTXoCHIfMZwUpLfXOvG2A18vy0mfXTcXaTVtixfZT1zTKykOM6NoLIcpDtnubEdWtVx4G2haYogzkaffO9O4hIEtBQEY6l7NgoVr0UC8UIp1mpm0gaYzLGQ9rRNMQBlb10mj+Dh1oZQD8S+oUKTtXvGykOpJwEnp7M08rfXi7+5XUS5dd6KQvjRNGQzA75CYcqgnkZcMAEJYpPNMFEMJ0VkhEWmChdWUmX8PVT+D9pnViuZTer5Xp13kYR7IMDUAE2OAN1cAEaoAUIuAMP4Ak8G/fGo/FivM5GC8Z8Zxf8gPH2CWsRleE=
AAAB/nicdVDLSgMxFM3UV62vUcGNm2ARKsiQ0Q52dgU3LluwVmhryaSZNjTzIMkIZZyFv+LGhYpbv8Odf2OmraCiBwKHc+7lnhwv5kwqhD6MwsLi0vJKcbW0tr6xuWVu71zJKBGEtkjEI3HtYUk5C2lLMcXpdSwoDjxO2974PPfbt1RIFoWXahLTXoCHIfMZwUpLfXOvG2A18vy0mfXTcXaTVtixfZT1zTKykOM6NoLIcpDtnubEdWtVx4G2haYogzkaffO9O4hIEtBQEY6l7NgoVr0UC8UIp1mpm0gaYzLGQ9rRNMQBlb10mj+Dh1oZQD8S+oUKTtXvGykOpJwEnp7M08rfXi7+5XUS5dd6KQvjRNGQzA75CYcqgnkZcMAEJYpPNMFEMJ0VkhEWmChdWUmX8PVT+D9pnViuZTer5Xp13kYR7IMDUAE2OAN1cAEaoAUIuAMP4Ak8G/fGo/FivM5GC8Z8Zxf8gPH2CWsRleE=
AAAB/nicdVDLSgMxFM3UV62vUcGNm2ARKsiQ0Q52dgU3LluwVmhryaSZNjTzIMkIZZyFv+LGhYpbv8Odf2OmraCiBwKHc+7lnhwv5kwqhD6MwsLi0vJKcbW0tr6xuWVu71zJKBGEtkjEI3HtYUk5C2lLMcXpdSwoDjxO2974PPfbt1RIFoWXahLTXoCHIfMZwUpLfXOvG2A18vy0mfXTcXaTVtixfZT1zTKykOM6NoLIcpDtnubEdWtVx4G2haYogzkaffO9O4hIEtBQEY6l7NgoVr0UC8UIp1mpm0gaYzLGQ9rRNMQBlb10mj+Dh1oZQD8S+oUKTtXvGykOpJwEnp7M08rfXi7+5XUS5dd6KQvjRNGQzA75CYcqgnkZcMAEJYpPNMFEMJ0VkhEWmChdWUmX8PVT+D9pnViuZTer5Xp13kYR7IMDUAE2OAN1cAEaoAUIuAMP4Ak8G/fGo/FivM5GC8Z8Zxf8gPH2CWsRleE=
e
V
(j)
k+1
AAACCnicdVDLSgNBEJz1GeMr6tHLYBAiwjK7JpjcAl48KhgNJDHMzvbqmNkHM7NKWPbuxV/x4kHFq1/gzb9xEiOoaEFDUdVNd5eXCK40Ie/W1PTM7Nx8YaG4uLS8slpaWz9VcSoZtFgsYtn2qALBI2hprgW0Ewk09ASceYODkX92DVLxODrRwwR6Ib2IeMAZ1Ubql7a6N9wHzYUPWTek+tILstM872eDXSc/zypXO3m/VCZ2o0Gq1Romdo24rls3hOy59YaDHZuMUUYTHPVLb10/ZmkIkWaCKtVxSKJ7GZWaMwF5sZsqSCgb0AvoGBrREFQvG/+S422j+DiIpalI47H6fSKjoVLD0DOdo3PVb28k/uV1Uh3UexmPklRDxD4XBanAOsajYLDPJTAthoZQJrm5FbNLKinTJr6iCeHrU/w/abl2w3aOq+VmdZJGAW2iLVRBDtpHTXSIjlALMXSL7tEjerLurAfr2Xr5bJ2yJjMb6Aes1w9Y6pto
AAACCnicdVDLSgNBEJz1GeMr6tHLYBAiwjK7JpjcAl48KhgNJDHMzvbqmNkHM7NKWPbuxV/x4kHFq1/gzb9xEiOoaEFDUdVNd5eXCK40Ie/W1PTM7Nx8YaG4uLS8slpaWz9VcSoZtFgsYtn2qALBI2hprgW0Ewk09ASceYODkX92DVLxODrRwwR6Ib2IeMAZ1Ubql7a6N9wHzYUPWTek+tILstM872eDXSc/zypXO3m/VCZ2o0Gq1Romdo24rls3hOy59YaDHZuMUUYTHPVLb10/ZmkIkWaCKtVxSKJ7GZWaMwF5sZsqSCgb0AvoGBrREFQvG/+S422j+DiIpalI47H6fSKjoVLD0DOdo3PVb28k/uV1Uh3UexmPklRDxD4XBanAOsajYLDPJTAthoZQJrm5FbNLKinTJr6iCeHrU/w/abl2w3aOq+VmdZJGAW2iLVRBDtpHTXSIjlALMXSL7tEjerLurAfr2Xr5bJ2yJjMb6Aes1w9Y6pto
AAACCnicdVDLSgNBEJz1GeMr6tHLYBAiwjK7JpjcAl48KhgNJDHMzvbqmNkHM7NKWPbuxV/x4kHFq1/gzb9xEiOoaEFDUdVNd5eXCK40Ie/W1PTM7Nx8YaG4uLS8slpaWz9VcSoZtFgsYtn2qALBI2hprgW0Ewk09ASceYODkX92DVLxODrRwwR6Ib2IeMAZ1Ubql7a6N9wHzYUPWTek+tILstM872eDXSc/zypXO3m/VCZ2o0Gq1Romdo24rls3hOy59YaDHZuMUUYTHPVLb10/ZmkIkWaCKtVxSKJ7GZWaMwF5sZsqSCgb0AvoGBrREFQvG/+S422j+DiIpalI47H6fSKjoVLD0DOdo3PVb28k/uV1Uh3UexmPklRDxD4XBanAOsajYLDPJTAthoZQJrm5FbNLKinTJr6iCeHrU/w/abl2w3aOq+VmdZJGAW2iLVRBDtpHTXSIjlALMXSL7tEjerLurAfr2Xr5bJ2yJjMb6Aes1w9Y6pto
AAACCnicdVDLSgNBEJz1GeMr6tHLYBAiwjK7JpjcAl48KhgNJDHMzvbqmNkHM7NKWPbuxV/x4kHFq1/gzb9xEiOoaEFDUdVNd5eXCK40Ie/W1PTM7Nx8YaG4uLS8slpaWz9VcSoZtFgsYtn2qALBI2hprgW0Ewk09ASceYODkX92DVLxODrRwwR6Ib2IeMAZ1Ubql7a6N9wHzYUPWTek+tILstM872eDXSc/zypXO3m/VCZ2o0Gq1Romdo24rls3hOy59YaDHZuMUUYTHPVLb10/ZmkIkWaCKtVxSKJ7GZWaMwF5sZsqSCgb0AvoGBrREFQvG/+S422j+DiIpalI47H6fSKjoVLD0DOdo3PVb28k/uV1Uh3UexmPklRDxD4XBanAOsajYLDPJTAthoZQJrm5FbNLKinTJr6iCeHrU/w/abl2w3aOq+VmdZJGAW2iLVRBDtpHTXSIjlALMXSL7tEjerLurAfr2Xr5bJ2yJjMb6Aes1w9Y6pto
V
(j)
k+1
AAAB/nicdVDLSgMxFM3UV62vquDGTbAIFaFkSmvbXcGNywr2Ae04ZNJMG5t5kGSEMs7CX3HjQsWt3+HOvzHTVlDRA4HDOfdyT44TciYVQh9GZml5ZXUtu57b2Nza3snv7nVkEAlC2yTggeg5WFLOfNpWTHHaCwXFnsNp15mcp373lgrJAv9KTUNqeXjkM5cRrLRk5w8GHlZjx407iR1PTs3kOi7enCR2voBK5Spq1OoQlarIbFTKmiBUb5whaGqSogAWaNn598EwIJFHfUU4lrJvolBZMRaKEU6T3CCSNMRkgke0r6mPPSqteJY/gcdaGUI3EPr5Cs7U7xsx9qSceo6eTNPK314q/uX1I+XWrZj5YaSoT+aH3IhDFcC0DDhkghLFp5pgIpjOCskYC0yUriynS/j6KfyftMulRsm8rBSalUUbWXAIjkARmKAGmuACtEAbEHAHHsATeDbujUfjxXidj2aMxc4++AHj7RNtRZXj
AAAB/nicdVDLSgMxFM3UV62vquDGTbAIFaFkSmvbXcGNywr2Ae04ZNJMG5t5kGSEMs7CX3HjQsWt3+HOvzHTVlDRA4HDOfdyT44TciYVQh9GZml5ZXUtu57b2Nza3snv7nVkEAlC2yTggeg5WFLOfNpWTHHaCwXFnsNp15mcp373lgrJAv9KTUNqeXjkM5cRrLRk5w8GHlZjx407iR1PTs3kOi7enCR2voBK5Spq1OoQlarIbFTKmiBUb5whaGqSogAWaNn598EwIJFHfUU4lrJvolBZMRaKEU6T3CCSNMRkgke0r6mPPSqteJY/gcdaGUI3EPr5Cs7U7xsx9qSceo6eTNPK314q/uX1I+XWrZj5YaSoT+aH3IhDFcC0DDhkghLFp5pgIpjOCskYC0yUriynS/j6KfyftMulRsm8rBSalUUbWXAIjkARmKAGmuACtEAbEHAHHsATeDbujUfjxXidj2aMxc4++AHj7RNtRZXj
AAAB/nicdVDLSgMxFM3UV62vquDGTbAIFaFkSmvbXcGNywr2Ae04ZNJMG5t5kGSEMs7CX3HjQsWt3+HOvzHTVlDRA4HDOfdyT44TciYVQh9GZml5ZXUtu57b2Nza3snv7nVkEAlC2yTggeg5WFLOfNpWTHHaCwXFnsNp15mcp373lgrJAv9KTUNqeXjkM5cRrLRk5w8GHlZjx407iR1PTs3kOi7enCR2voBK5Spq1OoQlarIbFTKmiBUb5whaGqSogAWaNn598EwIJFHfUU4lrJvolBZMRaKEU6T3CCSNMRkgke0r6mPPSqteJY/gcdaGUI3EPr5Cs7U7xsx9qSceo6eTNPK314q/uX1I+XWrZj5YaSoT+aH3IhDFcC0DDhkghLFp5pgIpjOCskYC0yUriynS/j6KfyftMulRsm8rBSalUUbWXAIjkARmKAGmuACtEAbEHAHHsATeDbujUfjxXidj2aMxc4++AHj7RNtRZXj
AAAB/nicdVDLSgMxFM3UV62vquDGTbAIFaFkSmvbXcGNywr2Ae04ZNJMG5t5kGSEMs7CX3HjQsWt3+HOvzHTVlDRA4HDOfdyT44TciYVQh9GZml5ZXUtu57b2Nza3snv7nVkEAlC2yTggeg5WFLOfNpWTHHaCwXFnsNp15mcp373lgrJAv9KTUNqeXjkM5cRrLRk5w8GHlZjx407iR1PTs3kOi7enCR2voBK5Spq1OoQlarIbFTKmiBUb5whaGqSogAWaNn598EwIJFHfUU4lrJvolBZMRaKEU6T3CCSNMRkgke0r6mPPSqteJY/gcdaGUI3EPr5Cs7U7xsx9qSceo6eTNPK314q/uX1I+XWrZj5YaSoT+aH3IhDFcC0DDhkghLFp5pgIpjOCskYC0yUriynS/j6KfyftMulRsm8rBSalUUbWXAIjkARmKAGmuACtEAbEHAHHsATeDbujUfjxXidj2aMxc4++AHj7RNtRZXj
d
AAAB53icdVDLSsNAFJ3UV62vqks3g0VwFSYxxWZXcOOyBWMLbSiTybQdO5mEmYlQQr/AjQsVt/6SO//G6UNQ0QMXDufcy733RBlnSiP0YZXW1jc2t8rblZ3dvf2D6uHRrUpzSWhAUp7KboQV5UzQQDPNaTeTFCcRp51ocjX3O/dUKpaKGz3NaJjgkWBDRrA2UjseVGvI9n3keXWI7DpyXbdhCLpwG74DHRstUAMrtAbV936ckjyhQhOOleo5KNNhgaVmhNNZpZ8rmmEywSPaM1TghKqwWBw6g2dGieEwlaaEhgv1+0SBE6WmSWQ6E6zH6rc3F//yerkeNsKCiSzXVJDlomHOoU7h/GsYM0mJ5lNDMJHM3ArJGEtMtMmmYkL4+hT+TwLX9m2n7dWa3iqNMjgBp+AcOOASNME1aIEAEEDBA3gCz9ad9Wi9WK/L1pK1mjkGP2C9fQKbyYz3
AAAB53icdVDLSsNAFJ3UV62vqks3g0VwFSYxxWZXcOOyBWMLbSiTybQdO5mEmYlQQr/AjQsVt/6SO//G6UNQ0QMXDufcy733RBlnSiP0YZXW1jc2t8rblZ3dvf2D6uHRrUpzSWhAUp7KboQV5UzQQDPNaTeTFCcRp51ocjX3O/dUKpaKGz3NaJjgkWBDRrA2UjseVGvI9n3keXWI7DpyXbdhCLpwG74DHRstUAMrtAbV936ckjyhQhOOleo5KNNhgaVmhNNZpZ8rmmEywSPaM1TghKqwWBw6g2dGieEwlaaEhgv1+0SBE6WmSWQ6E6zH6rc3F//yerkeNsKCiSzXVJDlomHOoU7h/GsYM0mJ5lNDMJHM3ArJGEtMtMmmYkL4+hT+TwLX9m2n7dWa3iqNMjgBp+AcOOASNME1aIEAEEDBA3gCz9ad9Wi9WK/L1pK1mjkGP2C9fQKbyYz3
AAAB53icdVDLSsNAFJ3UV62vqks3g0VwFSYxxWZXcOOyBWMLbSiTybQdO5mEmYlQQr/AjQsVt/6SO//G6UNQ0QMXDufcy733RBlnSiP0YZXW1jc2t8rblZ3dvf2D6uHRrUpzSWhAUp7KboQV5UzQQDPNaTeTFCcRp51ocjX3O/dUKpaKGz3NaJjgkWBDRrA2UjseVGvI9n3keXWI7DpyXbdhCLpwG74DHRstUAMrtAbV936ckjyhQhOOleo5KNNhgaVmhNNZpZ8rmmEywSPaM1TghKqwWBw6g2dGieEwlaaEhgv1+0SBE6WmSWQ6E6zH6rc3F//yerkeNsKCiSzXVJDlomHOoU7h/GsYM0mJ5lNDMJHM3ArJGEtMtMmmYkL4+hT+TwLX9m2n7dWa3iqNMjgBp+AcOOASNME1aIEAEEDBA3gCz9ad9Wi9WK/L1pK1mjkGP2C9fQKbyYz3
AAAB53icdVDLSsNAFJ3UV62vqks3g0VwFSYxxWZXcOOyBWMLbSiTybQdO5mEmYlQQr/AjQsVt/6SO//G6UNQ0QMXDufcy733RBlnSiP0YZXW1jc2t8rblZ3dvf2D6uHRrUpzSWhAUp7KboQV5UzQQDPNaTeTFCcRp51ocjX3O/dUKpaKGz3NaJjgkWBDRrA2UjseVGvI9n3keXWI7DpyXbdhCLpwG74DHRstUAMrtAbV936ckjyhQhOOleo5KNNhgaVmhNNZpZ8rmmEywSPaM1TghKqwWBw6g2dGieEwlaaEhgv1+0SBE6WmSWQ6E6zH6rc3F//yerkeNsKCiSzXVJDlomHOoU7h/GsYM0mJ5lNDMJHM3ArJGEtMtMmmYkL4+hT+TwLX9m2n7dWa3iqNMjgBp+AcOOASNME1aIEAEEDBA3gCz9ad9Wi9WK/L1pK1mjkGP2C9fQKbyYz3
V
(j)
k+1
AAAB/nicdVDLSgMxFM3UV62vquDGTbAIFWHIaAc7u4IblxXsA9paMmmmjc08SDJCGWfhr7hxoeLW73Dn35hpK6jogcDhnHu5J8eNOJMKoQ8jt7C4tLySXy2srW9sbhW3d5oyjAWhDRLyULRdLClnAW0opjhtR4Ji3+W05Y7PM791S4VkYXClJhHt+XgYMI8RrLTUL+51faxGrpc0034yPrbS66R8c5T2iyVkItuxLQSRaSPLOc2I41Qrtg0tE01RAnPU+8X37iAksU8DRTiWsmOhSPUSLBQjnKaFbixphMkYD2lH0wD7VPaSaf4UHmplAL1Q6BcoOFW/byTYl3Liu3oySyt/e5n4l9eJlVftJSyIYkUDMjvkxRyqEGZlwAETlCg+0QQTwXRWSEZYYKJ0ZQVdwtdP4f+kcWI6pnVZKdUq8zbyYB8cgDKwwBmogQtQBw1AwB14AE/g2bg3Ho0X43U2mjPmO7vgB4y3T3DeleY=
AAAB/nicdVDLSgMxFM3UV62vquDGTbAIFWHIaAc7u4IblxXsA9paMmmmjc08SDJCGWfhr7hxoeLW73Dn35hpK6jogcDhnHu5J8eNOJMKoQ8jt7C4tLySXy2srW9sbhW3d5oyjAWhDRLyULRdLClnAW0opjhtR4Ji3+W05Y7PM791S4VkYXClJhHt+XgYMI8RrLTUL+51faxGrpc0034yPrbS66R8c5T2iyVkItuxLQSRaSPLOc2I41Qrtg0tE01RAnPU+8X37iAksU8DRTiWsmOhSPUSLBQjnKaFbixphMkYD2lH0wD7VPaSaf4UHmplAL1Q6BcoOFW/byTYl3Liu3oySyt/e5n4l9eJlVftJSyIYkUDMjvkxRyqEGZlwAETlCg+0QQTwXRWSEZYYKJ0ZQVdwtdP4f+kcWI6pnVZKdUq8zbyYB8cgDKwwBmogQtQBw1AwB14AE/g2bg3Ho0X43U2mjPmO7vgB4y3T3DeleY=
AAAB/nicdVDLSgMxFM3UV62vquDGTbAIFWHIaAc7u4IblxXsA9paMmmmjc08SDJCGWfhr7hxoeLW73Dn35hpK6jogcDhnHu5J8eNOJMKoQ8jt7C4tLySXy2srW9sbhW3d5oyjAWhDRLyULRdLClnAW0opjhtR4Ji3+W05Y7PM791S4VkYXClJhHt+XgYMI8RrLTUL+51faxGrpc0034yPrbS66R8c5T2iyVkItuxLQSRaSPLOc2I41Qrtg0tE01RAnPU+8X37iAksU8DRTiWsmOhSPUSLBQjnKaFbixphMkYD2lH0wD7VPaSaf4UHmplAL1Q6BcoOFW/byTYl3Liu3oySyt/e5n4l9eJlVftJSyIYkUDMjvkxRyqEGZlwAETlCg+0QQTwXRWSEZYYKJ0ZQVdwtdP4f+kcWI6pnVZKdUq8zbyYB8cgDKwwBmogQtQBw1AwB14AE/g2bg3Ho0X43U2mjPmO7vgB4y3T3DeleY=
AAAB/nicdVDLSgMxFM3UV62vquDGTbAIFWHIaAc7u4IblxXsA9paMmmmjc08SDJCGWfhr7hxoeLW73Dn35hpK6jogcDhnHu5J8eNOJMKoQ8jt7C4tLySXy2srW9sbhW3d5oyjAWhDRLyULRdLClnAW0opjhtR4Ji3+W05Y7PM791S4VkYXClJhHt+XgYMI8RrLTUL+51faxGrpc0034yPrbS66R8c5T2iyVkItuxLQSRaSPLOc2I41Qrtg0tE01RAnPU+8X37iAksU8DRTiWsmOhSPUSLBQjnKaFbixphMkYD2lH0wD7VPaSaf4UHmplAL1Q6BcoOFW/byTYl3Liu3oySyt/e5n4l9eJlVftJSyIYkUDMjvkxRyqEGZlwAETlCg+0QQTwXRWSEZYYKJ0ZQVdwtdP4f+kcWI6pnVZKdUq8zbyYB8cgDKwwBmogQtQBw1AwB14AE/g2bg3Ho0X43U2mjPmO7vgB4y3T3DeleY=
Step 1: upsampling operator Step 2: local refinement network
M0
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQzPJmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLVBHqE8mlaodYU84E9Q0znLYTRXEcctoKRze53xpTpZkUD2aS0CDGA8EiRrCxUtCNsRkSzLO7ac/tVWtu3Z0BLROvIDUo0OxVv7p9SdKYCkM41rrjuYkJMqwMI5xOK91U0wSTER7QjqUCx1QH2Sz0FJ1YpY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPVVsCd7il5eJf1a/qnv357XGddFGGY7gGE7BgwtowC00wQcCT/AMr/DmjJ0X5935mI+WnGLnEP7A+fwBFfyR0g==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQzPJmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLVBHqE8mlaodYU84E9Q0znLYTRXEcctoKRze53xpTpZkUD2aS0CDGA8EiRrCxUtCNsRkSzLO7ac/tVWtu3Z0BLROvIDUo0OxVv7p9SdKYCkM41rrjuYkJMqwMI5xOK91U0wSTER7QjqUCx1QH2Sz0FJ1YpY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPVVsCd7il5eJf1a/qnv357XGddFGGY7gGE7BgwtowC00wQcCT/AMr/DmjJ0X5935mI+WnGLnEP7A+fwBFfyR0g==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQzPJmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLVBHqE8mlaodYU84E9Q0znLYTRXEcctoKRze53xpTpZkUD2aS0CDGA8EiRrCxUtCNsRkSzLO7ac/tVWtu3Z0BLROvIDUo0OxVv7p9SdKYCkM41rrjuYkJMqwMI5xOK91U0wSTER7QjqUCx1QH2Sz0FJ1YpY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPVVsCd7il5eJf1a/qnv357XGddFGGY7gGE7BgwtowC00wQcCT/AMr/DmjJ0X5935mI+WnGLnEP7A+fwBFfyR0g==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQzPJmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLVBHqE8mlaodYU84E9Q0znLYTRXEcctoKRze53xpTpZkUD2aS0CDGA8EiRrCxUtCNsRkSzLO7ac/tVWtu3Z0BLROvIDUo0OxVv7p9SdKYCkM41rrjuYkJMqwMI5xOK91U0wSTER7QjqUCx1QH2Sz0FJ1YpY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPVVsCd7il5eJf1a/qnv357XGddFGGY7gGE7BgwtowC00wQcCT/AMr/DmjJ0X5935mI+WnGLnEP7A+fwBFfyR0g==
M1
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQzPJmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLVBHqE8mlaodYU84E9Q0znLYTRXEcctoKRze53xpTpZkUD2aS0CDGA8EiRrCxUtCNsRkSzLO7ac/rVWtu3Z0BLROvIDUo0OxVv7p9SdKYCkM41rrjuYkJMqwMI5xOK91U0wSTER7QjqUCx1QH2Sz0FJ1YpY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPVVsCd7il5eJf1a/qnv357XGddFGGY7gGE7BgwtowC00wQcCT/AMr/DmjJ0X5935mI+WnGLnEP7A+fwBF3+R0w==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQzPJmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLVBHqE8mlaodYU84E9Q0znLYTRXEcctoKRze53xpTpZkUD2aS0CDGA8EiRrCxUtCNsRkSzLO7ac/rVWtu3Z0BLROvIDUo0OxVv7p9SdKYCkM41rrjuYkJMqwMI5xOK91U0wSTER7QjqUCx1QH2Sz0FJ1YpY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPVVsCd7il5eJf1a/qnv357XGddFGGY7gGE7BgwtowC00wQcCT/AMr/DmjJ0X5935mI+WnGLnEP7A+fwBF3+R0w==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQzPJmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLVBHqE8mlaodYU84E9Q0znLYTRXEcctoKRze53xpTpZkUD2aS0CDGA8EiRrCxUtCNsRkSzLO7ac/rVWtu3Z0BLROvIDUo0OxVv7p9SdKYCkM41rrjuYkJMqwMI5xOK91U0wSTER7QjqUCx1QH2Sz0FJ1YpY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPVVsCd7il5eJf1a/qnv357XGddFGGY7gGE7BgwtowC00wQcCT/AMr/DmjJ0X5935mI+WnGLnEP7A+fwBF3+R0w==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyI4GNVcONGqODYQjuUTJppQzPJmGQKZeh3uHGh4tafceffmGlnoa0HAodz7uWenDDhTBvX/XZKK6tr6xvlzcrW9s7uXnX/4FHLVBHqE8mlaodYU84E9Q0znLYTRXEcctoKRze53xpTpZkUD2aS0CDGA8EiRrCxUtCNsRkSzLO7ac/rVWtu3Z0BLROvIDUo0OxVv7p9SdKYCkM41rrjuYkJMqwMI5xOK91U0wSTER7QjqUCx1QH2Sz0FJ1YpY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPVVsCd7il5eJf1a/qnv357XGddFGGY7gGE7BgwtowC00wQcCT/AMr/DmjJ0X5935mI+WnGLnEP7A+fwBF3+R0w==
M2
AAAB83icbVDLSgMxFL3js9ZX1aWbYBFclZki+FgV3LgRKji20A4lk2ba0EwyJplCGfodblyouPVn3Pk3ZtpZaOuBwOGce7knJ0w408Z1v52V1bX1jc3SVnl7Z3dvv3Jw+Khlqgj1ieRStUOsKWeC+oYZTtuJojgOOW2Fo5vcb42p0kyKBzNJaBDjgWARI9hYKejG2AwJ5tndtFfvVapuzZ0BLROvIFUo0OxVvrp9SdKYCkM41rrjuYkJMqwMI5xOy91U0wSTER7QjqUCx1QH2Sz0FJ1apY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPZVtCd7il5eJX69d1bz782rjumijBMdwAmfgwQU04Baa4AOBJ3iGV3hzxs6L8+58zEdXnGLnCP7A+fwBGQKR1A==
AAAB83icbVDLSgMxFL3js9ZX1aWbYBFclZki+FgV3LgRKji20A4lk2ba0EwyJplCGfodblyouPVn3Pk3ZtpZaOuBwOGce7knJ0w408Z1v52V1bX1jc3SVnl7Z3dvv3Jw+Khlqgj1ieRStUOsKWeC+oYZTtuJojgOOW2Fo5vcb42p0kyKBzNJaBDjgWARI9hYKejG2AwJ5tndtFfvVapuzZ0BLROvIFUo0OxVvrp9SdKYCkM41rrjuYkJMqwMI5xOy91U0wSTER7QjqUCx1QH2Sz0FJ1apY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPZVtCd7il5eJX69d1bz782rjumijBMdwAmfgwQU04Baa4AOBJ3iGV3hzxs6L8+58zEdXnGLnCP7A+fwBGQKR1A==
AAAB83icbVDLSgMxFL3js9ZX1aWbYBFclZki+FgV3LgRKji20A4lk2ba0EwyJplCGfodblyouPVn3Pk3ZtpZaOuBwOGce7knJ0w408Z1v52V1bX1jc3SVnl7Z3dvv3Jw+Khlqgj1ieRStUOsKWeC+oYZTtuJojgOOW2Fo5vcb42p0kyKBzNJaBDjgWARI9hYKejG2AwJ5tndtFfvVapuzZ0BLROvIFUo0OxVvrp9SdKYCkM41rrjuYkJMqwMI5xOy91U0wSTER7QjqUCx1QH2Sz0FJ1apY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPZVtCd7il5eJX69d1bz782rjumijBMdwAmfgwQU04Baa4AOBJ3iGV3hzxs6L8+58zEdXnGLnCP7A+fwBGQKR1A==
AAAB83icbVDLSgMxFL3js9ZX1aWbYBFclZki+FgV3LgRKji20A4lk2ba0EwyJplCGfodblyouPVn3Pk3ZtpZaOuBwOGce7knJ0w408Z1v52V1bX1jc3SVnl7Z3dvv3Jw+Khlqgj1ieRStUOsKWeC+oYZTtuJojgOOW2Fo5vcb42p0kyKBzNJaBDjgWARI9hYKejG2AwJ5tndtFfvVapuzZ0BLROvIFUo0OxVvrp9SdKYCkM41rrjuYkJMqwMI5xOy91U0wSTER7QjqUCx1QH2Sz0FJ1apY8iqewTBs3U3xsZjrWexKGdzEPqRS8X//M6qYkug4yJJDVUkPmhKOXISJQ3gPpMUWL4xBJMFLNZERlihYmxPZVtCd7il5eJX69d1bz782rjumijBMdwAmfgwQU04Baa4AOBJ3iGV3hzxs6L8+58zEdXnGLnCP7A+fwBGQKR1A==
M3
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyo4GNVcONGqODYQjuUTJppQ5PMmGQKZeh3uHGh4tafceffmGlnodUDgcM593JPTphwpo3rfjmlpeWV1bXyemVjc2t7p7q796DjVBHqk5jHqh1iTTmT1DfMcNpOFMUi5LQVjq5zvzWmSrNY3ptJQgOBB5JFjGBjpaArsBkSzLPbae+0V625dXcG9Jd4BalBgWav+tntxyQVVBrCsdYdz01MkGFlGOF0WummmiaYjPCAdiyVWFAdZLPQU3RklT6KYmWfNGim/tzIsNB6IkI7mYfUi14u/ud1UhNdBBmTSWqoJPNDUcqRiVHeAOozRYnhE0swUcxmRWSIFSbG9lSxJXiLX/5L/JP6Zd27O6s1roo2ynAAh3AMHpxDA26gCT4QeIQneIFXZ+w8O2/O+3y05BQ7+/ALzsc3GoWR1Q==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyo4GNVcONGqODYQjuUTJppQ5PMmGQKZeh3uHGh4tafceffmGlnodUDgcM593JPTphwpo3rfjmlpeWV1bXyemVjc2t7p7q796DjVBHqk5jHqh1iTTmT1DfMcNpOFMUi5LQVjq5zvzWmSrNY3ptJQgOBB5JFjGBjpaArsBkSzLPbae+0V625dXcG9Jd4BalBgWav+tntxyQVVBrCsdYdz01MkGFlGOF0WummmiaYjPCAdiyVWFAdZLPQU3RklT6KYmWfNGim/tzIsNB6IkI7mYfUi14u/ud1UhNdBBmTSWqoJPNDUcqRiVHeAOozRYnhE0swUcxmRWSIFSbG9lSxJXiLX/5L/JP6Zd27O6s1roo2ynAAh3AMHpxDA26gCT4QeIQneIFXZ+w8O2/O+3y05BQ7+/ALzsc3GoWR1Q==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyo4GNVcONGqODYQjuUTJppQ5PMmGQKZeh3uHGh4tafceffmGlnodUDgcM593JPTphwpo3rfjmlpeWV1bXyemVjc2t7p7q796DjVBHqk5jHqh1iTTmT1DfMcNpOFMUi5LQVjq5zvzWmSrNY3ptJQgOBB5JFjGBjpaArsBkSzLPbae+0V625dXcG9Jd4BalBgWav+tntxyQVVBrCsdYdz01MkGFlGOF0WummmiaYjPCAdiyVWFAdZLPQU3RklT6KYmWfNGim/tzIsNB6IkI7mYfUi14u/ud1UhNdBBmTSWqoJPNDUcqRiVHeAOozRYnhE0swUcxmRWSIFSbG9lSxJXiLX/5L/JP6Zd27O6s1roo2ynAAh3AMHpxDA26gCT4QeIQneIFXZ+w8O2/O+3y05BQ7+/ALzsc3GoWR1Q==
AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyo4GNVcONGqODYQjuUTJppQ5PMmGQKZeh3uHGh4tafceffmGlnodUDgcM593JPTphwpo3rfjmlpeWV1bXyemVjc2t7p7q796DjVBHqk5jHqh1iTTmT1DfMcNpOFMUi5LQVjq5zvzWmSrNY3ptJQgOBB5JFjGBjpaArsBkSzLPbae+0V625dXcG9Jd4BalBgWav+tntxyQVVBrCsdYdz01MkGFlGOF0WummmiaYjPCAdiyVWFAdZLPQU3RklT6KYmWfNGim/tzIsNB6IkI7mYfUi14u/ud1UhNdBBmTSWqoJPNDUcqRiVHeAOozRYnhE0swUcxmRWSIFSbG9lSxJXiLX/5L/JP6Zd27O6s1roo2ynAAh3AMHpxDA26gCT4QeIQneIFXZ+w8O2/O+3y05BQ7+/ALzsc3GoWR1Q==
F
l
(·)
AAAB/HicdVDLSgMxFM34rPU1PnZugkWomyFTqu2Ai4IgLitYW2hLyWTSNjSTGZKMUIfir7hxoeLWD3Hn35hpK6jogcDhnHu5J8ePOVMaoQ9rYXFpeWU1t5Zf39jc2rZ3dm9UlEhCGyTikWz5WFHOBG1opjltxZLi0Oe06Y/OM795S6VikbjW45h2QzwQrM8I1kbq2fudEOshwTy9mPR4sUOCSB/37AJy0KlXRQgip4xKVTcjnndSrrjQddAUBTBHvWe/d4KIJCEVmnCsVNtFse6mWGpGOJ3kO4miMSYjPKBtQwUOqeqm0/QTeGSUAPYjaZ7QcKp+30hxqNQ49M1kllX99jLxL6+d6H61mzIRJ5oKMjvUTzjUEcyqgAGTlGg+NgQTyUxWSIZYYqJNYXlTwtdP4f+kUXI8x70qF2pn8zZy4AAcgiJwQQXUwCWogwYg4A48gCfwbN1bj9aL9TobXbDmO3vgB6y3TwBOlSA=
AAAB/HicdVDLSgMxFM34rPU1PnZugkWomyFTqu2Ai4IgLitYW2hLyWTSNjSTGZKMUIfir7hxoeLWD3Hn35hpK6jogcDhnHu5J8ePOVMaoQ9rYXFpeWU1t5Zf39jc2rZ3dm9UlEhCGyTikWz5WFHOBG1opjltxZLi0Oe06Y/OM795S6VikbjW45h2QzwQrM8I1kbq2fudEOshwTy9mPR4sUOCSB/37AJy0KlXRQgip4xKVTcjnndSrrjQddAUBTBHvWe/d4KIJCEVmnCsVNtFse6mWGpGOJ3kO4miMSYjPKBtQwUOqeqm0/QTeGSUAPYjaZ7QcKp+30hxqNQ49M1kllX99jLxL6+d6H61mzIRJ5oKMjvUTzjUEcyqgAGTlGg+NgQTyUxWSIZYYqJNYXlTwtdP4f+kUXI8x70qF2pn8zZy4AAcgiJwQQXUwCWogwYg4A48gCfwbN1bj9aL9TobXbDmO3vgB6y3TwBOlSA=
AAAB/HicdVDLSgMxFM34rPU1PnZugkWomyFTqu2Ai4IgLitYW2hLyWTSNjSTGZKMUIfir7hxoeLWD3Hn35hpK6jogcDhnHu5J8ePOVMaoQ9rYXFpeWU1t5Zf39jc2rZ3dm9UlEhCGyTikWz5WFHOBG1opjltxZLi0Oe06Y/OM795S6VikbjW45h2QzwQrM8I1kbq2fudEOshwTy9mPR4sUOCSB/37AJy0KlXRQgip4xKVTcjnndSrrjQddAUBTBHvWe/d4KIJCEVmnCsVNtFse6mWGpGOJ3kO4miMSYjPKBtQwUOqeqm0/QTeGSUAPYjaZ7QcKp+30hxqNQ49M1kllX99jLxL6+d6H61mzIRJ5oKMjvUTzjUEcyqgAGTlGg+NgQTyUxWSIZYYqJNYXlTwtdP4f+kUXI8x70qF2pn8zZy4AAcgiJwQQXUwCWogwYg4A48gCfwbN1bj9aL9TobXbDmO3vgB6y3TwBOlSA=
AAAB/HicdVDLSgMxFM34rPU1PnZugkWomyFTqu2Ai4IgLitYW2hLyWTSNjSTGZKMUIfir7hxoeLWD3Hn35hpK6jogcDhnHu5J8ePOVMaoQ9rYXFpeWU1t5Zf39jc2rZ3dm9UlEhCGyTikWz5WFHOBG1opjltxZLi0Oe06Y/OM795S6VikbjW45h2QzwQrM8I1kbq2fudEOshwTy9mPR4sUOCSB/37AJy0KlXRQgip4xKVTcjnndSrrjQddAUBTBHvWe/d4KIJCEVmnCsVNtFse6mWGpGOJ3kO4miMSYjPKBtQwUOqeqm0/QTeGSUAPYjaZ7QcKp+30hxqNQ49M1kllX99jLxL6+d6H61mzIRJ5oKMjvUTzjUEcyqgAGTlGg+NgQTyUxWSIZYYqJNYXlTwtdP4f+kUXI8x70qF2pn8zZy4AAcgiJwQQXUwCWogwYg4A48gCfwbN1bj9aL9TobXbDmO3vgB6y3TwBOlSA=
F
l
(·)
AAAB/HicdVDLSgMxFM34rPU1PnZugkWomyFTqu2Ai4IgLitYW2hLyWTSNjSTGZKMUIfir7hxoeLWD3Hn35hpK6jogcDhnHu5J8ePOVMaoQ9rYXFpeWU1t5Zf39jc2rZ3dm9UlEhCGyTikWz5WFHOBG1opjltxZLi0Oe06Y/OM795S6VikbjW45h2QzwQrM8I1kbq2fudEOshwTy9mPR4sUOCSB/37AJy0KlXRQgip4xKVTcjnndSrrjQddAUBTBHvWe/d4KIJCEVmnCsVNtFse6mWGpGOJ3kO4miMSYjPKBtQwUOqeqm0/QTeGSUAPYjaZ7QcKp+30hxqNQ49M1kllX99jLxL6+d6H61mzIRJ5oKMjvUTzjUEcyqgAGTlGg+NgQTyUxWSIZYYqJNYXlTwtdP4f+kUXI8x70qF2pn8zZy4AAcgiJwQQXUwCWogwYg4A48gCfwbN1bj9aL9TobXbDmO3vgB6y3TwBOlSA=
AAAB/HicdVDLSgMxFM34rPU1PnZugkWomyFTqu2Ai4IgLitYW2hLyWTSNjSTGZKMUIfir7hxoeLWD3Hn35hpK6jogcDhnHu5J8ePOVMaoQ9rYXFpeWU1t5Zf39jc2rZ3dm9UlEhCGyTikWz5WFHOBG1opjltxZLi0Oe06Y/OM795S6VikbjW45h2QzwQrM8I1kbq2fudEOshwTy9mPR4sUOCSB/37AJy0KlXRQgip4xKVTcjnndSrrjQddAUBTBHvWe/d4KIJCEVmnCsVNtFse6mWGpGOJ3kO4miMSYjPKBtQwUOqeqm0/QTeGSUAPYjaZ7QcKp+30hxqNQ49M1kllX99jLxL6+d6H61mzIRJ5oKMjvUTzjUEcyqgAGTlGg+NgQTyUxWSIZYYqJNYXlTwtdP4f+kUXI8x70qF2pn8zZy4AAcgiJwQQXUwCWogwYg4A48gCfwbN1bj9aL9TobXbDmO3vgB6y3TwBOlSA=
AAAB/HicdVDLSgMxFM34rPU1PnZugkWomyFTqu2Ai4IgLitYW2hLyWTSNjSTGZKMUIfir7hxoeLWD3Hn35hpK6jogcDhnHu5J8ePOVMaoQ9rYXFpeWU1t5Zf39jc2rZ3dm9UlEhCGyTikWz5WFHOBG1opjltxZLi0Oe06Y/OM795S6VikbjW45h2QzwQrM8I1kbq2fudEOshwTy9mPR4sUOCSB/37AJy0KlXRQgip4xKVTcjnndSrrjQddAUBTBHvWe/d4KIJCEVmnCsVNtFse6mWGpGOJ3kO4miMSYjPKBtQwUOqeqm0/QTeGSUAPYjaZ7QcKp+30hxqNQ49M1kllX99jLxL6+d6H61mzIRJ5oKMjvUTzjUEcyqgAGTlGg+NgQTyUxWSIZYYqJNYXlTwtdP4f+kUXI8x70qF2pn8zZy4AAcgiJwQQXUwCWogwYg4A48gCfwbN1bj9aL9TobXbDmO3vgB6y3TwBOlSA=
AAAB/HicdVDLSgMxFM34rPU1PnZugkWomyFTqu2Ai4IgLitYW2hLyWTSNjSTGZKMUIfir7hxoeLWD3Hn35hpK6jogcDhnHu5J8ePOVMaoQ9rYXFpeWU1t5Zf39jc2rZ3dm9UlEhCGyTikWz5WFHOBG1opjltxZLi0Oe06Y/OM795S6VikbjW45h2QzwQrM8I1kbq2fudEOshwTy9mPR4sUOCSB/37AJy0KlXRQgip4xKVTcjnndSrrjQddAUBTBHvWe/d4KIJCEVmnCsVNtFse6mWGpGOJ3kO4miMSYjPKBtQwUOqeqm0/QTeGSUAPYjaZ7QcKp+30hxqNQ49M1kllX99jLxL6+d6H61mzIRJ5oKMjvUTzjUEcyqgAGTlGg+NgQTyUxWSIZYYqJNYXlTwtdP4f+kUXI8x70qF2pn8zZy4AAcgiJwQQXUwCWogwYg4A48gCfwbN1bj9aL9TobXbDmO3vgB6y3TwBOlSA=
F
l
(·)
AAAB/HicdVDLSgMxFM34rPU1PnZugkWomyFTqu2Ai4IgLitYW2hLyWTSNjSTGZKMUIfir7hxoeLWD3Hn35hpK6jogcDhnHu5J8ePOVMaoQ9rYXFpeWU1t5Zf39jc2rZ3dm9UlEhCGyTikWz5WFHOBG1opjltxZLi0Oe06Y/OM795S6VikbjW45h2QzwQrM8I1kbq2fudEOshwTy9mPR4sUOCSB/37AJy0KlXRQgip4xKVTcjnndSrrjQddAUBTBHvWe/d4KIJCEVmnCsVNtFse6mWGpGOJ3kO4miMSYjPKBtQwUOqeqm0/QTeGSUAPYjaZ7QcKp+30hxqNQ49M1kllX99jLxL6+d6H61mzIRJ5oKMjvUTzjUEcyqgAGTlGg+NgQTyUxWSIZYYqJNYXlTwtdP4f+kUXI8x70qF2pn8zZy4AAcgiJwQQXUwCWogwYg4A48gCfwbN1bj9aL9TobXbDmO3vgB6y3TwBOlSA=
AAAB/HicdVDLSgMxFM34rPU1PnZugkWomyFTqu2Ai4IgLitYW2hLyWTSNjSTGZKMUIfir7hxoeLWD3Hn35hpK6jogcDhnHu5J8ePOVMaoQ9rYXFpeWU1t5Zf39jc2rZ3dm9UlEhCGyTikWz5WFHOBG1opjltxZLi0Oe06Y/OM795S6VikbjW45h2QzwQrM8I1kbq2fudEOshwTy9mPR4sUOCSB/37AJy0KlXRQgip4xKVTcjnndSrrjQddAUBTBHvWe/d4KIJCEVmnCsVNtFse6mWGpGOJ3kO4miMSYjPKBtQwUOqeqm0/QTeGSUAPYjaZ7QcKp+30hxqNQ49M1kllX99jLxL6+d6H61mzIRJ5oKMjvUTzjUEcyqgAGTlGg+NgQTyUxWSIZYYqJNYXlTwtdP4f+kUXI8x70qF2pn8zZy4AAcgiJwQQXUwCWogwYg4A48gCfwbN1bj9aL9TobXbDmO3vgB6y3TwBOlSA=
AAAB/HicdVDLSgMxFM34rPU1PnZugkWomyFTqu2Ai4IgLitYW2hLyWTSNjSTGZKMUIfir7hxoeLWD3Hn35hpK6jogcDhnHu5J8ePOVMaoQ9rYXFpeWU1t5Zf39jc2rZ3dm9UlEhCGyTikWz5WFHOBG1opjltxZLi0Oe06Y/OM795S6VikbjW45h2QzwQrM8I1kbq2fudEOshwTy9mPR4sUOCSB/37AJy0KlXRQgip4xKVTcjnndSrrjQddAUBTBHvWe/d4KIJCEVmnCsVNtFse6mWGpGOJ3kO4miMSYjPKBtQwUOqeqm0/QTeGSUAPYjaZ7QcKp+30hxqNQ49M1kllX99jLxL6+d6H61mzIRJ5oKMjvUTzjUEcyqgAGTlGg+NgQTyUxWSIZYYqJNYXlTwtdP4f+kUXI8x70qF2pn8zZy4AAcgiJwQQXUwCWogwYg4A48gCfwbN1bj9aL9TobXbDmO3vgB6y3TwBOlSA=
AAAB/HicdVDLSgMxFM34rPU1PnZugkWomyFTqu2Ai4IgLitYW2hLyWTSNjSTGZKMUIfir7hxoeLWD3Hn35hpK6jogcDhnHu5J8ePOVMaoQ9rYXFpeWU1t5Zf39jc2rZ3dm9UlEhCGyTikWz5WFHOBG1opjltxZLi0Oe06Y/OM795S6VikbjW45h2QzwQrM8I1kbq2fudEOshwTy9mPR4sUOCSB/37AJy0KlXRQgip4xKVTcjnndSrrjQddAUBTBHvWe/d4KIJCEVmnCsVNtFse6mWGpGOJ3kO4miMSYjPKBtQwUOqeqm0/QTeGSUAPYjaZ7QcKp+30hxqNQ49M1kllX99jLxL6+d6H61mzIRJ5oKMjvUTzjUEcyqgAGTlGg+NgQTyUxWSIZYYqJNYXlTwtdP4f+kUXI8x70qF2pn8zZy4AAcgiJwQQXUwCWogwYg4A48gCfwbN1bj9aL9TobXbDmO3vgB6y3TwBOlSA=
Local Stage § 3.3.2
Figure 3.4: Theiterativeupsamplingandrenementprocess in the local geometry stage.
Based on the coarse meshM
0
obtained from the global stage, the local stage progressively
produces meshes in higher resolution and with ner details,fM
k
g
L
k=1
. At each levelk, this pro-
cess is done in two steps, as shown in Fig. 3.4: (1) a xed and dierentiable upsampling operator
62
to provide a reliable initialization for upsampled meshes, and (2) a local renement network to
further improve the surface details based on the input images.
Upsampling Operator: Ranjan et al. [161] propose a mesh upsampling technique based on
the barycentric embedding of vertices in the lower-resolution mesh version. Directly using this
upsampling scheme results in unsmooth artifacts, as the barycentric embedding contrains the
upsampled vertices to lie in the surface of the lower-resolution mesh. Instead, we use additional
normal displacement weights as shown in step 1 of Fig. 3.4. Given a sparser meshM
k
= (V
k
;T
k
)
and its per-vertex normal vectorsN
k
, we upsample the mesh by
e
V
k+1
=Q
k
V
k
+D
k
N
k
; (3.2)
where Q
k
2 R
N
k+1
N
k
is the barycentric weight matrix as in [161] and D
k1
2 R
N
k+1
N
k
is
the additional coecient matrix that apply displacement vectors along normal directions. The
normal displacements encode additional surface details that allow vertices to be outside of the
input surface.
For a hierarchy withL levels, we rst downsample the full-resolution template meshT =
(V;T) :=T
L
by isotropic remeshing and non-rigid registration, into a series of meshes with de-
creasing resolution while still preserving geometry and topology of original mesh:fT
L1
;T
L2
;
:::;T
0
g. Next, we embed the vertices at higher resolution in the surface at lower resolution meshes
by barycentric coordinatesQ
k
as in [161]. We then project the remaining residual vectors onto
the normal direction and obtainD
k
.
63
Local Renement Network: Around each vertex (indexed with j) of the upsampled mesh
e
V
(j)
k+1
, we dene a smaller grid thanG
g
in the global stage in the local neighborhoodG
(j)
l
. We
sample local volumetric featuresL
(j)
l
by Eq. 3.1. For each local feature volume, we apply the local
renement network
l
, a 3D convolutional network with skip connections, to predict per-vertex
probability volumeC
(j)
l
=
l
(L
(j)
l
). Then we compute the corrective vector by the expectation
operator,V
(j)
k+1
=
E
(C
(j)
l
). This process is applied to all vertices independently, and therefore
can be parallelized in batches. Finally the upsampled and rened mesh vertices are
V
k+1
=
e
V
k+1
+V
k+1
: (3.3)
GivenM
0
, we iteratively apply the local stage at all levels until we reach the highest resolution
and obtainM
L
.
The volumetric feature sampling and the upsampling operator, along with the networks are
fully dierentiable, enabling the progressive geometry network end-to-end trainable from input
images to dense registered meshes.
3.3.3 AppearanceandDetailCapture
Skin detail and appearance maps are commonly used in photo-realistic rendering, which is often
dicult to estimate without special capture hardware, such as the Light Stage capture system [51].
We propose a simple yet eective architecture to estimate high-resolution detail and appearance
maps, potentially without the dependency on special appearance capture systems.
64
AlbedoMapsGeneration: The base meshes are reconstructed for a smaller head region. We
augment the base meshes by additional tting for the back of the head using Laplacian deforma-
tion [186]. We then perform the standard texturing given the completed mesh and multi-view
images and obtain the albedo reectance map on the UV domain. Furthermore, by applying the
texturing process and sample vertex locations instead of RGB colors, we obtain another map on
the UV domain, that we call the geometry map.
DetailMapsSynthesis: To further augment the representation, we adopt an image-to-image
translation strategy to infer ner-level details. Using a network similar to [214], our synthesis
network infers specular reectances and displacements given both albedo and geometry map.
We then upscale all the texture maps to 4K resolution by using the super resolution strategy of
[215]. We can obtain the detailed mesh in high-resolution by applying the displacement maps
on the base mesh, as shown in Fig. 3.2. The reconstructed skin detail and appearance maps are
directly usable for standard graphics pipelines for photo-realistic rendering.
3.4 Experiments
3.4.1 Datasets
We evaluate our method on datasets captured from the Light Stage system [76, 126], with 3D scans
from MVS, ground truth base meshes from a traditional mesh registration pipeline [115], and
ground truth skin attributes from the traditional light stage pipeline [51]. In particular, we correct
the ground truth base meshes (registrations) with optical ow and manual work of a professional
artist, to ensure high quality and high accuracy of registration. The dataset contains 64 subjects
65
(45 for train and 19 for test), covering a wide diversities in gender, age and ethnicity. Each set
of capture contains a neutral face and 26 expressions, including some extreme face deformations
(e.g. mouth widely open), asymmetrical motions (jaw to left/right) and subtle expressions (e.g.
concave cheek or eye motions).
3.4.2 ImplementationDetails
For the progressive mesh generation network, our feature extraction network adopts a pre-
trained UNet [167] with ResNet34 [85] as its backbone, which predicts feature maps of half of
resolution of input image with 8 channels. The volumetric features of the global stage are sam-
pled from a 32
3
grid with grid size of 10 millimeters, the local stage uses a 8
3
grid with a grid size
of 2.5 millimeters. We randomly rotate the grids for the volumetric feature sampling as data aug-
mentation during training. The mesh hierarchy withL = 3 contains meshes with 341, 1194, 3412
and 10495 vertices. Both the global geometry network and local renement network use a simi-
lar architecture as the V2V network in [93]. Both stages are trained separately. The global stage
trains for 400K iterations with al
2
loss
V
0
V
0
2
2
, the local stage trains for 150K iterations with
al
2
loss combined across mesh hierarchy levels with equal weights,
P
L
k=0
V
k
V
k
2
2
, where
V
k
is the ground truth base mesh vertices for the predictedV
k
at levelk. We train the progressive
mesh generation network using Adam optimizer with a learning rate of 1e 4 and batch size of
2 on a single NVIDIA V100 GPU. For the detail maps synthesis, we adopt the synthesis network
from [214] and the super-resolution network from ESRGAN [215]. For more details, please see
the Appendix B.
66
3.4.3 Results
Baselines: We evaluate the performance of our base mesh prediction and compare to the fol-
lowing existing methods:
• Traditional MVS and Registration: we run commercial photogrammetry software Al-
iceVision [4], followed by non-rigid ICP surface registration.
• 3DMMRegression: we adopt a network architecture similar to [199, 200, 223] for a multi-
view setting.
• DFNRMVS: [11]: a method that learns an adaptive model space and on-the-y iterative
renements on top of 3DMM regression.
We argue that the two-step methods of MVS and registration is susceptible to MVS errors
and requires manual tweaking optimization parameters for dierent inputs, which makes it not
robust. Our method shows robustness and generalizability for challenging cases, outperforms
existing learning-based method and achieves the state-of-the-art geometry and correspondence
quality. Our method has ecient run-time. We show various ablation studies to validate the
eectiveness of our design. We will provide more comparison and results in the Appendix B.
Robustness: Fig. 3.5 show the results from various methods given challenging inputs. Note
that when the nose of the subject (top case) is specular reective (due to oily skin) or has facial
hair, the traditional MVS fails to reconstruct the true surface, producing artifacts that aect the
subsequent surface registration step. With conservative optimization parameters (e.g. strong re-
liance on 3DMM), the result is more robust. However with the same parameters, it aects the
67
Traditional MVS + registration Ours DFNRMVS
Input images
(2 of 15)
Output mesh MVS scan
No dependency
on MVS
No dependency
on MVS
Output mesh MVS scan
Conservative Aggressive
Traditional MVS + registration Ours DFNRMVS
Input images
(2 of 15)
Conservative Aggressive
No dependency
on MVS
No dependency
on MVS
Figure 3.5: Evaluationonmethodrobustness.
exibility for tting detailed shape and motion for other input cases (e.g. bottom case). Further-
more, the extreme and asymmetrical motion is challenging for tting only within the morphable
model. This case requires “aggressive” tting, in which less regularizations are applied. Therefore
we point out this dilemma of general parameters in the traditional MVS and registration aects of
automation and requires much manual work for high-quality results. The learning based method
DFNRMVS [11] shows potential for robustness and generalizability. However, they cannot output
68
3DMM regression
(w/ post-processing)
Reference scan
3DMM regression
(direct output)
DFNRMVS
(w/ post-processing)
Our base mesh
(direct output)
Input images
(8 out of 15)
3DMM regression
(w/ post-processing)
Reference Scan
3DMM regression
(direct output)
DFNRMVS
(w/ post-processing)
Our base mesh
(direct output)
Input Images
(8 out of 15)
0
Output mesh Overlay
Scan-to-mesh
distance
Output mesh Overlay
Scan-to-mesh
distance
0
Figure 3.6: Qualitative comparison on geometric accuracy with the existing methods.
The scan-to-mesh distance is visualized in heatmap (red means> 5 mm). Note that 3DMM and
DFNRMVS [11] need rigid ICP as post-processing. Our outputs require no post-processing, while
outperforming the existing learning-based method in geometry accuracy.
69
meshes in accurate shape and expressions. On the contrary, our model shows superior perfor-
mances in predicting a reliable mesh, given such challenging inputs. Note that the details, such
as closed eyelids and asymmetrical mouth motion are faithfully captured.
Geometric Accuracy: Fig. 3.6 shows the inferred meshes given images from 15 views, along
with error visualizations with the reference scans. The 3DMM regression method cannot t ex-
treme or subtle expressions (wide mouth open, concave cheek and eye shut). The adaptive space
and the online renement improve DFNRMVS [11] for a better tting, but it still lacks the ac-
curacy to cover the geometric details. Our method is capable of predicting base meshes that
closely t the ground truth surfaces. The results recover identities for the subjects and captures
challenging expressions such as extreme mouth opening or subtle non-linearity of small muscles
movement (concave cheek) which cannot be modeled by linear 3DMMs. The overlay and error vi-
sualizations indicate that our reconstruction ts the ground truth scan closely with tting errors
signicantly below 5 millimeters. Due to not being able to utilize true projection parameters, the
results of 3DMM regression and DFNRMVS [11] lack accuracy in absolute coordinate and need
a Procrustes analysis (scale and rigid pose) as post-processing for further tting to the target. In
contrast, our method outperforms these methods without post-processing.
As a quantitative evaluation, we measure the distribution of scan-to-mesh distances. 78.3% of
vertices by our methods have scan-to-mesh distance lower than 1 mm. This result outperforms
the 3DMM regression which have 27.0% and 33.1% (without and with post-processing). The me-
dian scan-to-mesh distance for our results is 0.584 mm, achieving sub-millimeter performance.
We show cumulative scan-to-mesh distance curves in the Appendix B.
70
0
Vertex-to-vertex
distance
Texture map error
3DMM regression
w/o post-processing
3DMM regression
w/ post-processing
DFNRMVS
w/ post-processing
Ours
0
Figure 3.7: Visualizationoncorrespondenceaccuracy.
Correspondence Accuracy: We provide quantitative measure for correspondence accuracy
for generated base meshes, by comparing them to the ground truth aligned meshes (generated
by artists), and compute the vertex-to-vertex (v2v) distances on a test set. The 3DMM regression
method achieves a median v2v distance of 3.66 mm / 2.88 mm (without and with post-processing).
Our method achieves 1.97 mm outperforming the existing method. The v2v distances are also
visualized on the ground truth mesh in Fig. 3.7. We additionally evaluate our aligned meshes by
the median errors to the ground truth 3D landmarks. Our method achieves 2.02 mm, while the
3DMM regression method achieves 3.92 mm / 3.21 mm (without and with post-processing). We
provide more quantitative evaluations in the Appendix B.
We compute the photometric errors between the texture map of the output meshes and the
one of the ground truth meshes. Lower photometric errors indicate the UV textures match the
pre-designed UV parametrization (i.e. better correspondence). Our method has signicantly
lower errors, especially in the eyebrow region, around the jaw and for wrinkles around eyes and
71
Ours Optical flow
Traditional pipeline
(w/ manual adjust.)
Warped image Reference image Target image
Figure 3.8: Qualitativeevaluationoncorrespondencecomparedtoopticalow.
nose. Note that the 3DMM regression method without post-processing performs worse, while
our method requires no post-processing.
In Fig. 3.8, we further evaluate the correspondence quality by projecting it onto 2D images
and warping the reference image (extreme expression) back to target image (neutral expression).
The ideal warping outputs would be as close to the target image as possible, except for shades
as in wrinkles. We compare the performance with traditional pipeline of MVS and registration
(with manual adjustment) and the traditional optical ow method. Our method recovers better
2D correspondence than optical ow, which relies on local matching which tends to fail when
occlusion and large motion, as shown in Fig. 3.8 (see lip regions). Further optical ow takes 30
seconds on image resolution 1366 1003, compared within 1 second based on our base meshes.
The traditional method achieves good results, but at a cost of 3 orders of magnitude longer of
processing time and possibly manual adjustment.
72
Base mesh Detailed mesh With full attributes Base mesh Detailed mesh With full attributes
Figure 3.9: EectofToFuinferredappearanceanddetailmaps. Based on our reliable base
meshes, our appearance and detail capture network predicts realistic face skin details and at-
tributes, without special hardware such as Light Stage at test-time, enabling photo-realistic ren-
dering.
Inference Speed: The traditional pipeline takes more than 10 minutes and potentially more
time for manual adjustments. DFNRMVS [11] infers faces without tuning at test-time but is still
slower at 4.5 seconds due to its online optimization step and heavy computation on the dense
photometric term. Our global and local stage takes 0.081 seconds and 0.304 seconds respectively.
As shown in Table 3.1, our method produces a high-quality registered base mesh in 0.385 seconds,
and achieves sub-second performance, while being fully automatic without manual tweaking.
Methods Time Automatic
Traditional pipeline 600+ 7
DFNRMVS [11] 4.5 3
ToFu (base mesh) 0.385 3
Table 3.1: Comparisononruntimeonbasemesh, given images from 15 views and measured
in seconds.
Appearance Capture: In Fig. 3.1 and Fig. 3.9, we show rendering results with the inferred
displacement and albedo and specular maps, enabling photo-realistic renderings.
Ablation Studies: In Fig. 3.10 (left), we evaluate the robustness of our network on various
numbers of input views. The resulting quality degrades gracefully as the views decrease. Our
73
Base mesh
Scan-to-mesh
distance
4 views 8 views 15 views
Without normal disp.
in upsampling
With normal disp.
in upsampling
0
Figure 3.10: Ablationstudies. Left: number of input camera views; Right: on normal displace-
ment weights in mesh upsampling function.
Figure 3.11: Generalizationtonewcapturesetups. Results on CoMA [161] datasets.
method produces reasonable results on views as sparse as 4, which is extremely dicult for stan-
dard MVS due to large baseline and little overlaps. Fig. 3.10 (right) demonstrates the normal
displacement in the upsampling function contributes in capturing ne shape details. We provide
more ablation studies in the Appendix B.
GeneralizationtoNewCaptureSetups: We netune our system on the CoMA [161] dataset,
which contains a dierent camera setup, signicantly fewer views (4) and subjects (12), dier-
ent lighting conditions and special make-up patterns painted on subjects’ faces. The results in
Fig. 3.11 show that our system can in principle be applied to the dierent capture setups. However,
we observe some artifacts around jaws and slightly protruding eyebrow bones. This is potentially
74
due to limited number of subjects and insucient camera coverage (e.g. the third image misses
the jaw region).
Discussions: While our system achieves promising progress in automating facial performance
capture, the system currently requires supervised learning, i.e., training with well-labeled facial
registration meshes. One promising direction is to explore weakly-supervised or self-supervised
techniques, such as using dierentiable/neural rendering [121, 122, 139, 231]. Although trained
and tested frame-by-frame, our system can produce dynamic facial capture results with decent
temporal stability (see results in Appendix B). It would be interesting to further explore video
sequences and investigate methods to ensure temporal coherency in surface deformation and
appearance at a ner scale. Another direction is extending our approach beyond the facial skin
region, adding teeth, tongue, eyes, and hair. Neural networks such as our volumetric framework
show the potential to predict non-parametric facial surfaces directly from input images. Further-
more, recent developments in neural rendering, such as NeRF [134], show the possibility of joint
modeling complicated geometry and appearance with a compact representation. Exploring hy-
brid modeling of surface, detail, and appearance would be interesting, potentially eliminating the
need for specialized shaders and merging multiple steps in digital human creation pipelines.
3.5 Conclusion
We introduced a 3D face inference approach from multi-view input images that can produce high-
delity 3D faces meshes with consistent topology using a volumetric sampling approach. We have
shown that, given multi-view inputs, implicitly learning a shape variation and deformation eld
can produce superior results, compared to methods that use an underlying 3DMM even if they
75
rene the resulting inference with an optimization step. We have demonstrated sub-millimeter
surface reconstruction accuracy, and state-of-the-art correspondence performance while achiev-
ing up to 3 orders of magnitude of speed improvement over conventional techniques. Most im-
portantly, our approach is fully automated and eliminates the need for data clean up after MVS,
or any parameter tweaking for conventional non-rigid registration techniques. Our experiments
also show that the volumetric feature sampling can aggregate eectively features across views at
various scales and can also provide salient information for predicting accurate alignment without
the need for any manual post-processing.
76
Chapter4
GeneralCaptureandModelingwithDynamicNeural
RadianceFields
Figure 4.1:Neural3Dvideosynthesis. We propose a novel method for representing and render-
ing high quality 3D video. Our method trains a novel and compact dynamic neural radiance eld
(DyNeRF) in an ecient way. Our method demonstrates near photorealistic dynamic novel view
synthesis for complex scenes including challenging scene motions and strong view-dependent
eects. We demonstrate three synthesized 3D video, and show the associated high quality ge-
ometry in the heatmap visualization in each top right corner. Theembeddedanimationsonly
play in Adobe Reader or KDE Okular. Please see the full video for the high-quality
renderingsandadditionalinformation.
77
4.1 Introduction
Photorealistic representation and rendering of dynamic real-world scenes are highly challeng-
ing research topics, yet with many important applications that range from movie production to
virtual and augmented reality. Dynamic real-world scenes are notoriously hard to model using
classical mesh-based representations, since they often contain thin structures, semi-transparent
objects, specular surfaces, and topology that constantly evolves over time due to the often com-
plex scene motion of multiple objects and people.
In theory, the 6D plenoptic functionP (x;d;t) is a suitable representation for this rendering
problem, as it completely explains our visual reality and enables rendering every possible view
at every moment in time [1]. Here,x2R
3
is the camera position in 3D space,d = (;) is the
viewing direction, andt is time. Thus, fully measuring the plenoptic function requires placing an
omnidirectional camera at every position in space at every possible time.
Neural radiance elds (NeRF) [134] oer a way to circumvent this problem: instead of di-
rectly encoding the plenoptic function, they encode the radiance eld of the scene in an im-
plicit, coordinate-based function, which can be sampled through ray casting to approximate the
plenoptic function. However, the ray casting, which is required to train and to render a neural
radiance eld, involves hundreds of MLP evaluations for each ray. While this might be accept-
able for a static snapshot of a scene, directly reconstructing a dynamic scene as a sequence of
per-frame neural radiance elds would be prohibitive as both storage and training time increase
linearly with time. For example, to represent a 10 second, 30 FPS multi-view video recording by
18 cameras, which we later demonstrate with our method, a per-frame NeRF would require about
78
15 000 GPU hours in training and about 1 GB in storage. More importantly, such obtained rep-
resentations would only reproduce the world as a discrete set of snapshots, lacking any means
to reproduce the world in-between. On the other hand, Neural Volumes [123] is able to handle
dynamic objects and even renders at interactive frame rates. Its limitation is the dense uniform
voxel grid that limits resolution and/or size of the reconstructed scene due to the inherentO(n
3
)
memory complexity.
In this chapter, we propose a novel approach for 3D video synthesis of complex, dynamic
real-world scenes that enables high-quality view synthesis and motion interpolation while being
compact. Videos typically consist of a time-invariant component under stable lighting and a con-
tinuously changing time-variant component. This dynamic component typically exhibits locally
correlated geometric deformations and appearance changes between frames. By exploiting this
fact, we propose to reconstruct a dynamic neural radiance eld based on two novel contributions.
First, we extend neural radiance elds to the space-time domain. Instead of directly using
time as input, we parameterize scene motion and appearance changes by a set of compact latent
codes. Compared to the more obvious choice of an additional “time coordinate”, the learned latent
codes show more expressive power, allowing for recording the vivid details of moving geometry
and texture. They also allow for smooth interpolation in time, which enables visual eects such
as slow motion or “bullet time”. Second, we propose novel importance sampling strategies for
dynamic radiance elds. Ray-based training of neural scene representations treats each pixel
as an independent training sample and requires thousands of iterations to go through all pixels
observed from all views. However, captured dynamic video often exhibits a small amount of pixel
change between frames. This opens up an opportunity to signicantly boost the training progress
by selecting the pixels that are most important for training. Specically, in the time dimension,
79
we schedule training with coarse-to-ne hierarchical sampling in the frames. In the ray/pixel
dimension, our design tends to sample those pixels that are more time-variant than others. These
strategies allow us to shorten the training time of long sequences signicantly, while retaining
high quality reconstruction results. We demonstrate our approach using a multi-view rig based on
18 GoPro cameras. We show results on multiple challenging dynamic environments with highly
complex view-dependent and time-dependent eects. Compared to the naïve per-frame NeRF
baseline, we show that with our combined temporal and spatial importance sampling we achieve
one order of magnitude acceleration in training speed, with a model that is 40 times smaller in
size for 10 seconds of a 30 FPS 3D video. In summary we make the following contributions:
• We propose a novel dynamic neural radiance eld based on temporal latent codes that
achieves high quality 3D video synthesis of complex, dynamic real-world scenes.
• We present novel training strategies based on hierarchical training and importance sam-
pling in the spatiotemporal domain, which boost training speed signicantly and lead to
higher quality results for longer sequences.
• We provide our datasets of time-synchronized and calibrated multi-view videos that cov-
ers challenging 4D scenes for research purposes at https://github.com/facebookresearch/
Neural_3D_Video.
4.2 RelatedWork
Our work is related to several research domains, such as novel view synthesis for static scenes,
3D video synthesis for dynamic scenes, image-based rendering, and neural rendering approaches.
80
For a detailed discussion of neural rendering applications and neural scene representations, we
refer to the surveys [196] and [198].
NovelViewSynthesisforStaticScenes: Novel view synthesis has been tackled by explicitly
reconstructing textured 3D models of the scene and rendering from arbitrary viewpoints. Multi-
view stereo [67, 176] and visual hull reconstructions [57, 104] have been successfully employed.
Complex view-dependent eects can be captured by light transport acquisition methods [51, 220].
Learning-based methods have been proposed to relax the high number of required views and to
accelerate the inference speed for geometry reconstruction [81, 97, 230] and appearance cap-
ture [20, 129], or combined reconstruction techniques [139, 231]. Novel view synthesis can also
be achieved by reusing input image pixels. Early works using this approach interpolate the view-
points [41]. The Light Field/Lumigraph method [50, 79, 106, 145] resamples input image rays to
generate novel views. One drawback of these approaches is that it require dense sampling for
high quality rendering of complex scenes. More recently, [65, 95, 133, 187, 239] learn to fuse and
resample pixels from reference views using neural networks. Neural Radiance Fields (NeRFs)
[134] train an MLP-based radiance and opacity eld and achieve state-of-the-art quality for novel
view synthesis. Other approaches [130, 219] employ an explicit point-based scene representation
combined with a screen space neural network for hole lling. [103] push this further and encode
the scene appearance in a dierentiable sphere-based representation. [185] employs a dense voxel
grid of features in combination with a screen space network for view synthesis. All these meth-
ods are excellent at interpolating views for static scenes, but it is unclear how to extend them to
the dynamic setting.
81
3DVideoSynthesisforDynamicScenes: Techniques in this category enable view synthesis
for dynamic scenes and might also enable interpolation across time. For video synthesis, [96]
pioneers in showing the possibility of explicitly capture geometry and textures. [241] proposes
a temporal layered representation that can be compressed and replayed at an interactive rate.
Reconstruction and animation is particularly well studied for humans [39, 82, 188], but is usually
performed model-based and/or only works with high-end capture setups. [110] captures tem-
porally consistent surfaces by tracking and completion. [45] proposes a system for capturing
and compressing streamable 3D video with high-end hardware. More recently, learning-based
methods such as [88] achieve volumetric video capture for human performances from sparse
camera views. [12] focus on more general scenes. They decompose them into a static and dy-
namic component, re-project information based on estimated coarse depth, and employ a U-Net
in screen space to convert the intermediate result to realistic imagery. [19] uses a neural network
for space-time and illumination interpolation. [233] uses a model-based step for merging the es-
timated depth maps to a unied representation that can be rendered from novel views. Neural
Scene Flow Fields [119] incorporates a static background model. Space-time Neural Irradiance
Fields [224] employs video depth estimation to supervise a space-time radiance eld. [70] recently
proposes a time-conditioned radiance eld, supervised by its own predicted ow vectors. These
works have limited view angle due to their single-view setting and require additional supervi-
sion, such as depth or ow. [53, 147, 158, 207] explicitly model dynamic scenes by a warp eld or
velocity eld to deform a canonical radiance eld. STaR [234] models scenes of rigidly moving ob-
jects using several canonical radiance elds that are rigidly transformed. These methods cannot
model challenging dynamic events such as topology changes. Several radiance eld approaches
have been proposed for modeling digital humans [69, 120, 141, 154, 159], but they can not directly
82
be applied to general non-rigid scenes. Furthermore, there have been eorts in improving neu-
ral radiance elds for in-the-wild scenes [128], generalization across scenes. HyperNeRF [148]
is a concurrent work on dynamic novel view synthesis, but they focus on monocular video in a
short sequence. Neural Volumes [123] employs volume rendering in combination with a view-
conditioned decoder network to parameterize dynamic sequences of single objects. Their results
are limited in resolution and scene complexity due to the inherentO(n
3
) memory complexity.
[34] enable 6DoF video for VR applications based on independent alpha-textured meshes that
can be streamed at the rate of hundreds of Mb/s. This approach employs a capture setup with 46
cameras and requires a large training dataset to construct a strong scene-prior. In contrast, we
seek a unied space-time representation that enables continuous viewpoint and time interpola-
tion, while being able to represent an entire multi-view video sequence of 10 seconds in as little
as 28MB.
4.3 DyNeRF:DynamicNeuralRadianceFields
We address the problem of reconstructing dynamic 3D scenes from time-synchronized multi-view
videos with known intrinsic and extrinsic parameters. The representation we aim to reconstruct
from such multi-camera recordings should allow us to render photorealistic images from a wide
range of viewpoints at arbitrary points in time.
Building on NeRF [134], we propose dynamic neural radiance elds (DyNeRF) that are di-
rectly optimized from input videos captured with multiple video cameras. DyNeRF is a novel
continuous space-time neural radiance eld representation, controllable by a series of temporal
latent embeddings that are jointly optimized during training. Our representation compresses a
83
DyNeRF
(r,g,b)
AAAB73icbVBNSwMxEJ2tX7V+VT16CRahQim7RVBvBS8eW3BtpV1KNs22oUl2SbJCWQr+By8eVLz6d7z5b0w/Dtr6YODx3gwz88KEM21c99vJra1vbG7ltws7u3v7B8XDo3sdp4pQn8Q8Vu0Qa8qZpL5hhtN2oigWIaetcHQz9VuPVGkWyzszTmgg8ECyiBFsrPRQVhU0qKDwvFcsuVV3BrRKvAUpwQKNXvGr249JKqg0hGOtO56bmCDDyjDC6aTQTTVNMBnhAe1YKrGgOshmB0/QmVX6KIqVLWnQTP09kWGh9ViEtlNgM9TL3lT8z+ukJroKMiaT1FBJ5ouilCMTo+n3qM8UJYaPLcFEMXsrIkOsMDE2o4INwVt+eZX4tep11WtelOrNp3kaeTiBUyiDB5dQh1togA8EBDzDK7w5ynlx3p2PeWvOWSR4DH/gfP4ANnqPWw==
AAAB73icbVBNSwMxEJ2tX7V+VT16CRahQim7RVBvBS8eW3BtpV1KNs22oUl2SbJCWQr+By8eVLz6d7z5b0w/Dtr6YODx3gwz88KEM21c99vJra1vbG7ltws7u3v7B8XDo3sdp4pQn8Q8Vu0Qa8qZpL5hhtN2oigWIaetcHQz9VuPVGkWyzszTmgg8ECyiBFsrPRQVhU0qKDwvFcsuVV3BrRKvAUpwQKNXvGr249JKqg0hGOtO56bmCDDyjDC6aTQTTVNMBnhAe1YKrGgOshmB0/QmVX6KIqVLWnQTP09kWGh9ViEtlNgM9TL3lT8z+ukJroKMiaT1FBJ5ouilCMTo+n3qM8UJYaPLcFEMXsrIkOsMDE2o4INwVt+eZX4tep11WtelOrNp3kaeTiBUyiDB5dQh1togA8EBDzDK7w5ynlx3p2PeWvOWSR4DH/gfP4ANnqPWw==
AAAB73icbVBNSwMxEJ2tX7V+VT16CRahQim7RVBvBS8eW3BtpV1KNs22oUl2SbJCWQr+By8eVLz6d7z5b0w/Dtr6YODx3gwz88KEM21c99vJra1vbG7ltws7u3v7B8XDo3sdp4pQn8Q8Vu0Qa8qZpL5hhtN2oigWIaetcHQz9VuPVGkWyzszTmgg8ECyiBFsrPRQVhU0qKDwvFcsuVV3BrRKvAUpwQKNXvGr249JKqg0hGOtO56bmCDDyjDC6aTQTTVNMBnhAe1YKrGgOshmB0/QmVX6KIqVLWnQTP09kWGh9ViEtlNgM9TL3lT8z+ukJroKMiaT1FBJ5ouilCMTo+n3qM8UJYaPLcFEMXsrIkOsMDE2o4INwVt+eZX4tep11WtelOrNp3kaeTiBUyiDB5dQh1togA8EBDzDK7w5ynlx3p2PeWvOWSR4DH/gfP4ANnqPWw==
AAAB73icbVBNSwMxEJ2tX7V+VT16CRahQim7RVBvBS8eW3BtpV1KNs22oUl2SbJCWQr+By8eVLz6d7z5b0w/Dtr6YODx3gwz88KEM21c99vJra1vbG7ltws7u3v7B8XDo3sdp4pQn8Q8Vu0Qa8qZpL5hhtN2oigWIaetcHQz9VuPVGkWyzszTmgg8ECyiBFsrPRQVhU0qKDwvFcsuVV3BrRKvAUpwQKNXvGr249JKqg0hGOtO56bmCDDyjDC6aTQTTVNMBnhAe1YKrGgOshmB0/QmVX6KIqVLWnQTP09kWGh9ViEtlNgM9TL3lT8z+ukJroKMiaT1FBJ5ouilCMTo+n3qM8UJYaPLcFEMXsrIkOsMDE2o4INwVt+eZX4tep11WtelOrNp3kaeTiBUyiDB5dQh1togA8EBDzDK7w5ynlx3p2PeWvOWSR4DH/gfP4ANnqPWw==
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjwm4CaBZAmzk9lkzDyWmVkhLAE/wYsHFa9+kDf/xsnjoIkFDUVVN91dccqZsb7/7RXW1jc2t4rbpZ3dvf2D8uFR06hMExoSxZVux9hQziQNLbOctlNNsYg5bcWj26nfeqTaMCXv7TilkcADyRJGsHVSs2vYQOBeueJX/RnQKgkWpAIL1Hvlr25fkUxQaQnHxnQCP7VRjrVlhNNJqZsZmmIywgPacVRiQU2Uz66doDOn9FGitCtp0Uz9PZFjYcxYxK5TYDs0y95U/M/rZDa5jnIm08xSSeaLkowjq9D0ddRnmhLLx45gopm7FZEh1phYF1DJhRAsv7xKwovqTTVoXFZqjad5GkU4gVM4hwCuoAZ3UIcQCDzAM7zCm6e8F+/d+5i3FrxFgsfwB97nDzNOj4Y=
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjwm4CaBZAmzk9lkzDyWmVkhLAE/wYsHFa9+kDf/xsnjoIkFDUVVN91dccqZsb7/7RXW1jc2t4rbpZ3dvf2D8uFR06hMExoSxZVux9hQziQNLbOctlNNsYg5bcWj26nfeqTaMCXv7TilkcADyRJGsHVSs2vYQOBeueJX/RnQKgkWpAIL1Hvlr25fkUxQaQnHxnQCP7VRjrVlhNNJqZsZmmIywgPacVRiQU2Uz66doDOn9FGitCtp0Uz9PZFjYcxYxK5TYDs0y95U/M/rZDa5jnIm08xSSeaLkowjq9D0ddRnmhLLx45gopm7FZEh1phYF1DJhRAsv7xKwovqTTVoXFZqjad5GkU4gVM4hwCuoAZ3UIcQCDzAM7zCm6e8F+/d+5i3FrxFgsfwB97nDzNOj4Y=
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjwm4CaBZAmzk9lkzDyWmVkhLAE/wYsHFa9+kDf/xsnjoIkFDUVVN91dccqZsb7/7RXW1jc2t4rbpZ3dvf2D8uFR06hMExoSxZVux9hQziQNLbOctlNNsYg5bcWj26nfeqTaMCXv7TilkcADyRJGsHVSs2vYQOBeueJX/RnQKgkWpAIL1Hvlr25fkUxQaQnHxnQCP7VRjrVlhNNJqZsZmmIywgPacVRiQU2Uz66doDOn9FGitCtp0Uz9PZFjYcxYxK5TYDs0y95U/M/rZDa5jnIm08xSSeaLkowjq9D0ddRnmhLLx45gopm7FZEh1phYF1DJhRAsv7xKwovqTTVoXFZqjad5GkU4gVM4hwCuoAZ3UIcQCDzAM7zCm6e8F+/d+5i3FrxFgsfwB97nDzNOj4Y=
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjwm4CaBZAmzk9lkzDyWmVkhLAE/wYsHFa9+kDf/xsnjoIkFDUVVN91dccqZsb7/7RXW1jc2t4rbpZ3dvf2D8uFR06hMExoSxZVux9hQziQNLbOctlNNsYg5bcWj26nfeqTaMCXv7TilkcADyRJGsHVSs2vYQOBeueJX/RnQKgkWpAIL1Hvlr25fkUxQaQnHxnQCP7VRjrVlhNNJqZsZmmIywgPacVRiQU2Uz66doDOn9FGitCtp0Uz9PZFjYcxYxK5TYDs0y95U/M/rZDa5jnIm08xSSeaLkowjq9D0ddRnmhLLx45gopm7FZEh1phYF1DJhRAsv7xKwovqTTVoXFZqjad5GkU4gVM4hwCuoAZ3UIcQCDzAM7zCm6e8F+/d+5i3FrxFgsfwB97nDzNOj4Y=
.
.
.
.
.
.
(x,y,z,✓ , )
AAAB/3icbVDLSsNAFJ3UV62vqAsXbgaLUKGURAR1V3DjsgVjC20ok+mkGTp5MHMjxlAQf8WNCxW3/oY7/8bpY6GtBy73cM69zNzjJYIrsKxvo7C0vLK6VlwvbWxube+Yu3u3Kk4lZQ6NRSzbHlFM8Ig5wEGwdiIZCT3BWt7wauy37phUPI5uIEuYG5JBxH1OCWipZx5U7qs4q+KHKu5CwIDongT8pGeWrZo1AV4k9oyU0QyNnvnV7cc0DVkEVBClOraVgJsTCZwKNip1U8USQodkwDqaRiRkys0nB4zwsVb62I+lrgjwRP29kZNQqSz09GRIIFDz3lj8z+uk4F+4OY+SFFhEpw/5qcAQ43EauM8loyAyTQiVXP8V04BIQkFnVtIh2PMnLxLntHZZs5tn5XrzcZpGER2iI1RBNjpHdXSNGshBFI3QM3pFb8aT8WK8Gx/T0YIxS3Af/YHx+QNGYJTv
AAAB/3icbVDLSsNAFJ3UV62vqAsXbgaLUKGURAR1V3DjsgVjC20ok+mkGTp5MHMjxlAQf8WNCxW3/oY7/8bpY6GtBy73cM69zNzjJYIrsKxvo7C0vLK6VlwvbWxube+Yu3u3Kk4lZQ6NRSzbHlFM8Ig5wEGwdiIZCT3BWt7wauy37phUPI5uIEuYG5JBxH1OCWipZx5U7qs4q+KHKu5CwIDongT8pGeWrZo1AV4k9oyU0QyNnvnV7cc0DVkEVBClOraVgJsTCZwKNip1U8USQodkwDqaRiRkys0nB4zwsVb62I+lrgjwRP29kZNQqSz09GRIIFDz3lj8z+uk4F+4OY+SFFhEpw/5qcAQ43EauM8loyAyTQiVXP8V04BIQkFnVtIh2PMnLxLntHZZs5tn5XrzcZpGER2iI1RBNjpHdXSNGshBFI3QM3pFb8aT8WK8Gx/T0YIxS3Af/YHx+QNGYJTv
AAAB/3icbVDLSsNAFJ3UV62vqAsXbgaLUKGURAR1V3DjsgVjC20ok+mkGTp5MHMjxlAQf8WNCxW3/oY7/8bpY6GtBy73cM69zNzjJYIrsKxvo7C0vLK6VlwvbWxube+Yu3u3Kk4lZQ6NRSzbHlFM8Ig5wEGwdiIZCT3BWt7wauy37phUPI5uIEuYG5JBxH1OCWipZx5U7qs4q+KHKu5CwIDongT8pGeWrZo1AV4k9oyU0QyNnvnV7cc0DVkEVBClOraVgJsTCZwKNip1U8USQodkwDqaRiRkys0nB4zwsVb62I+lrgjwRP29kZNQqSz09GRIIFDz3lj8z+uk4F+4OY+SFFhEpw/5qcAQ43EauM8loyAyTQiVXP8V04BIQkFnVtIh2PMnLxLntHZZs5tn5XrzcZpGER2iI1RBNjpHdXSNGshBFI3QM3pFb8aT8WK8Gx/T0YIxS3Af/YHx+QNGYJTv
AAAB/3icbVDLSsNAFJ3UV62vqAsXbgaLUKGURAR1V3DjsgVjC20ok+mkGTp5MHMjxlAQf8WNCxW3/oY7/8bpY6GtBy73cM69zNzjJYIrsKxvo7C0vLK6VlwvbWxube+Yu3u3Kk4lZQ6NRSzbHlFM8Ig5wEGwdiIZCT3BWt7wauy37phUPI5uIEuYG5JBxH1OCWipZx5U7qs4q+KHKu5CwIDongT8pGeWrZo1AV4k9oyU0QyNnvnV7cc0DVkEVBClOraVgJsTCZwKNip1U8USQodkwDqaRiRkys0nB4zwsVb62I+lrgjwRP29kZNQqSz09GRIIFDz3lj8z+uk4F+4OY+SFFhEpw/5qcAQ43EauM8loyAyTQiVXP8V04BIQkFnVtIh2PMnLxLntHZZs5tn5XrzcZpGER2iI1RBNjpHdXSNGshBFI3QM3pFb8aT8WK8Gx/T0YIxS3Af/YHx+QNGYJTv
Figure 4.2: Dynamic Neural Radiance Fields (DyNeRF). We learn the 6D plenoptic function
by our novel dynamic neural radiance eld that conditions on position, view direction and a
compact, yet expressive time-variant latent code.
huge volume of input videos from multiple cameras to a compact 6D representation that can be
queried continuously in both space and time. The learned embedding faithfully captures detailed
temporal variations of the scene, such as complex photometric and topological changes, without
explicit geometric tracking.
4.3.1 Representation
The problem of representing 3D video comprises learning the 6D plenoptic function that maps
a 3D positionx2 R
3
, directiond2 R
2
, and timet2 R, to RGB radiancec2 R
3
and opacity
2R. Based on NeRF [134], which approximates the 5D plenoptic function of a static scene with
a learnable function, a potential solution would be to add a time dependency to the function:
F
: (x;d;t)! (c;) ; (4.1)
which is realized by a Multi-Layer Perceptron (MLP) with trainable weights . The 1-dimensional
time variablet can be mapped via positional encoding [194] to a higher dimensional space, in a
84
manner similar to how NeRF handles the inputsx andd. However, we empirically found that it is
challenging for this design to capture complex dynamic 3D scenes with challenging topological
changes and time-dependent volumetric eects, such as ames.
DynamicNeuralRadianceFields: We model the dynamic scene by time-variant latent codes
z
t
2R
D
, as shown in Fig. 4.2. We learn a set of time-dependent latent codes, indexed by a discrete
time variablet:
F
: (x;d;z
t
)! (c;) : (4.2)
The latent codes provide a compact representation of the state of a dynamic scene at a certain time,
which can handle various complex scene dynamics, including deformation, topological and radi-
ance changes. We apply positional encoding [194] to the input position coordinates to map them
to a higher-dimensional vector. However, no positional encoding is applied to the time-dependent
latent codes. Before training, the latent codesfz
t
g are randomly initialized independently across
all frames.
Rendering: We use volume rendering techniques to render the radiance eld given a query
view in space and time. Given a rayr(s) =o +sd with the origino and directiond dened by
the specied camera pose and intrinsics, the rendered color of the pixel corresponding to this ray
C(r) is an integral over the radiance weighted by accumulated opacity [134]:
C
(t)
(r) =
Z
s
f
sn
T (s)(r(s);z
t
)c(r(s);d;z
t
))ds : (4.3)
85
where s
n
and s
f
denote the bounds of the volume depth range and the accumulated opacity
T (s) = exp(
R
s
sn
(r(p);z
t
))dp): We apply a hierarchical sampling strategy as [134] with strat-
ied sampling on the coarse level followed by importance sampling on the ne level.
LossFunction: The network parameters and the latent codesfz
t
g are simultaneously opti-
mized by minimizing the`
2
-loss between the rendered colors
^
C(r) and the ground truth colors
C(r), and summed over all raysr that correspond to the image pixels from all training camera
viewsR and throughout all time framest2T of the recording:
L =
X
t2T;r2R
X
j2fc;fg
^
C
(t)
j
(r)C
(t)
(r)
2
2
: (4.4)
We evaluate the loss at both the coarse and the ne level, denoted by
^
C
(t)
c
and
^
C
(t)
f
respectively,
similar to NeRF. We train with a stochastic version of this loss function, by randomly sampling
ray data and optimizing the loss of each ray batch. Please note that our dynamic radiance eld is
trained with this plain`
2
-loss without any special regularization.
4.3.2 EcientTraining
An additional challenge of ray casting–based neural rendering on video data is the large amount
of training time required. The number of training iterations per epoch scales linearly with the
total number of pixels in the input multi-view videos. For a 10-second, 30 FPS, 1 MP multi-view
video sequence from 18 cameras, there are about 7.4 billion ray samples in one epoch, which
would take about half a week to process using 8 NVIDIA Volta class GPUs. Given that each ray
86
(b) Importance weights
for the keyframes
(c) Importance weights
for the full sequence
Time Time
(a) Temporal appearance changes
0.1
Figure 4.3: Overview of our ecient training strategies. We perform hierarchical training
rst using keyframes (b) and then on the full sequence (c). At both stages, we apply the ray
importance sampling technique to focus on the rays with high time-variant information based
on weight maps that measure the temporal appearance changes (a). We show a visualized example
of the sampling probability based on global median map using a heatmap (red and opaque means
high probability).
needs to be re-visited several times to obtain high quality results, this sampling process is one of
the biggest bottlenecks for ray-based neural reconstruction methods to train 3D videos at scale.
However, for a natural video a large proportion of the dynamic scene is either time-invariant
or only contains a small time-variant radiance change at a particular timestamp across the entire
observed video. Hence, uniformly sampling rays causes an imbalance between time-invariant ob-
servations and time-variant ones. This means it is highly inecient and impacts reconstruction
quality: time-invariant regions reach high reconstruction quality sooner and are uselessly over-
sampled, while time-variant regions require additional sampling, increasing the training time.
To explore temporal redundancy in the context of 3D video, we propose two strategies to
accelerate the training process (see Fig. 4.3): (1) hierarchical training that optimizes data over a
coarse-to-ne frame selection and (2) importance sampling that prefers rays around regions of
87
higher temporal variance. In particular, these strategies form a dierent loss function by paying
more attention to the “important” rays in time frame setS and pixel setI for training:
L
ecient
=
X
t2S;r2I
X
j2fc;fg
^
C
(t)
j
(r)C
(t)
(r)
2
2
: (4.5)
These two strategies combined can be regarded as an adaptive sampling approach, contributing
to signicantly faster training and improved rendering quality.
Hierarchical Training: Instead of training DyNeRF on all video frames, we rst train it on
keyframes, which we sample all images equidistantly at xed time intervalsK, i.e.S =ftjt =
nK;n2Z
+
;t2Tg. Once the model converges with keyframe supervision, we use it to initialize
the nal model, which has the same temporal resolution as the full video. Since the per-frame
motion of the scene within each segment (divided by neighboring keyframes) is smooth, we ini-
tialize the ne-level latent embeddings by linearly interpolating between the coarse embeddings.
Finally, we train using data from all the frames jointly,S =T , further optimizing the network
weights and the latent embeddings. The coarse keyframe model has already captured an approxi-
mation of the time-invariant information across the video. Therefore, the ne full-frame training
only needs to learn the time-variant information per-frame.
Ray Importance Sampling: We propose to sample raysI across time with dierent impor-
tance based on the temporal variation in the input videos. For each observed ray r at time t,
we compute a weight!
(t)
(r). In each training iteration we pick a time framet at random. We
rst normalize the weights of the rays across all input views for framet, and then apply inverse
transform sampling to select rays based on these weights.
88
To calculate the weight of each ray, we propose three implementations based on dierent
insights.
• Global-Median(DyNeRF-ISG): We compute the weight of each ray based on the residual
dierence of its color to its the global median value across time.
• Temporal-Dierence (DyNeRF-IST): We compute the weight of each ray based on the
color dierence in two consecutive frames.
• CombinedMethod(DyNeRF-IS
?
): Combine both strategies above.
We empirically observed that training DyNeRF-ISG with a high learning rate leads to very
quick recovery of dynamic detail, but results in some jitter across time. On the other hand, train-
ing DyNeRF-IST with a low learning rate produces a smooth temporal sequence which is still
somewhat blurry. Thus, we combine the benets of both methods in our nal strategy, DyNeRF-
IS
?
(referred as DyNeRF in later sections), which rst obtains sharp details via DyNeRF-ISG and
then smoothens the temporal motion via DyNeRF-IST. We explain the details of the three strate-
gies in the Appendix C. All importance sampling methods assume a static camera rig.
4.4 Experiments
We demonstrate our approach on a large variety of captured daily events with challenging scene
motions, varying illuminations and self-cast shadows, view-dependent appearances and highly
volumetric eects. We performed detailed ablation studies and comparisons to various baselines
on our multi-view data and immersive video data [34].
89
SupplementalMaterials: We strongly recommend the reader to watch oursupplementalvideo
at the project website https://neural- 3d- video.github.io/ to better judge the photorealism of
our approach at high resolution, which cannot be represented well by the metrics. We demon-
strate interactive playback of our 3D videos in commodity VR headset Quest 2 in the supplemen-
tal video. We further provide comprehensive details of our capture setup, dataset descriptions,
comparison settings, more ablations studies on parameter choices and failure case discussions in
Appendix C.
4.4.1 Datasets
PlenopticVideoDatasets: We build a mobile multi-view capture system using 21 GoPro Black
Hero 7 cameras. We capture videos at a resolution of 20282704 (2.7K) and frame rate of 30 FPS.
The multi-view inputs are time-synchronized. We obtain the camera intrinsic and extrinsic pa-
rameters using COLMAP [177]. We employ 18 views for training, and 1 view for qualitative and
quantitative evaluations for all datasets except one sequence observing multiple people moving,
which uses 14 training views. For more details on the capture setup, please refer to the Ap-
pendix C.
Our captured data demonstrates a variety of challenges for video synthesis, including (1) ob-
jects of high specularity, translucency and transparency, (2) scene changes and motions with
changing topology (poured liquid), (3) self-cast moving shadows, (4) volumetric eects (ame),
(5) an entangled moving object with strong view-dependent eects (the torch gun and the pan),
(6) various lighting conditions (daytime, night, spotlight from the side), and (7) multiple peo-
ple moving around in open living room space with outdoor scenes seen through transparent
90
windows with relatively dark indoor illumination. Our collected data can provide sucient syn-
chronized camera views for high quality 4D reconstruction of challenging dynamic objects and
view-dependent eects in a natural daily indoor environment, which, to our knowledge, did not
exist in public 4D datasets. We will release the datasets for research purposes.
Immersive Video Datasets: We also demonstrate the generality of our method using the
multi-view videos from [34] directly trained on their sheye video input.
4.4.2 EvaluationSettings
Baselines: We compare to the following baselines:
• Multi-View Stereo (MVS): frame-by-frame rendering of the reconstructed and textured
3D meshes using commercial software RealityCapture
∗
.
• LocalLightFieldFusion(LLFF) [133]: frame-by-frame rendering of the LLFF-produced
multiplane images with the pretrained model
†
.
• NeuralVolumes (NV)[123]: One prior-art volumetric video rendering method using a
warped canonical model. We follow the same setting as the original paper.
• NeRF-T: a temporal NeRF baseline as described in Eq. 4.1.
• DyNeRF
y
: An ablation setting of DyNeRF without our proposed hierarchical training and
importance sampling.
∗
https://www.capturingreality.com/
†
https://github.com/Fyusion/LLFF
91
We provide more ablation analysis of our importance sampling strategies and latent code dimen-
sion in the Appendix C.
Metrics: We evaluate the rendering quality on test view and the following quantitative metrics:
• Peak signal-to-noise ratio (PSNR);
• Mean square error (MSE);
• Structural dissimilarity index measure (DSSIM) [174, 211], which remaps the structural
similarity index (SSIM) [216] into the range [0; 1] by the formulaDSSIM(x;y) = (1
SSIM(x;y))=2 for pixel (x;y) [174, 211];
• Perceptual quality measure LPIPS [236];
• Perceived error dierence FLIP [9];
• Just-Objectionable-Dierence (JOD) [127], a new video-quality metric that measures pho-
tometric quality of video and detects temporal artifacts such as ickering or jittering.
Higher PSNR and scores indicate better reconstruction quality and higher JOD represents less
visual dierence compared to the reference video. For all other metrics, lower numbers indicate
better quality.
For any video of length shorter than 60 frames, we evaluate the model frame-by-frame on
the complete video. Considering the signicant amount required for high resolution rendering,
we evaluate the model every 10 frames to calculate the frame-by-frame metrics reported for any
video of length equal or longer than 300 frames in Tab. 4.1. For video metric JOD which requires
a stack of continuous video frames, we evaluate the model on the whole sequence reported in
92
Tab. 4.2. We veried on 2 video sequences with a frame length of 300 that the PSNR diers by
at most 0:02 comparing evaluating them every 10th frame vs. on all frames. We evaluate all the
models at 1K resolution, and report the average of the result from every evaluated frame.
Implementation Details: We implement our approach in PyTorch [150]. We use the same
MLP architecture as in NeRF [134] except that we use 512 activations for the rst 8 MLP layers
instead of 256. We employ 1024-dimensional latent codes. In the hierarchical training we rst
only train on keyframes that areK = 30 frames apart. We employ the Adam optimizer [99] with
parameters
1
= 0:9 and
2
= 0:999. In the keyframe training stage, we set a learning rate of
5e4 and train for 300K iterations. We include the details on the important sampling scheme in
the Appendix C. We set the latent code learning rate to be 10 higher than for the other network
parameters. The per-frame latent codes are initialized fromN (0;
0:01
p
D
), whereD = 1024. The
total training takes about a week with 8 NVIDIA V100 GPUs and a total batch size of 24576 rays.
4.4.3 Results
We demonstrate our novel view rendering results on dierent sequences in Fig. 4.1 and Fig. 4.4.
Our method can represent a 30 FPS multi-view video of up to 10 seconds in length with at
high quality. Our reconstructed model can enable near photorealistic continuous novel-view
rendering at 1K resolution. In the supplemental video hosted on project website https://
neural- 3d- video.github.io/ , we render special visual eects such as slow motion by interpolating
sub-frame latent codes between two discrete time-dependent latent codes and the “bullet time”
eect with view-dependent eect by querying any latent code at any continuous time within the
video. Rendering with interpolated latent codes resulted in a smooth and plausible representation
93
Figure 4.4: High-quality novel view videos synthesized by our approach for dynamic real-
world scenes. We visualize normalized depth in color space on the last column in the each row.
Our representation is compact, yet expressive and even handles complex specular reections and
translucency.
of dynamics between the two neighboring input frames. Please refer to our supplemental video
for the 3D video visualizations.
94
Ours MVS LLFF NV
FLIP: 0.130 FLIP: 0.206 FLIP: 0.186 FLIP: 0.207
Figure 4.5:Comparisonofournalmodeltoexistingmethods, including Multi-view Stereo
(MVS), local light eld fusion (LLFF)[133] and NeuralVolume (NV) [123]. The rst row shows
novel view rendering on a test view. The second row visualizes the FLIP compared to the ground
truth image. Compared to alternative methods, our method can achieve best visual quality.
Quantitative Comparison to the Baselines: Tab. 4.1 shows the quantitative comparison of
our methods to the baselines using an average of single frame metrics and Tab. 4.2 shows the
comparison to baselines using a perceptual video metric. We train all the neural radiance eld
based baselines and our method the same number of iterations for fair comparison. Compared
to the existing methods, MVS, NeuralVolumes and LLFF, our method is able capture and ren-
der signicant more photo-realistic images, in all the quantitative measures. Compared to the
time-variant NeRF baseline NeRF-T and our basic DyNeRF model without our proposed train-
ing strategy (DyNeRF
y
), our DyNeRF model variants trained with our proposed training strategy
perform signicantly better in all metrics.
QualitativeComparisontotheBaselines: We highlight visual comparisons of our methods
to the baselines in Fig. 4.5 and Fig. 4.6. The visual results of the rendered images and FLIP error
95
Table 4.1: Quantitative comparison of our proposed method to baselines of existing methods
and radiance eld baselines trained at 200K iterations on a 10-second sequence.
Method PSNR" MSE# DSSIM# LPIPS# FLIP#
MVS 19.1213 0.01226 0.1116 0.2599 0.2542
NeuralVolumes 22.7975 0.00525 0.0618 0.2951 0.2049
LLFF 23.2388 0.00475 0.0762 0.2346 0.1867
NeRF-T 28.4487 0.00144 0.0228 0.1000 0.1415
DyNeRF
y
28.4994 0.00143 0.0231 0.0985 0.1455
DyNeRF 29.5808 0.00110 0.0197 0.0832 0.1347
Table 4.2:Quantitativecomparison of our proposed method to baselines using perceptual video
quality metric Just-Objectionable-Dierence (JOD) [127]. Higher number (maximum 10) indi-
cates less noticeable visual dierence to the ground truth.
Method NeuralVolumes LLFF NeRF-T DyNeRF
JOD" 6.50 6.48 7.73 8.07
maps highlight the advantages of our approach in terms of photorealism that are not well quanti-
ed using the metrics. In Fig. 4.5 we compare to the existing methods. MVS with texturing suers
from incomplete reconstruction, especially for occlusion boundaries, such as image boundaries
and the window regions. The baked-in textures also cannot capture specular and transparent
eects properly, e.g., the window glasses. LLFF [133] produces blurred images with ghosting ar-
tifacts and less consistent novel view across time, especially for objects at occlusion boundaries
and greater distances to the foreground, e.g., trees through the windows behind the actor. The re-
sults from Neural Volumes [123] contain cloudy artifacts and suer from inconsistent colors and
brightness (which can be better observed in the supplemental video). In contrast, our method
achieves clear images, unobstructed by “cloud artifacts” and produces the best results compared
to the existing methods. In particular, the details of the actor (e.g., hat, hands) and important
details (e.g., ame torch, which consists of a highly reective surface as well as the volumetric
96
RGB rendering Dynamic region zoom-in DSSIM FLIP
NeRF-T DyNeRF
†
DyNeRF
0.0531 0.1487
0.0392 0.1294
0.0235 0.1144
Figure 4.6: Qualitativecomparisons of DyNeRF variants on one image of the sequence whose
averages are reported in Tab. 4.1. From left to right we show the rendering by each method, then
zoom onto the moving ame gun, then visualize DSSIM and FLIP for this region using theviridis
colormap (dark blue is 0, yellow is 1, lower is better).
ame appearance) are faithfully captured by our method. Furthermore, MVS and LLFF and Neu-
ralVolume cannot model scenes as compact and continuous spatio-temporal representation as
our DyNeRF representation. In Fig. 4.6, we compare various settings of the dynamic neural radi-
ance elds. NeRF-T can only capture a blurry motion representation, which loses all appearance
details in the moving regions and cannot capture view-dependent eects. Though DyNeRF
y
has
a similar quantitative performance as NeRF-T, it has signicantly improved visual quality in the
moving regions compared to NeRF-T, but still struggles to recover the sharp appearance details.
DyNeRF with our proposed training strategy can recover sharp details in the moving regions,
including the torch gun and the ames.
97
Figure 4.7: Snapshotsofnovelviewrenderedvideos on immersive video datasets [34].
Comparisons on Training Time: Our proposed method is computationally more ecient
compared to alternative solutions. Training a NeRF model frame-by-frame is the only baseline
that can achieve the same photorealism as DyNeRF. However, we nd that training a single frame
NeRF model to achieve the same photorealism requires about 50 GPU hours, which in total re-
quires 15K GPU hours for a 30 FPS video of 10 seconds length. Our method only requires 1:3K
GPU hours for the same video, which reduces the required compute by one order of magnitude.
Results on Immersive Video Datasets [34]: We further demonstrates our DyNeRF model
can create reasonably well 3D immersive video using non-forward-facing and spherically dis-
torted multi-view videos with the same parameter setting and same training time. Fig. 4.7 shows
a few novel views rendered from our trained models. We include the video results in the sup-
plementary video. DyNeRF is able to generate an immersive coverage of the whole dynamic
space with a compact model. Compared to the frame-by-frame multi-spherical images (MSI) rep-
resentation used in [34], DyNeRF represents the video as one spatial temporal model which is
more compact in size (28MB for a 5s 30 FPS video) and can better represent the view-dependent
eects in the scene. Given the same amount of training time, we also observe there are some
98
Figure 4.8: Limitation. A few examples of failed outdoor reconstruction using DyNeRF.
challenges, particularly the blurriness in the fast moving regions given the same compute budget
and as above. We estimate one epoch of training time will take 4 weeks while we only trained
all models using 1/4 of all pixels for a week. It requires longer training time to gain sharpness,
which remains as a challenge to our current method in computation.
Discussions: There are a few scenarios that are challenging for our method. (1) Highly dy-
namic scenes with large and fast motions are challenging to model and learn, which might lead
to blur in the moving regions. As shown in Fig. 4.8, we observe it is particularly dicult to tackle
fast motion in a complex environment, e.g. outdoors with forest structure behind. An adap-
tive sampling strategy during the hierarchical training that places more keyframes during the
challenging parts of the sequence or more explicit motion modeling could help to further im-
prove results. (2) While we already achieve a signicant improvement in terms of training speed
compared to the baseline approaches, training still takes a lot of time and compute resources.
Finding ways to further decrease training time and to speed up rendering at test time are re-
quired. (3) Viewpoint extrapolation beyond the bounds of the training views is challenging and
might lead to artifacts in the rendered imagery. We hope that, in the future, we can learn strong
scene priors that will be able to ll in the missing information. (4) We discussed the importance
sampling strategy and its eectiveness based on the assumption of videos observed from static
cameras. We leave the study of this strategy on videos from moving cameras as future work. We
99
believe these current limitations are good directions to explore in follow-up work and that our
approach is a stepping stone in this direction.
4.5 Conclusion
We have proposed a novel neural 3D video synthesis approach that is able to represent real-world
multi-view video recordings of dynamic scenes in a compact, yet expressive representation. As
we have demonstrated, our approach is able to represent a 10 second long multi-view recording
by 18 cameras in under 28MB. Our model-free representation enables both high-quality view
synthesis as well as motion interpolation. At the core of our approach is an ecient algorithm to
learn dynamic latent-conditioned neural radiance elds that signicantly boosts training speed,
leads to fast convergence, and enables high quality results. We see our approach as a rst step
forward in eciently training dynamic neural radiance elds and hope that it will inspire follow-
up work in the exciting and emerging eld of neural scene representations.
100
Chapter5
ConclusionandOutlook
This dissertation investigates the fundamental algorithms and computational frameworks for
scalable pipelines of realistic digital human creation. Scalable creation pipelines for digital hu-
mans involve many sub-areas of computer vision and computer graphics. In this vast problem
space, we chose to explore the following three “gradients”: (1) generic modeling; (2) ecient and
automated processing; (3) general methodology. In Section 5.1, we summarize the contributions
of this dissertation. In Section 5.2, we list the contributions and compare them to the traditional
pipeline. With this, we discuss the insights that led to the contributions. We also discuss the
“missing components”, which suggest future opportunities, in Section 5.3.
5.1 SummaryofContributions
GenericFaceModeling: We propose a generic face model namedFLAME (Faces Learned with
an Articulated Model and Expressions). The generic model is able to accurately t novel subjects
(identities) as well as challenging expressions. In order to achieve wide coverage of variations, we
propose an automated pipeline to process a vast amount of high-quality facial capture datasets.
101
We design a system to reconstruct and register 4D faces with a carefully curated coarse-to-ne
optimization solver, exploiting geometric, photometric, and motion cues. Since facial deforma-
tions due to identity, expression, and pose present dierent characteristics, we further propose a
generic model structure by disentangling the three factors. In particular, we represent the articu-
lated parts (jaw, neck, and eyeballs) with linear blend skinning. This design separates non-linear
deformations from the linear expression model (a common design choice) and leads to great ad-
vantages in representing extreme expressions. Furthermore, we present pose-dependent correc-
tive blendshapes which levitate artifacts and capture additional deformation details. We show
experimentally that the resulting generic face model is a lightweight yet expressive model that
covers the shape variations in a wide range of populations as well as realistic deformations due to
expressions and pose changes. The FLAME model is compatible with existing graphics software
and is easy to t to data.
Ecient Inference for Topologically Consistent Face Meshes: We propose ToFu (Topo-
logical consistent Face inference from multi-view), an inference framework that produces topo-
logically consistent meshes across facial identities and expressions. The key design is a neural
network architecture that (1) incorporates known camera calibration information, (2) extracts and
processes volumetric features that encode the three-dimensional information of the scene, and (3)
is free of the constraints of 3DMM. We show that the ToFu framework can produce high-quality
face registration meshes without the need for traditional photogrammetry and mesh registra-
tion. We demonstrate state-of-the-art geometric and correspondence accuracy. The system takes
0.385 seconds to compute a mesh with 10K vertices, which is three orders of magnitude faster than
traditional techniques. Furthermore, we show that ToFu infers appearance maps for pore-level
102
geometric details which enable high-quality rendering. The resulting assets are readily applicable
in production studios for avatar creation, animation, and physically-based skin rendering.
GeneralCaptureandModelingwithDynamicNeuralRadianceFields: We explore the
capture and modeling systems that work beyond human faces (e.g., clothed human bodies, gen-
eral objects, and scenes). We propose a system to capture and model dynamic humans in arbitrary
dynamic scenes. Given multi-view video recordings of a dynamic real-world scene, the system
compresses the visual observations (radiance) into the neural representation that supports high-
quality 3D video synthesis. One key contribution is a novel time-conditioned neural radiance
eld, named DyNeRF (Dynamic Neural Radiance Field), that represents scene dynamics using a
set of compact latent codes. This representation, while being compact, supports features such as
view synthesis and motion interpolation. As the ray-based training procedures tend to be slow,
we further propose a novel hierarchical training scheme and a ray importance sampling, to better
exploit the smooth nature of the spatial-temporal data. The two strategies signicantly accelerate
the training progress and improve the perceptual quality of the generated video synthesis. We
demonstrate that our method can render high-delity wide-angle novel views at over 1K resolu-
tion, even for complex and dynamic scenes. We show that our learned representation is highly
compact and able to represent a 10-second 30 FPS multi-view video recording by 18 cameras with
a model size of only 28MB.
Publications: This dissertation comprises of three full-length (co-)rst-author publications [116,
117, 118], published in top-tier computer vision and computer graphics journals/conferences. The
papers and the co-authors are listed as follows. Note: the asterisk (*) denotes equal contributions.
103
• Tianye Li*, Timo Bolkart*, Michael J. Black, Hao Li, and Javier Romero, Learning a Model
of Facial Shape and Expression from 4D Scans, ACM Transactions on Graphics, Volume 36,
Issue 6, December 2017 (Proceedings of ACM SIGGRAPH Asia 2017)
• Tianye Li, Shichen Liu, Timo Bolkart, Jiayi Liu, Hao Li, and Yajie Zhao, Topologically Con-
sistent Multi-View Face Inference Using Volumetric Sampling, Proceedings of the IEEE In-
ternational Conference on Computer Vision (ICCV) 2021
• Tianye Li*, Mira Slavcheva*, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil
Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, and Zhaoy-
ang Lv, Neural 3D Video Synthesis from Multi-view Video, Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR) 2022
The supplemental materials (including videos, data and code release) are hosted at their cor-
responding project website [142, 144, 143].
Beside these three publications, the author has further published the following papers [171,
88, 237, 121, 122] during the Ph.D. program:
• Shunsuke Saito, Tianye Li, and Hao Li, Real-Time Facial Segmentation and Performance
Capture from RGB Input,ProceedingsoftheEuropeanConferenceonComputerVision(ECCV)
2016
• Zeng Huang, Tianye Li, Weikai Chen, Yajie Zhao, Jun Xing, Chloe LeGendre, Linjie Luo,
Chongyang Ma, and Hao Li, Deep Volumetric Video from Very Sparse Multi-view Perfor-
mance Capture, Proceedings of the European Conference on Computer Vision (ECCV) 2018
104
• Yajie Zhao*, Zeng Huang*, Tianye Li, Weikai Chen, Chloe LeGendre, Xinglei Ren, Jun Xing,
Ari Shapiro, and Hao Li, Learning Perspective Undistortion of Portraits, Proceedings of the
IEEE International Conference on Computer Vision (ICCV) 2019
• Shichen Liu, Tianye Li, Weikai Chen, and Hao Li, Soft Rasterizer: A Dierentiable Renderer
for Image-based 3D Reasoning,ProceedingsoftheIEEEInternationalConferenceonComputer
Vision (ICCV) 2019
• Shichen Liu, Tianye Li, Weikai Chen, and Hao Li, A General Dierentiable Mesh Renderer
for Image-based 3D Reasoning, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence (TPAMI) 2020
5.2 Discussion
In Fig. 5.1, we revisit contributions in this dissertation and align them to the existing anima-
tion production pipeline in computer graphics. The existing pipeline contains multiple steps [15],
including (1)modeling (for character shape), (2)retopology
∗
(aligning surfaces, which provide
dense surface correspondences for character rigging and support ecient texturing, (3)textur-
ing (for character appearance), (4) rigging (building an animation model). These steps are tra-
ditionally nished by professional artists manually. Although computational methods have been
automating several components, such as photogrammetry to replace manual modeling or mesh
∗
While some texts include retopology as a part of modeling, we make this distinction as consistent topology is
especially crucial for dynamic digital humans.
105
Modeling Retopology Rigging
TBA
Texturing
MVS Coarse-to-Fine Registration FLAME Training Texturing
Global Stage Local Stage
§ 3.1 § 3.2
ToFu (Mesh Inference) ToFu (Appearance)
3DMM Training
Synthesis Network
DyNeRF + Efficient Training
Images Raw Geometry Registered Geometry Appearance Animation Model
?
FLAME
Chapter 2
ToFu
Chapter 3
Neural 3D
Video Synthesis
Chapter 4
Traditional
Pipelines
Figure 5.1: Summary of contributions in this thesis from a perspective of traditional content
creation pipelines. The red blocks denote manual or semi-automated components. The blue
blocks denote fully automated components. The half transparent components were not the major
emphasis of the chapter.
warping for automatic retopology, these systems cannot easily scale up for massive data capture
and processing and scalable production for digital humans.
GenericFaceModeling: In Chapter 2, we propose the FLAME pipeline, an expressive yet com-
pact modeling technique for generic modeling, which contains two major contributions. The rst
contribution is an expressive yet compact modeling technique for generic modeling. The model
eectively summarizes high variations of identities as well as expressions, and pose changes. The
eectiveness is thanks to a model structure that disentangles variations in three natures: shape
(identity), expression (mostly muscle activation) and poses (articulation due to bones). This sug-
gests the importance of a proper model structure that respects the nature of the data. The
106
performance of the FLAME model also depends on high-quality data, which is obtained by curat-
ing a heterogeneous set of raw datasets. In total, FLAME is trained from over 33; 000 registered
data. That leads to the second contribution: an automated facial data capture and processing
pipeline, which is shown to be a scalable system to capture massive 4D facial performance data
at the level of 10
4
to 10
5
meshes. The registration data accurately capture the facial shape de-
tails and vivid performances when the subject’s face is deforming due to expressions and pose
changes. We show the registration is accurate in both the geometry (it ts the true surface) and
correspondence (it tracks the deforming surface). The success indicates theimportanceofau-
tomatedpipelineforhigh-qualitydata.
The FLAME model is shown to be expressive in terms of tting to unseen data (either 3D
or 2D inputs), which shows great potential in applications such as facial performance transfer
(retargeting) and single-view face reconstruction. Furthermore, the FLAME model is lightweight
and compatible with existing graphics pipelines (such as Blender). Compared to the traditional
pipeline in Fig. 5.1, we show that the contributions in this chapter scale up for the modeling,
retopology, and rigging steps with a well-designed computational pipeline. While we did not
explore FLAME for appearance in this dissertation, FLAME is able to support modeling facial ap-
pearances, as shown in later work [59, 60]. It is worth noting that, since the the publication [116],
FLAME has enpowered several state-of-the-art methods in computer vision and computer graph-
ics, including expressive body capture and modeling (the SMPL-X model) [151], single-view de-
tailed face modeling [61], emotional face reconstruction [48], audio-driven facial animation [47],
as well as neural implicit models [238].
107
EcientInferenceforTopologicallyConsistentFaceMeshes: In Chapter 3, we re-evaluate
the traditional capture and processing systems (including the one in FLAME in Chapter 2). These
systems contain two key components: multi-view reconstruction (MVS) and mesh registration.
We observe that these systems, though already automated and ecient given the help of parallel
processing, still present several potential issues in their eciency and scalability: First, the MVS
and mesh registration are slow to compute (at the level of minutes); Second, MVS tends to suer
from imperfect image input such as specularity (due to facial oils) or outliers (e.g., facial hair).
Consequently, the quality of mesh registration can be aected by the errors from MVS. Third,
when errors occur (which is almost inevitable in massive data capture and processing), it often
requires skilled artists to manually x the results or adjust the optimization parameters.
We observe that there is a synergy between the MVS and mesh registration, in which the
two steps are computed for a similar goal: MVS reconstructs 3D points by matching pixels from
multiple images, establishing dense correspondence across views. Mesh registration builds dense
correspondence among the meshes across identities and expressions. Both tasks aim to solve
dense correspondence. Given this insight, we design a neural network architecture that exploits
this synergy: we design a 2D neural network to predict the dense feature maps that encode the
dense correspondence (across subjects and expressions). Then the volumetric sampling and vol-
umetric networks further exploit the multi-view features with the known perspective projection
(as opposed to merging them by arbitrary network connections). These designs are shown to be
eective when compared to existing work which global regression architecture which requires a
global regression architecture and an additional 3D morphable model.
This success suggests theimportanceofsystemthinking. Previously, we tended to think
of each sub-problem as an individual task, and each task evolves within its own family of research
108
progress. This chapter argues that, in order to improve a complicated system, we must rethink
the components jointly. Further, when designing a learning-based system, we should consider
the nature of the data or problem and design the architecture correspondingly. Compared to the
traditional pipeline and FLAME in Fig. 5.1, we see great potential for a well-designed learning-
based component to replace the existing components. The new design can have great advantages
in speed (ToFu achieves three orders of magnitude speed improvement and higher robustness
to noises and outliers). The improvement in eciency and robustness will further enhance the
scalability of the realistic digital human pipeline.
GeneralCaptureandModelingwithDynamicNeuralRadianceFields: In Chapter 4, we
explore capture and modeling systems for general scenes, which contains dynamic humans as
well as arbitrary dynamic scenes. This scenario is highly challenging as the objects can have
highly detailed geometry, view-dependent appearance (e.g., specularity and transparency), volu-
metric eects, and varying topology. Note that it is still possible to build morphable mesh models
onsome general objects, but this requires the object instances to share some structure (as in dense
surface correspondence), and this requires great eorts to build morphable models foreach object
category that we are interested in. As general objects rarely share one consistent structure, the
diversity is multiplied compared to that in humans. Therefore the methodology from the tradi-
tional mesh-based pipeline (as in what we studied in face modeling in the previous two chapters)
is dicult to scale up and generalize to objects and scenes.
Instead, we take a dierent approach: we utilize the power of neural implicit representations
(such as neural radiance elds) and dierentiable rendering (volume rendering, in this case). To
109
capture the dynamic eects, we propose a dynamic neural radiance elds (DyNeRF) representa-
tion based on a set of compact latent codes. Facing the issues of long training times, we propose
ecient training schemes, including hierarchical training and ray importance sampling, to accel-
erate the training as well as improve result quality. We show qualitative and quantitative results
of our method to demonstrate the state-of-the-art performance in capturing dynamic scenes with
humans in terms of 3D video synthesis. The trained dynamic NeRF representation further sup-
ports visual eects such as bullet-time and novel view synthesis and can support editing with view
changes and temporal manipulation. We also show the trained representation can be viewed at
interactive speed with a commodity VR Quest 2 headset.
Compared to the traditional pipelines (including those in Chapter 2 and Chapter 3), we are
observing aparadigmshift. The traditional pipelines break the nal task into small steps: mod-
eling (for raw geometry), registrations (for aligned geometry), texturing (for appearance), and
rigging (for a manipulable model for animation). The pipeline has matured thanks to the devel-
opment over the last few decades, and the results are compatible with existing graphics software
and tools. In Chapter 3, we see a trend that a well-designed neural network can replace indi-
vidual traditional components in some settings (in our case, a lab/studio capture setup). Chap-
ter 4 suggests an even more ambitious vision: the neural/implicit representations combined with
neural/dierentiable rendering show potential for a versatile solution that can cover arbitrary ob-
jects and scenes with one “universal” representation (in our case, dynamic neural radiance elds).
However, the cost is that the current neural models cannot support full editability at the same
level as the classical mesh models. The dynamic neutral radiance elds in Chapter 4 support the
playback of 3D videos. Although the spatial-temporal continuous representation supports frame
interpolation and view manipulation, which enable slow-motion video and bullet-time eects, it
110
(a) (b) (c) (d)
Figure 5.2: Futuredirections. (a) Hybrid modeling [108, 168]; (b) Unconstrained capture [131];
(c) & (d) Cross-modality, including speech animation [240] and text-driven synthesis [157].
currently does not support scene editing beyond the captured physical history (e.g., adding a new
character or editing the events or background).
5.3 Outlook
Hybrid Modeling: As discussed in Section 5.2, neural representations and pipelines present
great potential in modeling arbitrary objects with challenging details and appearances, while the
classical graphics pipelines still hold the advantages on particular objects (such as humans) for
being ecient, matured (in terms of computing and software infrastructure) and supporting full
controllability and editability. Therefore, hybrid solutions deserve attention, as it is possible to
combine the merits of the two worlds for future digital humans. This direction has also been
demonstrated by the recent work to model eyes [108], hair [168], and clothing [62, 189]. For
future content creation, such as in game design and lm-making, it is important to support exible
scene editing for motion, geometry, appearance, and lighting conditions. Therefore it would also
be interesting to push controllability and editability for general scenes and objects.
111
Unconstrained Capture: For capturing data at the highest possible quality, in this disserta-
tion, we utilize high-end hardware, including multi-view capture systems such as 3DMD [92] and
the Light Stage system [76, 126]. To further democratize the digital human and 3D/4D capture
technology, it is promising to pursue a more unconstrained capture system with fewer potentially
dynamic cameras. While single-view reconstruction methods have made progress in digitizing
human faces [37, 61, 238], hands [84], full body [170, 226], animals [169, 229], and general ob-
jects [121, 139, 146], their reconstruction delity is no match for the high-end requirement of real-
ism (e.g., in lm-making). It would be interesting to keep pushing for higher quality. The potential
directions include exible representations, eective supervision (including dierentiable/neural
rendering), and learning better object/human priors. Another way to view this problem is from
a system point of view. It would also be interesting to innovate capture devices and combine
signals from multiple sensors [131, 132].
Cross-modality: Human social interactions involve more modalities than only visual cues.
It would be interesting to explore human modeling with speech [47, 162, 195, 240], sign/body
language [175], group interactions [94] and human-object interaction [43, 58, 193]. One in-
teresting recent development is the text-driven synthesis powered by large language models
(LLMs) [33, 52], including Text-to-Image synthesis [160, 166] and Text-to-3D synthesis [157].
The photo-realistic synthesis results suggest an interesting direction for future content creation
(including digital humans) with assistance from AI-driven agents.
Closing Remark: Any suciently advanced technology is indistinguishable from magic. This
famous quote from Arthur C. Clark appeared, interestingly, in a footnote in his revision of the
112
essay "Hazards of Prophecy: The Failure of Imagination" [44]. In this essay, he discussed the im-
portance (and scarcity) of imagination. Too often, our minds are clogged by existing mindsets.
We should be free to rethink the old problems and strive to discover new problems and solutions.
As Clark said, the only way of discovering the limits of the possible is to venture a little way past
them into the impossible. See you in the future.
113
Bibliography
[1] Edward H. Adelson and James R. Bergen. The plenoptic function and the elements of early
vision. In Computational Models of Visual Processing, pages 3–20. MIT Press, 1991.
[2] Oswald Aldrian and William AP Smith. Inverse rendering of faces on a cloudy day. InEur.
Conf. Comput. Vis., pages 201–214, 2012.
[3] Oleg Alexander, Mike Rogers, William Lambeth, Matt Chiang, and Paul Debevec. The
digital emily project: Photoreal facial modeling and animation. InSIGGRAPH2009Courses,
pages 12:1–12:15, 2009.
[4] Alicevision. Alicevision. https://alicevision.org . Accessed: 2020-04-28.
[5] Brett Allen, Brian Curless, and Zoran Popović. The space of human body shapes: recon-
struction and parameterization from range scans. ACM transactions on graphics (TOG),
22(3):587–594, 2003.
[6] Brett Allen, Brian Curless, Zoran Popović, and Aaron Hertzmann. Learning a correlated
model of identity and pose-dependent body shape variation for real-time synthesis. In
ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’06, pages 147–
156, 2006.
[7] Brian Amberg, Reinhard Knothe, and Thomas Vetter. Expression invariant 3D face recog-
nition with a morphable model. In International Conference on Automatic Face Gesture
Recognition, pages 1–6, 2008.
[8] Brian Amberg, Sami Romdhani, and Thomas Vetter. Optimal step nonrigid ICP algorithms
for surface registration. In Conference on Computer Vision and Pattern Recognition, pages
1–8, 2007.
[9] Pontus Andersson, Jim Nilsson, Tomas Akenine-Möller, Magnus Oskarsson, Kalle Åström,
and Mark D Fairchild. Flip: a dierence evaluator for alternating images. Proceedings of
the ACM on Computer Graphics and Interactive Techniques (HPG 2020), 3(2), 2020.
[10] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers,
and James Davis. SCAPE: Shape completion and animation of people. Transactions on
Graphics (Proceedings of SIGGRAPH), 24(3):408–416, 2005.
[11] Ziqian Bai, Zhaopeng Cui, Jamal Ahmed Rahim, Xiaoming Liu, and Ping Tan. Deep facial
non-rigid multi-view stereo. InProceedingsoftheIEEE/CVFConferenceonComputerVision
and Pattern Recognition, pages 5850–5860, 2020.
114
[12] Aayush Bansal, Minh Vo, Yaser Sheikh, Deva Ramanan, and Srinivasa Narasimhan. 4d
visualization of dynamic events from unconstrained multi-view videos. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5366–5375,
2020.
[13] Anil Bas and William A. P. Smith. What does 2d geometric information really tell us about
3d face shape? Int. J. Comput. Vis., 127, 2019.
[14] Anil Bas, William A. P. Smith, Timo Bolkart, and Stefanie Wuhrer. Fitting a 3D morphable
model to edges: A comparison between hard and soft correspondences. In ACCVW, pages
377–391, Cham, 2017. Springer International Publishing.
[15] Andy Beane. 3D Animation Essentials. John Wiley & Sons, 2012.
[16] Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. High-quality
single-shot capture of facial geometry. ACM Trans. Graph., 29(4), 2010.
[17] Thabo Beeler and Derek Bradley. Rigid stabilization of facial expressions. Transactions on
Graphics (Proceedings of SIGGRAPH), 33(4):44:1–44:9, 2014.
[18] Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman,
Robert W Sumner, and Markus Gross. High-quality passive facial performance capture
using anchor frames.TransactionsonGraphics(ProceedingsofSIGGRAPH), 30(4):75:1–75:10,
2011.
[19] Mojtaba Bemana, Karol Myszkowski, Hans-Peter Seidel, and Tobias Ritschel. X-elds:
implicit neural view-, light-and time-image interpolation. ACM Transactions on Graphics
(TOG), 39(6):1–15, 2020.
[20] Sai Bi, Zexiang Xu, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Georoy, David Krieg-
man, and Ravi Ramamoorthi. Deep reectance volumes: Relightable reconstructions from
multi-view photometric images. In Computer Vision – ECCV 2020: 16th European Confer-
ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part III, page 294–311, Berlin, Heidel-
berg, 2020. Springer-Verlag.
[21] Volker Blanz, Curzio Basso, Tomaso Poggio, and Thomas Vetter. Reanimating faces in
images and video. Computer Graphics Forum, 22(3):641–650, 2003.
[22] Volker Blanz, Sami Romdhani, and Thomas Vetter. Face identication across dierent poses
and illuminations with a 3D morphable model. In Proceedings of fth IEEE international
conference on automatic face gesture recognition, pages 202–207. IEEE, 2002.
[23] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3D faces. InACM
Transactions on Graphics (TOG), SIGGRAPH ’99, 1999.
[24] Federica Bogo, Michael J Black, Matthew Loper, and Javier Romero. Detailed full-body re-
constructions of moving people from monocular RGB-D sequences. In International Con-
ference on Computer Vision, pages 2300–2308, 2015.
115
[25] Federica Bogo, Javier Romero, Matthew Loper, and Michael J. Black. FAUST: Dataset and
evaluation for 3D mesh registration. In Conference on Computer Vision and Pattern Recog-
nition, pages 3794–3801, 2014.
[26] Federica Bogo, Javier Romero, Gerard Pons-Moll, and Michael J Black. Dynamic faust:
Registering human bodies in motion. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 6233–6242, 2017.
[27] Timo Bolkart and Stefanie Wuhrer. A groupwise multilinear correspondence optimization
for 3D faces. In Int. Conf. Comput. Vis., pages 3604–3612, 2015.
[28] James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, and Stefanos Zafeiriou.
Large scale 3D morphable models. International Journal of Computer Vision, pages 1–22,
2017.
[29] James Booth, Anastasios Roussos, Stefanos Zafeiriou, Allan Ponniah, and David Dunaway.
A 3D morphable model learnt from 10,000 faces. In Conference on Computer Vision and
Pattern Recognition, pages 5543–5552, 2016.
[30] George Borshukov, Dan Piponi, Oystein Larsen, J. P. Lewis, and Christina Tempelaar-Lietz.
Universal capture - image-based facial animation for "the matrix reloaded". In ACM SIG-
GRAPH 2005 Courses, SIGGRAPH ’05, page 16–es, New York, NY, USA, 2005. Association
for Computing Machinery.
[31] Soen Bouaziz, Yangang Wang, and Mark Pauly. Online modeling for realtime facial ani-
mation. Transactions on Graphics (Proceedings of SIGGRAPH), 32(4):40:1–40:10, 2013.
[32] Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Numerical geometry of
non-rigid shapes. Springer Science & Business Media, 2008.
[33] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language
models are few-shot learners. Advances in neural information processing systems, 33:1877–
1901, 2020.
[34] Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erickson, Peter Hedman, Matthew
Duvall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. Immersive light eld
video with a layered mesh representation.ACMTransactionsonGraphics(TOG), 39(4):86–1,
2020.
[35] Alan Brunton, Timo Bolkart, and Stefanie Wuhrer. Multilinear wavelets: A statistical shape
space for human faces. In European Conference on Computer Vision, pages 297–312, 2014.
[36] Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. Real-time high-delity facial per-
formance capture. Transactions on Graphics (Proceedings of SIGGRAPH), 34(4):46:1–46:9,
2015.
[37] Chen Cao, Tomas Simon, Jin Kyu Kim, Gabe Schwartz, Michael Zollhoefer, Shun-Suke
Saito, Stephen Lombardi, Shih-En Wei, Danielle Belko, Shoou-I Yu, et al. Authentic volu-
metric avatars from a phone scan. ACM Transactions on Graphics (TOG), 41(4):1–19, 2022.
116
[38] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. Facewarehouse: A 3D fa-
cial expression database for visual computing. TransactionsonVisualizationandComputer
Graphics, 20(3):413–425, 2014.
[39] Joel Carranza, Christian Theobalt, Marcus A Magnor, and Hans-Peter Seidel. Free-
viewpoint video of human actors. ACM Transactions on Graphics (TOG), 22(3):569–577,
2003.
[40] Feng-Ju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, and Gerard
Medioni. Expnet: Landmark-free, deep, 3D facial expressions. In 2018 13th IEEE Interna-
tional Conference on Automatic Face & Gesture Recognition (FG 2018), pages 122–129. IEEE,
2018.
[41] Shenchang Eric Chen and Lance Williams. View interpolation for image synthesis. In
Proceedings of the 20th annual conference on Computer graphics and interactive techniques,
pages 279–288, 1993.
[42] Yu Chen, Duncan P Robertson, and Roberto Cipolla. A practical system for modelling
body shapes from single view measurements. In British Machine Vision Conference, pages
82.1–82.11, 2011.
[43] Sammy Christen, Muhammed Kocabas, Emre Aksan, Jemin Hwangbo, Jie Song, and Otmar
Hilliges. D-grasp: Physically plausible dynamic grasp synthesis for hand-object interac-
tions. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition,
pages 20577–20586, 2022.
[44] Arthur C Clarke. Proles of the Future: An Inquiry into the Limits of the Possible. Harper &
Row; Revised edition (January 1, 1973), 1973.
[45] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese,
Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint
video. ACM Transactions on Graphics (ToG), 34(4):1–13, 2015.
[46] Darren Cosker, Eva Krumhuber, and Adrian Hilton. A FACS valid 3D dynamic action unit
database with applications to 3D dynamic morphable facial modeling. In International
Conference on Computer Vision, pages 2296–2303, 2011.
[47] Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. Cap-
ture, learning, and synthesis of 3D speaking styles. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pages 10101–10111, 2019.
[48] Radek Danecek, Michael J. Black, and Timo Bolkart. EMOCA: Emotion driven monocular
face capture and animation. In Conference on Computer Vision and Pattern Recognition
(CVPR), pages 20311–20322, 2022.
[49] Rhodri Davies, Carole Twining, and Chris Taylor. StatisticalModelsofShape: Optimisation
and Evaluation. Springer, 2008.
117
[50] Abe Davis, Marc Levoy, and Fredo Durand. Unstructured light elds. Comput. Graph.
Forum, 31(2pt1):305–314, may 2012.
[51] Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Westley Sarokin, and
Mark Sagar. Acquiring the reectance eld of a human face. In Proceedings of the 27th
annual conference on Computer graphics and interactive techniques, pages 145–156. ACM
Press/Addison-Wesley Publishing Co., 2000.
[52] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
[53] Yilun Du, Yinan Zhang, Hong-Xing Yu, Joshua B Tenenbaum, and Jiajun Wu. Neural ra-
diance ow for 4d view synthesis and video processing. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 14324–14334, 2021.
[54] Ludovic Dutreve, Alexandre Meyer, and Sada Bouakaz. Easy acquisition and real-time
animation of facial wrinkles. Computer Animation and Virtual Worlds, 22(2-3):169–176,
2011.
[55] Bernhard Egger, William A. P. Smith, Ayush Tewari, Stefanie Wuhrer, Michael Zollhöfer,
Thabo Beeler, Florian Bernard, Timo Bolkart, Adam Kortylewski, Sami Romdhani, Chris-
tian Theobalt, Volker Blanz, and Thomas Vetter. 3D morphable face models - past, present
and future. ACM Trans. Graph., 2020.
[56] Paul Ekman and Wallace V Friesen. Facial action coding system. EnvironmentalPsychology
& Nonverbal Behavior, 1978.
[57] Carlos Hernández Esteban and Francis Schmitt. Silhouette and stereo fusion for 3D object
modeling. Computer Vision and Image Understanding, 96(3):367–392, 2004.
[58] Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann,
Michael J Black, and Otmar Hilliges. Articulated objects in free-form hand interaction.
arXiv preprint arXiv:2204.13662, 2022.
[59] Haiwen Feng. Photometric ame tting. https://github.com/HavenFeng/photometric_
optimization, 2020.
[60] Haiwen Feng, Timo Bolkart, Joachim Tesch, Michael J. Black, and Victoria Abrevaya. To-
wards racially unbiased skin tone estimation via scene disambiguation. In European Con-
ference on Computer Vision, 2022.
[61] Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable de-
tailed 3D face model from in-the-wild images. ACM Transactions on Graphics (ToG), Proc.
SIGGRAPH, 40(4):88:1–88:13, 2021.
[62] Yao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black, and Timo Bolkart. Capturing and
animation of body and clothing from monocular video. InSIGGRAPHAsia2022Conference
Papers, SA ’22, New York, NY, USA, 2022. Association for Computing Machinery.
118
[63] Claudio Ferrari, Giuseppe Lisanti, Stefano Berretti, and Alberto Del Bimbo. Dictionary
learning based 3D morphable model construction for face recognition with varying ex-
pression and pose. In International Conference on 3D Vision, pages 509–517, 2015.
[64] Barbara Flueckiger. Computer-generated characters in avatar and benjamin button. Digi-
talitat und Kino. Translation from German by B. Letzler, 1:2, 2011.
[65] John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fye, Ryan Over-
beck, Noah Snavely, and Richard Tucker. Deepview: View synthesis with learned gradient
descent. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 2367–2376, 2019.
[66] Yasutaka Furukawa. High-delity image-based modeling. University of Illinois at Urbana-
Champaign, 2008.
[67] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis.
IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2009.
[68] Graham Fye, Koki Nagano, Loc Huynh, Shunsuke Saito, Jay Busch, Andrew Jones, Hao Li,
and Paul Debevec. Multi-view stereo on consistent face topology. Comput. Graph. Forum,
36(2):295–309, May 2017.
[69] Guy Gafni, Justus Thies, Michael Zollhofer, and Matthias Nießner. Dynamic neural radi-
ance elds for monocular 4d facial avatar reconstruction. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 8649–8658, 2021.
[70] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from
dynamic monocular video. InProceedingsoftheIEEEInternationalConferenceonComputer
Vision, 2021.
[71] Pablo Garrido, Levi Valgaert, Chenglei Wu, and Christian Theobalt. Reconstructing de-
tailed dynamic face geometry from monocular video. TransactionsonGraphics(Proceedings
of SIGGRAPH Asia), 32(6):158:1–158:10, 2013.
[72] Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick Pérez,
and Christian Theobalt. Reconstruction of personalized 3D face rigs from monocular video.
Transactions on Graphics (Presented at SIGGRAPH 2016), 35(3):28:1–28:15, 2016.
[73] Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. Visual turing test for
computer vision systems. Proceedings of the National Academy of Sciences, 112(12):3618–
3623, 2015.
[74] Stuart Geman and D McClure. Bayesian image analysis: An application to single photon
emission tomography. Amer. Statist. Assoc, pages 12–18, 1985.
[75] Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic, and William T.
Freeman. Unsupervised training for 3D morphable model regression. InIEEEConf.Comput.
Vis. Pattern Recog., pages 8377–8386, 2018.
119
[76] Abhijeet Ghosh, Graham Fye, Borom Tunwattanapong, Jay Busch, Xueming Yu, and Paul
Debevec. Multiview face capture using polarized spherical gradient illumination. ACM
Trans. Graph., 30(6), 2011.
[77] Michael Goesele, Brian Curless, and Steven M Seitz. Multi-view stereo revisited. In 2006
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06),
volume 2, pages 2402–2409. IEEE, 2006.
[78] Google. Project Starline: Feel like you’re there, together. https://blog.google/technology/
research/project- starline/ . Accessed: 2022-11-28.
[79] Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F Cohen. The lumigraph.
InProceedingsofthe23rdannualconferenceonComputergraphicsandinteractivetechniques,
pages 43–54, 1996.
[80] Paulo Gotardo, Jérémy Riviere, Derek Bradley, Abhijeet Ghosh, and Thabo Beeler. Practical
dynamic facial appearance modeling and acquisition. ACMTransactionsonGraphics,(Proc.
SIGGRAPH Asia), 37(6):232:1–232:13, 2018.
[81] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade
cost volume for high-resolution multi-view stereo and stereo matching. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[82] Kaiwen Guo, Peter Lincoln, Philip Davidson, Jay Busch, Xueming Yu, Matt Whalen, Geo
Harvey, Sergio Orts-Escolano, Rohit Pandey, Jason Dourgarian, et al. The relightables:
Volumetric performance capture of humans with realistic relighting. ACMTransactionson
Graphics (TOG), 38(6):1–19, 2019.
[83] Nils Hasler, Carsten Stoll, Martin Sunkel, Bodo Rosenhahn, and H-P Seidel. A Statistical
Model of Human Pose and Body Shape. Computer Graphics Forum, 2009.
[84] Yana Hasson, Gul Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J Black, Ivan Laptev,
and Cordelia Schmid. Learning joint reconstruction of hands and manipulated objects. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
11807–11816, 2019.
[85] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. InProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition,
pages 770–778, 2016.
[86] David A. Hirshberg, Matthew Loper, Eric Rachlin, and Michael J. Black. Coregistration:
Simultaneous alignment and modeling of articulated 3D shape. InEuropeanConferenceon
Computer Vision, pages 242–255, 2012.
[87] Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman
Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. Avatar digitization from a single image
for real-time rendering. ACM Trans. Graph., 36(6), 2017.
120
[88] Zeng Huang, Tianye Li, Weikai Chen, Yajie Zhao, Jun Xing, Chloe LeGendre, Linjie Luo,
Chongyang Ma, and Hao Li. Deep volumetric video from very sparse multi-view perfor-
mance capture. InProceedingsoftheEuropeanConferenceonComputerVision(ECCV), pages
336–354, 2018.
[89] Alexandru Eugen Ichim, Soen Bouaziz, and Mark Pauly. Dynamic 3D avatar creation
from hand-held video input. ACM Transactions on Graphics (ToG), 34(4):1–14, 2015.
[90] Sunghoon Im, Hyowon Ha, Hae-Gon Jeon, Stephen Lin, and In So Kweon. Deep depth
from uncalibrated small motion clip. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2019.
[91] Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In-So Kweon. Dpsnet: End-to-end deep
plane sweep stereo. In7thInternationalConferenceonLearningRepresentations,ICLR2019.
International Conference on Learning Representations, ICLR, 2019.
[92] 3dMD Inc. http://www.3dmd.com/.
[93] Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. Learnable triangulation
of human pose. In Proceedings of the IEEE International Conference on Computer Vision,
pages 7718–7727, 2019.
[94] Hanbyul Joo, Tomas Simon, Mina Cikara, and Yaser Sheikh. Towards social articial in-
telligence: Nonverbal social signal prediction in a triadic interaction. In Proceedings of the
IEEE/CVFConferenceonComputerVisionandPatternRecognition, pages 10873–10883, 2019.
[95] Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. Learning-based view
synthesis for light eld cameras. ACM Transactions on Graphics (TOG), 35(6):1–10, 2016.
[96] Takeo Kanade, Peter Rander, and PJ Narayanan. Virtualized reality: Constructing virtual
worlds from real scenes. IEEE Multimedia, 4(1):34–47, 1997.
[97] Abhishek Kar, Christian Häne, and Jitendra Malik. Learning a multi-view stereo machine.
In Advances in neural information processing systems, pages 365–376, 2017.
[98] Ira Kemelmacher-Shlizerman and Steven M Seitz. Face reconstruction in the wild. In
International Conference on Computer Vision, pages 1746–1753, 2011.
[99] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua
Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[100] Leif Kobbelt, Swen Campagna, Jens Vorsatz, and Hans-Peter Seidel. Interactive multi-
resolution modeling on arbitrary meshes. In SIGGRAPH, pages 105–114, 1998.
[101] Vladimir Kolmogorov and Ramin Zabih. Multi-camera scene reconstruction via graph cuts.
In European conference on computer vision, pages 82–96. Springer, 2002.
121
[102] Yeara Kozlov, Derek Bradley, Moritz Bächer, Bernhard Thomaszewski, Thabo Beeler, and
Markus Gross. Enriching facial blendshape rigs with physical simulation. ComputerGraph-
ics Forum, 2017.
[103] Christoph Lassner and Michael Zollhofer. Pulsar: Ecient sphere-based neural rendering.
InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition, pages
1440–1449, 2021.
[104] Aldo Laurentini. The visual hull concept for silhouette-based image understanding. IEEE
Transactions on pattern analysis and machine intelligence, 16(2):150–162, 1994.
[105] Jason Lawrence, Dan B Goldman, Supreeth Achar, Gregory Major Blascovich, Joseph G.
Desloge, Tommy Fortes, Eric M. Gomez, Sascha Häberling, Hugues Hoppe, Andy Huibers,
Claude Knaus, Brian Kuschak, Ricardo Martin-Brualla, Harris Nover, Andrew Ian Russell,
Steven M. Seitz, and Kevin Tong. Project starline: A high-delity telepresence system.
ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 40(6), 2021.
[106] Marc Levoy and Pat Hanrahan. Light eld rendering. In Proceedings of the 23rd annual
conference on Computer graphics and interactive techniques, pages 31–42, 1996.
[107] Marc Levoy, Kari Pulli, Brian Curless, Szymon Rusinkiewicz, David Koller, Lucas Pereira,
Matt Ginzton, Sean Anderson, James Davis, Jeremy Ginsberg, et al. The digital michelan-
gelo project: 3D scanning of large statues. In Proceedings of the 27th annual conference on
Computer graphics and interactive techniques, pages 131–144, 2000.
[108] Gengyan Li, Abhimitra Meka, Franziska Mueller, Marcel C Buehler, Otmar Hilliges, and
Thabo Beeler. Eyenerf: a hybrid representation for photorealistic synthesis, animation and
relighting of human eyes. ACM Transactions on Graphics (TOG), 41(4):1–16, 2022.
[109] Hao Li, Bart Adams, Leonidas J Guibas, and Mark Pauly. Robust single-view geometry and
motion reconstruction. ACM Transactions on Graphics (ToG), 28(5):1–10, 2009.
[110] Hao Li, Linjie Luo, Daniel Vlasic, Pieter Peers, Jovan Popović, Mark Pauly, and Szymon
Rusinkiewicz. Temporally coherent completion of dynamic shapes. ACM Transactions on
Graphics (TOG), 31(1):1–11, 2012.
[111] Hao Li, Thibaut Weise, and Mark Pauly. Example-based facial rigging. Transactions on
Graphics (Proceedings of SIGGRAPH), 29(4):32:1–32:6, 2010.
[112] Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. Realtime facial animation with on-the-y
correctives. Transactions on Graphics (Proceedings of SIGGRAPH), 32(4):42:1–42:10, 2013.
[113] Jiaman Li, Zhengfei Kuang, Yajie Zhao, Mingming He, Karl Bladin, and Hao Li. Dynamic
facial asset and rig generation from a single scan. ACM Transactions on Graphics (TOG),
39(6):1–18, 2020.
[114] Jun Li, Weiwei Xu, Zhiquan Cheng, Kai Xu, and Reinhard Klein. Lightweight wrinkle
synthesis for 3D facial modeling and animation. Computer-AidedDesign, 58:117–122, 2015.
122
[115] Ruilong Li, Karl Bladin, Yajie Zhao, Chinmay Chinara, Owen Ingraham, Pengda Xiang,
Xinglei Ren, Pratusha Prasad, Bipin Kishore, Jun Xing, and Hao Li. Learning formation of
physically-based face attributes. In IEEE Conf. Comput. Vis. Pattern Recog., 2020.
[116] Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. Learning a model
of facial shape and expression from 4D scans. ACM Transactions on Graphics (TOG), 36(6),
2017.
[117] Tianye Li, Shichen Liu, Timo Bolkart, Jiayi Liu, Hao Li, and Yajie Zhao. Topologically con-
sistent multi-view face inference using volumetric sampling. InProceedingsoftheIEEE/CVF
International Conference on Computer Vision, pages 3824–3834, 2021.
[118] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil
Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, and
Zhaoyang Lv. Neural 3D video synthesis from multi-view video. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5521–5531, 2022.
[119] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene ow elds for
space-time view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 6498–6508, 2021.
[120] Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sarkar, Jiatao Gu, and Christian
Theobalt. Neural actor: Neural free-view synthesis of human actors with pose control.
ACM Trans. Graph.(ACM SIGGRAPH Asia), 2021.
[121] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft rasterizer: A dierentiable renderer
for image-based 3D reasoning. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 7708–7717, 2019.
[122] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. A general dierentiable mesh renderer for
image-based 3D reasoning. IEEETransactionsonPatternAnalysisandMachineIntelligence,
2020.
[123] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann,
and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images.
ACM Trans. Graph., 38(4):65:1–65:14, July 2019.
[124] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J.
Black. Smpl: A skinned multi-person linear model. Transactions on Graphics (Proceedings
of SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
[125] Matthew M Loper and Michael J Black. OpenDR: An approximate dierentiable renderer.
In European Conference on Computer Vision, pages 154–169, 2014.
[126] Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, Paul E De-
bevec, et al. Rapid acquisition of specular and diuse normal maps from polarized spherical
gradient illumination. Rendering Techniques, 2007(9):10, 2007.
123
[127] Rafał K. Mantiuk, Gyorgy Denes, Alexandre Chapiro, Anton Kaplanyan, Gizem Rufo, Ro-
main Bachy, Trisha Lian, and Anjul Patney. FovVideoVDP: A visible dierence predictor
for wide eld-of-view video. tog, 2021.
[128] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey
Dosovitskiy, and Daniel Duckworth. NeRF in the Wild: Neural Radiance Fields for Uncon-
strained Photo Collections. In CVPR, 2021.
[129] Abhimitra Meka, Christian Haene, Rohit Pandey, Michael Zollhöfer, Sean Fanello, Graham
Fye, Adarsh Kowdle, Xueming Yu, Jay Busch, Jason Dourgarian, et al. Deep reectance
elds: high-quality facial reectance eld inference from color gradient illumination. ACM
Transactions on Graphics (TOG), 38(4):1–12, 2019.
[130] Moustafa Meshry, Dan B Goldman, Sameh Khamis, Hugues Hoppe, Rohit Pandey, Noah
Snavely, and Ricardo Martin-Brualla. Neural rerendering in the wild. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 6878–6887, 2019.
[131] Meta. Project aria. https://about.meta.com/realitylabs/projectaria/ . Accessed: 2022-12-
01.
[132] Microsoft. Microsoft hololens 2. https://www.microsoft.com/en- us/hololens . Accessed:
2022-12-01.
[133] Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi
Ramamoorthi, Ren Ng, and Abhishek Kar. Local light eld fusion: Practical view synthesis
with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14,
2019.
[134] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoor-
thi, and Ren Ng. Nerf: Representing scenes as neural radiance elds for view synthesis. In
ECCV, 2020.
[135] Masahiro Mori, Karl F MacDorman, and Norri Kageki. The uncanny valley [from the eld].
IEEE Robotics & automation magazine, 19(2):98–100, 2012.
[136] MPC. MPC Blade Runner 2049 VFX breakdown. https://youtu.be/x8ZnqCKZABY , 2018. Ac-
cessed: 2022-12-07.
[137] Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral Agar-
wal, Jens Fursund, Hao Li, Richard Roberts, et al. pagan: real-time avatars using dynamic
textures. In ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), page 258. ACM, 2018.
[138] Thomas Neumann, Kiran Varanasi, Stephan Wenger, Markus Wacker, Marcus Magnor, and
Christian Theobalt. Sparse localized deformation components. Transactions on Graphics
(Proceedings of SIGGRAPH Asia), 32(6):179:1–179:10, 2013.
[139] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Dierentiable
volumetric rendering: Learning implicit 3d representations without 3d supervision. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
3504–3515, 2020.
124
[140] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, 2006.
[141] Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya Harada. Neural articulated radiance
eld. In International Conference on Computer Vision, 2021.
[142] Project Website of FLAME. http://flame.is.tue.mpg.de , 2017.
[143] Project Website of Neural 3D Video Synthesis. https://neural- 3d- video.github.io/ , 2022.
[144] Project Website of ToFu. https://tianyeli.github.io/tofu , 2021.
[145] Ryan S. Overbeck, Daniel Erickson, Daniel Evangelakos, and Paul Debevec. The making of
welcome to light elds vr. In ACM SIGGRAPH 2018 Talks, SIGGRAPH ’18, New York, NY,
USA, 2018. Association for Computing Machinery.
[146] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove.
Deepsdf: Learning continuous signed distance functions for shape representation. In Pro-
ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–
174, 2019.
[147] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Soen Bouaziz, Dan B Goldman,
Steven M Seitz, and Ricardo Martin-Brualla. Neres: Deformable neural radiance elds. In
ProceedingsoftheIEEE/CVFInternationalConferenceonComputerVision, pages 5865–5874,
2021.
[148] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. Barron, Soen Bouaziz, Dan B
Goldman, Ricardo Martin-Brualla, and Steven M. Seitz. Hypernerf: A higher-dimensional
representation for topologically varying neural radiance elds. ACM Trans. Graph., 40(6),
dec 2021.
[149] Frederick Ira Parke. A parametric model for human faces. Technical report, UTAH UNIV
SALT LAKE CITY DEPT OF COMPUTER SCIENCE, 1974.
[150] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative
style, high-performance deep learning library. Advances in neural information processing
systems, 32, 2019.
[151] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman,
Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3D hands, face, and
body from a single image. InProceedingsoftheIEEE/CVFconferenceoncomputervisionand
pattern recognition, pages 10975–10985, 2019.
[152] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3D
face model for pose and illumination invariant face recognition. InInternationalConference
on Advanced Video and Signal Based Surveillance, pages 296–301, 2009.
125
[153] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. JournalofMachine
Learning Research, 12:2825–2830, 2011.
[154] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and
Xiaowei Zhou. Neural body: Implicit neural representations with structured latent codes
for novel view synthesis of dynamic humans. In CVPR, 2021.
[155] Leonid Pishchulin, Stefanie Wuhrer, Thomas Helten, Christian Theobalt, and Bernt Schiele.
Building statistical shape spaces for 3D human modeling. Pattern Recognition, 67(C):276–
286, July 2017.
[156] J-P Pons, Renaud Keriven, and Olivier Faugeras. Modelling dynamic scenes by registering
multi-view image sequences. In2005IEEEComputerSocietyConferenceonComputerVision
and Pattern Recognition (CVPR’05), volume 2, pages 822–827. IEEE, 2005.
[157] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3D
using 2D diusion. arXiv preprint arXiv:2209.14988, 2022.
[158] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-NeRF:
Neural Radiance Fields for Dynamic Scenes. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2020.
[159] Amit Raj, Michael Zollhöfer, Tomas Simon, Jason M. Saragih, Shunsuke Saito, James Hays,
and Stephen Lombardi. Pixel-aligned volumetric avatars. 2021 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages 11728–11737, 2021.
[160] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical
text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
[161] Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J. Black. Generating 3D faces
using convolutional mesh autoencoders. In Eur. Conf. Comput. Vis., pages 725–741, 2018.
[162] Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando De la Torre, and Yaser
Sheikh. Meshtalk: 3D face animation from speech using cross-modality disentanglement.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–
1182, 2021.
[163] Elad Richardson, Matan Sela, and Ron Kimmel. 3D face reconstruction by learning from
synthetic data. In International Conference On 3d Vision (3DV), pages 460–469, 2016.
[164] Jérémy Riviere, Paulo Gotardo, Derek Bradley, Abhijeet Ghosh, and Thabo Beeler. Single-
shot high-quality facial geometry and skin appearance capture. ACM Trans. Graph., 39(4),
aug 2020.
[165] K. Robinette, S. Blackwell, H. Daanen, M. Boehmer, S. Fleming, T. Brill, D. Hoeferlin, and
D. Burnsides. Civilian American and European Surface Anthropometry Resource (CAE-
SAR) nal report. Technical Report AFRL-HE-WP-TR-2002-0169, US Air Force Research
Laboratory, 2002.
126
[166] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Om-
mer. High-resolution image synthesis with latent diusion models. In Proceedings of the
IEEE/CVFConferenceonComputerVisionandPatternRecognition, pages 10684–10695, 2022.
[167] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for
biomedical image segmentation. In International Conference on Medical image computing
and computer-assisted intervention, pages 234–241. Springer, 2015.
[168] Radu Alexandru Rosu, Shunsuke Saito, Ziyan Wang, Chenglei Wu, Sven Behnke, and Giljoo
Nam. Neural strands: Learning hair geometry and appearance from multi-view images. In
European Conference on Computer Vision, pages 73–89. Springer, 2022.
[169] Nadine Rüegg, Silvia Zu, Konrad Schindler, and Michael J Black. Barc: Learning to regress
3D dog shape from images by exploiting breed information. InProceedingsoftheIEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 3876–3884, 2022.
[170] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and
Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2304–
2314, 2019.
[171] Shunsuke Saito, Tianye Li, and Hao Li. Real-time facial segmentation and performance cap-
ture from RGB input. In European conference on computer vision, pages 244–261. Springer,
2016.
[172] Augusto Salazar, Stefanie Wuhrer, Chang Shu, and Flavio Prieto. Fully automatic
expression-invariant face correspondence. MachineVisionandApplications, 25(4):859–879,
2014.
[173] Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael Black. Learning to regress 3D
face shape and expression from an image without 3D supervision. InProceedingsIEEEConf.
on Computer Vision and Pattern Recognition (CVPR), pages 7763–7772, 2019.
[174] Umme Sara, Morium Akter, and Mohammad Shorif Uddin. Image quality assessment
through fsim, ssim, mse and psnr - a comparative study. JournalofComputerandCommu-
nications, 7(3):8–18, 2019.
[175] Ben Saunders, Necati Cihan Camgoz, and Richard Bowden. Signing at scale: Learning to
co-articulate signs for large-scale photo-realistic sign language production. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5141–5151,
2022.
[176] Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise
view selection for unstructured multi-view stereo. In European Conference on Computer
Vision, pages 501–518. Springer, 2016.
[177] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In
Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
127
[178] Yeongho Seol, Wan-Chun Ma, and JP Lewis. Creating an actor-specic facial rig from
performance capture. In Proceedings of the 2016 Symposium on Digital Production, pages
13–17, 2016.
[179] Mike Seymour. The curious case of aging visual eects. https://www.fxguide.com/
fxfeatured/the_curious_case_of_aging_visual_effects/ . Accessed: 2022-11-28.
[180] Mike Seymour. Real time mike! https://www.fxguide.com/fxfeatured/real- time- mike/ ,
2017. Accessed: 2022-12-08.
[181] Mike Seymour, Chris Evans, and Kim Libreri. Meet mike: Epic avatars. InACMSIGGRAPH
2017 VR Village, SIGGRAPH ’17, New York, NY, USA, 2017. Association for Computing
Machinery.
[182] Qi Shan, Riley Adams, Brian Curless, Yasutaka Furukawa, and Steven M Seitz. The visual
turing test for scene reconstruction. In 2013 International Conference on 3D Vision-3DV
2013, pages 25–32. IEEE, 2013.
[183] Ari Shapiro, Andrew Feng, Ruizhe Wang, Hao Li, Mark Bolas, Gerard Medioni, and Evan
Suma. Rapid avatar capture and simulation using commodity depth sensors. Computer
Animation and Virtual Worlds, 25(3-4):201–211, 2014.
[184] Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. Automatic acquisition of high-
delity facial performances using monocular videos. TransactionsonGraphics(Proceedings
of SIGGRAPH Asia), 33(6):222:1–222:13, 2014.
[185] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and
Michael Zollhofer. Deepvoxels: Learning persistent 3D feature embeddings. InProceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2437–2446,
2019.
[186] Olga Sorkine, Daniel Cohen-Or, Yaron Lipman, Marc Alexa, Christian Rössl, and H-P Sei-
del. Laplacian surface editing. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH
symposium on Geometry processing, pages 175–184, 2004.
[187] Pratul P Srinivasan, Richard Tucker, Jonathan T Barron, Ravi Ramamoorthi, Ren Ng, and
Noah Snavely. Pushing the boundaries of view extrapolation with multiplane images. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 175–
184, 2019.
[188] Jonathan Starck and Adrian Hilton. Surface capture for performance-based animation.
IEEE computer graphics and applications, 27(3):21–31, 2007.
[189] Zhaoqi Su, Tao Yu, Yangang Wang, and Yebin Liu. Deepcloth: Neural garment repre-
sentation for shape and style editing. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2022.
[190] Robert W. Sumner and Jovan Popović. Deformation transfer for triangle meshes. Transac-
tions on Graphics (Proceedings of SIGGRAPH), 23(3):399–405, 2004.
128
[191] Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M Seitz. Total moving
face reconstruction. In European Conference on Computer Vision, pages 796–812, 2014.
[192] Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. What makes
Tom Hanks look like Tom Hanks. In International Conference on Computer Vision, pages
3952–3960, 2015.
[193] Omid Taheri, Vasileios Choutas, Michael J Black, and Dimitrios Tzionas. Goal: Generating
4d whole-body motion for hand-object grasping. InProceedingsoftheIEEE/CVFConference
on Computer Vision and Pattern Recognition, pages 13263–13273, 2022.
[194] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan,
Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let
networks learn high frequency functions in low dimensional domains. AdvancesinNeural
Information Processing Systems, 33:7537–7547, 2020.
[195] Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia
Rodriguez, Jessica Hodgins, and Iain Matthews. A deep learning approach for generalized
speech animation. ACM Transactions on Graphics (TOG), 36(4):1–11, 2017.
[196] A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi, K. Sunkavalli, R. Martin-Brualla,
T. Simon, J. Saragih, M. Nießner, R. Pandey, S. Fanello, G. Wetzstein, J.-Y. Zhu, C. Theobalt,
M. Agrawala, E. Shechtman, D. B Goldman, and M. Zollhöfer. State of the Art on Neural
Rendering. Computer Graphics Forum (EG STAR 2020), 2020.
[197] Ayush Tewari, Florian Bernard, Pablo Garrido, Gaurav Bharaj, Mohamed Elgharib, Hans-
Peter Seidel, Patrick Pérez, Michael Zollhoefer, and Christian Theobalt. Fml: Face model
learning from videos. In IEEE Conf. Comput. Vis. Pattern Recog., 2019.
[198] Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi, Zexiang Xu,
Tomas Simon, Matthias Nießner, Edgar Tretschk, L. Liu, Ben Mildenhall, Pratul P. Srini-
vasan, Rohit Pandey, Sergio Orts-Escolano, Sean Ryan Fanello, M. Guo, Gordon Wetzstein,
Jun-Yan Zhu, Christian Theobalt, Maneesh Agrawala, Daniel B. Goldman, and Michael
Zollhöfer. Advances in neural rendering. In SIGGRAPH 2021: Special Interest Group on
Computer Graphics and Interactive Techniques Conference, Courses, Virtual Event, USA, Au-
gust 9-13, 2021. ACM, 2021.
[199] Ayush Tewari, Michael Zollhoefer, Florian Bernard, Pablo Garrido, Hyeongwoo Kim,
Patrick Perez, and Christian Theobalt. High-delity monocular face reconstruction based
on an unsupervised model-based face autoencoder. IEEE Trans. Pattern Anal. Mach. Intell.,
pages 1–1, 2018.
[200] Ayush Tewari, Michael Zollhoefer, Pablo Garrido, Florian Bernard, Hyeongwoo Kim,
Patrick Pérez, and Christian Theobalt. Self-supervised multi-level face model learning for
monocular reconstruction at over 250 hz. In IEEE Conf. Comput. Vis. Pattern Recog., pages
2549–2559, 2018.
129
[201] Ayush Tewari, Michael Zollhoefer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard,
Patrick Perez, and Theobalt Christian. MoFA: Model-based Deep Convolutional Face Au-
toencoder for Unsupervised Monocular Reconstruction. In Int. Conf. Comput. Vis., 2017.
[202] Justus Thies, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger, and
Christian Theobalt. Real-time expression transfer for facial reenactment. Transactions on
Graphics (Proceedings of SIGGRAPH Asia), 34(6):183:1–183:14, 2015.
[203] Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias
Nießner. Face2face: Real-time face capture and reenactment of rgb videos. In CVPR, 2016.
[204] Anh Tuan Tran, Tal Hassner, Iacopo Masi, Eran Paz, Yuval Nirkin, and Gérard G Medioni.
Extreme 3d face reconstruction: Seeing through occlusions. In IEEE Conf. Comput. Vis.
Pattern Recog., pages 3935–3944, 2018.
[205] Luan Tran, Feng Liu, and Xiaoming Liu. Towards high-delity nonlinear 3d face morphable
model. In IEEE Conf. Comput. Vis. Pattern Recog., 2019.
[206] David Traum, Andrew Jones, Kia Hays, Heather Maio, Oleg Alexander, Ron Artstein, Paul
Debevec, Alesia Gainer, Kallirroi Georgila, Kathleen Haase, et al. New dimensions in testi-
mony: Digitally preserving a holocaust survivor’s interactive storytelling. InInternational
Conference on Interactive Digital Storytelling, pages 269–281. Springer, 2015.
[207] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner,
and Christian Theobalt. Non-rigid neural radiance elds: Reconstruction and novel view
synthesis of a dynamic scene from monocular video. In IEEE International Conference on
Computer Vision (ICCV). IEEE, 2021.
[208] Ed Ulbrich. TED-Ed talk: How Benjamin Button got his face. https://youtu.be/52JqQkx_
VDc. Accessed: 2022-12-11.
[209] Ed Ulbrich. TED talk: How Benjamin Button got his face. https://youtu.be/jUIfiLplNqQ .
Accessed: 2022-11-28.
[210] Unreal Engine. The Matrix Awakens: An Unreal Engine 5 experience. https://youtu.be/
WU0gvPcc3jQ, 2021. Accessed: 2022-12-07.
[211] Paul Upchurch, Noah Snavely, and Kavita Bala. From a to z: supervised transfer of style
and content using deep neural network generators. arXiv preprint arXiv:1603.02003, 2016.
[212] Daniel Vlasic, Matthew Brand, Hanspeter Pster, and Jovan Popović. Face transfer with
multilinear models. Transactions on Graphics (Proceedings of SIGGRAPH), 24(3):426–433,
2005.
[213] George Vogiatzis, Philip HS Torr, and Roberto Cipolla. Multi-view stereo via volumet-
ric graph-cuts. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’05), volume 2, pages 391–398. IEEE, 2005.
130
[214] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catan-
zaro. High-resolution image synthesis and semantic manipulation with conditional gans.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[215] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and
Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks.
In The European Conference on Computer Vision Workshops (ECCVW), September 2018.
[216] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assess-
ment: from error visibility to structural similarity. IEEE transactions on image processing,
13(4):600–612, 2004.
[217] Thibaut Weise, Soen Bouaziz, Hao Li, and Mark Pauly. Realtime performance-based facial
animation. Transactions on Graphics (Proceedings of SIGGRAPH), 30(4):77:1–77:10, 2011.
[218] Thibaut Weise, Bastian Leibe, and Luc Van Gool. Fast 3D scanning with automatic motion
compensation. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages
1–8. IEEE, 2007.
[219] Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin Johnson. Synsin: End-to-end
view synthesis from a single image. InProceedingsoftheIEEE/CVFConferenceonComputer
Vision and Pattern Recognition, pages 7467–7477, 2020.
[220] Daniel N Wood, Daniel I Azuma, Ken Aldinger, Brian Curless, Tom Duchamp, David H
Salesin, and Werner Stuetzle. Surface light elds for 3d photography. In Proceedings of
the27thannualconferenceonComputergraphicsandinteractivetechniques, pages 287–296,
2000.
[221] Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas
Bulling. A 3D morphable eye region model for gaze estimation. InEuropeanConferenceon
Computer Vision, pages 297–313, 2016.
[222] Chenglei Wu, Derek Bradley, Markus Gross, and Thabo Beeler. An anatomically-
constrained local deformation model for monocular face capture. TransactionsonGraphics
(Proceedings of SIGGRAPH), 35(4):115:1–115:12, 2016.
[223] Fanzi Wu, Linchao Bao, Yajing Chen, Yonggen Ling, Yibing Song, Songnan Li, King Ngi
Ngan, and Wei Liu. Mvf-net: Multi-view 3D face morphable model regression. InProceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June
2019.
[224] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. Space-time neural irradiance
elds for free-viewpoint video. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 9421–9431, 2021.
[225] Xuehan Xiong and Fernando De la Torre. Supervised descent method and its applications to
face alignment. In Conference on Computer Vision and Pattern Recognition, pages 532–539,
2013.
131
[226] Yuliang Xiu, Jinlong Yang, Dimitrios Tzionas, and Michael J Black. Icon: Implicit clothed
humans obtained from normals. In 2022 IEEE/CVF Conference onComputer Vision and Pat-
tern Recognition (CVPR), pages 13286–13296. IEEE, 2022.
[227] Feng Xu, Jinxiang Chai, Yilong Liu, and Xin Tong. Controllable high-delity facial per-
formance transfer. Transactions on Graphics (Proceedings of SIGGRAPH), 33(4):42:1–42:11,
2014.
[228] Fei Yang, Jue Wang, Eli Shechtman, Lubomir Bourdev, and Dimitri Metaxas. Expression
ow for 3D-aware face component transfer. Transactions on Graphics (Proceedings of SIG-
GRAPH), 30(4):60:1–10, 2011.
[229] Gengshan Yang, Minh Vo, Natalia Neverova, Deva Ramanan, Andrea Vedaldi, and Han-
byul Joo. Banmo: Building animatable 3D neural models from many casual videos. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
2863–2873, 2022.
[230] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for
unstructured multi-view stereo. In Proceedings of the European Conference on Computer
Vision (ECCV), pages 767–783, 2018.
[231] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and Yaron
Lipman. Multiview neural surface reconstruction by disentangling geometry and appear-
ance. Advances in Neural Information Processing Systems, 33, 2020.
[232] Lijun Yin, Xiaozhou Wei, Yi Sun, Jun Wang, and M.J. Rosato. A 3D facial expression
database for facial behavior research. In International Conference on Automatic Face and
Gesture Recognition, pages 211–216, 2006.
[233] Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. Novel view
synthesis of dynamic scenes with globally coherent depths from a monocular camera. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
5336–5345, 2020.
[234] Wentao Yuan, Zhaoyang Lv, Tanner Schmidt, and Steven Lovegrove. Star: Self-supervised
tracking and reconstruction of rigid objects in motion with neural rendering. InProceedings
oftheIEEE/CVFConferenceonComputerVisionandPatternRecognition, pages 13144–13152,
2021.
[235] Chao Zhang, William Smith, Arnaud Dessein, Nick Pears, and Hang Dai. Functional faces:
Groupwise dense correspondence using functional maps. InIEEEConf.Comput.Vis.Pattern
Recog., 2016.
[236] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The un-
reasonable eectiveness of deep features as a perceptual metric. InProceedingsof theIEEE
conference on computer vision and pattern recognition, pages 586–595, 2018.
132
[237] Yajie Zhao, Zeng Huang, Tianye Li, Weikai Chen, Chloe LeGendre, Xinglei Ren, Ari
Shapiro, and Hao Li. Learning perspective undistortion of portraits. In Proceedings of
the IEEE/CVF International Conference on Computer Vision, pages 7849–7859, 2019.
[238] Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C Bühler, Xu Chen, Michael J Black,
and Otmar Hilliges. Im avatar: Implicit morphable head avatars from videos. InProceedings
oftheIEEE/CVFConferenceonComputerVisionandPatternRecognition, pages 13545–13555,
2022.
[239] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fye, and Noah Snavely. Stereo mag-
nication: Learning view synthesis using multiplane images. ACMTrans.Graph., 37(4), jul
2018.
[240] Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan
Singh. Visemenet: Audio-driven animator-centric speech animation. ACMTransactionson
Graphics (TOG), 37(4):1–10, 2018.
[241] C Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard
Szeliski. High-quality video view interpolation using a layered representation. ACMtrans-
actions on graphics (TOG), 23(3):600–608, 2004.
133
AppendixA
GenericFaceModeling
Figure A.1:Sampleregistrationsoftheshapedata extracted from the CAESAR body database.
Figure A.2: Sampleregistrationsoftheselfcapturedposedata. Top: Head rotations around
the neck. Bottom: Mouth articulations.
134
Figure A.3: Samplesregistrationsoftheexpressiondata from D3DFACS (top) and self cap-
tured sequences (bottom).
Data: FLAME is built from three heterogeneous sources, using more than 33; 000 3D scans in
total. This comprises shape data (3800 shapes), pose data (8000 shapes), and 21; 000 registered
expression frames sampled from the 69; 000 registered expression frames.
Figure A.1 shows sample head registrations of the CAESAR database [165], showing the large
variation in shape present in the database. Figure A.2 shows samples of the captured neck rotation
(top) and jaw motions (bottom) used as pose data. Figure A.3 shows the expression data, namely
registrations of D3DFACS [46] (top) and self-captured sequences (bottom).
RegistrationQuality: Figure A.4 shows further sample registrations of the D3DFACS dataset
(top) and our self-captured sequences (bottom). Our registration is able to track subtle motions
such as raising the eyebrows (Figure A.4 top) or extreme facial expressions such as a wide open
mouth (Figure A.4 bottom).
ModelQuality: Figure A.5 gives further qualitative evaluations on the inuence of a varying
number of identity components for tting the neutral BU-3DFE face scans. Increasing the number
of components increases the ability of the model to reconstruct localized details. FLAME 300 leads
to registrations with an error that is close to zero millimeters in most facial region Figure A.6
135
gives further qualitative comparisons of BFM [152], FW [38], and FLAME. Compared to FLAME,
BFM has lots o high-frequency details that make the ts look more realistic. Nevertheless, the
comparison with the scans reveal that these details are hallucinated and spurious, as they come
from people in their original training dataset, rather than from the scans. While lower-resolution
and less detailed, FLAME is actually more accurate.
Shape Reconstruction from Images: Figure A.7 shows the 2D landmark tting using FW
(top) and FLAME (bottom). FLAME better ts the identity and produces a lower 3D scan distance.
ExpressionTransfer: Figure A.8 shows the expression transfer between a subject in our test
dataset and a high-resolution static scan of Beeler et al. [18]. The synthetic sequence looks real-
istic despite the large face shape dierence of source and target.
136
0 mm
>1 mm
0 mm
>1 mm
Figure A.4: Registration quality. Sample frames, registrations, and scan-to-mesh distance of
one sequences of the D3DFACS database (top) and one sequence of our self-captured sequence
(bottom). The texture-based registration allows to track subtle motions such as raising eyebrows
(top).
137
0 mm
>1 mm
0 mm
>1 mm
0 mm
>1 mm
Scan FLAME 49 Error FLAME 49 FLAME 90 Error FLAME 90 FLAME 300 Error FLAME 300
Figure A.5: ExpressivenessoftheFLAMEidentityspace for tting neutral scans of the BU-
3DFE face database with a varying number of identity components.
0 mm
>1 mm
0 mm
>1 mm
0 mm
>1 mm
Scan BFM Full Error BFM Full FW Error FW FLAME 198 Error FLAME 198
Figure A.6: Additionalcomparisononidentityspace of Basel Face Model (BFM) [152], Face-
Warehouse model [38] and FLAME for tting neutral scans of the BU-3DFE database.
138
0 mm
>10 mm
0 mm
>10 mm
Image / Scan Fitting Fitting Error
Figure A.7: Additionalcomparisonon3Dfacettingfromasingle2Dimage of FaceWare-
house model (top) and FLAME (bottom). Note, that the scan is only used for evaluation.
Figure A.8: Additional results on expression transfer from a source sequence (blue) to a
static target scan (pink). The aligned personalized template for the scan is shown in green, the
transferred expression in yellow.
139
AppendixB
EcientInferenceforTopologicallyConsistentFace
Meshes
Additional Quantitative Results: Tab. B.1 provides additional quantitative comparisons to
other learning based methods, namely 3DMM regression and DFNRMVS [11]. Fig. B.1 shows
the cumulative error curves for scan-to-mesh distances among the methods. All methods are
evaluated on a common held-out test set with 499 ground truth 3D scans; no data of test subjects
are used during training. The geometric reconstruction accuracy is evaluated using scan-to-
mesh distance (s2m) that measures the distance between each vertex of a ground truth scan,
and the closest point in the surface of the reconstructed mesh. The correspondence accuracy is
evaluated using a vertex-to-vertex distance (v2v) that measures the distance between each vertex
of a registered ground truth mesh, and the semantically corresponding point in the reconstructed
mesh.
Our method (without post-processing) outperforms the existing methods (without and with
post-processing) in terms of geometric reconstruction quality and the quality of the correspon-
dence. Note that while the distance of DFNRMVS [11] is higher than for the 3DMM regression,
140
Methods median s2m median v2v
3DMM Regr. 2.104 3.662
3DMM Regr. (PP) 1.659 2.890
DFNRMVS [11] (PP) 1.885 4.565
Our Method 0.585 1.973
Table B.1: Comparison on geometry accuracy (median s2m), correspondence accuracy (median
v2v) among the learning based methods, measured in millimeters. “PP” denotes the result after a
post-processing Procrustes alignment that solves for the optimal rigid pose (i.e. 3D rotation and
translation) and scale to best align the reconstructed mesh with the ground truth. Note that our
method requires no post-processing.
0 1 2 3 4 5 6 7
Distance (mm)
0
20
40
60
80
100
Percentage (%)
3DMM (w/o rigid ICP)
3DMM (with rigid ICP)
Ours
Figure B.1: Quantitative evaluation by cumulative error curves for scan-to-mesh distances
among learning based methods.
DFNRMVS [1]
Output mesh
Scan-to-mesh
distance
Figure B.2: ExampleresultsfromDFNRMVS [11].
141
Camera A Camera B Camera C
Camera A Camera B Camera C
Frame 0 Frame 20 Frame 40 Frame 60 Frame 80 Frame 100 Frame 120 Frame 140
Figure B.3: Dynamic facial performance capture using ToFu. Base mesh reconstruction
for a multi-view video sequence overlaid on the video frames. Our method captures the facial
performance well. The result meshes are temporally stable and accurately align with the input
images. Visualizing with a shared checkerboard texture indicates good tracking quality. Please
see the supplemental video for better visualization.
DFNRMVS [11] is visually better in most regions. Their reconstructed meshes tend to have large
errors in the forehead and in the jaw areas, as shown in Fig. B.2, due to a dierent mask deni-
tion for their on-the-y deep photo-metric renement. Fig. 3.6 shows that our methods produces
signicantly better reconstructions than DFNRMVS [11] across the entire face.
142
ResultsonDynamicFacialCapture: We evaluate our trained model on a multi-view video
sequence with 8 calibrated and synchronized views, captured at 30 fps. We apply our pro-
gressive mesh generation network in a frame-by-frame manner, without applying any tempo-
ral smoothing. Fig. B.3 shows that our base mesh well captures the extreme expressions, and
it aligns well with the input images. Despite being trained on static images only, the result-
ing reconstruction is temporally stable, as shown in the supplemental video which is hosted
on https://tianyeli.github.io/tofu . Fig. B.9 shows additional base mesh reconstructions for
dierent static multi-view images of varying subjects in dierent expressions. Our method re-
constructs the face shape and expression well, closely to the ground truth scans. We show more
visualizations in the supplemental video.
Impact of Local Renements: Fig. B.5 shows the cumulative error curves for scan-to-mesh
distances among the local stages. Given the coarse meshM
0
as output of the global stage, each
local stage successively increases the mesh resolution and renes the vertex locations. Fig. B.4
demonstrates the eect of each local renement step. As shown in Fig. B.4, the quality of the
reconstructed mesh improves after each local stage, while the scan-to-mesh distance to the scan
reduces. Note that details such as nose corners and lips gradually improve through the local
stages.
More Ablation on Number of Views: Fig. B.6 shows the cumulative error curves for scan-
to-mesh distances for networks with dierent number of input views.
MoreResultsonAppearanceandDetailCapture: Fig. B.10 shows additional results of the
appearance enhancement network, which predicts normal displacements and additional albedo
143
Progressively generated meshes Scan-to-mesh distance
0
M
0
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zM79xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFoEkxE=
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zM79xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFoEkxE=
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zM79xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFoEkxE=
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zM79xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFoEkxE=
M
1
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zMn9xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFuIkxI=
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zMn9xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFuIkxI=
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zMn9xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFuIkxI=
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zMn9xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFuIkxI=
M
2
AAAB93icbVBNS8NAFHzxs9aPVj16WSyCp5IUQQUPBS9ehArGFtoQNttNu3SzCbsboYb8Ei8eVLz6V7z5b9y0OWjrwMIw8x5vdoKEM6Vt+9taWV1b39isbFW3d3b3avX9gwcVp5JQl8Q8lr0AK8qZoK5mmtNeIimOAk67weS68LuPVCoWi3s9TagX4ZFgISNYG8mv1wYR1mOCeXab+1kr9+sNu2nPgJaJU5IGlOj49a/BMCZpRIUmHCvVd+xEexmWmhFO8+ogVTTBZIJHtG+owBFVXjYLnqMTowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm66qpgRn8cvLxG01L5vO3VmjfVW2UYEjOIZTcOAc2nADHXCBQArP8Apv1pP1Yr1bH/PRFavcOYQ/sD5/AF0MkxM=
AAAB93icbVBNS8NAFHzxs9aPVj16WSyCp5IUQQUPBS9ehArGFtoQNttNu3SzCbsboYb8Ei8eVLz6V7z5b9y0OWjrwMIw8x5vdoKEM6Vt+9taWV1b39isbFW3d3b3avX9gwcVp5JQl8Q8lr0AK8qZoK5mmtNeIimOAk67weS68LuPVCoWi3s9TagX4ZFgISNYG8mv1wYR1mOCeXab+1kr9+sNu2nPgJaJU5IGlOj49a/BMCZpRIUmHCvVd+xEexmWmhFO8+ogVTTBZIJHtG+owBFVXjYLnqMTowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm66qpgRn8cvLxG01L5vO3VmjfVW2UYEjOIZTcOAc2nADHXCBQArP8Apv1pP1Yr1bH/PRFavcOYQ/sD5/AF0MkxM=
AAAB93icbVBNS8NAFHzxs9aPVj16WSyCp5IUQQUPBS9ehArGFtoQNttNu3SzCbsboYb8Ei8eVLz6V7z5b9y0OWjrwMIw8x5vdoKEM6Vt+9taWV1b39isbFW3d3b3avX9gwcVp5JQl8Q8lr0AK8qZoK5mmtNeIimOAk67weS68LuPVCoWi3s9TagX4ZFgISNYG8mv1wYR1mOCeXab+1kr9+sNu2nPgJaJU5IGlOj49a/BMCZpRIUmHCvVd+xEexmWmhFO8+ogVTTBZIJHtG+owBFVXjYLnqMTowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm66qpgRn8cvLxG01L5vO3VmjfVW2UYEjOIZTcOAc2nADHXCBQArP8Apv1pP1Yr1bH/PRFavcOYQ/sD5/AF0MkxM=
AAAB93icbVBNS8NAFHzxs9aPVj16WSyCp5IUQQUPBS9ehArGFtoQNttNu3SzCbsboYb8Ei8eVLz6V7z5b9y0OWjrwMIw8x5vdoKEM6Vt+9taWV1b39isbFW3d3b3avX9gwcVp5JQl8Q8lr0AK8qZoK5mmtNeIimOAk67weS68LuPVCoWi3s9TagX4ZFgISNYG8mv1wYR1mOCeXab+1kr9+sNu2nPgJaJU5IGlOj49a/BMCZpRIUmHCvVd+xEexmWmhFO8+ogVTTBZIJHtG+owBFVXjYLnqMTowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm66qpgRn8cvLxG01L5vO3VmjfVW2UYEjOIZTcOAc2nADHXCBQArP8Apv1pP1Yr1bH/PRFavcOYQ/sD5/AF0MkxM=
M
3
=M
AAACBXicbVBNS8NAEJ3Ur1q/oh5FWCyCp5KooIJCwYsXoYKxhTaUzXbTLt18sLsRSsjJi3/FiwcVr/4Hb/4bN20OtfXBwOO9GWbmeTFnUlnWj1FaWFxaXimvVtbWNza3zO2dBxklglCHRDwSLQ9LyllIHcUUp61YUBx4nDa94XXuNx+pkCwK79Uopm6A+yHzGcFKS11zvxNgNSCYp7dZNz3J0BWaVsyqVbPGQPPELkgVCjS65nenF5EkoKEiHEvZtq1YuSkWihFOs0onkTTGZIj7tK1piAMq3XT8RoYOtdJDfiR0hQqN1emJFAdSjgJPd+YnylkvF//z2onyz92UhXGiaEgmi/yEIxWhPBPUY4ISxUeaYCKYvhWRARaYKJ1cRYdgz748T5zj2kXNvjut1i+LNMqwBwdwBDacQR1uoAEOEHiCF3iDd+PZeDU+jM9Ja8koZnbhD4yvX1qVmJg=
AAACBXicbVBNS8NAEJ3Ur1q/oh5FWCyCp5KooIJCwYsXoYKxhTaUzXbTLt18sLsRSsjJi3/FiwcVr/4Hb/4bN20OtfXBwOO9GWbmeTFnUlnWj1FaWFxaXimvVtbWNza3zO2dBxklglCHRDwSLQ9LyllIHcUUp61YUBx4nDa94XXuNx+pkCwK79Uopm6A+yHzGcFKS11zvxNgNSCYp7dZNz3J0BWaVsyqVbPGQPPELkgVCjS65nenF5EkoKEiHEvZtq1YuSkWihFOs0onkTTGZIj7tK1piAMq3XT8RoYOtdJDfiR0hQqN1emJFAdSjgJPd+YnylkvF//z2onyz92UhXGiaEgmi/yEIxWhPBPUY4ISxUeaYCKYvhWRARaYKJ1cRYdgz748T5zj2kXNvjut1i+LNMqwBwdwBDacQR1uoAEOEHiCF3iDd+PZeDU+jM9Ja8koZnbhD4yvX1qVmJg=
AAACBXicbVBNS8NAEJ3Ur1q/oh5FWCyCp5KooIJCwYsXoYKxhTaUzXbTLt18sLsRSsjJi3/FiwcVr/4Hb/4bN20OtfXBwOO9GWbmeTFnUlnWj1FaWFxaXimvVtbWNza3zO2dBxklglCHRDwSLQ9LyllIHcUUp61YUBx4nDa94XXuNx+pkCwK79Uopm6A+yHzGcFKS11zvxNgNSCYp7dZNz3J0BWaVsyqVbPGQPPELkgVCjS65nenF5EkoKEiHEvZtq1YuSkWihFOs0onkTTGZIj7tK1piAMq3XT8RoYOtdJDfiR0hQqN1emJFAdSjgJPd+YnylkvF//z2onyz92UhXGiaEgmi/yEIxWhPBPUY4ISxUeaYCKYvhWRARaYKJ1cRYdgz748T5zj2kXNvjut1i+LNMqwBwdwBDacQR1uoAEOEHiCF3iDd+PZeDU+jM9Ja8koZnbhD4yvX1qVmJg=
AAACBXicbVBNS8NAEJ3Ur1q/oh5FWCyCp5KooIJCwYsXoYKxhTaUzXbTLt18sLsRSsjJi3/FiwcVr/4Hb/4bN20OtfXBwOO9GWbmeTFnUlnWj1FaWFxaXimvVtbWNza3zO2dBxklglCHRDwSLQ9LyllIHcUUp61YUBx4nDa94XXuNx+pkCwK79Uopm6A+yHzGcFKS11zvxNgNSCYp7dZNz3J0BWaVsyqVbPGQPPELkgVCjS65nenF5EkoKEiHEvZtq1YuSkWihFOs0onkTTGZIj7tK1piAMq3XT8RoYOtdJDfiR0hQqN1emJFAdSjgJPd+YnylkvF//z2onyz92UhXGiaEgmi/yEIxWhPBPUY4ISxUeaYCKYvhWRARaYKJ1cRYdgz748T5zj2kXNvjut1i+LNMqwBwdwBDacQR1uoAEOEHiCF3iDd+PZeDU+jM9Ja8koZnbhD4yvX1qVmJg=
M
0
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zM79xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFoEkxE=
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zM79xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFoEkxE=
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zM79xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFoEkxE=
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zM79xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFoEkxE=
M
1
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zMn9xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFuIkxI=
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zMn9xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFuIkxI=
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zMn9xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFuIkxI=
AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4ESoYW2hD2Gw37dLNJuxuhBryS7x4UPHqX/Hmv3HT5qCtAwvDzHu82QkSzpS27W+rsrK6tr5R3axtbe/s1ht7+w8qTiWhLol5LHsBVpQzQV3NNKe9RFIcBZx2g8l14XcfqVQsFvd6mlAvwiPBQkawNpLfqA8irMcE8+w29zMn9xtNu2XPgJaJU5ImlOj4ja/BMCZpRIUmHCvVd+xEexmWmhFO89ogVTTBZIJHtG+owBFVXjYLnqNjowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm65qpgRn8cvLxD1tXbacu7Nm+6psowqHcAQn4MA5tOEGOuACgRSe4RXerCfrxXq3PuajFavcOYA/sD5/AFuIkxI=
M
2
AAAB93icbVBNS8NAFHzxs9aPVj16WSyCp5IUQQUPBS9ehArGFtoQNttNu3SzCbsboYb8Ei8eVLz6V7z5b9y0OWjrwMIw8x5vdoKEM6Vt+9taWV1b39isbFW3d3b3avX9gwcVp5JQl8Q8lr0AK8qZoK5mmtNeIimOAk67weS68LuPVCoWi3s9TagX4ZFgISNYG8mv1wYR1mOCeXab+1kr9+sNu2nPgJaJU5IGlOj49a/BMCZpRIUmHCvVd+xEexmWmhFO8+ogVTTBZIJHtG+owBFVXjYLnqMTowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm66qpgRn8cvLxG01L5vO3VmjfVW2UYEjOIZTcOAc2nADHXCBQArP8Apv1pP1Yr1bH/PRFavcOYQ/sD5/AF0MkxM=
AAAB93icbVBNS8NAFHzxs9aPVj16WSyCp5IUQQUPBS9ehArGFtoQNttNu3SzCbsboYb8Ei8eVLz6V7z5b9y0OWjrwMIw8x5vdoKEM6Vt+9taWV1b39isbFW3d3b3avX9gwcVp5JQl8Q8lr0AK8qZoK5mmtNeIimOAk67weS68LuPVCoWi3s9TagX4ZFgISNYG8mv1wYR1mOCeXab+1kr9+sNu2nPgJaJU5IGlOj49a/BMCZpRIUmHCvVd+xEexmWmhFO8+ogVTTBZIJHtG+owBFVXjYLnqMTowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm66qpgRn8cvLxG01L5vO3VmjfVW2UYEjOIZTcOAc2nADHXCBQArP8Apv1pP1Yr1bH/PRFavcOYQ/sD5/AF0MkxM=
AAAB93icbVBNS8NAFHzxs9aPVj16WSyCp5IUQQUPBS9ehArGFtoQNttNu3SzCbsboYb8Ei8eVLz6V7z5b9y0OWjrwMIw8x5vdoKEM6Vt+9taWV1b39isbFW3d3b3avX9gwcVp5JQl8Q8lr0AK8qZoK5mmtNeIimOAk67weS68LuPVCoWi3s9TagX4ZFgISNYG8mv1wYR1mOCeXab+1kr9+sNu2nPgJaJU5IGlOj49a/BMCZpRIUmHCvVd+xEexmWmhFO8+ogVTTBZIJHtG+owBFVXjYLnqMTowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm66qpgRn8cvLxG01L5vO3VmjfVW2UYEjOIZTcOAc2nADHXCBQArP8Apv1pP1Yr1bH/PRFavcOYQ/sD5/AF0MkxM=
AAAB93icbVBNS8NAFHzxs9aPVj16WSyCp5IUQQUPBS9ehArGFtoQNttNu3SzCbsboYb8Ei8eVLz6V7z5b9y0OWjrwMIw8x5vdoKEM6Vt+9taWV1b39isbFW3d3b3avX9gwcVp5JQl8Q8lr0AK8qZoK5mmtNeIimOAk67weS68LuPVCoWi3s9TagX4ZFgISNYG8mv1wYR1mOCeXab+1kr9+sNu2nPgJaJU5IGlOj49a/BMCZpRIUmHCvVd+xEexmWmhFO8+ogVTTBZIJHtG+owBFVXjYLnqMTowxRGEvzhEYz9fdGhiOlplFgJouYatErxP+8fqrDCy9jIkk1FWR+KEw50jEqWkBDJinRfGoIJpKZrIiMscREm66qpgRn8cvLxG01L5vO3VmjfVW2UYEjOIZTcOAc2nADHXCBQArP8Apv1pP1Yr1bH/PRFavcOYQ/sD5/AF0MkxM=
M
3
=M
AAACBXicbVBNS8NAEJ3Ur1q/oh5FWCyCp5KooIJCwYsXoYKxhTaUzXbTLt18sLsRSsjJi3/FiwcVr/4Hb/4bN20OtfXBwOO9GWbmeTFnUlnWj1FaWFxaXimvVtbWNza3zO2dBxklglCHRDwSLQ9LyllIHcUUp61YUBx4nDa94XXuNx+pkCwK79Uopm6A+yHzGcFKS11zvxNgNSCYp7dZNz3J0BWaVsyqVbPGQPPELkgVCjS65nenF5EkoKEiHEvZtq1YuSkWihFOs0onkTTGZIj7tK1piAMq3XT8RoYOtdJDfiR0hQqN1emJFAdSjgJPd+YnylkvF//z2onyz92UhXGiaEgmi/yEIxWhPBPUY4ISxUeaYCKYvhWRARaYKJ1cRYdgz748T5zj2kXNvjut1i+LNMqwBwdwBDacQR1uoAEOEHiCF3iDd+PZeDU+jM9Ja8koZnbhD4yvX1qVmJg=
AAACBXicbVBNS8NAEJ3Ur1q/oh5FWCyCp5KooIJCwYsXoYKxhTaUzXbTLt18sLsRSsjJi3/FiwcVr/4Hb/4bN20OtfXBwOO9GWbmeTFnUlnWj1FaWFxaXimvVtbWNza3zO2dBxklglCHRDwSLQ9LyllIHcUUp61YUBx4nDa94XXuNx+pkCwK79Uopm6A+yHzGcFKS11zvxNgNSCYp7dZNz3J0BWaVsyqVbPGQPPELkgVCjS65nenF5EkoKEiHEvZtq1YuSkWihFOs0onkTTGZIj7tK1piAMq3XT8RoYOtdJDfiR0hQqN1emJFAdSjgJPd+YnylkvF//z2onyz92UhXGiaEgmi/yEIxWhPBPUY4ISxUeaYCKYvhWRARaYKJ1cRYdgz748T5zj2kXNvjut1i+LNMqwBwdwBDacQR1uoAEOEHiCF3iDd+PZeDU+jM9Ja8koZnbhD4yvX1qVmJg=
AAACBXicbVBNS8NAEJ3Ur1q/oh5FWCyCp5KooIJCwYsXoYKxhTaUzXbTLt18sLsRSsjJi3/FiwcVr/4Hb/4bN20OtfXBwOO9GWbmeTFnUlnWj1FaWFxaXimvVtbWNza3zO2dBxklglCHRDwSLQ9LyllIHcUUp61YUBx4nDa94XXuNx+pkCwK79Uopm6A+yHzGcFKS11zvxNgNSCYp7dZNz3J0BWaVsyqVbPGQPPELkgVCjS65nenF5EkoKEiHEvZtq1YuSkWihFOs0onkTTGZIj7tK1piAMq3XT8RoYOtdJDfiR0hQqN1emJFAdSjgJPd+YnylkvF//z2onyz92UhXGiaEgmi/yEIxWhPBPUY4ISxUeaYCKYvhWRARaYKJ1cRYdgz748T5zj2kXNvjut1i+LNMqwBwdwBDacQR1uoAEOEHiCF3iDd+PZeDU+jM9Ja8koZnbhD4yvX1qVmJg=
AAACBXicbVBNS8NAEJ3Ur1q/oh5FWCyCp5KooIJCwYsXoYKxhTaUzXbTLt18sLsRSsjJi3/FiwcVr/4Hb/4bN20OtfXBwOO9GWbmeTFnUlnWj1FaWFxaXimvVtbWNza3zO2dBxklglCHRDwSLQ9LyllIHcUUp61YUBx4nDa94XXuNx+pkCwK79Uopm6A+yHzGcFKS11zvxNgNSCYp7dZNz3J0BWaVsyqVbPGQPPELkgVCjS65nenF5EkoKEiHEvZtq1YuSkWihFOs0onkTTGZIj7tK1piAMq3XT8RoYOtdJDfiR0hQqN1emJFAdSjgJPd+YnylkvF//z2onyz92UhXGiaEgmi/yEIxWhPBPUY4ISxUeaYCKYvhWRARaYKJ1cRYdgz748T5zj2kXNvjut1i+LNMqwBwdwBDacQR1uoAEOEHiCF3iDd+PZeDU+jM9Ja8koZnbhD4yvX1qVmJg=
Figure B.4: InferredmeshesforeachleveloftheToFupipeline. global stageM
0
and after
upsampling and renement for each local stageM
i
(1i 3).
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Distance (mm)
0
20
40
60
80
100
Percentage (%)
Stage 0 (init)
Stage 1
Stage 2
Stage 3 (final)
Figure B.5: Quantitative evaluation by cumulative error curves for scan-to-mesh distances
among local renement stages.
and specular maps on top of the predicted base meshM (see Fig. 3.2). Our reconstruction pipeline
(i.e. base mesh reconstruction and appearance and detail capture) enables us to reconstruct a 3D
face with high-quality assets, 2 to 3 orders of magnitude faster than existing methods, which can
readily be used for photorealistic rendering.
144
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Distance (mm)
0
20
40
60
80
100
Percentage (%)
4 views
8 views
15 views
Figure B.6: Quantitative evaluation by cumulative error curves for scan-to-mesh distances
among various numbers of views.
ResultsonClothedHumanBodyDatasets: While we focus on face mesh in correspondence,
we nd that our method can also predict clothed full body meshes in correspondence. We test our
method on a dataset of human bodies as shown in Fig. B.7. Human bodies are challenging due
to large pose variations and occlusions. Given the challenging inputs, our methods still outputs
detailed geometry which closely t the ground truth surfaces with small scan-to-mesh distances,
shown in Fig. B.7. Checkerboard projection also shows the accuracy of semantic correspondence
among extreme poses. The results demonstrate the exibility of our method for highly articulated
and diverse surfaces.
Albedo: While the input images in our datasets are diuse albedo images, obtained with po-
larized lighting and cameras [76, 126], the results in Fig. 3.11, indicate that our system can be
adapted to non-lightstage setups, e.g. capture system of CoMA [161]. The appearance capture
network learns the mapping between albedo images and the details of specular reectance and
145
Scan Mesh Overlay
Scan-to-mesh
distance
Checkerboard
rendering
0
Figure B.7: ToFu results on clothed human body. Our system can also infer clothed human
body surfaces in consistent topology.
ne geometry, as “image-to-image translation”. This synthesis is reasonable since the input im-
ages contain pore-level details and the outputs are pixel-aligned. However, imperfect albedo
images can potentially contain more information on specularity, which in principle can guide the
systhesis network to better recover details. This is an interesting perspective for future work.
TheEoperator: LetB be batch size andN be vertex number. Given a feature volumeL
g
from
the global volumetric feature sampling, the global geometry network (3D ConvNet) predicts a
probability volume C
g
of size (B;N; 32; 32; 32), whose N-channel is ordered in a predened
vertex order. Finally the soft arg-max operatorE computes the expectations onC
g
per channel,
and outputs vertices of shape (B;N; 3) corresponding to the predened order.
146
Figure B.8:Visualizationofcross-subjectdensecorrespondence of the base meshes inferred
by ToFu in a shared checkerboard texture.
OnDenseCorrespondence: Dense correspondence across identities and expressions is a chal-
lenging task [116, 26]. Cross-identity dense correspondence is fundamentally dicult to dene
beyond signicant landmarks, especially in texture-less regions. The state-of-the-art methods
rely on landmarks and propagate the dense correspondence by statistical (e.g., 3DMM) or phys-
ical constraints (e.g., Laplacian regularization) in a carefully designed optimization process with
manual adjustments. Cross-expression correspondence, however denable, can be enforced by
photometric consistency (optical ow or dierentiable rendering). Our ground truth datasets
147
utilized all these state-of-the-art strategies and therefore can be regarded as one of the best cu-
rated datasets. With the “best” ground truth one can get as now, we trained our network in a
supervised manner to the ground truth meshes (same topology) with equal vertex weights. Mea-
suring the distances to the ground truth (v2v and landmark errors) gives informative and reliable
cross-expression evaluations on dense correspondence quality. Furthermore, photometric error vi-
sualizations on a shared UV map and the stable rendering of reconstructed sequence as in Fig. B.3
both qualitatively shows high quality of cross-expression correspondence.
However, quantitative evaluating cross-identity dense correspondence is by nature dicult.
These two metrics above indirectly measure for cross-subject correspondence. Here we show ad-
ditional visualizations by rendering inferred meshes in a shared checkerboard texture and high-
lighting some facial landmarks in Fig. B.8. The meshes inferred by ToFu preserve dense semantic
correspondences across subjects and expressions, as shown by the landmarks and the uniquely
textured regions.
Implementation Details: The appearance enhancement synthesis network uses as similar
architecture and losses as proposed by Wang et al. [214]. We train the global generator and
2 multi-scale discriminators at resolution of 512 512. The main dierence is that we extract
features from two inputs separately before concatenating them and feeding into the convolutional
back-end so that we can better encode useful features correspondingly. The network is trained
using an Adam optimizer with learning rate of 2e 4 (decayed from 100 epoch) and batch size
of 32 on a NVIDIA GeForce GTX 1080 GPU. For further enhancement, we trained a separate
super-resolution network, upsampling attribute maps from 512 to 4K resolution. We modify the
network design from ESRGAN [215] by expanding the number of Residual in Residual Dense
148
Blocks (RRDB) from 23 to 32, enabling the upsampling capacity from 4 to 8 in a single pass.
The super-resolution network is trained with learning rate of 1e 4 (halved at 50K, 100K, 200K
iterations) and batch size of 16 on two NVIDIA GeForce GTX 1080 GPUs.
149
Reference Scan Output Mesh Overlay Scan-to-Mesh Distance
Input Images
(4 out of 15)
0
Figure B.9: More results of reconstructed meshes in dense correspondence. The scan-to-
mesh distance is visualized color coded on the reference scan, where red denotes an error above
5 millimeters.
150
Base mesh
With inferred
normal displacement
Full model
w/ albedo and details
Zoom-in to fine details
Figure B.10: ToFu-infered facial appearances. Our method can generate reliable base align-
ment meshes, on top of which a comprehensive face modeling pipeline can be built. Here we
show more rendering with inferred normal displacements and additional albedo and specular
maps.
151
AppendixC
GeneralCaptureandModelingwithDynamicNeural
RadianceFields
C.1 SupplementalVideo
We strongly recommend the reader to watch oursupplementalvideo, hosted at the project website:
https://neural- 3d- video.github.io/ , to better judge the photorealism of our approach at high
resolution, which cannot be represented well by the metrics. The supplemental video includes:
• 3D video synthesis results on various dynamic scenes including challenging dynamic topol-
ogy change fast motion, view-dependent eects such as specularity and transparency, varying
illuminations and shadows, and volumetric eects such as steam and re;
• A short presentation on the method (the DyNeRF representation and the ecient training
method);
• Video comparisons to baseline methods including NeRF-T, DyNeRF-noIS, LLFF [133], Neu-
ralVolume [123];
• Visualization of the estimated geometry (rendered as depth maps);
152
• Slow-motion and bullet-time eects by our DyNeRF;
• More results on more challenging indoor scenes;
• Results on immersive video datasets [34];
• Demonstration of interactive playback of our 3D videos in commodity VR headsetQuest2 using
layered meshes distilled from our pretrained DyNeRF model;
• Limitation of our results on more challenging outdoor scenes.
C.2 Datasets
Figure C.1: Ourmulti-viewcapturesetup using synchronized GoPro Black Hero 7 cameras.
DetailsontheCaptureSetup: We build a mobile multi-view capture system using 21 GoPro
Black Hero 7 cameras, as shown in Fig. C.1. For all results discussed in Chapter 4, we capture
videos using the linear camera mode at a resolution of 2028 2704 (2.7K) and frame rate of
30 FPS. The multi-view inputs are synchronized by a timecode system, and the camera intrinsic
and extrinsic parameters are obtained by COLMAP [177] and are kept the same throughout the
capture.
153
Figure C.2:Framesfromourcapturedmulti-viewvideoamesalmon sequence (top). We use
18 camera views for training (downsized on the right), and held out the upper row center view
of the rig as novel view for quantitative evaluation. We captured sequences at dierent physical
locations, time, and under varying illumination conditions. Our data shows a large variety of
challenges in high quality wide angle 3D video synthesis.
Our collected data can provide sucient synchronized camera views for high quality 4D re-
construction of challenging dynamic objects and view-dependent eects in a natural daily indoor
environment, which did not exist in public 4D datasets. Our captured data demonstrates a variety
of challenges for video synthesis, including objects of high specularity, translucency and trans-
parency. It also contains scene changes and motions with changing topology (poured liquid),
self-cast moving shadows, volumetric eects (re ame), and an entangled moving object with
strong view-dependent eects (the torch gun and the pan), various lighting conditions (daytime,
night, spotlight from the side), multiple people moving around in open living room space with
outdoor scenes seen through transparent windows with relatively dark indoor illumination. We
visualize one snapshot of the sequence in Fig. C.2. Unless otherwise stated, we use keyframes
that are 30 frames apart. In total, we trained our methods on a 60 second video sequence (ame
salmon) in 6 chunks with each 10 seconds in length, ve other 10 seconds cooking videos captured
at dierent time with dierent motion and lighting, and one 25 seconds video in indoor videos
154
in 5 chunks. We also trained a few additional videos of outdoor scenes in chunks of 5 seconds
with denser keyframes, which are 10 frames apart. In the end, we employ a subset of 18 camera
views for training, and 1 view for quantitative evaluation for all datasets except one sequence
observing multiple people moving, which only uses 14 cameras views for training. We calculate
a continuous interpolated spiral trajectory based on the training camera views, which we employ
for qualitative novel view evaluation.
We found that the GoPro linear FOV mode suciently well compensates for sheye eects,
thus we employ a pinhole camera model for all our experiments. For all training, we hold out
the top center camera for testing, and use the rest of the cameras for training. For each captured
multi-view sequence, we removed a particular camera if the time synchronization did not work.
We also notice there are some inconsistent appearance in the some video streams caused by
dierent lighting sources observed from dierent view angles, which we excluded in training.
AdditionalImmersiveVideosfrom[34]: We also demonstrate our method using the multi-
view captured videos from [34] which have been made publicly available recently. Due to the time
constraints, we train DyNeRF models individually on a few 5s video clips from “Welder”, “Flames”,
and “Alexa Meade Face Paint” to validate our algorithm. There are a few dierences in their
capture setup which pose dierent opportunities and challenges to our method. First, dierent
from our captured linear camera videos which are front-facing, their videos are captured on a half
spherical inside-out rig with heavy distortion in each view. Second, their rig is composed of 46
cameras in each scene, which contains more than two times more numbers of cameras in training.
Successfully training on this scene using Dynerf requires us to compress a larger dynamic view
space and utilizing all training video pixels more eciently. During training, we sample the rays
155
directly from the raw resolutions of the distorted multi-view videos and render the novel view
video using a pinhole camera. We demonstrate our algorithm can work on this type of data to
create an immersive 3D video experience without any change in the representation.
C.3 ImportanceSamplingSchemes
SamplingBasedonGlobalMedianMaps(DyNeRF-ISG). For each ground truth video, we
rst calculate the global median value of each ray for all time stampsC(r) = median
t2T
C
(t)
(r) and
cache the global median image. During training, we compare each frame to the global median
image and compute the residual. We choose a robust norm of the residuals to balance the contrast
of weight. The norm measures the transformed values by a non-linear transfer function () that
is parameterized by
to adjust the sensitivity at various ranges of variance:
W
(t)
(r) =
1
3
C
(t)
(r)C(r);
1
: (C.1)
Here, (x;
) =
x
2
x
2
+
2
is the Geman-McLure robust function [74] applied element-wise. Intu-
itively, a larger
will lead to a high probability to sample the time-variant region, and
ap-
proaching zero will approximate uniform sampling. C(r) is a representative image across time,
which can also take other forms such as a mean image. We empirically validated that using a
median image is more eective to handle high frequency signal of moving regions across time,
which helps us to approach sharp results faster during training.
SamplingBasedonTemporalDierence(DyNeRF-IST): An alternative strategy, DyNeRF-
IST, calculates the residuals by considering two nearby frames in timet
i
andt
j
. In each training
156
iteration we load two frames within a 25-frame distance,jt
i
t
j
j 25. In this strategy, we focus
on sampling the pixels with largest temporal dierence. We calculate the residuals between the
two frames, averaged over the 3 color channels
W
(t
i
)
(r) = min
1
3
C
(t
i
)
(r)C
(t
j
)
(r)
1
;
: (C.2)
To ensure that we do not sample pixels whose values changed due to spurious artifacts, we clamp
W
(t
i
)
(r) with a lower-bound, which is a hyper-parameter. Intuitively, a small value of would
favor highly dynamic regions, while a large value would assign similar importance to all rays.
Combined Method (DyNeRF-IS
?
): We empirically observed that training DyNeRF-ISG with
a high learning rate leads to very quick recovery of dynamic detail, but results in some jitter
across time. On the other hand, training DyNeRF-IST with a low learning rate produces a smooth
temporal sequence which is still somewhat blurry. Thus, we combine the benets of both methods
in our nal strategy, DyNeRF-IS
?
, which rst obtains sharp details via DyNeRF-ISG and then
smoothens the temporal motion via DyNeRF-IST.
Training Details with the Important Sampling Schemes: We apply global median map
importance sampling (DyNeRF-ISG) in both the keyframe training and full video training stage,
and subsequently rene with temporal derivative importance sampling only for the full video.
For faster computation in DyNeRF-ISG we calculate temporal median maps and pixel weights
for each view at
1
4
th of the resolution, and then upsample the median image map to the input
resolution. For
in the Geman-McClure robust norm, we set 1e3 during keyframe training,
and 2e2 in the full video training stage. Empirically, this samples the background more densely
157
in the keyframe training stage than for the following full video training. We also found out
that using importance sampling has a larger impact in the full video training, as keyframes are
highly dierent. We set = 0:1 in DyNeRF-IST. In the full video training stage we rst train for
250K iterations of DyNeRF-ISG with learning rate 1e4 and then for another 100K iterations
of DyNeRF-IST with learning rate 1e5.
C.4 MoreResults
DetailsonBaselineMethods:
• Multi-View Stereo (MVS): We reconstruct the textured 3D meshes using commercial pho-
togrammetry software RealityCapture
∗
and render the novel view with from the textured 3D
meshes frame-by-frame. This baseline demonstrates the challenges using traditional geometry
based approaches.
• LocalLightFieldFusion(LLFF) [133]: LLFF is one of the state-of-the-art Multiplane Images
based methods tailored to front-facing scenes. We apply the pre-trained network in LLFF to
produce the multiplane images and render the novel views using default parameters. To work
with videos in our datasets, we produce the novel view frame-by-frame by query the inputs at
each corresponding time.
• NeuralVolumes(NV) [123]: NV is one of the state-of-the-art learning based volumetric meth-
ods can generate novel view videos. We use the same training videos and apply the default
parameters to train the network. We set the bounding volume according to the geometry of
∗
https://www.capturingreality.com/
158
Table C.1: Quantitativecomparison of our proposed method to baselines of existing methods
and radiance eld baselines trained at 200K iterations on a 10-second sequence. DyNeRF-IS
?
uses
both sampling strategies (ISG and IST) and thus runs for more iterations: 250K iterations of ISG,
followed by 100K of IST; it is shown here only for completeness.
Method PSNR" MSE# DSSIM# LPIPS# FLIP#
MVS 19.1213 0.01226 0.1116 0.2599 0.2542
NeuralVolumes 22.7975 0.00525 0.0618 0.2951 0.2049
LLFF 23.2388 0.00475 0.0762 0.2346 0.1867
NeRF-T 28.4487 0.00144 0.0228 0.1000 0.1415
DyNeRF
y
28.4994 0.00143 0.0231 0.0985 0.1455
DyNeRF-ISG 29.4623 0.00113 0.0201 0.0854 0.1375
DyNeRF-IST 29.7161 0.00107 0.0197 0.0885 0.1340
DyNeRF-IS
?
29.5808 0.00110 0.0197 0.0832 0.1347
the scene. We use 128
3
voxel grid for the RGB volume and 32
3
for the warping grid. It renders
a novel view image via ray marching a warped voxel grid at each timestamps.
• NeRF-T: Refers to the version in Eq. 4.1 in Chapter 4, which is a straight-forward temporal
extension of NeRF. We implement it following the details in [134], with only one dierence in
the input. The input concatenates the original positionally-encoded location, view direction,
and time. We choose the positionally-encoded bandwidth for the time variable to be 4 and we
do not nd that increasing the bandwidth further improves results.
• DyNeRF
y
: We compare to DyNeRF without our proposed hierarchical training strategy and
without importance sampling, i.e. this strategy uses per-frame latent codes that are trained
jointly from scratch.
• DyNeRF with varying hyper-parameters: We vary the dimension of the employed latent
codes (8, 64, 256, 1024, 8192). We also apply ablation studies on dierent versions of DyNeRF
with important sampling methods: DyNeRF-ISG,DyNeRF-IST, andDyNeRF-IS
?
.
159
QuantitativeComparisontotheBaselines: Tab. C.1 shows the quantitative comparison of
our methods to the baselines using an average of single frame metrics. We train all the neural
radiance eld based baselines and our method the same number of iterations for fair comparison.
Compared to the existing methods, MVS, NeuralVolumes and LLFF, our method is able capture
and render signicant more photo-realistic images, in all the quantitative measures. Compared to
the time-variant NeRF baseline NeRF-T and our basic DyNeRF model without our proposed train-
ing strategy (DyNeRF
y
), our DyNeRF model variants trained with our proposed training strategy
perform signicantly better in all metrics. DyNeRF-ISG and DyNeRF-IST can both achieve high
quantitative performance, with DyNeRF-IST slightly more favorable in terms of the metrics. Our
complete strategy DyNeRF-IS
?
requires more iterations and is added to the table only for com-
pleteness.
The Impact of Importance Sampling: In Fig. C.3 we evaluate the eect of our importance
sampling strategies, DyNeRF-ISG, DyNeRF-IST and DyNeRF-IS
?
, against a baseline DyNeRF-
noIS that also employs a hierarchical training strategy with latent codes initialized from trained
keyframes, but instead of selecting rays based on importance, selects them at random like in stan-
dard NeRF [134]. The gure shows zoomed-in crops of the dynamic region for better visibility.
We clearly see that all the importance sampling strategies manage to recover the moving ame
gun better than DyNeRF-noIS in two times less iterations. At 100k iterations DyNeRF-ISG and
DyNeRF-IST look similar, though they converge dierently with DyNeRF-IST being blurrier in
early iterations and DyNeRF-ISG managing to recover moving details slightly faster. The visual-
izations of the nal results upon convergence in Fig. C.3 demonstrate the superior photorealism
that DyNeRF-IS
?
achieves, as DyNeRF-noIS remains much blurrier in comparison. We notice
160
5000 iterations
DyNeRF-noIS
25000 iterations 50000 iterations 100000 iterations
DyNeRF-ISG DyNeRF-IST DyNeRF-noIS final
DyNeRF-IS⋆ final
Figure C.3: Comparisonofimportancesamplingstrategiesovertrainingiterations.
that without importance sampling, the system cannot reach an acceptable visual quality within
extended training time, indicating the necessity of the importance sampling scheme. In Fig. C.4,
we compare various settings of the dynamic neural radiance elds. NeRF-T can only capture a
blurry motion representation, which loses all appearance details in the moving regions and can-
not capture view-dependent eects. Though DyNeRF
y
has a similar quantitative performance as
NeRF-T, it has signicantly improved visual quality in the moving regions compared to NeRF-T,
161
RGB rendering Dynamic region zoom-in DSSIM FLIP
NeRF-T DyNeRF
†
DyNeRF-ISG DyNeRF-IST DyNeRF-IS*
0.0531 0.1487
0.0392 0.1294
0.0252 0.1174
0.0275 0.1199
0.0235 0.1144
Figure C.4: Qualitativecomparisons of DyNeRF variants on one image of the sequence whose
averages are reported in Tab. C.1. From left to right we show the rendering by each method, then
zoom onto the moving ame gun, then visualize DSSIM and FLIP for this region using theviridis
colormap (dark blue is 0, yellow is 1, lower is better). The three hierarchical DyNeRF variants
outperform these baselines: DyNeRF-ISG has sharper details than DyNeRF-IST, but DyNeRF-IST
recovers more of the ame, while DyNeRF
?
combines both of these benets.
but still struggles to recover the sharp appearance details. DyNeRF with our proposed training
strategy, DyNeRF-ISG, DyNeRF-IST and DyNeRF-IS
?
, can recover sharp details in the moving
regions, including the torch gun and the ames.
162
Table C.2:Comparisoninmodelstoragesize of our method (DyNeRF) to alternative solutions.
For HEVC, we use the default GoPro 7 video codec. For JPEG, we employ a compression rate that
maintains the highest image quality. For NeRF, we use a set of the original NeRF networks [134]
reconstructed frame by frame. For HEVC, PNG and JPEG, the required memory may vary within
a factor of 3 depending on the video appearance. For NeuralVolumes (NV), it only accounts the
neural network size without counting its dependency on additional input streams. For NeRF,
NeuralVolume and DyNeRF, the required memory is constant. All calculation are based on 10
seconds of 30 FPS videos captured by 18 cameras.
HEVC PNG JPEG NeRF NV DyNeRF
Size (MB) 1,406 21,600 3,143 1,080 773 28
Comparisons on of Model Compression: Our model is compact in terms of model size. In
Tab. C.2, we compare our model DyNeRF to the alternatives in terms of storage size. Compared to
the raw videos stored in dierent images, e.g., PNG or JPEG, our representation is more than two
orders of magnitude smaller. Compared to a highly compact 2D video codec (HEVC), which is
used as the default video codec for the GoPro camera, our model is still 50 times smaller. It is worth
noting that these compressed 2D representations do not provide a 6D continuous representation
as we do. Though NeRF is a compact model for a single static frame, representing the whole
captured video without dropping frames requires a stack of frame-by-frame reconstructed NeRF
networks, which is more than 30 times larger in size compared to our single DyNeRF model.
Compared to the convolutional model used in NeuralVolume, DyNeRF is more compact in size
and can represent the dynamic scene with better quality.
Impact of Latent Embedding Size on DyNeRF: We run an ablation on latent code length
on 60 continuous frames and present the results in Tab. C.3. In this experiment, we do not in-
clude keyframe training or importance sampling. We ran the experiments until 300K iterations,
which is when most models are starting to converge in rendering qualities. Note that with a code
length of 8,192 we cannot t the same number of samples in the GPU memory as in the other
163
Table C.3: Ablation studies on the latent code dimension on a sequence of 60 consecutive
frames. Codes of dimension 8 are insucient to capture sharp details, while codes of dimension
8,192 take too long to be processed by the network. We use 1,024 for our experiments, which
allows for high quality while converging fast. *Note that with a code length of 8,192 we cannot
t the same number of samples in the GPU memory as in the other cases, so we report a score
from a later iteration when roughly the same number of samples have been used.
Dimension PSNR" MSE# DSSIM# LPIPS# FLIP#
8 26.4349 0.00228 0.0438 0.2623 0.1562
64 27.1651 0.00193 0.0401 0.2476 0.1653
256 27.3823 0.00184 0.0421 0.2669 0.1500
1,024 27.6286 0.00173 0.0408 0.2528 0.1556
8,192* 27.4100 0.00182 0.0348 0.1932 0.1616
cases, so we report a score from a later iteration when roughly the same number of samples have
been used. We use 4 16GB GPUs and network width 256 for the experiments with this short
sequence. From the metrics we clearly conclude that a code of length 8 is insucient to represent
the dynamic scene well. Moreover, we have visually observed that results with such a short code
are typically blurry. With increasing latent code size, the performance also increases respectively,
which however saturates at a dimension of 1024. A latent code size of 8192 has longer training
time per iteration. Taking the capacity and speed jointly into consideration, we choose 1024 as
our default latent code size for all the sequences in Chapter 4 and the supplementary video.
AdditionalDiscussionsontheLatentCodes: Besides all the above ndings, we also observe
some failure cases to manipulate the latent codes. Extrapolating the latent codes in time cannot
directly create high quality extrapolated views. We have extensively investigated latent code
optimization with various combinations of parameter learnability for latent codes (keyframe /
remaining frames) and the network. With frozen keyframe latent codes, we observe blurrier
results than the all-learnable case. Therefore learning both latent codes (keyframe / remaining
frames) and the network is necessary for producing sharp and high-quality renderings.
164
View-dependent Eects in Dark Indoor Scenes: DyNeRF can represent view-dependent
eects as well as motion in one continuous representation. When input cameras streams have
slightly dierent appearance dierences in observation, we nd DyNeRF will model this dier-
ence as part of the view-dependent eects when generating novel views. We can observe this
artifact in all of our dark indoor scenes where there are more obvious color inconsistency from
wide-angle input video streams. Incorporating more careful color calibration and learning color
calibration may address this problem, which we leave for future work.
3D Video Editing via Manipulating the Latent Codes: DyNeRF represents a continuous
spatial-temporal dynamic scene which supports rendering any view within the interpolation
boundary of space and time. We can create a latent code at a sub-frame time via interpolation
and render a “slow motion” 3D video with any given FPS rate. DyNeRF can enable smooth inter-
polation from 30fps to 60fps or even 150fps. Furthermore our method can render “bullet-time”
eect by by freezing the latent code at any arbitrary time and manipulating camera views in
space. We include the video eects of “slow motion“ and “bullet-time“ from arbitrary time in our
supplementary videos.
RenderingTime: The rendering time of our method is on par with NeRF due to the structural
similarity of the approaches. Our current, not fully optimized version achieves a rendering time
of 45 seconds for one 1080p frame using two V-100 GPUs with 16 GB memory.
165
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multi-scale dynamic capture for high quality digital humans
PDF
Point-based representations for 3D perception and reconstruction
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
3D deep learning for perception and modeling
PDF
Rapid creation of photorealistic virtual reality content with consumer depth cameras
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Data-driven 3D hair digitization
PDF
Human appearance analysis and synthesis using deep learning
PDF
Shape-assisted multimodal person re-identification
PDF
Complete human digitization for sparse inputs
PDF
Learning controllable data generation for scalable model training
PDF
Real-time simulation of hand anatomy using medical imaging
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Recording, reconstructing, and relighting virtual humans
PDF
Deep representations for shapes, structures and motion
PDF
Digitizing human performance with robust range image registration
PDF
Acquisition of human tissue elasticity properties using pressure sensors
PDF
Anatomically based human hand modeling and simulation
Asset Metadata
Creator
Li, Tianye
(author)
Core Title
Scalable dynamic digital humans
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-12
Publication Date
07/09/2023
Defense Date
08/26/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D animation,3D computer vision,3D faces,3D morphable face models,3D morphable model,3D reconstruction,3D video,3DMM,4D reconstruction,articulated model,augmented reality,avatar creation,blendshapes,computer graphics,computer vision,content creation,deep learning,differentiable rendering,digital humans,dynamic reconstruction,dynamic scene reconstruction,efficient,efficient training,face alignment,face fitting,face modeling,facial performance capture,free-viewpoint video,generic face modeling,importance sampling,iterative closest point,multi-view reconstruction,multi-view stereo,neural 3D video synthesis,neural networks,neural radiance fields,neural rendering,non-rigid shape reconstruction,novel view synthesis,OAI-PMH Harvest,photogrammetry,principal component analysis,rendering,retargeting,scalability,scalable,shape deformation,skinning,statistical shape analysis,surface registration,topologically consistent face meshes,virtual humans,virtual reality,visual effects,volumetric feature sampling,volumetric representation
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Hill, Randall Jr. (
committee chair
), Nealen, Andrew (
committee member
), Scherer, Stefan (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
tianye.focus@gmail.com,tianyeli@protonmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112710956
Unique identifier
UC112710956
Identifier
etd-LiTianye-11405.pdf (filename)
Legacy Identifier
etd-LiTianye-11405
Document Type
Dissertation
Format
theses (aat)
Rights
Li, Tianye
Internet Media Type
application/pdf
Type
texts
Source
20230111-usctheses-batch-1000
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
3D animation
3D computer vision
3D faces
3D morphable face models
3D morphable model
3D reconstruction
3D video
3DMM
4D reconstruction
articulated model
augmented reality
avatar creation
blendshapes
computer vision
content creation
deep learning
differentiable rendering
digital humans
dynamic reconstruction
dynamic scene reconstruction
efficient
efficient training
face alignment
face fitting
face modeling
facial performance capture
free-viewpoint video
generic face modeling
importance sampling
iterative closest point
multi-view reconstruction
multi-view stereo
neural 3D video synthesis
neural networks
neural radiance fields
neural rendering
non-rigid shape reconstruction
novel view synthesis
photogrammetry
principal component analysis
rendering
retargeting
scalability
scalable
shape deformation
skinning
statistical shape analysis
surface registration
topologically consistent face meshes
virtual humans
virtual reality
visual effects
volumetric feature sampling
volumetric representation