Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning to optimize the geometry and appearance from images
(USC Thesis Other)
Learning to optimize the geometry and appearance from images
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEARNING TO OPTIMIZE THE GEOMETRY AND APPEARANCE FROM IMAGES
by
Shichen Liu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2023
Copyright 2023 Shichen Liu
Dedication
In memory of my grandfather Haiqiao Liu.
ii
Acknowledgements
First of all, I would like to thank my advisors Prof. Hao Li and Prof. Yajie Zhao for mentoring me in
conducting cutting-edge research, and the generous support for me to freely explore my interested areas.
I would also like to thank my thesis committee: Prof. Randall Hill, Prof. Stefanos Nikolaidis, Prof. Ai-
ichiro Nakano, and Prof. Andrew Nealen for their insightful suggestions, and Prof. Andrew Gordon for
joining my qualifying exam committee. I also want to thank Kathleen Haase and Christina Trejo for their
supportive administration, without whom none of my research would be possible.
I feel grateful for being able to collaborate with many talented researchers: Tianye Li, Weikai Chen,
Shunsuke Saito, Yichao Zhou, Timo Bolkart, Haiwei Chen, Jiayi Liu, and Yunxuan Cai. Their positive and
constructive feedback always helped me proceed with confidence. I would like to extend my gratitude to
Linjie Luo, Alex Ma, Tony Tung, Yuanlu Xu, Nikolaos Sarafianos, Simon Yuen, Koki Nagano, and Jaewoo
Seo for the excellent guidance and support during my internships at Bytedance, Meta Reality Labs, and
Nvidia. I would also like to take this opportunity to thank my undergraduate supervisors and mentors,
Prof. Mingsheng Long, Dr. Yue Cao, Prof. Gao Huang, and Prof. Kilian Weinberger, whom introduced me
to computer vision and machine learning research.
I would like to acknowledge my lab mates at USC for being amazing collaborators and sharing the fun
times throughout many deadlines: Jun Xing, Mingming He, Zeng Huang, Bipin Kishore, Pratusha Prasad,
Marcel Ramos, Yi Zhou, Ruilong Li, Yuliang Xiu, Kyle Olszewski, Zimo Li, Pengda Xiang, Sitao Xiang,
Jing Yang, and Yuming Gu. Special thanks also goes to my friends that have been helped me during my
iii
Ph.D. program, especially during the hardest time: Qiangeng Xu, Youwei Zhuo, Haiwen Feng, Yufeng Yin,
Yujing Xue.
Finally, I would like to thank my father, mother, and my family for their unwavering support and
unconditional love.
iv
TableofContents
Dedication ii
Acknowledgements iii
List of Tables vii
List of Figures ix
Abstract xiv
Chapter 1: Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 2: Image-based 3D Optimization with Inverse Graphics 6
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Soft Rasterizer: A General Differentiable Mesh Renderer . . . . . . . . . . . . . . . . . . . 12
2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Probability Map Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Aggregation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4 Texture Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.5 Comparisons with Prior Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.6 Image-based 3D Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.6.1 Single-view Mesh Reconstruction . . . . . . . . . . . . . . . . . . . . . . 25
2.2.6.2 Image-based Shape Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.7 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.8.1 Forward Rendering Results . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.8.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.8.3 Quantitative Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.8.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.8.5 Image-based Shape Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3 Learning to Infer the Implicit Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.2 Sampling-Based 2D Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.3 Geometric Regularization on Implicit Surfaces . . . . . . . . . . . . . . . . . . . . . 46
v
2.3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3.5.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3.5.2 Ablation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Chapter 3: Learning to Optimize Vanishing Points for Scene Structure Analysis 54
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 VaPiD: A Rapid Vanishing Point Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.2.1 Vanishing Point Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.2.2 Learning to Optimize Vanishing Points . . . . . . . . . . . . . . . . . . . 62
3.2.2.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.4.2 Experiment Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.4.3 Results on Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.4.4 Results on Real-World Datasets . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.4.5 Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 4: Learning to Optimize Face Avatars 73
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Rapid Face Asset Acquisition with Recurrent Feature Alignment . . . . . . . . . . . . . . . 80
4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.2 Feature Extraction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.3 Recurrent Face Geometry Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.3.1 Visual-Semantic Correlation (VSC) Networks . . . . . . . . . . . . . . . 85
4.2.3.2 Geometry Decoding Network . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.4 Texture Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.5 Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.7.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.7.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.2.7.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.7.4 Devices and Real-world Captures . . . . . . . . . . . . . . . . . . . . . . 106
4.2.7.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Chapter 5: Conclusion and Future Directions 111
5.1 Summary of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 Open Questions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Bibliography 114
vi
ListofTables
2.1 Comparison of mean IoU with other 3D unsupervised reconstruction methods on 13
categories of ShapeNet datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Ablation study of the regularizer and various forms of distance and aggregate functions.
A
N
is the aggregation function implemented as a neural network.A
S
andA
O
are defined
in Equation 2.8 and 2.13 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Comparison of cube rotation estimation error with NMR, measured in mean relative
angular error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Comparison of 3D IoU with other unsupervised reconstruction methods. . . . . . . . . . . 50
2.5 Quantitative evaluations of our approach on chair category using different regularizer
configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.6 Quantitative evaluations on table category with different ∆ d . . . . . . . . . . . . . . . . . 52
2.7 Quantitative measurements for the ablation analysis of importance sampling and
boundary-aware assignment on the chair category as shown in Figure 2.25. . . . . . . . . . 53
3.1 Comparisons of mean, median angle errors and the angular accuracies of 0.2°, 0.5°, 1° with
baseline methods on SU3 dataset [180]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2 Comparisons of mean, median angle errors and the angular accuracies of 1°, 2°, 10° with
baseline methods on Natural Scene dataset [181]. . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Comparisons of mean, median angle errors and the angular accuracies of 1°, 2°, 10° with
baseline methods on HoliCity dataset [178]. . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Comparisons of AUC values at 10° with baseline methods on NYU-VP dataset [78].
Supervised methods are noted as “yes” in the last column. † Method requires additional
line segment detector such as LSD [157]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5 Ablation study on the efficient conic convolutions. “VaPiD-ECC" denotes the baseline
without using our proposed efficient conic convolutions. . . . . . . . . . . . . . . . . . . . 71
vii
3.6 Ablation study on the number of refinement steps. ( × T ) indicates T refinement steps
during the inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1 Symbol table. By default, we setH =W =H
t
=W
t
=512. . . . . . . . . . . . . . . . . . 81
4.2 Quantitative comparison on our Light stage captured dataset. The table measures the
percentage of points that are within Xmm to the ground truth surface (column 1-3), mean
and median scan-to-mesh errors (column 4-5), and a comparison of the supported features
(column 6-8). “Consistency” denotes whether the reconstructed has consistent mesh
connectivity. “Dense” denotes whether the model reconstructs geometry of more than
500k vertices [107], and “Texture” denotes whether the network output includes texture
information. Although the original work of ToFU includes texture inference, the module
is separate from its main architecture and thus not learned end-to-end. . . . . . . . . . . . 91
4.3 Inference Time comparison. Our model is both more efficient and more effective to the
baselines at a 4.5FPS speed. A lighter model that achieves similar accuracy to ToFu runs
at 9FPS, which is close to real-time performance. . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 Ablation study on our dataset. Underlined items are our default settings. Correlation
feature: whether use as default (“correlation”) or simply concatenate the semantic and
visual features (“concat”). View aggr func.: choice of the pooling function. Grid size:
the total size length of the 3D grid built for computing correlation. Search radius: the
search radius in computing the visual-semantic correlation. Recurrent Layer: whether
GRU is used or is replaced by convolution layers. UV-space Feature: components of
the UV-space features: UV coordinates (U), position map (P), and face region map (R).
UV-space Embedding: whether the UV-space featureg is learned by a neural network
(“Network”) or directly set as learnable parameters (“Parameter”). Input View: number
of views used as input in the inference. Notably, decreasing the number of views for
the inference only results in a slight decrease in performance. Our model’s performance
with only 4 views still achieve the best accuracy when compared to the best baseline that
utilizes 15 views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
viii
ListofFigures
2.1 We propose Soft Rasterizer R (upper), a naturally differentiable renderer, which
formulates rendering as a differentiable aggregating process A(·) that fuses per-triangle
contributions{D
i
} in a “soft" probabilistic manner. Our approach attacks the core problem
of differentiating the standard rasterizer, which cannot propagate gradients from pixels to
geometry due to the discrete sampling operation (below). . . . . . . . . . . . . . . . . . . . 12
2.2 Forward rendering (left): various rendering effects generated by SoftRas by tuning the
degree of transparency and blurriness. Applications based on the backward gradients
provided by SoftRas: (1) 3D unsupervised mesh reconstruction from a single input image
(middle) and (2) 3D pose fitting to the target image by flowing gradient to the occluded
triangles (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Comparisons between the standard rendering pipeline (upper branch) and our rendering
framework (lower branch). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Probability maps of a triangle under Euclidean (upper) and barycentric (lower) . . . . . . . 17
2.5 An illustration of our proposed texture map handling approach when texture resolution is 3. 22
2.6 Comparisons with prior differentiable renderers in terms of gradient flow. . . . . . . . . . 23
2.7 The proposed framework for single-view mesh reconstruction. . . . . . . . . . . . . . . . 25
2.8 Network structure for color reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9 3D mesh reconstruction from a single image. From left to right, we show input image,
ground truth, the results of our method (SoftRas), Neural Mesh Renderer [73] and
Pixel2mesh [158] – all visualized from 2 different views. Along with the results, we also
visualize mesh-to-scan distances measured from reconstructed mesh to ground truth. . . . 30
2.10 Results of colorized mesh reconstruction. The learned principal colors and their usage
histogram are visualize on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
ix
2.11 Different rendering effects achieved by our SoftRas renderer. We show how a colorized
cube can be rendered in various ways by tuning the parameters of SoftRas. In particular,
by increasingγ , SoftRas can render the object with more transparency while more blurry
renderings can be achieved via increasing σ . As γ → 0 and σ → 0, one can achieve
rendering effect closer to standard rendering. . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.12 Single-view reconstruction results on real images. . . . . . . . . . . . . . . . . . . . . . . . 32
2.13 Visualization of loss function landscapes of NMR and SoftRas for pose optimization given
target image (a) and initialization (f). SoftRas achieves global minimum (b) with loss
landscape (g). NMR is stuck in local minimum (c) with loss landscape (h). At this local
minimum, SoftRas produces the smooth and partially transparent rendering (d)(e), which
smoothens the loss landscape (i)(j) with largerσ andγ , and consequently leads to better
minimum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.14 Intermediate process of fitting a color cube (second row) to a target pose shown in the
input image (first row). The smoothened rendering (third row) that is used to escape local
minimum, as well as the colorized fitting errors (fourth row), are also demonstrated. . . . . 36
2.15 Results for optimizing human pose given single image. . . . . . . . . . . . . . . . . . . . . 37
2.16 Results for optimizing facial identity, expression, skin reflectance, lighting and rigid pose
given single 2D image along with 2D landmarks. . . . . . . . . . . . . . . . . . . . . . . . . 38
2.17 While explicit shape representations may suffer from poor visual quality due to limited
resolutions or fail to handle arbitrary topologies (a), implicit surfaces handle arbitrary
topologies with high resolutions in a memory efficient manner (b). However, in contrast
to the explicit representations, it is not feasible to directly project an implicit field onto a
2D domain via perspective transformation. Thus, we introduce a field probing approach
based on efficient ray sampling that enables unsupervised learning of implicit surfaces
from image-based supervision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.18 Ray-based field probing technique. (a) A sparse set of 3D anchor points are distributed to
sense the field by sampling the occupancy value at its location. (b) Each anchor is assigned
a spherical supporting region to enable ray-point intersection. The anchor points that
have higher probability to stay inside the object surface are marked with deeper blue. (c)
Rays are cast passing through the sampling points{x
i
} on the 2D silhouette under the
camera views{π k
} (blue indicates object interior and white otherwise). (d) By aggregating
the information from the intersected anchor points via max pooling, one can obtain the
prediction for each ray. (e) The silhouette loss is obtained by comparing the prediction
with the ground-truth label in the image space. . . . . . . . . . . . . . . . . . . . . . . . . 43
2.19 Network architecture for unsupervised learning of implicit surfaces. The input imageI
is first mapped to a latent feature z by an image encoder g while the implicit decoder
f consumes both the latent codez and a query pointp
j
and predicts its occupancy
probability ϕ (p
j
). With a trained network, one can generate an implicit field whose
iso-surface at 0.5 depicts the inferred geometry. . . . . . . . . . . . . . . . . . . . . . . . . 46
x
2.20 2D illustration of importance weighted geometric regularization. . . . . . . . . . . . . . . . 47
2.21 Qualitative results of single-view reconstruction using different surface representations.
For point cloud representation, we also visualize the meshes reconstructed from the
output point cloud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.22 Qualitative comparisons with mesh-based approach [101] in term of modeling capability
of capturing varying topologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.23 Qualitative evaluations of geometric regularization by using different configurations. . . . 52
2.24 Qualitative results of reconstruction using our approach with different regularizer
sampling step∆ d. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.25 Qualitative analysis of importance sampling and boundary-aware assignment for
single-view reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1 We propose a novel vanishing point detection network VaPiD, which runs in real-time
with high accuracy. Speed-accuracy curves compare with state-of-the-art methods on
the SU3 dataset [180]. Dotted horizontal lines labeled withnϵ represent thenth smallest
angle errors that numerically can be represented by 32-bit floating point numbers when
computing the angle between two normalized direction vectors, i.e.,arccos⟨d
1
,d
2
⟩. . . . 55
3.2 The architecture of our proposed VaPiD. It incorporates three major components: (1)
a backbone network for feature extraction from the input image; (2) a vanishing point
proposal network (VPPN) to generate reliable vanishing point proposals with efficient
conic convolutions; (3) a weight sharing neural vanishing point optimizer (NVPO) to
refine each vanishing point proposal to achieve high accuracy. Note that our network is
trained in an end-to-end fashion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3 Illustration of point-based non-maximum suppression (PNMS). (a) The confidence score
map of a dense vanishing point anchor grid predicted by our efficient conic convolution
networks. Higher scores are visualized as solid spheres with larger radius. (b) Top-3
vanishing point proposals after PNMS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Illustration of our update operator. (a) Given the camera system(X,Y,Z) and the current
vanishing point positiond
(t)
, we define the local system (X
′
,Y
′
,Z
′
). (b) We obtain the
refined vanishing point d
(t+1)
by applying the update vector∆ (t)
in the local system. . . 63
3.5 Qualitative comparison with baseline methods on SU3 Wireframe dataset [180]. Line
group in the same color indicates the same vanishing point. We highlight all prediction
errors in red color. Better view in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6 Angle accuracy curves and speed-accuracy comparisons for different methods on SU3
wireframe dataset [180]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7 Angle accuracy curves and speed-accuracy comparisons for different methods on Natural
Scene dataset [181] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
xi
3.8 Visualization of our vanishing point detection results on various types of scenes. For each
of the vanishing points, we visualize it using a group of 2D lines in the same color. Better
view in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1 Composition of the UV-space featureG.G is a concatenation of (a) UV space coordinates,
(b) position map of a mean shape and (c) a carefully crafted face region map (31
dimensional one-hot vector). The composition serves to encode the facial semantics and
the geometry priors necessary for the future steps. . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 An illustration of visual-semantic correlation (VSC). A 3D local grid is built around the 3D
position of each pixel in the UV-space position map. The volume of correlation feature is
then constructed by taking the inner product between each UV-space feature in the local
grid the its projected features in the multi-view image space. The correlation feature is
a local representation of the alignment between the observed visual information and the
semantic priors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3 An example set of subject data used for training. (a) Selected views of the captured
images as input. (b) Processed geometry in the form of a 3D mesh. In addition to the
face, head, and neck, our model represents teeth, gums, eyeballs, eye blending, lacrimal
fluid, eye occlusion, and eyelashes. The green region denotes the face area that our model
aims to reconstruct. The other parts are directly adopted from a template (c)4K× 4K
physically-based skin properties, including albedo (bottom-left), specular (top-left) and
displacement maps (top-right) used for texture supervision, and the512× 512 position
map (bottom-right), converted from the 3D mesh in (b), used for geometry supervision. . . 90
4.4 Network architecture of ReFA. Our model recurrently optimizes for the facial geometry
and the head pose based on computation of visual-semantic correlation (VSC) and utilizes
the pixel-aligned signals learned thereof for high-resolution texture inference. . . . . . . . 91
4.5 Qualitative comparison with the baselines on our testing dataset. As the release codes
of the baseline methods [67] and [93] do not produce appearance maps, the results
presented here are the network direct output geometry rendered with a basic shader using
Blender. Visual inspection suffices to reveal the improvement our model has achieved:
ReFA produces results that are more robust to challenging expressions (row 1,4,5), facial
shapes (row 6,7) and reconstructs a wider face area including the ears and forehead when
compared to [93] and [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.6 Qualitative comparison with NeRF-based method. . . . . . . . . . . . . . . . . . . . . . . . 93
4.7 A comparison between our method (c) and a traditional MVS and fitting pipeline (b). The
traditional pipeline incorrectly reconstructs the two challenging input examples as shown
in the figure: pointy ear in the upper case due to hair occlusion and closed eyes in the
lower case. Our system not only correctly reconstructs the fine geometry details, but also
at a significantly faster speed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.8 Cumulative density function (CDF) curves of scan-to-mesh distance comparison on our
testing dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
xii
4.9 Testing result on the Triple Gangers [152] dataset, whose capturing setup contains
different camera placements, no polarizer and a lighting condition that is not seen in our
training dataset. The result demonstrated here shows that our model generalizes well to
unseen multi-view dataset that has been captured in different settings. . . . . . . . . . . . 95
4.10 Speed-accuracy graph of ReFA with varying numbers of the inference iterations against
ToFu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.11 Images rendered from our reconstructed face assets. Geometry constructed from the input
images and the inferred appearance maps are used in the physical renderings with Maya
Arnold under an lighting environments provided HDRI images. The renderings achieve
photo-realistic quality that faithfully recovers the appearance and expression captured in
the input photos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.12 Detailed results for the texture map inference. The even rows display the zoom-in images
of the 4096× 4096 texture maps. Our texture inference network constructs texture
maps from the multi-view images with high-frequency details that essentially allow for
photo-realistic renderings of the face assets. . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.13 Reconstruction of a video sequence at 4.5FPS, where the expression and the head pose of
the subject changes over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.14 Ablation study on the UV-space embedding network: (a) using our proposed UV-space
features along with a neural network; (b) directly setting the UV-space feature as a
learnable parameter. To visualize the features, we use T-SNE [155] to embed the feature
to the 3-dimension space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.15 Given valid UV mappings, the position map representation is amenable to conversions to
various representations, as shown in each column. This include 3D meshes of different
subdivisions, which enables Level of Detail (LOD) rendering; point cloud, landmarks and
region map representation that are commonly used in mobile applications. . . . . . . . . . 105
4.16 Our capture devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.17 Our capture devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
xiii
Abstract
By extracting geometry and appearance from images, 3D computer vision algorithms have impacted a va-
riety of applications ranging from entertainment and communication to manufacturing. These algorithms
provide a fundamental 3D understanding of the scenes by offering tools for object shape, scene proper-
ties, and digital human reconstruction. While these applications were once limited to specialized devices,
recent advancements in computer vision research and the availability of sensor-packed mobile devices
have created the environments to make them accessible to the general public. Furthermore, deep learn-
ing advances leverage large amounts of data to train a model that functions effectively with even limited
monocular input, further facilitating this trend. However, adopting the success of 2D deep learning such
as image recognition, object detection in 3D applications is not straightforward and requires addressing
practical issues such as acquiring large-scale 3D training data, developing fast and accurate models that
can handle the real-time nature of the applications, and enabling the neural networks to effectively work
with complex 3D representations.
This thesis addresses the practical challenges in the 3D understanding of several domains including
general objects, scene properties, and digital humans by introducing highly efficient deep learning models
that are architecturally inspired by the optimization-based techniques. Specifically, we first introduce a
novel differentiable renderer that can render a colored 3D mesh as well as provide the gradient from the
rendered images. This operator enables pose, shape, and texture optimization directly from 2D images.
Moreover, by incorporating it into the neural networks, the gradients can be back-propagated from the
xiv
2D images to the network parameters, enabling single-view reconstruction for general objects without the
need for supervision with large-scale 3D data. We then propose a novel neural optimization framework for
vanishing point detection. It learns a recurrent neural network to perform vanishing point optimization
on top of a low-level feature extractor. By supervising the model in an end-to-end fashion, it jointly learns
the 3D priors into the feature extractor and the optimizer resulting in unprecedented speed and accuracy.
The proposed framework allows high-quality scene property analysis at near real-time speeds. Finally, we
explore the efficient acquisition of the face avatar. To leverage the efficiency of the neural optimization
framework, we extend it to accommodate dense and semantically-aligned geometries such as surfaces. To
this end, we develop an efficient module that computes the correlation between a predefined shape space
and the image space. The proposed approach demonstrates state-of-the-art accuracy and speed, and the
direct output can be readily used in a physically-based rendering pipeline for photo-realistic rendering.
xv
Chapter1
Introduction
1.1 Motivation
During the past decades, there has been a significant increase in the use of computer vision technolo-
gies. The ability to understand object pose, shape and appearance from images is crucial for developing
virtual reality (VR), augmented reality (AR), and autonomous driving applications. For instance, object re-
construction and camera localization enables the realistic and interactive virtual shopping and navigation
experiences. On the other hand, photo-realistic human digitization techniques including the generation
of face, hand, or even full-body avatars creates an immersive way of collaboration, entertainment, and
communication, driving innovation in the fields such as film-making, gaming, and fashion design. Recent
development of mobile devices and vehicle systems, which now provide sensors and powerful computa-
tion resources to general users, have greatly expanded the potential reach of these applications. However,
to ensure the high-quality user experiences, computer vision algorithms must be designed to take images
from a single view or sparse views as input and operate in an accurate, robust and efficient manner.
With minimal input, the problem becomes naturally challenging, sometimes even ill-posed. This is
because 2D image only captures a limited amount of information about the 3D structure, resulting in
infinitely many possible 3D models that could have generated the same image. Specifically, the ambiguity
in terms of depth, and the entanglement of shape and appearance, make it challenging to determine the true
1
3D structure from a single image. To overcome this ill-posed nature, traditional computer vision algorithms
often rely on priors that take forms of the optimization objectives based on certain assumptions about the
3D world, built on top of the hand-crafted features. However, to infer robust and accurate 3D information,
it is non-trivial to design the hand-crafted features and the choice of optimization algorithms, as they are
sensitive to noises and outliers and require manual tweaking to be effective.
On the other hand, deep learning models such as convolutional neural networks (CNNs) have demon-
strated their capability to learn priors from large amount of labeled data automatically, and have shown
success in various applications such as image classification, object detection, and face recognition. How-
ever, while CNNs succeed in many 2D tasks, they have not yet had as much success in 3D vision tasks.
The challenge begins with the data collection. Collecting and annotating high-quality 3D database is time-
consuming and difficult, because it either requires specialized capturing devices or 3D modeling expertise.
This poses a considerable obstacle for tasks involving numerous number of categories, such as daily ob-
jects. Furthermore, the foundation of 2D CNNs is 2D convolution operator, which is designed to learn
feature hierarchies from 2D grids of pixels by exploiting the translation-invariant property in 2D images.
However, as it is not yet well understood how to incorporate the 3D representations into 2D CNNs in a
principled manner, adapting these models to handle 3D data remains challenging. Finally, unlike 2D im-
ages where objects are represented by 2D pixels, there are different ways to represent 3D objects, such as
voxel, triangular mesh, point cloud, etc, where each representation has its own advantages and limitations.
The choice of combination between network architecture and 3D representation adds an additional layer
of difficulty to accurate and efficient shape learning for each specific task.
1.2 Contributions
In this dissertation, I take inspiration from the traditional methods and propose to incorporate the opti-
mization algorithms into the neural networks to address the challenges in 3D vision tasks.
2
InverseGraphicsforImage-baed3DOptimization
In Chapter 2, we design a novel differentiable renderer that can render a color mesh to 2D images and back
propagate the gradient from the 2D images to the mesh vertices and texture maps. The differentiable ren-
derer can directly optimize the object shape and pose from 2D images by comparing the rendered images
and the input images, allowing for 3D reasoning without 3D supervision. Unlike previous methods that use
traditional renderer in the forward pass and approximate the rendering gradient in the back-propagation,
we propose to replace the discrete functions in the traditional renderer with the continuous ones. More
specifically, the key to our framework is a novel formulation that views rendering as an aggregation func-
tion that fuses the probabilistic contributions of all mesh triangles with respect to the rendered pixels. Our
formulation enables our framework to flow gradients to the occluded and distant vertices, which cannot
be achieved by the previous state-of-the-arts. We demonstrate that our method can optimize the shape
of a wide range of objects from multi-view images and the challenging pose of rigid, non-rigid, and even
occluded body parts. Furthermore, we develop a single-view shape reconstruction neural network by con-
catenating a convolutional neural network and our proposed differentiable renderer. Our reconstruction
network can learn from 2D image datasets without the need of 3D ground truth, and perform single image
3D reconstruction during test time.
LearningtoOptimizeVanishingPointsforSceneStructureAnalysis
By utilizing the significant image visual cues, the line features, vanishing points closely relate the scene
structures such as intrinsic and extrinsic parameters together. Traditional methods are not robust to noises.
Recent works have attempted to address this problem by introducing 3D-aware structures into the network
architecture. While they have successfully improved the accuracy and generalizability, they tend to be
computationally expensive due to the increased complexity in the network architecture. In Chapter 3, I
instead combine the advantages of optimization-based approach and neural networks, and develop a neural
3
optimization scheme, which greatly improves the model efficiency while achieving superior performances.
Specifically, I describe how to divide the neural network into three main modules: feature extraction,
vanishing point proposal generation, and vanishing point refinement. We explore the computation
sharing technique and the neural optimizer design in these modules to boost the performance to near
real-time, while achieving the state-of-the-art accuracy in four benchmarks.
LearningtoOptimizeFaceAvatars
To achieve high-quality and rapid inference of digital humans, it is necessary to handle complex geometry
representations, such as surfaces. In Chapter 4, we propose an end-to-end neural network for rapid cre-
ation of production-grade face assets from multi-view images, which is a novel formulation of a recurrent
geometry optimizer that operates on UV-space geometry features and provides an effective solution to
high-quality face asset creation. Our model is on par with the industrial pipelines in quality for producing
accurate, complete, registered, and textured assets directly applicable to physically-based rendering, but
produces the asset end-to-end, fully automatically at a significantly faster speed at 4.5 FPS. Specifically,
We use position maps to represent the face geometry to achieve three crucial goals (1) dense and expres-
sive representation that supports high-resolution geometry; (2) UV space is semantically aligned; (3) the
position maps as 2D images are very efficient to process than a sparse representation such as mesh. Given
images from multi-view, our model first uses a feature extraction network to extract features for the input
images and a predefined UV-space feature map. Start from an initial guess, it then uses a learned neural
optimizer to iteratively refine the geometry. Finally, a texture inference network takes the inferred geom-
etry and the multi-view texture features to infer the high-resolution texture maps. Our proposed method
is a neural-based comprehensive, animation-ready face capturing solution that produce production-grade
dense geometry, complete texture maps required by most PBR skin shaders, and consistent mesh connec-
tivity across subjects and expressions. Our evaluation shows that the model is highly efficient and fully
4
automatic, and our real-world demonstration implies that it is even robust to novel multi-view capturing
rigs including light-weight systems with sparse views.
5
Chapter2
Image-based3DOptimizationwithInverseGraphics
Understanding and reconstructing 3D scenes and structures from 2D images has been one of the funda-
mental goals in computer vision. The key to image-based 3D reasoning is to find sufficient supervisions
flowing from the pixels to the 3D properties. To obtain image-to-3D correlations, prior approaches mainly
rely on the matching losses based on 2D key points/contours [16, 128, 98, 113] or shape/appearance pri-
ors [13, 103, 31, 86, 174]. However, the above approaches are either limited to task-specific domains or
can only provide weak supervision due to the sparsity of the 2D features. In contrast, as the process of
producing 2D images from 3D assets, rendering relates each pixel with the 3D parameters by simulat-
ing the physical mechanism of image formulation. Hence, by inverting a renderer, one can obtain dense
pixel-level supervision forgeneral-purpose 3D reasoning tasks, which cannot be achieved by conventional
approaches.
However, the rendering process is not differentiable in conventional graphics pipelines. In particu-
lar, a standard mesh renderer involves a non-differentiable sampling operation, called rasterization, which
prevents the gradient to be backpropagated into the mesh vertices. To achieve differentiable rendering,
recent advances [104, 73] only approximate the backward gradient with hand-crafted functions while di-
rectly employing a standard graphics renderer in the forward pass. While promising results have been
shown for the task of image-based 3D reconstruction, they are not able to propagate gradient to distant
6
vertices in all directions in the image space and fail to handle occlusions. We show in the experiments that
these limitations would cause problematic situations in image-based shape fitting where the 3D parameters
cannot be efficiently optimized. In Section 2.2, we propose a naturally differentiable rendering framework
that is able to (1) directly render colorized mesh using differentiable functions and (2) back-propagate effi-
cient supervisions to mesh vertices and their attributes from various forms of image representations. The
key to our framework is a novel formulation that views rendering as an aggregation function that fuses
the probabilistic contributions of all mesh triangles with respect to the rendered pixels. Such formulation
enables our framework to flow gradients to the occluded and distant vertices, which cannot be achieved
by the previous state-of-the-arts. We show that by using the proposed renderer, one can achieve signif-
icant improvement in 3D unsupervised single-view reconstruction both qualitatively and quantitatively.
Experiments also demonstrate that our approach can handle the challenging tasks in image-based shape
fitting, which remain nontrivial to existing differentiable renders.
Aside from the optimization method, the 3D representation also contributes largely to the high-quality
shape reconstruction. While explicit representations, such as point clouds and voxels, can span a wide
range of shape variations, their resolutions are often limited. Mesh-based representations are more efficient
but are limited by their ability to handle varying topologies. Implicit surfaces, however, can robustly handle
complex shapes, topologies, and also provide flexible resolution control. We address the fundamental
problem of learning implicit surfaces for shape inference without the need of 3D supervision. Despite their
advantages, it remains nontrivial to (1) formulate a differentiable connection between implicit surfaces
and their 2D renderings, which is needed for image-based supervision; and (2) ensure precise geometric
properties and control, such as local smoothness. In particular, sampling implicit surfaces densely is also
known to be a computationally demanding and very slow operation. To this end, in section 2.3 we propose
a novel ray-based field probing technique for efficient image-to-field supervision, as well as a general
geometric regularizer for implicit surfaces, which provides natural shape priors in unconstrained regions.
7
We demonstrate the effectiveness of our framework on the task of single-view image-based 3D shape
digitization and show how we outperform state-of-the-art techniques both quantitatively and qualitatively.
2.1 RelatedWork
DifferentiableRendering. To relate the changes in the observed image with that in the 3D shape manipu-
lation, a number of existing techniques have utilized the derivatives of rendering [52, 51, 110, 68]. Recently,
Loper and Black [104] introduced an approximate differentiable renderer which generates derivatives from
projected pixels to the 3D parameters. Kato et al. [73] propose to approximate the backward gradient of
rasterization with a hand-crafted function to achieve differentiable rendering for triangular mesh. Rhodin
et al. [133] pioneers in leveraging transparency and spatial smooth for pose estimation in the context of
point cloud rendering. By modeling an object as a collection of translucent Gaussian blobs whose color,
transparency, center and magnitude can be optimized to fit a given silhouette and appearance, their ap-
proach scales well to a variety of representations, e.g. mesh, skeleton, etc., for the task of pose estimation.
However, due to the loss of inter-connectivity between the Gaussians, it is non-trivial for their method
to model the fine-scale deformations of mesh surface. Hence, such limitation hampers it from handling
3D mesh reconstruction and deformation tasks. Our work pushes the idea of transparency and spatial
smoothing and extend it to mesh rendering. With the proposed SoftRas framework, we are able to handle
general image-based 3D mesh reasoning tasks without the aforementioned limitation.
More recently, Li et al. [94] introduce a novel edge sampling algorithm that is able to compute deriva-
tives of scalar functions over a ray-traced images. Built on Monte Carlo ray tracing, they propose new spa-
tial acceleration and importance sampling solutions to efficiently sample derivatives for arbitrary bound-
ces of light transport, including the secondary effects such as shadows or global illumination. While Li et
al. [94] focus on images rendered using ray tracing technique, our work aims to differentiate the rasteri-
zation based rendering approach. Loubet et al. [105] propose an alternative differentiable path tracer that
8
achieves much lower variance at the cost of some bias. They avoid the explicit sampling of discontinuities
by applying carefully chosen changes and re-parameterization of variables that remove the dependence
of discontinuities of scene parameters. Nimier-David et al. [125] introduce Mitsuba 2, a versatile render-
ing framework that implements differentiable rendering along with other advanced rendering features,
by leveraging modern C++ and template metaprogramming that enables retargetable implementation to
a variety of application domains at compile time. To compute the derivatives of radiometric measures
with respect to arbitrary scene parameterizations, Zhang et al. [173] introduce a new differential theory
of radiative transfer that allows differentiation of radiative transfer equation while handling a large range
of light transport phenomena. Recent advances in 3D face reconstruction [134, 148, 149, 151, 47], material
inference [99, 34] and other 3D reconstruction tasks [182, 132, 122, 63, 81, 124] have also leveraged some
other forms of differentiable rendering layers to obtain gradient flows in the neural networks. However,
these rendering layers are usually designed for special purpose and thus cannot be generalized to other
applications. In this chapter, we focus on a general-purpose differentiable rendering framework that is able
to directly render a given mesh using differentiable functions instead of only approximating the backward
derivatives.
Image-based3DReasoning. 2D images are widely used as the media for reasoning about 3D proper-
ties. In particular, image-based reconstruction has received the most attention. Conventional approaches
mainly leverage the stereo correspondence based on the multi-view geometry [60, 43] but are restricted
to the coverage provided by the multiple views. With the availability of large-scale 3D shape dataset [23],
learning-based approaches [158, 57, 65] are able to consider single or few images thanks to the shape prior
learned from the data. To simplify the learning problem, recent works reconstruct 3D shape via predicting
intermediate 2.5D representations, such as depth map [97], image collections [71], displacement map [66]
or normal map [129, 161]. Pose estimation is another key task to understanding the visual environment.
9
For 3D rigid pose estimation, while early approaches attempt to cast it as classification problem [153], re-
cent approaches [74, 164] can directly regress the 6D pose by using deep neural networks. Estimating the
pose of non-rigid objects, e.g. human face or body, is more challenging. By detecting the 2D key points,
great progress has been made to estimate the 2D poses [111, 20, 163]. To obtain 3D pose, shape priors [13,
103] have been incorporated to minimize the shape fitting errors in recent approaches [16, 20, 70, 14]. Our
proposed differentiable renderer can provide dense rendering supervision to 3D properties, benefitting a
variety of image-based 3D reasoning tasks. Also, existing methods focus on learning shapes from 2D su-
pervisions and the use of explicit shape representations (i.e., voxels, point clouds, and meshes), we further
present the first framework for unsupervised learning of implicit surface representations by differentiating
the implicit field rendering. With this framework, one can reconstruct shapes with arbitrary topology at
arbitrary resolution from a single image without requiring any 3D supervision.
Renderingtransparencyandsmoothness. Alpha blending [21] is widely used in computer graphics
to create the appearance of partial or full transparency. Geometric elements are rendered in separated
passes or layers and then composited into a single final image guided by the alpha channel. To avoid the
expensive operations of alpha blending that require rendering geometry in sorted order, order-independent
transparency (OIT) was later proposed to sort geometry per-pixel after rasterization and store all fragments
for exact computing. While the spirit of OIT has inspired a number of subsequent works [37, 121, 114, 69]
focusing on avoiding the cost of storing and sorting primitives or fragments, we focus on the branch of
blended order independent transparency which leverages a similar idea with our work. Meshkin [117] was
the first to introduce blended OIT by formulating a compositing operator based on weighted sum. Bavoil
and Myers [9] improve Meshkin’s operator using a better approximation of both coverage and color with
a weighted average operator. Closer to our approach, McGuire and Bavoil [115] propose weighted blended
10
OIT that leverages depth-based weights which decrease with the distance from the camera. Our aggre-
gation function uses a similar idea and shows that the depth weights play a key role in achieving depth-
based transparency and rendering differentiability. Smoothness-wise, conservative rasterization technique
in graphics introduces an uncertainty region around the boundary of triangles to handle rounding errors
and other issues that could add uncertainty to the exact dimensions of the triangle. In contrast, we extend
this concept of uncertainty to handling the non-differentiability of rasterization. Our probability map can
be viewed as a more general extension of uncertainty which allows the communication between distant
pixels and triangles.
GeometricRepresentationfor3DDeepLearning. A 3D surface can be represented either explicitly
or implicitly. Explicit representations mainly consist of three categories: voxel-, point- and mesh-based.
Due to their uniform spatial structures, voxel-based representations [23, 112, 154] have been extensively
explored to replicate the success of 2D convolutional networks onto the 3D regular domain. Such volu-
metric representations can be easily generalized across shape topologies, but are often restricted to low
resolutions due to large memory requirements. Progress has also been made in reconstructing point clouds
from single images using point feature learning [169, 38, 96, 1, 68]. While being able to describe arbitrary
topologies, point-based representations are also restricted by their resolution capabilities since dense sam-
ples are needed. Mesh representations can be more efficient since they naturally describe mesh connectiv-
ity and are hence, suitable for 2-manifold representations. Recent advances have focused on reconstructing
mesh geometry from point clouds [56] or even a single image [158]. AtlasNet [56] learns an implicit rep-
resentation that maps and assembles 2D squares to 3D surface patches. Despite the compactness of mesh
representations, it remains challenging to modify the vertex connections, making it unsuitable for model-
ing shapes with arbitrary topology.
Unlike explicit surfaces, implicit surface representations [22, 139] depict a 3D shape by extracting the
iso-surface from a continuous field. For implicit surfaces, a generative model can have more flexibility and
11
expressiveness for capturing complex topologies. Furthermore, multi-resolution representations and con-
trol enable them to also capture fine geometric details at arbitrary resolution and also reduce the memory
footprint during training. Recent works [95, 27, 127, 118, 116, 65] have shown promising results on su-
pervised learning for 3D shape inference based on implicit representations. Our approach further pushes
the envelope by achieving 3D-unsupervised learning of implicit generative shape modeling solely from 2D
images.
2.2 SoftRasterizer: AGeneralDifferentiableMeshRenderer
3D Mesh
Soft Rasterizer
Standard Rasterizer
R
Rendered Image
Soft Rendered ImageI
s
Discrete Sampling
¯
I
s
Aggregate Function A(·)
Shaded Fragments
Probability Maps w/ Color {D
j
}
3D Mesh
¯
R
Figure 2.1: We propose Soft RasterizerR (upper), a naturally differentiable renderer, which formulates
rendering as a differentiable aggregating process A(·) that fuses per-triangle contributions{D
i
} in a “soft"
probabilistic manner. Our approach attacks the core problem of differentiating the standard rasterizer,
which cannot propagate gradients from pixels to geometry due to the discrete sampling operation (below).
12
In this section, instead of studying a better form of rendering gradient, we attack the key problem of
differentiating the forward rendering function. Specifically, we propose a natually differentiable render-
ing framework that is able to render a colorized mesh in the forward pass (Figure 2.1). In addition, our
framework can consider texture and a variety of 3D properties, including mesh geometry, vertex attributes
(color, normal etc.), camera parameters and illuminations and is able to propagate efficient gradients from
pixels to mesh vertices and their attributes. Our renderer can be plugged into either a neural network or
a non-learning optimization framework for 3D reasoning.
The key to our approach is the formulation that views rendering as a “soft" probabilistic process. Unlike
the popular z-buffer rendering approach, which only selects the color of the closest triangle in the viewing
direction (Figure 2.1 below), we apply a differentiable, order-independent transparency rendering pipeline.
In particular, we propose that all triangles have probabilistic contributions to each rendered pixel, which
can be modeled as probability maps in the screen space. While conventional OpenGL rendering pipelines
merge shaded fragments by selecting the closest triangle, we propose a differentiable aggregation function
that fuses the per-triangle color maps based on the probability maps and the triangles’ relative depths to
obtain the final rendering result (Figure 2.1 upper). The proposed aggregating mechanism enables our ren-
derer to propagate gradients to all mesh triangles, including the occluded ones. In addition, our framework
can pass supervision signals from pixels to distant triangles in the image space because of its probabilistic
formulation. We call our framework Soft Rasterizer (SoftRas) as it “softens" the discrete rasterization to
enable differentiability.
SoftRas is able to provide high-quality gradients that supervise a variety of tasks on image-based 3D
reasoning. To evaluate the performance of SoftRas, we show applications in 3D unsupervised single-view
mesh reconstruction and image-based shape fitting (Figure 2.2, Section 2.2.8.2 and 2.2.8.5). In particular,
as SoftRas provides strong error signals to the mesh generator simply based on the rendering loss, one
can achieve mesh reconstruction from a single image without any 3D supervision. To faithfully texture
13
Input image Reconstruction result Input image Initialization Optimization result Standard rendering
Rendered
w/ larger
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjxGcJNAsoTZyWwyZh7LzKwQlvyDFw8qXv0gb/6Nk2QPmljQUFR1090Vp5wZ6/vfXmltfWNzq7xd2dnd2z+oHh61jMo0oSFRXOlOjA3lTNLQMstpJ9UUi5jTdjy+nfntJ6oNU/LBTlIaCTyULGEEWye1ekMsBO5Xa37dnwOtkqAgNSjQ7Fe/egNFMkGlJRwb0w381EY51pYRTqeVXmZoiskYD2nXUYkFNVE+v3aKzpwyQInSrqRFc/X3RI6FMRMRu06B7cgsezPxP6+b2eQ6yplMM0slWSxKMo6sQrPX0YBpSiyfOIKJZu5WREZYY2JdQBUXQrD88ioJL+o39eD+sta4LNIowwmcwjkEcAUNuIMmhEDgEZ7hFd485b14797HorXkFTPH8Afe5w/vII7b
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjxGcJNAsoTZyWwyZh7LzKwQlvyDFw8qXv0gb/6Nk2QPmljQUFR1090Vp5wZ6/vfXmltfWNzq7xd2dnd2z+oHh61jMo0oSFRXOlOjA3lTNLQMstpJ9UUi5jTdjy+nfntJ6oNU/LBTlIaCTyULGEEWye1ekMsBO5Xa37dnwOtkqAgNSjQ7Fe/egNFMkGlJRwb0w381EY51pYRTqeVXmZoiskYD2nXUYkFNVE+v3aKzpwyQInSrqRFc/X3RI6FMRMRu06B7cgsezPxP6+b2eQ6yplMM0slWSxKMo6sQrPX0YBpSiyfOIKJZu5WREZYY2JdQBUXQrD88ioJL+o39eD+sta4LNIowwmcwjkEcAUNuIMmhEDgEZ7hFd485b14797HorXkFTPH8Afe5w/vII7b
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjxGcJNAsoTZyWwyZh7LzKwQlvyDFw8qXv0gb/6Nk2QPmljQUFR1090Vp5wZ6/vfXmltfWNzq7xd2dnd2z+oHh61jMo0oSFRXOlOjA3lTNLQMstpJ9UUi5jTdjy+nfntJ6oNU/LBTlIaCTyULGEEWye1ekMsBO5Xa37dnwOtkqAgNSjQ7Fe/egNFMkGlJRwb0w381EY51pYRTqeVXmZoiskYD2nXUYkFNVE+v3aKzpwyQInSrqRFc/X3RI6FMRMRu06B7cgsezPxP6+b2eQ6yplMM0slWSxKMo6sQrPX0YBpSiyfOIKJZu5WREZYY2JdQBUXQrD88ioJL+o39eD+sta4LNIowwmcwjkEcAUNuIMmhEDgEZ7hFd485b14797HorXkFTPH8Afe5w/vII7b
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjxGcJNAsoTZyWwyZh7LzKwQlvyDFw8qXv0gb/6Nk2QPmljQUFR1090Vp5wZ6/vfXmltfWNzq7xd2dnd2z+oHh61jMo0oSFRXOlOjA3lTNLQMstpJ9UUi5jTdjy+nfntJ6oNU/LBTlIaCTyULGEEWye1ekMsBO5Xa37dnwOtkqAgNSjQ7Fe/egNFMkGlJRwb0w381EY51pYRTqeVXmZoiskYD2nXUYkFNVE+v3aKzpwyQInSrqRFc/X3RI6FMRMRu06B7cgsezPxP6+b2eQ6yplMM0slWSxKMo6sQrPX0YBpSiyfOIKJZu5WREZYY2JdQBUXQrD88ioJL+o39eD+sta4LNIowwmcwjkEcAUNuIMmhEDgEZ7hFd485b14797HorXkFTPH8Afe5w/vII7b
Rendered
w/ larger and
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjxGcJNAsoTZyWwyZh7LzKwQlvyDFw8qXv0gb/6Nk2QPmljQUFR1090Vp5wZ6/vfXmltfWNzq7xd2dnd2z+oHh61jMo0oSFRXOlOjA3lTNLQMstpJ9UUi5jTdjy+nfntJ6oNU/LBTlIaCTyULGEEWye1eoYNBe5Xa37dnwOtkqAgNSjQ7Fe/egNFMkGlJRwb0w381EY51pYRTqeVXmZoiskYD2nXUYkFNVE+v3aKzpwyQInSrqRFc/X3RI6FMRMRu06B7cgsezPxP6+b2eQ6yplMM0slWSxKMo6sQrPX0YBpSiyfOIKJZu5WREZYY2JdQBUXQrD88ioJL+o39eD+sta4LNIowwmcwjkEcAUNuIMmhEDgEZ7hFd485b14797HorXkFTPH8Afe5w8ElY7p
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjxGcJNAsoTZyWwyZh7LzKwQlvyDFw8qXv0gb/6Nk2QPmljQUFR1090Vp5wZ6/vfXmltfWNzq7xd2dnd2z+oHh61jMo0oSFRXOlOjA3lTNLQMstpJ9UUi5jTdjy+nfntJ6oNU/LBTlIaCTyULGEEWye1eoYNBe5Xa37dnwOtkqAgNSjQ7Fe/egNFMkGlJRwb0w381EY51pYRTqeVXmZoiskYD2nXUYkFNVE+v3aKzpwyQInSrqRFc/X3RI6FMRMRu06B7cgsezPxP6+b2eQ6yplMM0slWSxKMo6sQrPX0YBpSiyfOIKJZu5WREZYY2JdQBUXQrD88ioJL+o39eD+sta4LNIowwmcwjkEcAUNuIMmhEDgEZ7hFd485b14797HorXkFTPH8Afe5w8ElY7p
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjxGcJNAsoTZyWwyZh7LzKwQlvyDFw8qXv0gb/6Nk2QPmljQUFR1090Vp5wZ6/vfXmltfWNzq7xd2dnd2z+oHh61jMo0oSFRXOlOjA3lTNLQMstpJ9UUi5jTdjy+nfntJ6oNU/LBTlIaCTyULGEEWye1eoYNBe5Xa37dnwOtkqAgNSjQ7Fe/egNFMkGlJRwb0w381EY51pYRTqeVXmZoiskYD2nXUYkFNVE+v3aKzpwyQInSrqRFc/X3RI6FMRMRu06B7cgsezPxP6+b2eQ6yplMM0slWSxKMo6sQrPX0YBpSiyfOIKJZu5WREZYY2JdQBUXQrD88ioJL+o39eD+sta4LNIowwmcwjkEcAUNuIMmhEDgEZ7hFd485b14797HorXkFTPH8Afe5w8ElY7p
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjxGcJNAsoTZyWwyZh7LzKwQlvyDFw8qXv0gb/6Nk2QPmljQUFR1090Vp5wZ6/vfXmltfWNzq7xd2dnd2z+oHh61jMo0oSFRXOlOjA3lTNLQMstpJ9UUi5jTdjy+nfntJ6oNU/LBTlIaCTyULGEEWye1eoYNBe5Xa37dnwOtkqAgNSjQ7Fe/egNFMkGlJRwb0w381EY51pYRTqeVXmZoiskYD2nXUYkFNVE+v3aKzpwyQInSrqRFc/X3RI6FMRMRu06B7cgsezPxP6+b2eQ6yplMM0slWSxKMo6sQrPX0YBpSiyfOIKJZu5WREZYY2JdQBUXQrD88ioJL+o39eD+sta4LNIowwmcwjkEcAUNuIMmhEDgEZ7hFd485b14797HorXkFTPH8Afe5w8ElY7p
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjxGcJNAsoTZyWwyZh7LzKwQlvyDFw8qXv0gb/6Nk2QPmljQUFR1090Vp5wZ6/vfXmltfWNzq7xd2dnd2z+oHh61jMo0oSFRXOlOjA3lTNLQMstpJ9UUi5jTdjy+nfntJ6oNU/LBTlIaCTyULGEEWye1ekMsBO5Xa37dnwOtkqAgNSjQ7Fe/egNFMkGlJRwb0w381EY51pYRTqeVXmZoiskYD2nXUYkFNVE+v3aKzpwyQInSrqRFc/X3RI6FMRMRu06B7cgsezPxP6+b2eQ6yplMM0slWSxKMo6sQrPX0YBpSiyfOIKJZu5WREZYY2JdQBUXQrD88ioJL+o39eD+sta4LNIowwmcwjkEcAUNuIMmhEDgEZ7hFd485b14797HorXkFTPH8Afe5w/vII7b
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjxGcJNAsoTZyWwyZh7LzKwQlvyDFw8qXv0gb/6Nk2QPmljQUFR1090Vp5wZ6/vfXmltfWNzq7xd2dnd2z+oHh61jMo0oSFRXOlOjA3lTNLQMstpJ9UUi5jTdjy+nfntJ6oNU/LBTlIaCTyULGEEWye1ekMsBO5Xa37dnwOtkqAgNSjQ7Fe/egNFMkGlJRwb0w381EY51pYRTqeVXmZoiskYD2nXUYkFNVE+v3aKzpwyQInSrqRFc/X3RI6FMRMRu06B7cgsezPxP6+b2eQ6yplMM0slWSxKMo6sQrPX0YBpSiyfOIKJZu5WREZYY2JdQBUXQrD88ioJL+o39eD+sta4LNIowwmcwjkEcAUNuIMmhEDgEZ7hFd485b14797HorXkFTPH8Afe5w/vII7b
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjxGcJNAsoTZyWwyZh7LzKwQlvyDFw8qXv0gb/6Nk2QPmljQUFR1090Vp5wZ6/vfXmltfWNzq7xd2dnd2z+oHh61jMo0oSFRXOlOjA3lTNLQMstpJ9UUi5jTdjy+nfntJ6oNU/LBTlIaCTyULGEEWye1ekMsBO5Xa37dnwOtkqAgNSjQ7Fe/egNFMkGlJRwb0w381EY51pYRTqeVXmZoiskYD2nXUYkFNVE+v3aKzpwyQInSrqRFc/X3RI6FMRMRu06B7cgsezPxP6+b2eQ6yplMM0slWSxKMo6sQrPX0YBpSiyfOIKJZu5WREZYY2JdQBUXQrD88ioJL+o39eD+sta4LNIowwmcwjkEcAUNuIMmhEDgEZ7hFd485b14797HorXkFTPH8Afe5w/vII7b
AAAB7HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoN4CXjxGcJNAsoTZyWwyZh7LzKwQlvyDFw8qXv0gb/6Nk2QPmljQUFR1090Vp5wZ6/vfXmltfWNzq7xd2dnd2z+oHh61jMo0oSFRXOlOjA3lTNLQMstpJ9UUi5jTdjy+nfntJ6oNU/LBTlIaCTyULGEEWye1ekMsBO5Xa37dnwOtkqAgNSjQ7Fe/egNFMkGlJRwb0w381EY51pYRTqeVXmZoiskYD2nXUYkFNVE+v3aKzpwyQInSrqRFc/X3RI6FMRMRu06B7cgsezPxP6+b2eQ6yplMM0slWSxKMo6sQrPX0YBpSiyfOIKJZu5WREZYY2JdQBUXQrD88ioJL+o39eD+sta4LNIowwmcwjkEcAUNuIMmhEDgEZ7hFd485b14797HorXkFTPH8Afe5w/vII7b
Figure 2.2: Forward rendering (left): various rendering effects generated by SoftRas by tuning the degree
of transparency and blurriness. Applications based on the backward gradients provided by SoftRas: (1) 3D
unsupervised mesh reconstruction from a single input image (middle) and (2) 3D pose fitting to the target
image by flowing gradient to the occluded triangles (right).
the mesh, we further propose to extract representative colors from input image and formulates the color
regression as a classification problem. Regarding the task of image-based shape fitting, we show that our
approach is able to (1) handle occlusions using the aggregating mechanism that considers the probabilistic
contributions of all triangles; and (2) provide much smoother energy landscape, compared to other differ-
entiable renderers, that avoids local minima by using the smooth rendering (Figure 2.2 left). Experimental
results demonstrate that our approach significantly outperforms the state-of-the-arts both quantitatively
and qualitatively. The code of our paper is available at https://github.com/ShichenLiu/SoftRas.
The contributions of this section can be summarized as follows:
• We propose a differentiable mesh rendering technique by introducing smoothing operations in both
the spatial and depth extent to make rasterization differentiable.
• The proposed SoftRas renderer can directly render colorized mesh using differentiable functions
with the ability of tuning the sharpness and blurriness of the rendering results. It enables gradients
to be back-propagated from rendered image pixels to far-range and even occluded vertices, making
challenging image-based 3D reasoning tasks (see Section 2.2.8.5) possible.
• With specific illumination models, e.g. Phong model or spherical harmonics, our framework is gen-
eral enough to reason all 3D properties, e.g. geometry, camera, texture, material and lighting, by
back-propagating supervision signals from the image. Our approach can also handle a variety of
image representations including silhouette, shading and color images.
14
Rendered
Image
Transform
Color
Computation
Intrinsic
Properties
Extrinsic
Variables Rendering Pipeline
Rasterization Z-buffering
Probability
Computation
Aggregate
Functions
Traditional Rasterizer
Soft Rasterizer
Differentiable function Non-differentiable function Differentiable forward Non-differentiable forward
N
M
A
Z
U
C
F
D
L
I
¯
I
P
Figure 2.3: Comparisons between the standard rendering pipeline (upper branch) and our rendering frame-
work (lower branch).
2.2.1 Overview
As shown in Figure 2.3, we consider both extrinsic variables (cameraP and lighting conditionsL) that
define the environmental settings, and intrinsic properties (triangle meshes M and per-vertex appearance
A, including color, material etc.) that describe the model-specific properties. Following the standard ren-
dering pipeline, one can obtain the mesh normalN, image-space coordinateU and view-dependent depths
Z by transforming input geometryM based on cameraP. With specific assumptions of illumination (e.g.
spherical harmonics) and material models (e.g. Phong model), we can compute colorC given{A,N,L}.
These two modules are differentiable with automatic differentiation . However, the subsequent operations:
rasterization andz-buffering , in the standard graphics pipeline (Figure 2.3 red blocks) are not differentiable
with respect toU andZ due to the discrete sampling operations.
Analytically speaking, following a similar spirit of [104], according to the computation graph in Fig-
ure 2.3, our gradient from rendered imageI to vertices in meshM is obtained by
∂I
∂M
=
∂I
∂U
∂U
∂M
+
∂I
∂Z
∂Z
∂M
+
∂I
∂N
∂N
∂M
. (2.1)
While
∂U
∂M
,
∂Z
∂M
,
∂I
∂N
and
∂N
∂M
can be easily obtained by inverting the projection matrix and the illumi-
nation models,
∂I
∂U
and
∂I
∂Z
do not exist in conventional rendering pipelines. Our framework introduces
15
an intermediate representation, probability mapD, that factorizes the gradient
∂I
∂U
to
∂I
∂D
∂D
∂U
, enabling the
differentiability of
∂I
∂U
. Further, we obtain
∂I
∂Z
via the proposed aggregation function. We will detail the
gradient
∂D
∂U
in Section 2.2.2 and gradient
∂I
∂D
and
∂I
∂Z
in Section 2.2.3, respectively.
Ourdifferentiableformulation. We take a different perspective that the rasterization can be viewed
as binary masking that is determined by the relative positions between the pixels and triangles, while
z-buffering merges the rasterization results F in a pixel-wise one-hot manner (only selecting the closest
triangle) based on the relative depths of triangles. The problem is then formulated as modeling the discrete
binary masks and the one-hot merging operation in a soft and differentiable fashion. We therefore propose
two major components that smooth the operations over thespatial anddepth extent, respectively. Spatial-
wise, we propose probability mapsD = {D
j
} that model the probability of each pixel staying inside a
specific triangle f
j
. Depth-wise, aggregate functionA(·) is introduced to fuse per-triangle color maps
based on{D
j
} and the relative depths among triangles. With such formulation, all 3D properties, e.g.
camera, texture, material, lighting and geometry, could receive gradients from the image.
2.2.2 ProbabilityMapComputation
We model the influence of triangle f
j
on image plane by probability mapD
j
. To estimate the probability
ofD
j
at pixelp
i
, the function is required to take into account both the relative position and the distance
betweenp
i
andD
j
. To this end, we define D
j
at pixelp
i
as follows:
D
i
j
=sigmoid(δ i
j
· d
2
(i,j)
σ ), (2.2)
where σ is a positive scalar that controls the sharpness of the probability distribution while δ i
j
is a sign
indicatorδ i
j
={+1,if p
i
∈f
j
;− 1,otherwise}. We setσ as1× 10
− 4
unless otherwise specified. d(i,j) is
16
{
d(i,j)
p
i
f
j
f
j
p
i
min{b
i
j
}
(a) ground truth (b)σ =0.003 (c)σ =0.01 (d)σ =0.03
Figure 2.4: Probability maps of a triangle under Euclidean (upper) and barycentric (lower)
metric. (a) definition of pixel-to-triangle distance; (b)-(d) probability maps generated with different σ .
the closest distance fromp
i
tof
j
’s edges. A natural choice ford(i,j) is the Euclidean distance. However,
other metrics, such as barycentric orl
1
distance, can be used in our approach.
Intuitively, by using the sigmoid function, Equation 2.2 normalizes the output to (0,1), which is a
faithful continuous approximation of binary mask with boundary landed on 0.5. In addition, the sign
indicator maps pixels inside and outside f
j
to the range of (0.5,1) and (0,0.5) respectively. Figure 2.4
showsD
j
of a triangle with varying σ using Euclidean distance. Smaller σ leads to sharper probability
distribution while largerσ tends to blur the outcome. This design allows controllable influence for triangles
on image plane. As σ → 0, the resulting probability map converges to the exact shape of the triangle,
enabling our probability map computation to be a generalized form of traditional rasterization.
In the following context of this subsection, we will provide details of the computation of
∂D
∂U
in terms
of different metric choices. In particular, we introduce two candidate metrics for d(i,j), namelysignedEu-
clideandistance andbarycentricmetric. To correlatep
i
withf
j
, we representp
i
usingbarycentriccoordinate
b
i
j
∈R
3
defined by f
j
:
17
b
i
j
=U
− 1
j
p
i
, (2.3)
whereU
j
=
x
1
x
2
x
3
y
1
y
2
y
3
1 1 1
f
j
andp
i
=
x
y
1
p
i
.
Euclidean Distance. Lett
i
j
∈ R
3
be the barycentric coordinate of the point on the edge of f
j
that is
closest top
i
. The signed Euclidean distanceD
E
(i,j) fromp
i
to the edges off
j
can be computed as:
D
E
(i,j)=δ i
j
U
j
t
i
j
− p
i
2
2
, (2.4)
whereδ i
j
is a sign indicator defined as δ i
j
={+1,if p
i
∈f
j
;− 1,otherwise}.
Then the partial gradient
∂D
E
(i,j)
∂U
j
can be obtained via:
∂D
E
(i,j)
∂U
j
=2δ i
j
U
j
t
i
j
− p
i
t
i
j
T
. (2.5)
BarycentricDistance. We define the barycentric metric D
B
(i,j) as the minimum of barycentric coor-
dinate:
D
B
(i,j)=min{b
i
j
} (2.6)
lets=argmin
k
(b
i
j
)
(k)
, then the gradient fromD
B
(i,j) toU
j
can be obtained through:
18
∂D
B
(i,j)
∂(U
j
)
(k,l)
=
∂min{b
i
j
}
∂(U
j
)
(k,l)
=
∂
b
i
j
(s)
∂U
− 1
j
∂U
− 1
j
∂(U
j
)
(k,l)
=− X
t
(p
i
)
(t)
U
− 1
j
(s,k)
U
− 1
j
(l,t)
(2.7)
wherek andl are the indices ofU
j
’s element.
We show the probability maps of a triangle constructed using Euclidean and barycentric metrics under
different parametric settings in Figure 2.4. In general, Euclidean metric is able to generate uniformly
decaying influence as the distance increases regardless of the triangle size and shape. Hence, it is more
robust to varying densities of triangular meshes. In contrast, the probabilistic distribution generated by
barycentric metric is more sensitive to the shape of triangle. However, compared to Euclidean metric
which can only influence the closest edge to the query point, barycentric metric is able to pass gradient
to all the three vertices of the triangle. We provide more detailed performance comparison of these two
metrics in the ablation study in Section 2.2.8.4.
2.2.3 AggregationFunction
For each mesh triangle f
j
, we define its color map C
j
at pixel p
i
on the image plane by interpolating
color using barycentric coordinates. We clip its barycentric coordinates to [0, 1] and normalize their sum
amounts to 1, which prevents negative barycentric coordinate for color computation. We then propose to
use an aggregation functionA(·) to merge color maps{C
j
} to obtain rendering outputI based on{D
j
}
19
and the relative depths{z
j
}. Inspired by the softmax operator, we define an aggregation function A
S
as
follows:
I
i
=A
S
({C
j
})=
X
j
w
i
j
C
i
j
+w
i
b
C
b
, (2.8)
whereC
b
is the background color; the weights{w
j
} satisfy
P
j
w
i
j
+w
i
b
=1 and are defined as:
w
i
j
=
D
i
j
exp(z
i
j
/γ )
P
k
D
i
k
exp(z
i
k
/γ )+exp(ϵ/γ )
. (2.9)
In particular,z
i
j
denotes the normalized depth of the 3D point onf
i
whose 2D projection isp
i
. We normalize
the depth so that the closer triangle receives a largerz
i
j
by
z
i
j
=
Z
far
− Z
i
j
Z
far
− Z
near
, (2.10)
whereZ
i
j
denotes the actual clipped depth off
j
atp
i
, whileZ
near
andZ
far
denote the far and near cut-off
distances of the viewing frustum respectively.ϵ is small constant that enables the background color while
γ (set as 1× 10
− 4
unless otherwise specified) controls the sharpness of the aggregation function. Note
that w
j
is a function of two major variables: D
j
and z
j
. Specifically, w
j
assigns higher weight to closer
triangles that have largerz
j
. Asγ → 0, the color aggregation function only outputs the color of nearest
triangle, which exactly matches the behavior of z-buffering. In addition, w
j
is robust to z-axis translations.
D
j
modulates the w
j
along the x, y directions such that the triangles closer to p
i
on screen space will
receive higher weight. Equation 2.8 also works forshadingimages when the intrinsic vertex colors are set
as constant ones.
The gradient
∂I
∂D
i
j
and
∂I
∂z
i
j
can be obtained as follows:
20
∂I
i
∂D
i
j
=
X
k
∂I
i
∂w
i
k
∂w
i
k
∂D
i
j
+
∂I
i
∂w
i
b
∂w
i
b
∂D
i
j
=
X
k̸=j
− C
i
k
w
i
j
w
i
k
D
i
j
+C
i
j
(
w
i
j
D
i
j
− w
i
j
w
i
j
D
i
j
)− C
i
b
w
i
j
w
i
b
D
i
j
=
w
i
j
D
i
j
(C
i
j
− I
i
) (2.11)
∂I
i
∂z
i
j
=
X
k
∂I
i
∂w
i
k
∂w
i
k
∂z
i
j
+
∂I
i
∂w
i
b
∂w
i
b
∂z
i
j
=
X
k̸=j
− C
i
k
w
i
j
w
i
k
γ +C
i
j
(
w
i
j
γ − w
i
j
w
i
j
γ )− C
i
b
w
i
j
w
i
b
γ =
w
i
j
γ (C
i
j
− I
i
) (2.12)
Note that we only clamp the barycentric coordinates for color computation (C in Equation 2.8, 2.11
and 2.12). The gradients with respect to spatial location and per-vertex normal still exist, i.e. throughw
toD, then to mesh vertices in Equation 2.11 and 2.12, where the barycentric weights are not clamped.
In addition, our framework does not provide gradient for UV coordinates. However, we argue that this
gradient may not be necessary as with given fixed mesh topology and UV mapping, we are able to optimize
the color of UV texture map. Yet, directly deforming mesh in 3D spatial space has greater capability than
deforming UV coordinates in 2D UV space. Further, since the texture itself might be changing during the
learning process, forcing UV coordinates to be fixed can achieve more stable training of networks.
OccupancyAggregationFunction. While Equation 2.8 works for colors, it can also be extend to aggre-
gate the alpha channel, i.e. the silhouette images. However, the continuous interpolation in Equation 2.8
may not fit the binary nature of silhouettes. In addition, the silhouette of object is independent from its
21
color and depth map. Hence, we propose a dedicated aggregation functionA
O
for silhouettes based on
the binary occupancy:
I
i
s
=A
O
({D
j
})=1− Y
j
(1−D
i
j
). (2.13)
Intuitively, Equation 2.13 models silhouette as the probability of havingatleast one triangle cover the
pixelp
i
. The partial gradient
∂I
i
sil
∂D
i
j
can be computed as follows:
∂I
i
sil
∂D
i
j
=
1− I
i
sil
1−D
i
j
. (2.14)
Note that there might exist other forms of occupancy aggregate functions. One alternative option
may be using a universal aggregate functionA
N
that is implemented as a neural network. We provide an
ablation study on this regard in Section 2.2.8.4.
2.2.4 TextureHandling
c
3
c
4
c
5
c
6
c
7
c
8
c
2
c
0
c
1
c
0
c
1
c
2
c
3
c
4
c
5
c
6
c
7
c
8
vt
0
vt
1
vt
2
vt
0
vt
1
vt
2
(a) Texture map (b) Resample texture (c) Texture tensor
Figure 2.5: An illustration of our proposed texture map handling approach when texture resolution is 3.
Our method can handle mesh color in forms of both vertex color and texture map. In the case of vertex
color, we compute the color at a given position using barycentric interpolationc=
P
i
t
i
c
i
, wherec
i
is per-
vertex color and(t
0
,t
1
,t
2
) are the barycentric weights at the query point. Note that as negative colors are
not acceptable, we clip the barycentric weights to(0,1) to avoid invalid configurations. For texture map,
22
with fixed UV mapping, our approach can propagate gradient from pixels to the color values in texture map
in a way akin to the case of vertex color. Implementation-wise, as a single mesh may come with multiple
texture maps, it is not trivial to retrieve and update texture color in distributed files especially in tensor-
based deep learning frameworks. We therefore propose a resampling mechanism that first resamples color
from the texture image and rearrange it into a fixed-size tensor, as shown in Fig 2.5. The size of tensor
is defined by a texture resolution R and the texture size of each face is R
2
. We believe by resampling
texture image to a fixed-size tensor indexed by faces introduces two benefits: 1) it is straightforward for
such tensor to fit into mini-batches for learning; 2) it is independent from the number of texture maps,
leading to a unified memory management system that makes it easier for training.
2.2.5 ComparisonswithPriorWorks
OpenDR SoftRas NMR
Gradient flow
Gradient to x Gradient to y Gradient to x and y
(b) Screen-space
gradient from
pixels to vertices
(a) Gradient
from pixels to
triangles
}
Occluded
Triangles
f
1
f
2
f
N
f
1
f
2
f
N
f
1
f
2
f
N
p
i
f
j
I
i
@ I
i
@ D
1
@ I
i
@ D
2
@ I
i
@ D
N
p
i
p
i
p
i
I
i
I
i
p
i
p
i
f
j
f
j
@ I
i
@ D
1
@ I
i
@ D
1
Figure 2.6: Comparisons with prior differentiable renderers in terms of gradient flow.
In this section, we compare our approach with the state-of-the-art rasterization-based differential ren-
derers: OpenDR [104] and NMR [73], in terms of gradient flows as shown in Figure 2.6. In particular,
23
NMR leverages a hand-crafted function to approximate the gradient for rasterization while directly using
standard graphics renderer in the forward rendering. OpenDR approximates rendering using a function
with respect to vertex locations, camera parameters and per-vertex brightness. When computing gradi-
ent from image intensity to 2D image coordinates, OpenDR applies image filtering with limited width to
smooth image and pass gradient to boundary and interior pixels. Due to the filtering operation, pixels
close to or inside the triangle can receive gradients. But the range is limited to the bandwidth of the filter,
as demonstrated in Figure 2.6.
Gradient from pixels to triangles. Since both OpenDR and NMR utilize standard graphics renderer
in the forward pass, they have no control over the intermediate rendering process and thus cannot flow
gradient into the triangles that are occluded in the final rendered image (Figure 2.6(a) left and middle).
In addition, as their gradients only operate on the image plane, both OpenDR and NMR are not able to
optimize the depth value z of the triangles. In contrast, our approach has full control on the internal
variables and is able to flow gradients to invisible triangles and the z coordinates of all triangles through
the aggregation function (Figure 2.6(a) right).
Screen-space gradient from pixels to vertices. Thanks to our continuous probabilistic formulation,
in our approach, the gradient from pixelp
j
in screen space can flow gradient to all distant vertices (Fig-
ure 2.6(b) right). However, for OpenDR, a vertex can only receive gradients from neighboring pixels within
a close distance due to the local filtering operation (Figure 2.6(b) left). Regarding NMR, there is no gradient
defined from the pixels inside the white regions with respect to the triangle vertices ((Figure 2.6(b) middle).
In contrast, our approach does not have such issue thanks to our orientation-invariant formulation.
Comparisonoftimeandspacecomplexity. In our implementation, we do not use PyTorch autograd
functions considering its large memory footage and computation overhead. Instead, we argue that it is
24
Input Image
Color
Generator
Shape
Generator
Soft Rasterizer
Extrinsic
Variables
Losses
C
M
I
c
I
s
L
g
L
s
L
c
R
Figure 2.7: The proposed framework for single-view mesh reconstruction.
necessary to customize the CUDA kernel functions, and propose two strategies for optimization: 1) we
clip the computation for pixels that are too far away from the triangle (negligible influence), hence the
footprint of each triangle is not infinite; 2) we customize our aggregation function implementation so that
the memory consumption for each pixel isO(1). Our implemented complexity is the same as that of prior
works (NMR and OpenDR), which areO(HWN) andO(HW) respectively. Theoretically, it is possible to
further reduce the time complexity by introducing more advanced hierarchical data structure. However,
due to the restriction on data management and workflow streaming, it is very challenging to implement
techniques like hierarchical z-buffer in current deep learning frameworks, making it difficult to reach the
lower bound of theoretical complexity.
2.2.6 Image-based3DReasoning
Our SoftRas can compute gradients to both extrinsic (e.g. camera and lighting) and intrinsic (e.g. geometry,
texture, material, etc.) properties, enabling a variety of tasks on 3D reasoning. In Section 2.2.8.2, we
evaluate SoftRas for single-view mesh reconstruction by fixing extrinsic parameters.
2.2.6.1 Single-viewMeshReconstruction
Image-based 3D reconstruction plays a key role in a variety of tasks in computer vision and computer
graphics, such as scene understanding, VR/AR, autonomous driving, etc. Reconstructing 3D objects either
25
Feature
Extractor
Color Selection
Network
Color Sampling
Network
Color Sampler Input Image
Color Palette
Color Selections
Color Predictions
softmax
softmax
HW⇥ 3
H⇥ W⇥ 3
N
c
⇥ 3
N
p
⇥ 3
N
c
⇥ N
p
N
p
⇥ HW
Figure 2.8: Network structure for color reconstruction.
in mesh[158, 126] or voxel [167] representation from a single RGB image has been actively studied thanks
to the advent of deep learning technologies. While most approaches on mesh reconstruction rely on super-
vised learning, methods working on voxel representation have strived to leverage rendering loss[29, 154,
81] to mitigate the lack of 3D data. However, the reconstruction quality of voxel-based approaches are lim-
ited primarily due to the high computational expense and its discrete nature. Nonetheless, unlike voxels,
which can be easily rendered via differentiable projection, rendering a mesh in a differentiable fashion is
non-trivial as discussed in the previous context. By introducing a naturally differentiable mesh renderer,
SoftRas combines the merits of both worlds – the ability to harness abundant resources of multi-view
images and the high reconstruction quality of mesh representation.
To demonstrate the effectiveness of SoftRas, we fix the extrinsic variables and evaluate its performance
on single-view 3D reconstruction by incorporating it with a mesh generator. The direct gradient from
image pixels to shape and color generators enables us to achieve 3D unsupervised mesh reconstruction.
Our framework is demonstrated in Figure 2.7. Given an input image, our shape and color generators
generate a triangle meshM and its corresponding colorsC, which are then fed into the soft rasterizer. The
SoftRas layer renders both the silhouetteI
s
and color imageI
c
and provide rendering-based error signal by
comparing with the ground truths. Inspired by the latest advances in mesh learning [73, 158], we leverage
a similar idea of synthesizing 3D model by deforming a template mesh. To validate the performance of
SoftRas, the shape generator employ an encoder-decoder architecture identical to that of [73, 167].
26
Losses. The reconstruction networks are supervised by three losses: silhouette lossL
s
, color lossL
c
and
geometry lossL
g
. Let
ˆ
I
s
and I
s
denote the predicted and the ground-truth silhouette respectively. The
silhouette loss is defined as L
s
= 1− ||
ˆ
Is⊗ Is||
1
||
ˆ
Is⊕ Is− ˆ
Is⊗ Is||
1
, where⊗ and⊕ are the element-wise product and
sum operators respectively. The color loss is measured as the l
1
norm between the rendered and input
image:L
c
=||
ˆ
I
c
− I
c
||
1
. To achieve appealing visual quality, we further impose a geometry lossL
g
that
regularizes the Laplacian of both shape and color predictions. The final loss is a weighted sum of the three
losses:
L=L
s
+λ L
c
+µ L
g
. (2.15)
Color Reconstruction The recent advances in novel view synthesis [177] has demonstrated that in-
stead of learning to synthesize pixels from scratch, learning to copy from the input image can achieve
results with even higher fidelity. Though directly regressing colors is conceptually simpler, training a reli-
able regression model is inherently challenging and prone to over-fitting due to the difficulty of predicting
a continuous variable lying in a high-dimensional space. Hence, we propose to formulate color reconstruc-
tion as a classification problem that learns to reuse the pixel colors in the input image for each sampling
point. LetN
c
denote the number of sampling points onM andH,W be the height and width of the input
image respectively. However, the computational cost of a naive color selection approach is prohibitive,
i.e. O(HWN
c
). To address this challenge, we propose to colorize mesh using a color palette, as shown in
Figure 2.8. Specifically, after passing input image to a neural network, the extracted features are fed into
(1) a sampling network that samples the representative colors for building the palette; and (2) a selection
network that combines colors from the palette for texturing the sampling points. The color prediction
is obtained by multiplying the color selections with the learned color palette. Our approach reduces the
computation complexity toO(N
p
(HW +N
c
)), whereN
p
is the size of color palette. With a proper set-
ting ofN
p
, one can significantly reduce the computational cost while achieving sharp and accurate color
recovery.
27
2.2.6.2 Image-basedShapeFitting
Image-based shape fitting has a fundamental impact in various tasks, such as pose estimation, shape align-
ment, model-based reconstruction, etc. Yet without direct correlation between image and 3D parameters,
conventional approaches have to rely on coarse correspondences, e.g. 2D joints [16] or feature points [128],
to obtain supervision signals for optimization. In contrast, SoftRas can directly back-propagate pixel-level
errors to 3D properties, enabling dense image-to-3D correspondence for high-quality shape fitting. How-
ever, a differentiable renderer has to resolve two challenges in order to be readily applicable. (1) occlusion
awareness: the occluded portion of 3D model should be able to receive gradients in order to handle large
pose changes. (2)far-rangeimpact: the loss at a pixel should have influence on distant mesh vertices, which
is critical to dealing with local minima during optimization. While prior differentiable renderers [73, 104]
fail to satisfy these two criteria, our approach handles these challenges simultaneously. (1) Our aggregate
function fuses the probability maps from all triangles, enabling the gradients to be flowed to all vertices
including the occluded ones. (2) Our soft approximation based on probability distribution allows the gra-
dient to be propagated to the far end while the size of receptive field can be well controlled (Figure 2.4).
To this end, our approach can faithfully solve the image-based shape fitting problem by minimizing the
following energy objective:
argmin
ρ,θ,t
||R(M(ρ,θ,t ))− I
t
||
2
, (2.16)
where R(·) is the rendering function that generates a rendered image I from mesh M parametrized by
its poseθ , translationt, and non-rigid deformation parametersρ . The difference between I and the target
imageI
t
provides strong supervision to solve the unknowns{ρ,θ,t }.
28
2.2.7 ImplementationDetails
DatasetsandEvaluationMetrics. We use the dataset provided by [73], which contains 13 categories
of objects from ShapeNet [23]. Each object is rendered in 24 different views with image resolution of 64 × 64. For fair comparison, we employ the same train/validate/test split on the same dataset as in [73, 167].
For quantitative evaluation, we adopt the standard reconstruction metric, 3D intersection over union (IoU),
to compare with baseline methods. Specifically, we voxelize our mesh to 32 × 32× 32 and compare it with
the ground truth.
ImplementationDetails. We use the same structure as [73, 167] for mesh generation. Our network is
optimized using Adam [75] withα =1× 10
− 4
,β 1
=0.9 andβ 2
=0.999. The training of our model takes
12 hours per category on a single NVIDIA 1080Ti GPU. Specifically, we set λ =1 andµ =1× 10
− 3
across
all experiments unless otherwise specified. We train the network with multi-view images of batch size 64
and implement it using PyTorch. The template mesh used in single-view reconstruction has 642 vertices
and 1280 triangles. We use Phong model with 0.5 ambient rate and 0.5 diffusive rate. We use intrinsic
colors for materials where ambient reflection rate is 1.0 and diffuse reflection rate is 1.0. Note that we do
not use the AutoGrad function provided by PyTorch. The practical problem of PyTorch AutoGrad function
is that it explicitly stores all the intermediate variables in CUDA the memory. This leads to a prohibitive
complexity of O(HWN), where H, W are the image height and width, respectively; N is number of
triangles. Instead, we implement customized CUDA kernel functions that reduce the memory complexity
toO(HW) by aggregating all triangles in a single pixel on-the-fly. The gradient is crucial to implement
these kernels.
29
Input SoftRas (3D unsupervised) NMR (3D unsupervised) Pixel2Mesh (supervised) Ground truth
Figure 2.9: 3D mesh reconstruction from a single image. From left to right, we show input image, ground
truth, the results of our method (SoftRas), Neural Mesh Renderer [73] and Pixel2mesh [158] – all visualized
from 2 different views. Along with the results, we also visualize mesh-to-scan distances measured from
reconstructed mesh to ground truth.
2.2.8 Results
2.2.8.1 ForwardRenderingResults
Our proposed SoftRas can directly render a given mesh using differentiable functions, while previous
rasterization-based differentiable renderers [73, 104] have to rely the off-the-shelf renders for forward
rendering. In addition, compared to standard graphics renderer, SoftRas can achieve different rendering
effects in a continuous manner thanks to its probabilistic formulation.
By increasing σ , the key parameter that controls the sharpness of the screen-space probability dis-
tribution, we are able to generate more blurry rendering results. Furthermore, with increasedγ , one can
assign more weights to the triangles on the far end, naturally achieving more transparency in the rendered
image. We demonstrate rendering effects in Figure 2.11. We will show in Section 2.2.8.5 that the blurring
and transparent effects are the key to reshaping the energy landscape in order to avoid local minima.
30
Input Reconstructed Results Learned Color Palettes
Figure 2.10: Results of colorized mesh reconstruction. The learned principal colors and their usage his-
togram are visualize on the right.
2.2.8.2 QualitativeResults
Category Airplane Bench Dresser Car Chair Display Lamp Speaker Rifle Sofa Table Phone Vessel Mean
retrieval [167] 0.5564 0.4875 0.5713 0.6519 0.3512 0.3958 0.2905 0.4600 0.5133 0.5314 0.3097 0.6696 0.4078 0.4766
voxel [167] 0.5556 0.4924 0.6823 0.7123 0.4494 0.5395 0.4223 0.5868 0.5987 0.6221 0.4938 0.7504 0.5507 0.5736
NMR [73] 0.6172 0.4998 0.7143 0.7095 0.4990 0.5831 0.4126 0.6536 0.6322 0.6735 0.4829 0.7777 0.5645 0.6015
Ours (sil.) 0.6419 0.5080 0.7116 0.7697 0.5270 0.6156 0.4628 0.6654 0.6811 0.6878 0.4487 0.7895 0.5953 0.6234
Ours (full) 0.6670 0.5429 0.7382 0.7876 0.5470 0.6298 0.4580 0.6807 0.6702 0.7220 0.5325 0.8127 0.6145 0.6464
Table 2.1: Comparison of mean IoU with other 3D unsupervised reconstruction methods on 13 categories
of ShapeNet datasets.
Comparisons with the state-of-the-arts. We compare the qualitative results of our approach with
that of the state-of-the-art supervised [158] and 3D unsupervised [73] mesh reconstruction approaches in
Figure 2.9. Though NMR [73] can recover the rough shape, the mesh surface is discontinuous and suffers
from a considerable amount of self intersections. In contrast, our method can faithfully reconstruct fine
31
=
3⇥ 10
2
=
1⇥ 10
4
=
1⇥ 10
3
=
1⇥ 10
2
=1⇥ 10
5
=3⇥ 10
5
=5⇥ 10
5
=1⇥ 10
4
=3⇥ 10
4
=5⇥ 10
4
=1⇥ 10
3
=3⇥ 10
3
=5⇥ 10
3
=1⇥ 10
2
More blurry
More transparent
Figure 2.11: Different rendering effects achieved by our SoftRas renderer. We show how a colorized cube
can be rendered in various ways by tuning the parameters of SoftRas. In particular, by increasingγ , SoftRas
can render the object with more transparency while more blurry renderings can be achieved via increasing
σ . Asγ →0 andσ →0, one can achieve rendering effect closer to standard rendering.
Figure 2.12: Single-view reconstruction results on real images.
details of the object, such as the airplane tail and the rifle barrel, while ensuring smoothness of the surface.
Though trained without 3D supervision, our approach achieves results on par with the supervised method
Pixel2Mesh [158]. In some cases, our approach can generate even more appealing details than that of [158],
e.g. the bench legs, the airplane engine and the side of the car. Mesh-to-scan distance visualization also
shows our results achieve much higher accuracy than [73] and comparable accuracy with that of [158].
32
Color Reconstruction. Our method is able to faithfully recover the mesh color based on the input
image. Figure 2.10 presents the colorized reconstruction from a single image and the learned color palettes.
Though the resolution of the input image is rather low (64× 64), our approach is still able to achieve sharp
color recovery and accurately restore the fine details, e.g. the subtle color transition on the body of airplane
and the shadow on the phone screen.
Single-view Reconstruction from Real Images. We further evaluate our approach on real images.
As demonstrated in Figure 2.12, though only trained on synthetic data, our model generalizes well to real
images and novel views with faithful reconstructions and fine-scale details, e.g. the tail fins of the fighter
aircraft and thin structures in the rifle and table legs.
2.2.8.3 QuantitativeEvaluations
We show the comparisons on 3D IoU score with the state-of-the-art approaches in Table 2.1. We test our
approach under two settings: one trained with silhouette loss only (sil.) and the other with both silhouette
and shading supervisions (full). Our approach has significantly outperformed all the other 3D unsupervised
methods on all categories. In addition, the mean score of our best setting has surpassed the state-of-the-art
NMR [73] by more than 4.5 points. As we use the identical mesh generator and same training settings with
[73], it indicates that it is the proposed SoftRas renderer that leads to the superior performance.
SoftRas settings
L
lap
mIoU (%)
distance func. aggregate func. (α )
aggregate func.
(color)
Barycentric A
O
- 60.8
Euclidean A
O
- 62.0
Euclidean A
O
- ✓ 62.4
Euclidean A
N
- ✓ 63.2
Euclidean A
O
A
S
✓ 64.6
Table 2.2: Ablation study of the regularizer and various forms of distance and aggregate functions.A
N
is
the aggregation function implemented as a neural network. A
S
andA
O
are defined in Equation 2.8 and
2.13 respectively.
33
2.2.8.4 AblationStudy
LossTermsandAlternativeFunctions. In Table 2.2, we investigate the impact of Laplacian regularizer
and various forms of the distance function (Section 2.2.2) and the aggregate function. As the RGB color
channel and the α channel (silhouette) have different candidate aggregate functions, we separate their
lists in Table 2.2. First, by adding Laplacian constraint, our performance is increased by 0.4 point (62.4
v.s. 62.0). In contrast, NMR [73] has reported a negative effect of geometry regularizer on its quantitative
results. The performance drop may be due to the fact that the ad-hoc gradient is not compatible with the
regularizer. It is optional to have color supervision on the mesh generation. However, we show that adding
a color loss can significantly improve the performance (64.6 v.s. 62.4) as more information is leveraged for
reducing the ambiguity of using silhouette loss only. In addition, we also show that Euclidean metric
usually outperforms the barycentric distance while the aggregate function based on neural networkA
N
performs slightly better than the non-parametric counterpartA
O
at the cost of more computations.
2.2.8.5 Image-basedShapeFitting
Rigid Pose Fitting. We compare our approach with NMR in the task of rigid pose fitting. In partic-
ular, given a colorized cube and a target image, the pose of the cube needs to be optimized so that its
rendered result matches the target image. Despite the simple geometry, the discontinuity of face colors,
the non-linearity of rotation and the large occlusions make it particularly difficult to optimize. As shown
in Figure 2.13, NMR is stuck in a local minimum while our approach succeeds to obtain the correct pose.
The key is that our method produces smooth and partially transparent renderings which “soften" the loss
landscape. Such smoothness can be controlled byσ andγ , which allows us to avoid the local minimum.
We also demonstrate the intermediate process of how the proposed SoftRas renderer managed to fit the
color cube to the target image in Figure 2.14. By rendering the cube with stronger blurring at the earlier
*The expectation of uniform-sampled SO3 rotation angle isπ/ 2+2/π
34
stage, our approach is able to avoid local minima, and gradually reduce the rendering loss until an accurate
pose can be fitted.
Further, we evaluate the rotation estimation accuracy on synthetic data given 100 randomly sampled
initializations and targets. We compare methods w/ and w/o scheduling schemes, and summarize mean rel-
ative angle error in Table 2.3. Without optimization scheduling, our method outperforms the best baseline
by 10.60°, demonstrating the effectiveness of the gradient flows provided by our method and the benefit
of handling largely occluded triangles. Scheduling is a commonly used technique for solving non-linear
optimization problems. For other methods, we solve with multi-resolution images in 5 levels; while for our
method, we set schedules to decayσ andγ in 5 steps. While scheduling improves all methods, our approach
still achieves better accuracy than the best baseline by 14.99°, indicating our consistent superiority.
(b) our result (c) NMR result (a) target
(f) initialization
(d) smooth
rendering
(e) smoother
rendering
(g) global
minimum
(h) local
minimum
(i) (j) =0.01
=0.1
AAACA3icbVDLSsNAFJ3UV42vqMtuBoviqiRFUBdCwY3LCsYWmlAm00k6dGYSZiZCCV248VfcuFBx60+482+ctllo64ELZ865l7n3RBmjSrvut1VZWV1b36hu2lvbO7t7zv7BvUpziYmPU5bKboQUYVQQX1PNSDeTBPGIkU40up76nQciFU3FnR5nJOQoETSmGGkj9Z1aoGjC0cmV23A9GAR2kCA+f3t9p27UGeAy8UpSByXafecrGKQ450RozJBSPc/NdFggqSlmZGIHuSIZwiOUkJ6hAnGiwmJ2xAQeG2UA41SaEhrO1N8TBeJKjXlkOjnSQ7XoTcX/vF6u44uwoCLLNRF4/lGcM6hTOE0EDqgkWLOxIQhLanaFeIgkwtrkZpsQvMWTl4nfbFw2vNuzeqtZplEFNXAEToEHzkEL3IA28AEGj+AZvII368l6sd6tj3lrxSpnDsEfWJ8/+6+VPw==
AAACA3icbVDLSsNAFJ3UV42vqMtuBoviqiRFUBdCwY3LCsYWmlAm00k6dGYSZiZCCV248VfcuFBx60+482+ctllo64ELZ865l7n3RBmjSrvut1VZWV1b36hu2lvbO7t7zv7BvUpziYmPU5bKboQUYVQQX1PNSDeTBPGIkU40up76nQciFU3FnR5nJOQoETSmGGkj9Z1aoGjC0cmV23A9GAR2kCA+f3t9p27UGeAy8UpSByXafecrGKQ450RozJBSPc/NdFggqSlmZGIHuSIZwiOUkJ6hAnGiwmJ2xAQeG2UA41SaEhrO1N8TBeJKjXlkOjnSQ7XoTcX/vF6u44uwoCLLNRF4/lGcM6hTOE0EDqgkWLOxIQhLanaFeIgkwtrkZpsQvMWTl4nfbFw2vNuzeqtZplEFNXAEToEHzkEL3IA28AEGj+AZvII368l6sd6tj3lrxSpnDsEfWJ8/+6+VPw==
AAACA3icbVDLSsNAFJ3UV42vqMtuBoviqiRFUBdCwY3LCsYWmlAm00k6dGYSZiZCCV248VfcuFBx60+482+ctllo64ELZ865l7n3RBmjSrvut1VZWV1b36hu2lvbO7t7zv7BvUpziYmPU5bKboQUYVQQX1PNSDeTBPGIkU40up76nQciFU3FnR5nJOQoETSmGGkj9Z1aoGjC0cmV23A9GAR2kCA+f3t9p27UGeAy8UpSByXafecrGKQ450RozJBSPc/NdFggqSlmZGIHuSIZwiOUkJ6hAnGiwmJ2xAQeG2UA41SaEhrO1N8TBeJKjXlkOjnSQ7XoTcX/vF6u44uwoCLLNRF4/lGcM6hTOE0EDqgkWLOxIQhLanaFeIgkwtrkZpsQvMWTl4nfbFw2vNuzeqtZplEFNXAEToEHzkEL3IA28AEGj+AZvII368l6sd6tj3lrxSpnDsEfWJ8/+6+VPw==
AAACA3icbVDLSsNAFJ3UV42vqMtuBoviqiRFUBdCwY3LCsYWmlAm00k6dGYSZiZCCV248VfcuFBx60+482+ctllo64ELZ865l7n3RBmjSrvut1VZWV1b36hu2lvbO7t7zv7BvUpziYmPU5bKboQUYVQQX1PNSDeTBPGIkU40up76nQciFU3FnR5nJOQoETSmGGkj9Z1aoGjC0cmV23A9GAR2kCA+f3t9p27UGeAy8UpSByXafecrGKQ450RozJBSPc/NdFggqSlmZGIHuSIZwiOUkJ6hAnGiwmJ2xAQeG2UA41SaEhrO1N8TBeJKjXlkOjnSQ7XoTcX/vF6u44uwoCLLNRF4/lGcM6hTOE0EDqgkWLOxIQhLanaFeIgkwtrkZpsQvMWTl4nfbFw2vNuzeqtZplEFNXAEToEHzkEL3IA28AEGj+AZvII368l6sd6tj3lrxSpnDsEfWJ8/+6+VPw==
=0.03
=0.3
AAACA3icbVC7TsMwFHXKq5RXgLGLRQViqpIWCRiQKrEwFonQSk1UOa6TWrWdyHaQqqgDC7/CwgCIlZ9g429w2wzQcqQrHZ9zr3zvCVNGlXacb6u0srq2vlHerGxt7+zu2fsH9yrJJCYeTlgiuyFShFFBPE01I91UEsRDRjrh6Hrqdx6IVDQRd3qckoCjWNCIYqSN1LervqIxRydXTt1pQt+v+DHi83ezb9eMOgNcJm5BaqBAu29/+YMEZ5wIjRlSquc6qQ5yJDXFjEwqfqZIivAIxaRnqECcqCCfHTGBx0YZwCiRpoSGM/X3RI64UmMemk6O9FAtelPxP6+X6egiyKlIM00Enn8UZQzqBE4TgQMqCdZsbAjCkppdIR4iibA2uVVMCO7iycvEa9Qv6+7tWa3VKNIogyo4AqfABeegBW5AG3gAg0fwDF7Bm/VkvVjv1se8tWQVM4fgD6zPHwHolUM=
AAACA3icbVC7TsMwFHXKq5RXgLGLRQViqpIWCRiQKrEwFonQSk1UOa6TWrWdyHaQqqgDC7/CwgCIlZ9g429w2wzQcqQrHZ9zr3zvCVNGlXacb6u0srq2vlHerGxt7+zu2fsH9yrJJCYeTlgiuyFShFFBPE01I91UEsRDRjrh6Hrqdx6IVDQRd3qckoCjWNCIYqSN1LervqIxRydXTt1pQt+v+DHi83ezb9eMOgNcJm5BaqBAu29/+YMEZ5wIjRlSquc6qQ5yJDXFjEwqfqZIivAIxaRnqECcqCCfHTGBx0YZwCiRpoSGM/X3RI64UmMemk6O9FAtelPxP6+X6egiyKlIM00Enn8UZQzqBE4TgQMqCdZsbAjCkppdIR4iibA2uVVMCO7iycvEa9Qv6+7tWa3VKNIogyo4AqfABeegBW5AG3gAg0fwDF7Bm/VkvVjv1se8tWQVM4fgD6zPHwHolUM=
AAACA3icbVC7TsMwFHXKq5RXgLGLRQViqpIWCRiQKrEwFonQSk1UOa6TWrWdyHaQqqgDC7/CwgCIlZ9g429w2wzQcqQrHZ9zr3zvCVNGlXacb6u0srq2vlHerGxt7+zu2fsH9yrJJCYeTlgiuyFShFFBPE01I91UEsRDRjrh6Hrqdx6IVDQRd3qckoCjWNCIYqSN1LervqIxRydXTt1pQt+v+DHi83ezb9eMOgNcJm5BaqBAu29/+YMEZ5wIjRlSquc6qQ5yJDXFjEwqfqZIivAIxaRnqECcqCCfHTGBx0YZwCiRpoSGM/X3RI64UmMemk6O9FAtelPxP6+X6egiyKlIM00Enn8UZQzqBE4TgQMqCdZsbAjCkppdIR4iibA2uVVMCO7iycvEa9Qv6+7tWa3VKNIogyo4AqfABeegBW5AG3gAg0fwDF7Bm/VkvVjv1se8tWQVM4fgD6zPHwHolUM=
AAACA3icbVC7TsMwFHXKq5RXgLGLRQViqpIWCRiQKrEwFonQSk1UOa6TWrWdyHaQqqgDC7/CwgCIlZ9g429w2wzQcqQrHZ9zr3zvCVNGlXacb6u0srq2vlHerGxt7+zu2fsH9yrJJCYeTlgiuyFShFFBPE01I91UEsRDRjrh6Hrqdx6IVDQRd3qckoCjWNCIYqSN1LervqIxRydXTt1pQt+v+DHi83ezb9eMOgNcJm5BaqBAu29/+YMEZ5wIjRlSquc6qQ5yJDXFjEwqfqZIivAIxaRnqECcqCCfHTGBx0YZwCiRpoSGM/X3RI64UmMemk6O9FAtelPxP6+X6egiyKlIM00Enn8UZQzqBE4TgQMqCdZsbAjCkppdIR4iibA2uVVMCO7iycvEa9Qv6+7tWa3VKNIogyo4AqfABeegBW5AG3gAg0fwDF7Bm/VkvVjv1se8tWQVM4fgD6zPHwHolUM=
Principal
Axis 2
Principal
Axis 1
Loss
Principal
Axis 2
Principal
Axis 1
Loss
Principal
Axis 2
Principal
Axis 1
Loss
Principal
Axis 2
Principal
Axis 1
Loss
Figure 2.13: Visualization of loss function landscapes of NMR and SoftRas for pose optimization given
target image (a) and initialization (f). SoftRas achieves global minimum (b) with loss landscape (g). NMR
is stuck in local minimum (c) with loss landscape (h). At this local minimum, SoftRas produces the smooth
and partially transparent rendering (d)(e), which smoothens the loss landscape (i)(j) with largerσ andγ ,
and consequently leads to better minimum.
35
Figure 2.14: Intermediate process of fitting a color cube (second row) to a target pose shown in the input
image (first row). The smoothened rendering (third row) that is used to escape local minimum, as well as
the colorized fitting errors (fourth row), are also demonstrated.
Non-rigid Shape Fitting. In Figure 2.15, we show that SoftRas can provide stronger supervision for
non-rigid shape fitting even in the presence of part occlusions. We optimize the human body parametrized
by SMPL model [103]. As the right hand (textured as red) is completely occluded in the initial view,
it is extremely challenging to fit the body pose to the target image. To obtain correct parameters, the
optimization should be able to (1) consider the impact of the occluded part on the rendered image and (2)
Method w/o scheduling w/ scheduling
random guess 126.48°
∗
126.48°
NMR[73] 93.40° 80.94°
Li et al.[94] 95.02° 78.56°
SoftRas 82.80° 63.57°
Table 2.3: Comparison of cube rotation estimation error with NMR, measured in mean relative angular
error.
36
Target image
Optimized image
(SoftRas)
Results (SoftRas)
Initialization
Target
Target image
Optimized image
(NMR)
Results (NMR)
Figure 2.15: Results for optimizing human pose given single image.
back-propagate the error signals to the occluded vertices. NMR [73] fails to move the hand to the right
position due to its incapability to handle occlusions. In comparison, our approach can faithfully complete
the task as our probabilistic formulation and aggregating mechanism can take all triangles into account
while being able to optimize thez coordinates (depth) of the mesh vertices. In thesupplementalmaterials,
we further compare the intermediate fitting processes of NMR [73] and SoftRas.
NMR fails to provide efficient signal to advance the optimization of the hand pose. In contrast, our
approach is able to obtain the correct pose within 320 iterations thanks to the occlusion-aware technique.
We wish to emphasize that a proper handling of transparency is the key to passing gradient to occluded
vertices. As Rhodin et al. [133] share the similar idea of viewing points as Gaussian blobs with tunable
visibility, their method is also able to handle pose estimation even in the presence of occlusion. However,
37
it is not trivial to approximate an arbitrary mesh using a large number of small Gaussians and accurately
produce the fine-scale mesh deformations using this approach.
Input image Reconstruction Geometry Skin reflectance Illumination
MoFA Ours MoFA Ours MoFA Ours
Figure 2.16: Results for optimizing facial identity, expression, skin reflectance, lighting and rigid pose given
single 2D image along with 2D landmarks.
38
Face Reconstruction. To demonstrate SoftRas is able to provide efficient gradients for multiple prop-
erties in inverse rendering problem, we show further experiments to fit face shape, expression, skin re-
flectance, lighting and rigid pose. Given a single 2D image and facial landmarks, we optimize the coeffi-
cients of shape, expression and skin reflectance model that is similar to the one used in MoFA [148], along
with lighting modeled by spherical harmonics and rigid pose parameters. The optimization is scheduled
in two stages: (1) fitting to 2D landmarks by solving for shape, expression and rigid pose; (2) fitting to
photometric loss and 2D landmarks by solving for all parameters (lowering learning rate for rigid pose).
We show results in comparison to [148] in Figure 2.16. We can see our method is able to achieve compara-
ble or better results compare to [148]. We show more results on single-view hand fitting and analysis on
smoothing in the supplemental materials.
2.2.9 Discussion
In this paper, we have presented a truly differentiable rendering framework (SoftRas) that is able to directly
render a given mesh in a fully differentiable manner. SoftRas can consider both extrinsic and intrinsic vari-
ables in a unified rendering framework and generate efficient gradients flowing from pixels to mesh vertices
and their attributes (color, normal,etc.). We achieve this goal by re-formulating the discrete operations in-
cluding rasterization and z-buffering as differentiable probabilistic processes. Such formulation enables
our renderer to flow gradients to unseen vertices and optimize the z coordinates of mesh triangles, lead-
ing to significant improvements in the tasks of single-view mesh reconstruction and image-based shape
fitting. However, our approach, in current form, cannot handle shadows and topology changes, which are
worth investigation in the future. In addition, our approach do not provide gradients to the UV coordi-
nates. However, given a fixed UV mapping, we are able to pass gradient to the colors in texture map. It
would be interesting to explore simultaneous optimization of UV mapping and texture map in the future
work.
39
2.3 LearningtoInfertheImplicitSurfaces
The efficient learning of 3D deep generative models is the key to achieving high-quality shape reconstruc-
tion and inference algorithms. While supervised learning with direct 3D supervision has shown promising
results, its modeling capabilities are constrained by the quantity and variations of available 3D datasets. In
contrast, far more 2D photographs are being taken and shared over the Internet, than can ever be watched.
To exploit the abundance of image datasets, various differentiable rendering techniques [73, 101, 68, 167]
were introduced recently, to learn 3D generative models directly from massive amounts of 2D pictures.
While several types of shape representations have been adopted, most techniques are based on explicit
surfaces, which often leads to poor visual quality due to limited resolutions (e.g., point clouds, voxels) or
fail to handle arbitrary topologies (e.g., polygonal meshes).
Implicit surfaces, on the other hand, describe a 3D shape using an iso-surface of an implicit field and can
therefore handle arbitrary topologies, as well as support multi-resolution control to ensure high-fidelity
modeling. As demonstrated by several recent 3D supervised learning methods [27, 127, 118, 116], implicit
representations are particularly advantageous over explicit ones, and naturally encode a 3D surface at
infinite resolution with minimal memory footprint.
Despite these benefits, it remains challenging to achieve unsupervised learning of implicit surfaces
only from 2D images. First, it is non-trivial to relate the changes of the implicit surface with that of the
observed images. An explicit surface, on the other hand, can be easily projected and shaded onto an image
plane (Figure 2.17 right). By inverting such process, one can obtain gradient flows that supervise the gen-
eration of the 3D shape. However, it is infeasible to directly project an implicit field onto a 2D domain via
transformation. Instead, rendering an implicit surface relies on ray sampling techniques to densely evalu-
ate the field, which may lead to very high computational cost, especially for objects with thin structures.
Second, it is challenging to ensure precise geometric properties such as local smoothness of an implicit
surface. This is critical to generating plausible shapes in unconstrained regions, especially when only
40
a) Explicit representation b) Implicit surface
Projection
(Rasterization)
Explicit shape representations
(pj)
Field probing
(Ray tracing)
Implicit representations
> 0.5
< 0.5
V oxel Point cloud Mesh Occupancy field
Differentiable
rendering
+Topology
-Fidelity
+Topology
-Fidelity
-Topology
+Fidelity
+Topology
++Fidelity
Figure 2.17: While explicit shape representations may suffer from poor visual quality due to limited res-
olutions or fail to handle arbitrary topologies (a), implicit surfaces handle arbitrary topologies with high
resolutions in a memory efficient manner (b). However, in contrast to the explicit representations, it is not
feasible to directly project an implicit field onto a 2D domain via perspective transformation. Thus, we
introduce a field probing approach based on efficient ray sampling that enables unsupervised learning of
implicit surfaces from image-based supervision.
image-based supervision is available. Unlike mesh-based surface representations, it is not straightforward
to obtain geometric properties, e.g. normal, curvature, etc., for an implicit surface, as the shape is implicitly
encoded as the level set of a scalar field.
We address the above challenges and propose the first framework for learning implicit surfaces with
only 2D supervision. In contrast to 3D supervised learning, where a signed distance field can be computed
from the 3D training data, 2D images can only provide supervision on the binary occupancy of the field.
Hence, we formulate the unsupervised learning of implicit fields as a classification problem such that the
occupancy probability at an arbitrary 3D point can be predicted. The key to our approach is a novel field
probing approach based on efficient ray sampling that achieves image-to-field supervision. Unlike con-
ventional sampling methods [141], which excessively cast rays passing through all image pixels and apply
binary search along the ray to detect the surface boundary, we propose a much more efficient approach
by leveraging sparse sets of 3D anchor points and rays. In particular, the anchor points probe the field by
evaluating the occupancy probability at its location, while the rays aggregate the information from the
anchor points that it intersects with. We assign a spherical supporting region to each anchor point to en-
able the ray-point intersection. To further improve the boundary modeling accuracy, we apply importance
41
sampling in both 2D and 3D space to allocate more rays and anchor points around the image and surface
boundaries respectively.
While geometric regularization for implicit fields is largely unexplored, we propose a new method for
constraining geometric properties of an implicit surface using the approximated derivatives of the field
with a finite difference method. Since we only care about the decision boundary of the field, regularizing
the entire 3D space would introduce scarcity of constraints in the region of interest. Hence, we further
propose an importance weighting technique to draw more attention to the surface region. We validate
our approach on the task of single-view surface reconstruction. Experimental results demonstrate the
superiority of our method over state-of-the-art unsupervised 3D deep learning techniques, that are based
on alternative shape representations, in terms of quantitative and qualitative measures. Comprehensive
ablation studies also verify the efficacy of proposed probing-based sampling technique and the implicit
geometric regularization.
Our contributions can be summarized as follows: (1) the first framework that enables learning of im-
plicit surfaces for shape modeling without 3D supervision; (2) a novel field probing approach based on
anchor points and probing rays that efficiently correlates the implicit field and the observed images; (3) an
efficient point and ray sampling method for implicit surface generation from image-based supervision; (4)
a general formulation of geometric regularization that can constrain the geometric properties of a contin-
uous implicit surface.
2.3.1 Overview
Our goal is to learn a generative model for implicit surfaces that infers 3D shapes solely from 2D images.
Unlike direct supervision with 3D ground truth, which supports the computation of a continuous signed
distance field with respect to the surface, 2D observations can only provide guidance on the occupancy
of the implicit field. Hence, we formulate the unsupervised learning of implicit surfaces as a classification
42
Silhouette
image
(a) 3D anchor points (b) Occupancy
field evaluation
(c) Ray casting with
boundary-aware assignment
(d) Aggregating intersected
anchors along rays
(e) Loss
computation
Ray predictions
Ray labels
(⇡ k
,x
i
)
S
k
(x
i
)
L
sil
S
k
x i
Figure 2.18: Ray-based field probing technique. (a) A sparse set of 3D anchor points are distributed to
sense the field by sampling the occupancy value at its location. (b) Each anchor is assigned a spherical
supporting region to enable ray-point intersection. The anchor points that have higher probability to stay
inside the object surface are marked with deeper blue. (c) Rays are cast passing through the sampling
points{x
i
} on the 2D silhouette under the camera views{π k
} (blue indicates object interior and white
otherwise). (d) By aggregating the information from the intersected anchor points via max pooling, one
can obtain the prediction for each ray. (e) The silhouette loss is obtained by comparing the prediction with
the ground-truth label in the image space.
problem. Given{I
k
}
N
K
k=1
images of an objectO from different views {π k
}
N
K
k=1
as supervision signals, we
train a neural network that takes a single imageI
k
and produce a continuous occupancy probability field,
whose iso-surface at 0.5 depicts the shape ofO. Our pipeline is based on a novel ray-based field probing
technique as illustrated in Figure 2.18. Instead of excessively casting rays to detect the surface boundary,
we probe the field using a sparse set of 3D anchor points and rays. The anchor points sense the field
by sampling the occupancy probability at its location, and are assigned a spherical supporting region to
ease the computation of ray-point intersection. We then correlate the field and the observed images by
casting the probing rays, which originate from the viewpoint and pass through the sampling points of
the images. The ray, that passes through the image pixel x
i
, given the camera parameter π k
, obtains
its prediction ψ (π k
,x
i
) by aggregating the occupancy values from the anchor points whose supporting
regions intersect with it. By comparingψ (π k
,x
i
) with the ground-truth label ofx
i
, we can obtain error
signals that supervise the generation of implicit fields. Note that when detecting ray-point intersections,
we apply a boundary-aware assignment to remove ambiguity, which is detailed in Section 2.3.2.
43
NetworkArchitecture. We demonstrate our network architecture in Figure 2.19. Following the recent
advances in unsupervised shape learning [167, 73], we use 2D silhouettes of the objects as the supervision
for network training. Our framework consist of two components: (1) an image encoder g that maps the
input image I to a latent featurez; and (2) an implicit decoder f that consumesz and a 3D query point
p
j
and infers its occupancy probability ϕ (p
j
). Note that the implicit decoder generates a continuous
prediction ranging from 0 to 1, where the estimated surface can be extracted at the decision boundary of
0.5 (Figure 2.19 right).
2.3.2 Sampling-Based2DSupervision
To compute the prediction loss of the implicit decoder, a key step is to properly aggregate the information
collected throughout the field probing process for each ray. Given a continuous occupancy field and a
set of anchor points along a rayr, the probability thatr hits the object interior can be considered as an
aggregation function:
ψ (π k
,x
i
)=G
{ϕ (c+r(π k
,x
i
)· t
j
)}
Np
j=1
, (2.17)
wherer(π k
,x
i
) denotes the ray direction that intersects with the image pixelx
i
in the viewing direc-
tionπ k
;c is the camera location;N
p
is the number of 3D anchor points;t
j
indicates the sampled location
along the ray for each anchor point;ϕ (·) is the occupancy function that returns the occupancy probability
of the input point;ψ denotes the predicted occupancy for rayr(π k
,x
i
). Since whether the rayr hits the
object interior is determined by the maximum occupancy value detected along the ray, in this work, we
adoptG as a max-pooling operation due to its computational efficiency and effectiveness demonstrated in
44
[167]. By considering thel
2
differences between the predictions and the ground-truth silhouette, we can
obtain the silhouette lossL
sil
:
L
sil
=
1
N
r
Nr
X
i=1
N
K
X
k=1
∥ψ (π k
,x
i
)− S
k
(x
i
)∥
2
, (2.18)
whereS
k
(x
i
) is a bilinearly interpolated silhouette atx
i
under thek-th viewpoint;N
r
andN
K
denote the
number of 2D sampling points and camera views, respectively.
Boundary-AwareAssignment. To facilitate the computation of ray-point intersections, we model each
anchor point as a sphere with a non-zero radius. While such a strategy works well in most cases, erroneous
labeling may occur in the vicinity of the decision boundary. For instance, a ray that has no intersection
with the target object may still have a chance to hit the supporting region of an anchor point whose center
lies inside the object. Since we use max-pooling as the aggregating function, the ray may be wrongly
labeled as an intersecting ray. To resolve this issue, we use 2D silhouettes as additional prior by filtering
out the anchor points on the wrong side. In particular, if a ray is passing through a pixel belonging to the
inside/outside of the silhouette, the anchor points lying outside/inside of the 3D object are ignored when
detecting intersections (Figure 2.18 (c)). This boundary-aware assignment can significantly improve the
quality and reconstructed details, which is demonstrated in the ablation study in Section 2.3.5.
Importance Sampling. A naive solution for distributing anchor points and probing rays is to apply
random sampling. However, as the occupancy of the target object may be highly sparse over the 3D space,
random sampling could be extremely inefficient. We propose an importance sampling approach based on
shape cues obtained from the 2D images for efficient sampling of rays and anchor points. The main idea
is to draw more samples around the surface boundary, which is equivalent to the 2D contour of the object
in image space. For ray sampling, we first obtain the contour map W
r
(x) by applying Laplacian operator
45
Image Encoder Implicit Decoder
> 0.5
inside
outside
< 0.5
p
j
f(p
j
,z)
z
(p
j
)
g(I)
concat
Figure 2.19: Network architecture for unsupervised learning of implicit surfaces. The input imageI is first
mapped to a latent featurez by an image encoderg while the implicit decoderf consumes both the latent
codez and a query pointp
j
and predicts its occupancy probability ϕ (p
j
). With a trained network, one
can generate an implicit field whose iso-surface at 0.5 depicts the inferred geometry.
over the input silhouette. We then generate a Gaussian mixture distribution by positioning the individual
kernels to each pixel of W
r
(x) and setting the kernel height as the pixel intensity at its location. The
rays are then generated by sampling from the resulting distribution. Similarly, to generate the 3D contour
mapW
p
(p), we apply mean filtering to the 3D visual hulls computed from the multi-view silhouettes. The
anchor points are then sampled from a 3D Gaussian mixture distribution model created in a similar fashion
to the 2D case, which yields the probabilistic density function of the sampling as:
P
r
(x) =
R
x
′
κ (x
′
,x;σ )W
r
(x
′
)dx
′
,
P
p
(p) =
R
p
′
κ (p
′
,p;σ )W
r
(p
′
)dp
′
,
(2.19)
wherex
′
is a pixel in the image domain andp
′
is a point in the 3D space,κ (·,·;σ ) denotes the gaussian
kernel with bandwidthσ ;P
r
(x) andP
p
(p) denotes the probabilistic density function at pixelx and point
p respectively.
2.3.3 GeometricRegularizationonImplicitSurfaces
Regularizing geometric surface properties is critical to achieving desirable shapes, especially in uncon-
strained regions. While such constraints can be easily realized with explicit shape representations, a con-
trolled regularization of an implicit surface is not straightforward, since the surface is implicitly encoded
46
{
0.5
0.3
0.1
0.7
0.9
s
j
q
l
j
d
n(s
j
)
n(q
l
j
)
Figure 2.20: 2D illustration of importance weighted geometric regularization.
as the level set of a scalar field. Here, we introduce a general formulation of geometric regularization for
implicit surfaces using a new importance weighting scheme.
Since computing geometric properties of a surface, e.g. normal, curvature, etc., requires access to the
derivatives of the field, we propose a finite difference method-based approach. In particular, we compute
then-order derivative of the implicit field at point p
j
with central difference approximation:
δ n
ϕ δ p
n
j
=
1
∆ d
n
n
X
l=0
(− 1)
l
n
l
ϕ (p
j
+(
n
2
− l)∆ d), (2.20)
where ∆ d is the spacing distance betweenp
j
and its adjacent sample points (Figure 2.20). When n
equals to 1, the surface normaln(p
j
) atp
j
can be obtained vian(p
j
)=
δϕ δ p
j
/
δϕ δ p
j
.
Importance weighting. As we focus on the geometric properties on the surface, applying the regu-
larizer over the entire 3D space would lead to overly loose constraint in regions of interest. Hence, we
propose an importance weighting approach to assign more attention on the sampling points closer to the
surface. Here, we leverage the prior learned by our network – the surface points should have an occupancy
probability close to the decision boundary, which is 0.5 in our implementation. Therefore, we propose a
weighting functionW(x)=I(|x− 0.5|<ϵ ) and formulate the loss of geometric regularization as follows:
47
L
geo
=
1
N
p
Np
X
j=1
W(ϕ (s
j
))
P
6
l=1
W(ϕ (q
l
j
))∥n(s
j
)− n(q
l
j
)∥
p
p
P
6
l=1
W(ϕ (q
l
j
))
. (2.21)
In particular, as shown in Figure 2.20, for each anchor points
j
, we uniformly sample 2 neighboring samples
{q
l
j
} with spacing∆ d along thex,y andz axis respectively. We feed the weight functionW(·) with the
predicted occupancy probabilityϕ (s
j
) such that anchor points closer to the surface (withϕ (s
j
) closer to
0.5) would receive higher weights and vice versa. By minimizingL
geo
, we encourage the normals at the 3D
anchors to stay close to that of its adjacent points. Notice that we usel
p
norm rather than the commonly
usedl
2
for generality. We show that various geometric properties can be achieved by takingp as a hyper
parameter (see Section 2.3.5.2).
The total loss for network training is a weighted sum of the silhouette lossL
sil
and the geometric
regularization lossL
geo
with a trade-off factor λ as shown below:
L=L
sil
+λ L
geo
. (2.22)
2.3.4 ImplementationDetails
Datasets. We evaluate our method on ShapeNet [23] dataset. We focus on 6 commonly used categories
with complex topologies: plane, bench, table, car, chair and boat. We use the same train/validate/test split
as in [167, 73, 101] and the rendered images (64× 64 resolution) provided by [73] which consist of 24 views
for each object.
Implementationdetails. We adopt a pre-trained ResNet18 as the encoder, which outputs a latent code
of 128 dimensions. The decoder is realized using 6 fully-connected layers (output channels as 2048, 1024,
512, 256, 128 and 1 respectively) followed by asigmoid activation function. We sampleN
p
=16,000 anchor
points in 3D space andN
r
=4096 rays for each view. The sampling bandwidthσ is set as7× 10
− 3
. The
48
radiusτ of the supporting region is set as3× 10
− 2
. For the regularizer, we set∆ d=3× 10
− 2
,λ =1× 10
− 2
,
and normp=0.8. We train the network using Adam optimizer with learning rate of1× 10
− 4
and batch
size of 8 on a single 1080Ti GPU.
2.3.5 Results
2.3.5.1 Comparison
We validate the effectiveness of our framework in the task of unsupervised shape digitization from a sin-
gle image. Figure 2.21 and Table 2.4 compare the performance of our approach with the state-of-the-art
unsupervised methods that are based on explicit surface representations, including voxels [167], point
clouds [68], and triangle meshes [73, 101]. We provide both qualitative and quantitative measures. Note
that all the methods are trained with the same training data for fair comparison. While the explicit surface
representations either suffer from visually unpleasant reconstruction due to limited resolution and expres-
siveness (voxels, point clouds), or fail to capture complex topology from a single template (meshes), our
approach produces visually appealing reconstructions for complex shapes with arbitrary topologies. Com-
pared to mesh-based representations, we are able to achieve higher resolution output, which is reflected by
the even sharper local geometric details, e.g. the engine of plane (first row) and the wheels of the vehicle
(fourth row). The performance of our method is also demonstrated in the quantitative comparisons, where
we achieve state-of-the-art reconstruction accuracy using 3D IoU with large margins.
In Figure 2.22, we further illustrate the importance of supporting arbitrary topologies, compared to
existing mesh-based reconstruction techniques [101]. Since real-world objects can exhibit a wide range
of varying topologies even for a single object category (e.g., chairs), mesh-based approaches often lead
to deteriorated results. In contrast, our approach is able to faithfully infer complex shapes and arbitrary
topologies from very limited visual cues, e.g. the chair and the table on the third row, thanks to the flexi-
bility of the implicit representation and the strong shape prior enabled through the geometric regularizer.
49
Input images PTN [4]
V oxel DPC [3]
(Point clouds)
NMR [1]
(Mesh)
Ours
(Implicit occupancy field )
SoftRas [2]
(Mesh)
Ground truths
Figure 2.21: Qualitative results of single-view reconstruction using different surface representations. For
point cloud representation, we also visualize the meshes reconstructed from the output point cloud.
Category Airplane Bench Table Car Chair Boat Mean
PTN [167] 0.5564 0.4875 0.4938 0.7123 0.4494 0.5507 0.5417
NMR [73] 0.6172 0.4998 0.4829 0.7095 0.4990 0.5953 0.5673
SoftRas [101] 0.6419 0.5080 0.4487 0.7697 0.5270 0.6145 0.5850
Ours 0.6510 0.5360 0.5150 0.7820 0.5480 0.6080 0.6067
Table 2.4: Comparison of 3D IoU with other unsupervised reconstruction methods.
2.3.5.2 AblationAnalysis
We provide a comprehensive ablation study to assess the effectiveness of each algorithmic component. For
all the experiments, we use the same data and parameters as before unless otherwise noted.
GeometricRegularization. In Table 2.5 and Figure 2.23, we demonstrate that our proposed geometric
regularization enables a flexible control over various geometric properties by varying the value of norm p.
To validate the effectiveness of geometric regularization, we train the same network using different config-
urations: 1) without using any geometry regularizers; 2) applying our proposed geometric regularization
withp norm equals to 0.8, 1.0, 2.0, respectively. As shown in the results, the lack of geometry regularizer
50
Input images Ground Truth SoftRas (Mesh) Ours (Implicit field) Input images Ground Truth SoftRas (Mesh) Ours (Implicit field)
Figure 2.22: Qualitative comparisons with mesh-based approach [101] in term of modeling capability of
capturing varying topologies.
would lead to an ambiguity of reconstructed geometry, e.g. first row in Figure 2.23, as some unexpected
shape could appear the same with the ground-truth with an accordingly optimized texture map, and thus
makes the generation of flat surface rather difficult. The proposed regularizer can effectively enhance the
regularity of reconstructed objects, especially for man-made objects, while providing flexible control. In
particular, when p = 2.0, the surface normal difference is minimized in a least-square manner, leading
to a smooth reconstruction. Whenp→ 0, sparsity is enforced in the surface normal consistency, which
encourages the reconstructed surface to be piece-wise linear and is often desirable for man-made objects.
We also perform ablation study on the effect of the sampling step ∆ d for the regularizer as shown in Ta-
ble 2.6 and Figure 2.24. We can observe that larger∆ d leads to more flattening surfaces at the cost of less
fine details.
Importance Sampling. To fully explore the effect of importance sampling, we compare two differ-
ent configurations of sampling scheme: 1) “-Imp. sampling": drawing both 3D anchor points and rays
from the normal distribution with mean and standard deviation set as0 and0.4 respectively; and 2) “Full
model": only using the importance sampling approach for both anchor points and rays with the band-
width set as 0.007. We show sampled rays and results in Table 2.7 and Figure 2.25. In terms of visual
51
3D IoU
norm p = 2.0 0.502
norm p = 1.0 0.524
norm p = 0.8 0.548
-Regularizer 0.503
Table 2.5: Quantitative evalua-
tions of our approach on chair
category using different regu-
larizer configurations.
Input images -Regularizer norm p=2.0 norm p=1.0 norm p=0.8 Ground truths
Figure 2.23: Qualitative evaluations of geometric regularization by us-
ing different configurations.
quality, importance sampling based approach has achieved much more detailed reconstruction compared
to its counterpart. The quantitative measurement also leads to consistent observation, where our proposed
importance sampling has outperformed the normal sampling by a large margin.
3D IoU
∆ d=1× 10
− 2
0.482
∆ d=3× 10
− 2
0.515
∆ d=1× 10
− 1
0.507
Table 2.6: Quantitative evalua-
tions on table category with dif-
ferent∆ d
Input images Ground truths d=3⇥ 10
2
d=1⇥ 10
2
d=1⇥ 10
1
Figure 2.24: Qualitative results of reconstruction using our ap-
proach with different regularizer sampling step ∆ d.
Boundary-Aware Assignment. We also compare the performance with and without boundary-
aware assignment in Table 2.7 and Figure 2.25. When boundary-aware assignment is disabled, the sampling
rays around the decision boundary may be assigned with incorrect labels. As a result, the reconstructions
lack sufficient accuracy, especially around the thin surface regions, and thus may not be able to capture
holes and thin structures as demonstrated in the rightmost examples in Figure 2.25.
52
3D IoU
Full model 0.548
-Imp. sampling 0.482
-Boundary aware 0.524
Table 2.7: Quantitative mea-
surements for the ablation
analysis of importance sam-
pling and boundary-aware
assignment on the chair cate-
gory as shown in Figure 2.25.
Ground truths Sampled rays Full model -Imp. sampling Input images -Boundary-aware
Figure 2.25: Qualitative analysis of importance sampling and boundary-
aware assignment for single-view reconstruction.
2.3.6 Discussion
We introduced a learning framework for implicit surface modeling of general objects without 3D super-
vision. An occupancy field is learned through a set of 2D silhouettes using an efficient field probing algo-
rithm, and the desired local smoothness of implicit field is achieved using a novel geometric regularizer
based on finite difference. Our experiments show that high-fidelity implicit surface modeling is possible
from 2D images alone, even for unconstrained regions. Our approach can produce more visually pleasant
and higher-resolution results compared to both voxels and point clouds. In addition, unlike mesh repre-
sentations, our approach can handle arbitrary topologies spanning various object categories. We believe
that the use of implicit surfaces and our proposed algorithms opens up new frontiers for learning limitless
shape variations from in-the-wild images. Future work includes unsupervised learning of textured geome-
tries, which has been recently addressed with an explicit mesh representation [101], and eliminating the
need of silhouette segmentations to further increase the scalability of the image-based learning. It would
also be interesting to investigate the use of anisotropic kernels for shape modeling and hierarchical im-
plicit representations with advanced data structure, e.g. Octree, to further improve the modeling efficiency.
Furthermore, we would like to consider the use of learning from texture cues in addition to binary masks.
53
Chapter3
LearningtoOptimizeVanishingPointsforSceneStructureAnalysis
Vanishing points are defined as the intersection points of 3D parallel lines when projected onto a 2D image.
By providing geometry-based cues to infer the 3D structures, they underpin a variety of applications,
such as camera calibration [80, 30], facade detection [100], 3D reconstruction [59], 3D scene structure
analysis [62, 159], 3D lifting of lines [131], SLAM [176], and autonomous driving [84].
Efforts have been made on vanishing point detection in the past decades. Traditionally, vanishing
points are detected in two stages. In the first stage, a line detection algorithm, such as probabilistic hough
transformation [76] or LSD [157], is used to extract a set of line segments. In the second stage, a line
clustering algorithm [109] or a voting procedure [7] is used to estimate the final positions of vanishing
points from detected line segments. The main weakness of this pipeline is that the extracted lines might
be noisy, leading to spurious results after clustering or voting when there are too many outliers. To make
algorithms more robust, priors of the underlying scenes can be used, such as Manhattan worlds [10] or At-
lanta worlds [137], which are common in man-made environments. Nevertheless, additional assumptions
complicate the problem setting, and the algorithms might not work well when these hard assumptions do
not hold.
Recent CNN-based deep learning approaches [24, 18, 175, 171, 77, 179] have demonstrated the robust-
ness of the data-driven approach. In particular, NeurVPS [179] provides a framework to detect vanishing
54
Figure 3.1: We propose a novel vanishing point detection network VaPiD, which runs in real-time with high
accuracy. Speed-accuracy curves compare with state-of-the-art methods on the SU3 dataset [180]. Dotted
horizontal lines labeled withnϵ represent thenth smallest angle errors that numerically can be represented
by 32-bit floating point numbers when computing the angle between two normalized direction vectors, i.e.,
arccos⟨d
1
,d
2
⟩.
points in an end-to-end fashion without relying on external heuristic line detectors. It proposesconiccon-
volution to exploit the geometric properties of vanishing points by enforcing the feature extraction and
aggregation along the structural lines of vanishing point candidates. This approach achieves satisfactory
performance, but it is inefficient as it requires evaluating all possible vanishing points in an image (1FPS is
reported in [179]). In contrast, most vanishing point applications must be run online in order to be useful
in a practical setting.
To this end, we introduce VaPiD, a novel end-to-end rapid vanishing point detector that significantly
boosts the model efficiency using learned optimizers. VaPiD consists of two components: (1) a vanishing
point proposal network (VPPN) that takes an image and returns a set of vanishing point proposals. It
harnesses a computation sharing scheme to efficiently process dense vanishing point anchors; (2) a neural
vanishing point optimizer (NVPO) that takes each proposal as input and optimizes for its position with a
neural network in an iterative fashion. In each iteration, it refines the vanishing points by regressing the
55
Backbone Vanishing point proposal Input image Recurrent vanishing point optimizer
PNMS
T iters
Conv Efficient conic conv Conic conv
v
0
1
v
0
2
v
0
n
Output vanishing points
Anchor
v
T
n
v
T
1
v
T
2
Figure 3.2: The architecture of our proposed VaPiD. It incorporates three major components: (1) a backbone
network for feature extraction from the input image; (2) a vanishing point proposal network (VPPN) to
generate reliable vanishing point proposals with efficient conic convolutions; (3) a weight sharing neural
vanishing point optimizer (NVPO) to refine each vanishing point proposal to achieve high accuracy. Note
that our network is trained in an end-to-end fashion.
residuals and updating the estimates. Our approach can be considered as learning to optimize. Compared
to the previous coarse-to-fine method in [179], our optimizing scheme avoids enumerating all possible
vanishing point candidate positions, which largely improves the inference speed.
We comprehensively evaluate our method on four public datasets including one synthetic dataset and
three real-world datasets. VaPiD significantly outperforms previous works in terms of the efficiency, while
achieving competitive, if not better, accuracy compared with the baselines. Remarkably, on the synthetic
dataset, the cosine of the median angle error (0.088°) is close to themachineepsilon of 32-bit floating-point
numbers
∗
, which indicates that VaPiD pushes the detection accuracy to the limit of numerical numbers.
With fewer refinement iterations, VaPiD runs at 26 frames per second while maintaining a median angle
error of 0.145° for 512× 512 images with 3 vanishing points.
3.1 RelatedWork
VanishingPointDetection. Early works represent vanishing points with unit vectors on a sphere (the
Gaussian sphere), which reveals the link between the 2D vanishing point and the 3D line direction [7,
∗
The cosine error is the dot product of the direction represented by predicted vanishing points and ground truth vanishing
points. It limits how accurate you can represent a vanishing point with floating-point numbers.
56
130]. Modern line-based vanishing point detection approaches first detect the line segments, which are
then used to cluster vanishing points [165, 109, 108, 78]. Among them, LSD [157] with J-linkage [146,
39] is probably one of the most widely used algorithms. These methods work well on images with strong
line signals, but are not robust to noises and outliers [140]. Therefore, structure constraints such as or-
thogonality properties are often used to increase the robustness. For example, the “Manhattan world”
assumes three mutually orthogonal vanishing directions [32, 119, 90]. Similarly, under the “Atlanta world”
assumption [137], vanishing points are detected in a common horizon line [6, 88, 165].
Recently, we have seen the success of CNN-based research on vanishing point detection. Chang et
al. [24] detects vanishing points in the image frame by classifying over image patches. Zhai et al. [171]
learns the prior of the horizon and its associated vanishing points in human-made environments. Klugeret
al. [77] projects lines in the image to the Gaussian sphere and regresses directly on the spherical image.
Zhou et al. [179] introduces the conic convolutions that can learn the vanishing point-related geometry
features. Our approach also builds upon conic convolutions. We propose a computation sharing scheme
that enables conic convolutions to process large-scale vanishing points.
LearningtoOptimize. Using neural network layers to mimic the steps of optimization algorithms has
shown to be effective in many computer vision tasks. Gregor et al. [55] first explored training a neural
network as the approximation of an optimizer for sparse coding. This idea has been further applied to
image super-resolution [35], novel view synthesis [42], and optical flow [147]. We follow this line of
works and train a neural network that iteratively estimates the residuals and updates the vanishing points.
In contrast to previous works, we target at the problem of vanishing point detection and our network
optimizes the vanishing points in the semi-spherical space.
57
3.2 VaPiD:ARapidVanishingPointDetector
3.2.1 Background
GeometryRepresentationofVanishingPoints. We adopt theGaussiansphererepresentation of van-
ishing points [7]. The position of a vanishing pointv =[v
x
,v
y
]
T
∈R
2
in an image encodes a set of parallel
3D lines with directiond = [v
x
− c
x
,v
y
− c
y
,f]
T
∈R
3
, where[c
x
,c
y
]
T
∈R
2
is the optical center andf
is the focal length of the camera. Representing vanishing points withd instead ofv avoids the degenerate
cases where the projected lines are parallel in 2D. In addition, we are now able to use the angle between
two 3D unit vectors as the distance between two vanishing points. In this paper, we use both 3Dd∈R
3
and its 2D counterpart ofv∈R
2
to represent a vanishing point.
ConicConvolutions. The conic convolution [179] is designed to extract point related features. A conic
convolution takes a feature map and a convolution center (the coordinates of the vanishing point candi-
dates) as input, and outputs another feature map. Mathametically, a 3× 3 conic convolution operator “∗ ”
is defined as:
(F∗ w)(p|v)
=
1
X
i=− 1
1
X
j=− 1
w(i,j)·F
p+R
v− p
·
i
j
,
(3.1)
whereF is the input feature map,w is a 3× 3 trainable convolution filter, p ∈ R
2
is the coordinates of
the output pixel,v is the convolution center that is set to be the candidate positions of vanishing points,
andR
v− p
is a 2D rotation matrix that rotates the x-axis to the direction ofv− p. In other words, conic
convolution is a structured convolution operator that always rotates the filters towards the vanishing point
v regardless of the output pixel coordinatesp. Intuitively, conic convolution can be seen as a way to check
whether there are enough lines shooting from a vanishing point.
58
3.2.2 NetworkArchitecture
Fig. 4.4 illustrates our overall workflow. VaPiD takes an image as input and predicts the associated van-
ishing points. Specifically, a backbone network first extracts the feature map from the input image. Our
vanishing point proposal network (VPPN) then generates a set of coarse vanishing point proposals using
the feature map. Finally, the neural vanishing point optimizer (NVPO) optimizes each proposal individ-
ually for a fixed number of iterations. We introduce the designs of VPPN and NVPO in Sec. 3.2.2.1 and
Sec. 3.2.2.2, respectively. In the end, we describe the loss functions for training both modules in Sec. 4.2.5.
3.2.2.1 VanishingPointProposals
The goal of the vanishing point proposal network (VPPN) is to produce a set of vanishing point proposals
efficiently. Let {v
i
}
N
i=1
be an anchor point grid of sizeN on a unit sphere. The vanishing point proposal
network classifies each anchor point to determine whether there is a vanishing point around it. We employ
a point-based non-maximum suppression (PNMS) on the score map of the anchor grid to generate the final
candidates.
Efficient Conic Convolutions. Given a vanishing point anchorv
i
, the conic convolution centered at
v
i
relates the vanishing point to its line features. However, the conic convolution needs to process each
vanishing point anchor separately, which is slow whenN is large. To solve this problem, we propose the
efficient conic convolution operator to quickly compute N feature maps by reusing some internal results
with approximations.
Our key observation is that the rotation matrixR
(·)
in Equ. 3.1 is the only factor that varies regarding
the same pixelp. For a dense anchor point grid, the computation is redundant as multiple anchors may
share similar rotation angles. Therefore, our method first approximates R
(v− p)
by K rotation matrices
{R
k
}
K− 1
k=0
, whereR
k
is a 2D rotation matrix that rotates
2πk
K
rad. We then pre-compute the feature map
59
G
k
by convolving the input feature map with the kernel rotated byR
− 1
k
. After that, we can efficiently
approximate the vanilla conic convolutions by retrieving the features from the pre-computed feature maps
with the closest rotation angles. This process can be described with the following formulas:
(F∗ w)(p|v) (3.2)
=
1
X
i=− 1
1
X
j=− 1
w(i,j)·F
p+R
v− p
·
i
j
(3.3)
≈ 1
X
i=− 1
1
X
j=− 1
w(i,j)·F
p+R
kv,p
·
i
j
(3.4)
≈ 1
X
i=− 1
1
X
j=− 1
w
R
− 1
kv,p
·
i
j
F
p+
i
j
(3.5)
≈
F⊗ RotateKernel(w,R
− 1
kv,p
)
(p) (3.6)
.
=G
kv,p
(p),
wherek
v,p
=argmax
k
Tr(R
k
·R
T
v− p
) is the index of the closest rotation matrix,RotateKernel(w,R
− 1
kv,p
)
rotates the convolutional kernelw with 2D rotation matrixR
− 1
kv,p
, “⊗ ” is the symbol of a regular convolu-
tion, and{G
k
}
K
k=1
is the set of pre-computed feature maps with 2D regular convolutions. Here, (3.2)-(3.3)
is the definition of conic convolution, (3.3)-(3.4) is the step of rotation discretization, (3.4)-(3.5) is inte-
gration by substitution, and (3.5)-(3.6) is according to the definition of rotation and convolution. Once we
obtain the feature maps for all the anchor points, we pass them into a fully connected layer with aSigmoid
activation to obtain the classification scores.
It takesN times of network forwards for the vanilla conic convolutions to compute the feature maps
for a vanishing point set of sizeN, while the proposed efficient conic convolutions only require K times
60
(a) Confidence score map (b) Vanishing point proposals w/ PNMS
Figure 3.3: Illustration of point-based non-maximum suppression (PNMS). (a) The confidence score map of
a dense vanishing point anchor grid predicted by our efficient conic convolution networks. Higher scores
are visualized as solid spheres with larger radius. (b) Top-3 vanishing point proposals after PNMS.
network forwards followed by a pooling operator that requires negligible computation costs, whereK is
the number of discretized rotation matrices. We find that setting K = 64 ≪ N can already give good
approximations. In addition, efficient conic convolution only requires regular 2D convolution instead of
deformable convolutions as used in [179], which in practice is much faster due to tremendous engineering
efforts in modern deep learning frameworks.
Vanishing Point Non-maximum Suppression (PNMS). The dense score maps tend to be locally
smooth. To remove duplicated proposals, we adopt a point-based non-maximum suppression (PNMS) ap-
proach, inspired by the widely adopted NMS techniques in object detection [50]. PNMS keeps the vanish-
ing points with the greatest scores locally and suppresses its neighboring vanishing points within an angle
thresholdΓ . Figure 3.3 illustrates the effect of PNMS. After PNMS, we select the top- K ranked anchors as
our proposals.
61
3.2.2.2 LearningtoOptimizeVanishingPoints
The goal of the neural vanishing point optimizer (NVPO) is to fine-tune the vanishing point positions
starting from the initial proposal d
(0)
. Our NVPO emulates the process of iterative optimization and
produces a sequence of estimates
d
(1)
,...,d
(T)
. In each iteration, it uses the image feature map and
the current vanishing point positiond
(t)
to regress an update vectorδ (t)
= (δθ (t)
,δϕ (t)
) using the conic
convolution network [179]. It then applies the update vector to the current vanishing point and obtains
the next estimate (as shown in Equ. (3.7)). For each vanishing point proposal, we use T iterations. The
network weights of the NVPO are shared across all the refinement iteration. As we only process a small
number of vanishing points, we adopt the vanilla conic convolutions in our NVPO. We provide the network
structure details in the supplementary materials.
We note that the update is applied in a local system defined by the position of vanishing points to avoid
the problem of Gimbal lock. We first write out d
(t)
using the spherical coordinate and construct∆ (t)
∈R
3
from the regressed vectorδ (t)
:
d
(t)
=
cosθ (t)
sinϕ (t)
sinθ (t)
sinϕ (t)
cosϕ (t)
T
,
∆ (t)
=
cosδθ (t)
sinδϕ (t)
sinδθ (t)
sinδϕ (t)
cosδϕ (t)
T
.
We then define the local system (X
′
,Y
′
,Z
′
), where theZ
′
-axis correspondsd
(t)
while keepingZ-axis lies
in theY
′
Z
′
plane. The update vector is then applied to the current estimate in(X
′
,Y
′
,Z
′
). This process
can be viewed as a rotation transformation:
d
(t+1)
=d
(t)
⊕ δ (t)
:=
e
(t)
d
(t)
× e
(t)
d
(t)
· ∆ (t)
, (3.7)
62
X
Y
Z
X
0
Y
0
Z
0
(a) Local system (b) Update operator
(t)
✓ (t)
(t)
d
(t+1)
d
(t)
Figure 3.4: Illustration of our update operator. (a) Given the camera system (X,Y,Z) and the current
vanishing point positiond
(t)
, we define the local system (X
′
,Y
′
,Z
′
). (b) We obtain the refined vanishing
pointd
(t+1)
by applying the update vector∆ (t)
in the local system.
wheree
(t)
=[− sinϕ (t)
,cosϕ (t)
,0]
T
. This process is illustrated in Fig. 3.4. An important property of such
update scheme is rotational equivariant. If one rotates the vanishing points with respect to the optical
center, the refined vanishing points will rotate in the same manner. This property is guaranteed by our
method, as the conic convolutions centered at the vanishing points are by nature rotation invariant.
Although the NVPO still uses conic convolution to compute the features, our approach is more efficient
compared to NeurVPS [179]. Specifically, NeurVPS samples candidate positions near the current estimates
and uses conic convolutional networks to determine if each candidate is near a real vanishing point. Even
with a coarse-to-fine strategy, it still needs to forward the conic convolutional networks 144 times per
vanishing point in order to reach high precision levels, which largely limits the model efficiency. We instead
adopt a “learning to optimize” methodology and directly regress the residuals of the vanishing points with
our conic convolutional networks. Such a design greatly accelerates the overall process. For instance, we
can now achieve better performance than NeurVPS with only a few network forwards. Alternatively, our
approach can be viewed as solving for equilibrium points: our update formulationd
(t+1)
=d
(t)
⊕ δ (t)
=
f(d
(t)
) can be viewed as a fixed point iteration method, where the function f is learned to fit our objective.
63
3.2.2.3 LossFunctions
For training VPPN, we assign a binary class label for each vanishing point anchor, where only the an-
chors with the closest angle to a ground truth are assigned with positive labels. This gives the equa-
tionL
cls
=
P
N
i=1
BCE(l
i
,l
∗ i
), where l
i
is the classification score for the i-th anchor and l
∗ i
is the as-
signed label. For training NVPO, we sample M anchors around the ground truths as the initial states
and supervise NVPO with an angular loss between the estimates and the ground truths for each step
L
ref
=
P
M
i=1
P
T
t=1
arccos(|⟨v
t
i
,v
∗ i
⟩|). We jointly train both modules with the final loss L=L
cls
+λ L
ref
,
whereλ is a trade-off hyper-parameter.
3.2.3 ImplementationDetails
We implement our network in PyTorch. We resize the input images to 512× 512. For generating the van-
ishing point proposals, we first uniformly sample an anchor grid of size N = 1,024 using Fibonacci
lattice [53]. We then set the PNMS thresholdΓ=15 ° and keep the top-K proposals if the dataset assumes
a fixed number of vanishing points K, otherwiseK = 6. For vanishing point refinement, we set the esti-
mation capS = 20°, and compute losses forT = 8 refinement steps. We use Adam optimizer [75] with a
learning rate of3× 10
− 4
to train the network. We set the trade-off parameter λ =1.
3.2.4 Experiments
3.2.4.1 Datasets
We conduct empirical studies on the following datasets:
SU3Wireframe[180]. SU3 Wireframe is a photo-realistic synthetic urban scene dataset generated with
a procedural building generator. It contains 22,500 training images and 500 validation images. The dataset
assumes “Manhattan world” scenes, where each image has exactly three mutual perpendicular vanishing
64
points. The ground truths are calculated from the CAD models, which are accurate enough for a systematic
investigation of vanishing point detection.
NaturalScene[181]. Collected from AVA and Flickr, this dataset contains images of natural scenes where
the authors pick onlyone dominant vanishing point as the ground truth. We adopt the data split from [179]
that divides the images into 2,000 training samples and 275 test samples.
HoliCity[178]. HoliCity is a city-scale real-world dataset with rich structural annotations. The ground
truths are accurately aligned with the CAD model of downtown London. There are various numbers of
vanishing points for each scene. We adopt the standard split that contains 45,032 training samples and
2,504 validation samples.
NYU-VP [78]. This dataset is manually labeled based on the NYU Depth V2 dataset [142]. It contains
1,449 indoor images. While most images show three ground truth vanishing points, it ranges from one to
eight. We follow [78] and split the dataset into 1,000 training samples, 224 validation samples, and 225
testing samples.
3.2.4.2 ExperimentSetups
EvaluationMetrics. We evaluate our method using mean and median angle errors. To better inspect our
method under various precision levels, we also make use of angle accuracy (AA) metrics [179], whereAA
α is defined as the area under the angle accuracy curve between [0,α ] divided by α . For NYU-VP dataset,
we follow [78] and adopt the AUC metric.
Baselines. We compare our method against robust fitting methods J-Linkage [39], T-Linkage [109], Se-
quential RANSAC [156], Multi-X [5], MCT [108], and learning-based method CONSAC [78], based on
line segments extracted with LSD [157]. For vanishing point detection methods, we compare our method
65
(a) Ground truth (c) NeurVPS (b) VaPiD (d) J-Linkage (a) Ground truth (c) NeurVPS (b) VaPiD (d) J-Linkage
Figure 3.5: Qualitative comparison with baseline methods on SU3 Wireframe dataset [180]. Line group in
the same color indicates the same vanishing point. We highlight all prediction errors in red color. Better
view in color.
against traditional methods VPDet [181] and Simonetal. [143]. For learning-based vanishing point detec-
tion methods, we compare our method against Zhai et al. [171], Kluger et al. [77], direct CNN regression
and classification [179], and previous state-of-the-art method NeurVPS [179].
3.2.4.3 ResultsonSyntheticDatasets
We show our results on SU3 Wireframe [180] in Tab. 3.1, and plot the AA curves in Fig. 3.6. With a similar
inference speed, VaPiD achieves an order of magnitude improvement over the naive CNN classification and
regression baselines. VaPiD also significantly outperforms the traditional line-based J-Linkage clustering
method [39]. Note that the synthetic dataset contains many sharp edges and long lines, which by nature
favors the line-based detectors. The improvement of the accuracy validates the efficacy of geometry-
inspired network designs such as conic convolutions and efficient conic convolutions. We also observe
that our method runs 17 times faster than our strong baseline NeurVPS [179], while achieving better
accuracies on all metrics. This indicates that our learning to optimize scheme can greatly improve the
model efficiency upon the coarse-to-fine enumerating strategy used in NeurVPS. Remarkably, VaPiD is
66
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45
Angle Difference (degree)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Percentage
AA Curves @ 0.5 on SU3 Wireframe
VaPiD
NeurVPS
LSD + J-Linkage
CNN Regression
CNN Classification
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
Angle Difference (degree)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Percentage
AA Curves @ 2 on SU3 Wireframe
VaPiD
NeurVPS
LSD + J-Linkage
CNN Regression
CNN Classification
10 100
Inference Time per Vanishing Point (millisecond)
0.1
0.3
1.0
3.0
Median Angle Error (degree)
Speed-acc Comparisons on SU3 Wireframe
VaPiD
NeurVPS
LSD+J-Linkage
CNN Regression
CNN Classification
Figure 3.6: Angle accuracy curves and speed-accuracy comparisons for different methods on SU3 wire-
frame dataset [180].
trained with angular metrics, but achieves a median angle error of 0.089° with float32. This is because
1− cos(0.089°) ≈ 1.2× 10
− 6
, which is already close to the machine epsilon of 32 bit floating point
numbersϵ ≈ 1.2× 10
− 7
. This fact indicates that VaPiD is able to push the detection precision close to the
numerical precisions. This is also reflected in the stepped curves at high precision levels (i.e. below 0.1
°
)
in Fig. 3.6.
QualitativeResults. We provide the visual comparison of the vanishing point detection with state-of-
the-art method NeurVPS [179] and J-Linkage [39] in Figure 3.5. The detection errors are shown in red
color. We observe that both the learning-based methods outperform the traditional method J-Linkage in
prediction accuracy. Although the synthetic scenes contain sharp edges and long lines, the performance
of J-Linkage is affected by clouds (second-row right panel), shadows (second-row left panel), and occluded
lines (third-row right panel). Compared to NeurVPS, our method is more robust to occlusion when one of
the vanishing points is occluded (third-row right panel). We believe this is benefited from our design of
using a dense vanishing point anchor grid as initial.
3.2.4.4 ResultsonReal-WorldDatasets
Comparisons on Natural Scene. We show the comparisons on Natural Scene [181] in Tab. 3.2 and
Fig. 3.7. Our method significantly outperforms the naive CNN classification and regression baselines as
well as the contour-based clustering algorithm VPDet [181] in all metrics. It also outperforms the strong
67
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
Angle Difference (degree)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Percentage
AA Curves @ 2 on Natural Scene
VaPiD
NeurVPS
VPDet
CNN Regression
CNN Classification
0.0 1.2 2.4 3.6 4.8 6.0 7.2 8.4 9.6 10.8
Angle Difference (degree)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Percentage
AA Curves @ 12 on Natural Scene
VaPiD
NeurVPS
VPDet
CNN Regression
CNN Classification
100 1000
Inference Time per Vanishing Point (millisecond)
1.0
2.0
3.0
4.0
Median Angle Error (degree)
Speed-acc Comparisons on Natural Scene
VaPiD
NeurVPS
VPDet
CNN Regression
CNN Classification
Figure 3.7: Angle accuracy curves and speed-accuracy comparisons for different methods on Natural Scene
dataset [181]
AA
.2°
AA
.5°
AA
1°
Mean Median
CNN-reg 2.03 6.48 15.02 2.077° 1.481°
CNN-cls 2.17 9.10 23.71 1.766° 0.984°
J-Linkage [39] 27.89 48.07 62.34 3.888° 0.209°
NeurVPS [179] 47.59 74.26 86.35 0.147° 0.090°
VaPiD 48.33 74.79 86.66 0.145° 0.088°
Table 3.1: Comparisons of mean, median angle errors and the angular accuracies of 0.2°, 0.5°, 1° with
baseline methods on SU3 dataset [180].
baseline NeurVPS [179] in most of the metrics. We note that the Natural Scene [181] is captured by cameras
with different focal lengths. Such data favors the enumeration-based methods over the optimization-based
methods, especially at a tighter angle threshold (i.e. below 1°). Nonetheless, we highlight that for images
with one dominant vanishing point, VaPiD can run at real-time (43FPS) while maintaining competitive
performance.
ComparisonsonHoliCity. We compare our method and NeurVPS [179] on the challenging real-world
dataset, HoliCity [178]. The dataset mostly embraces the Atlanta world and contains images with a variable
number of vanishing points. Our method outperforms NeurVPS on HoliCity. We think the gain originates
from more vanishing points proposal being retrieved. This shows that our VPPN can adapt to complex
scenes. Thanks to our efficient conic convolutions, we can process a denser anchor grid to produce fine-
grained proposals.
68
AA
1°
AA
2°
AA
10°
Mean Median
CNN-reg 2.4 9.9 58.8 5.09° 3.20°
CNN-cls 4.4 14.5 62.4 5.80° 2.79°
VPDet [181] 18.5 33.0 60.0 12.6° 1.56°
NeurVPS [179] 29.1 50.3 85.5 1.83° 0.87°
VaPiD 24.6 49.5 87.4 1.26° 0.87°
Table 3.2: Comparisons of mean, median angle errors and the angular accuracies of 1°, 2°, 10° with baseline
methods on Natural Scene dataset [181].
AA
1°
AA
2°
AA
10°
Mean Median
NeurVPS [179] 18.2 31.7 62.1 8.32° 1.78°
VaPiD 22.1 39.6 75.4 3.00° 1.19°
Table 3.3: Comparisons of mean, median angle errors and the angular accuracies of 1°, 2°, 10° with baseline
methods on HoliCity dataset [178].
ComparisonsonNYU-VP. We compare against recent robust fitting methods CONSAC [78], T-Linkage [109],
Sequential RANSAC [156], Multi-X [5], MCT [108], and vanishing point detection methods Simon et
al. [143], Zhai et al. [171] and Kluger et al. [77] on the NYU-VP dataset [78], and show the results in
Tab. 3.4. To fairly compare with the baselines, we follow [78] and use the Hungarian method to match the
predictions and the ground truths. We find that in general the supervised methods perform better than tra-
ditional methods, and our method outperforms all baselines by a large margin. Compared to robust fitting
methods, VaPiD does not rely on prior line detectors. Instead, thanks to our geometry-inspired structures,
VaPiD can extract meaningful and robust line features from raw images intrinsically via end-to-end su-
pervised learning. Compared to recent learning-based vanishing point detectors, VaPiD can make use of
rich geometry cues, i.e. vanishing point-related line features, to accurately locate the vanishing points.
QualitativeResults. We visualize our detected vanishing points on HoliCity [178] in Figure 3.8, which
shows that VaPiD is able to generalize well to different types of scenes and is robust to the perspective
distortions. Thanks to the VPPN, our method can handle the input images with a variable number of
vanishing points (2 for the first-row left panel, 3 for the first-row middle panel, and 4 for the second-row
69
Input Ground truth VaPiD result Input Ground truth VaPiD result Input Ground truth VaPiD result
Figure 3.8: Visualization of our vanishing point detection results on various types of scenes. For each of
the vanishing points, we visualize it using a group of 2D lines in the same color. Better view in color.
AUC
10°
Supervised
Multi-X [5]† 41.3 no
MCT [108]† 47.0 no
Sequential RANSAC [156]† 53.6 no
T-Linkage [109]† 57.8 no
Kluger et al. [77] 61.7 yes
Simon et al. [143] 62.1 no
Zhai et al. [171] 63.0 yes
CONSAC [78]† 65.0 yes
VaPiD 69.1 yes
Table 3.4: Comparisons of AUC values at 10° with baseline methods on NYU-VP dataset [78]. Supervised
methods are noted as “yes” in the last column. † Method requires additional line segment detector such as
LSD [157].
left panel) without relying on assumptions on the scene. In some cases, our predictions are even more
reasonable than ground truths (orange in the second-row right panel).
3.2.4.5 Ablations
In this section, we show ablation studies to investigate the effect of each component in our model. All
experiments are conducted on SU3 Wireframe [180], as we can eliminate the labeling errors with the
synthetic images.
The Effect of Efficient Conic Convolutions. The efficient conic convolutions are the core of our
VPPN. In Tab. 3.5, we demonstrate the effectiveness of the efficient conic convolutions by investigating
70
Rec
2°
Rec
4°
Rec
6°
Mean Median
VPPN-ECC 33.20 72.13 89.53 0.554° 0.101°
VPPN 56.20 95.67 99.00 0.146° 0.089°
Table 3.5: Ablation study on the efficient conic convolutions. “VaPiD-ECC" denotes the baseline without
using our proposed efficient conic convolutions.
AA
.1°
AA
.5°
AA
2°
Mean Med. FPS
NVPO× 4 15.1 62.1 89.0 0.227° 0.145° 26
NVPO× 6 24.3 72.7 92.4 0.157° 0.096° 22
NVPO× 8 26.6 74.8 93.0 0.145° 0.088° 17
NVPO× 12 27.5 75.2 93.2 0.141° 0.086° 13
NVPO× 16 27.6 75.3 93.2 0.140° 0.086° 10
NVPO× 24 28.1 75.6 93.3 0.139° 0.085° 7
Table 3.6: Ablation study on the number of refinement steps. ( × T ) indicates T refinement steps during
the inference.
a variant of VPPN (VPPN-ECC) that replaces the efficient conic convolutions with vanilla conic convolu-
tions but has a similar computation cost. As the goal of our VPPN is to pinpointall of the vanishing point
proposals, we adopt a recall metric, where Rec
α means the fraction of the ground truths that are success-
fully retrieved by one of the vanishing point proposals within the threshold ofα . In Tab. 3.5, we observe
that VPPN outperforms the VPPN-ECC baseline on all metrics. We note that the average closest neighbor
angle of our anchor grid is 4.3°. We find that VPPN achieves 99% recall with a threshold of 6 °, whereas the
VPPN-ECC variant still struggles at 90%. This validates the efficiency of the computation sharing scheme
in our efficient conic convolutions.
Convergence of the Learned Optimizer. In the training stage, we compute losses for a fixed refine-
ment step of 8. To investigate the convergence of our NVPO, we apply different refinement steps during
inference and show the results in Tab. 3.6. As the refinement step increases, it is clear that our NVPO
gradually produces better vanishing point estimates, and can converge to a fixed point. We also make two
key observations: (1) our method runs at nearly real-time (26FPS) with 4 refinement steps, yet is accurate
enough (with a 0.15° median error) for many downstream applications; (2) even with 24 refinement steps,
71
where the performance is saturated, our method still runs 7 times faster than thee previous state-of-the-art
method [179], while being more accurate.
3.2.5 Discussion
This paper presents a novel neural network-based vanishing points detection approach that achieves state-
of-the-art performance while being significantly faster than previous works. Our method contains two
designated modules: a novel vanishing points proposal network and a neural vanishing point optimizer.
Our key insight is to use the computation sharing to accelerate massive convolution operations, and em-
brace a learning to optimize methodology that progressively learns the residual of the objectives. In future
work, we will study how to combine VaPiD with downstream applications such as scene understanding,
camera calibration, and camera pose estimation.
72
Chapter4
LearningtoOptimizeFaceAvatars
Photo-realistic face avatar capture has become a key element in entertainment media due to the realism and
immersion it enables. As the digital assets created from photos of human faces surpass their artist-created
counterparts in both diversity and naturalness, there are increasing demands for the digitized face avatars
in the majority of the sectors in the digital industry: movies, video games, teleconference, and social media
platforms, to name a few. In a studio setting, the term “avatar” encompasses several production standards
for a scanned digital face, including high-resolution geometry ( with pore-level details), high-resolution
facial textures (4K) with skin reflectance measurements, as well as a digital format that is consistent in
mesh connectivity and ready to be rigged and animated. These standards together are oftentimes referred
to as a production-ready face avatar.
In this paper, we consider a common face acquisition setting where a collection of calibrated cameras
capture the color images that are processed into a full set of assets for a face avatar. In general, today’s
professional setting employs a two-step approach to the creation of the face assets. The first step com-
putes a middle-frequency geometry of the face (with noticeable wrinkle and facial muscle movement)
from multi-view stereo (MVS) algorithms. A second registration step is then taken to register the ge-
ometries to a template meth connectivity, commonly of lower resolution with around 10k to 50k vertices.
For production use, the registered base mesh is augmented by a set of texture maps, composed of albedo,
73
specular and displacement maps, that are computed via photogrammetry cues and specially designed de-
vices (e.g. polarizers and gradient light patterns in [107, 48]). The lower-resolution base mesh is combined
with a high resolution displacement maps to represent geometry with pore, freckle-level details. Modern
physically-based rendering agents further utilize the albedo and specularity maps to render the captured
face in photo-realistic quality.
While the avatars acquired thereby achieve satisfactory realism, many difficulties in this setting in-
evitably pose high-quality face avatar capturing as a costly operation that is far from mass production and
easy accessibility. More specifically, traditional MVS algorithms and registration take hours to run for a
single scan frame. The registration process is also error-prone, oftentimes requiring manual adjustment
to the initialization and clean-up by the professional artists. In addition, special devices (e.g. polarizers)
are needed for capturing skin reflectance properties. The long production cycle, intensive labor work and
equipment cost for special devices holds back a wider availability of high-quality facial capturing.
The growing demands for face acquisition in both research and digital production would greatly benefit
from a much faster and fully automatic system that produces professional-grade face avatars. Fortunately,
efforts towards this end are well found in recently proposed neural-learning-based techniques. Model-
based approaches such as DFNRMVS [4] incorporate a 3D morphable model as the prior to reconstruct
face geometry from a sequence of image input. Despite achieving a vast increase in efficiency, they have
yet to succeed in matching the quality and completeness of production-ready avatars, due to the limited
expressiveness and flexibility of the parametric space of the morphable model. On the other hand, deep
stereo matching approaches, such as [67], achieve higher accuracy in 3D reconstruction by accurately
regressing depth under geometric priors. Our adaptation of these methods to the facial reconstruction
settings has revealed that the best performing deep MVS framework [67] infers shapes within 0.88mm
median error, and the inference time is within a second. However, nontrivial steps are still required to
74
obtain the registered meshes and the corresponding texture maps. Recently, ToFu [93] have shown dedi-
cated designs for neural face acquisition, achieving state-of-the-art accuracy in the face reconstruction and
providing a solution that combines reconstruction with registration in an end-to-end manner. ToFu learns
the probability distributions of the individual vertices in a volumetric space, posting the reconstruction as
a coarse-to-fine landmark classification problem. However, the formulation limits ToFu to use a relatively
low-resolution geometry representation, which is in addition incompatible with texture inference.
In light of the progress needed to be made, our goal is a comprehensive, animation-ready neural face
capturing solution that can produce production-grade dense geometry (combining a mid-frequency mesh
with high-resolution displacement maps), complete texture maps (high-resolution albedo and specular
maps) required by most PBR skin shaders, and consistent mesh connectivity
∗
across subjects and expres-
sions. Most importantly, our proposed model aims to be highly efficient , creating the comprehensive face
assets within a second, fully automatic, an end-to-end system without the need for manual editing and
post-processing, and device-agnostic, easily adaptable to any novel multi-view capturing rigs with mini-
mal fine-tuning, including light-weight systems with sparse views.
Our proposed model is Recurrent Feature Alignment (ReFA), the first end-to-end neural-based system
designed to faithfully capture both the geometry and the skin assets of a human face from multi-view image
input and fully automatically create a face avatar that is production-ready. Compared to the state-of-the-
art method [93], ReFA boosts both the accuracy by 20% to a median error of 0.608mm, and a 40% increase
in speed, inferring high-quality textured shapes at 4.5FPS. The face geometries inferred by ReFA not only
outperforms the best deep MVS method [67], but they are reconstructed at a representation consistent in
mesh connectivity that provides dense correspondences across subjects and expressions. In addition, a
parallel texture inference network that shares the same representation with the geometry produces a full
set of high-resolution appearance maps that allow for photo-realistic renderings of the reconstructed face.
∗
We use mesh connectivity to describe a mesh with well-define faces “ f” and vertex texture coordinates “vt” but without
vertex coordinates “v”.
75
ReFA is based on two key designs to realize the aforementioned goals. The first is the adoption of the
position map [41] for representing geometry in a UV space. Such representation is not only amenable
to effective processing with image convolution networks, but it offers an efficient way to encode dense,
registered shape information (a 128× 128 size of position map encodes up to 16K vertices) across subjects
and expressions, and organically aligns the geometry and texture space for the inference of high-frequency
displacement maps and high-resolution textures. The position map also provides pixel-level flexibility for
geometry optimization, which allows modeling of extreme expression, non-linear muscle movement and
other challenging cases. In this paper, we adopt a position map of 512 × 512 size, with a capacity of
around 260K vertices that are well capable of modeling middle-frequency details directly using a neural
network. The second design is a learned recurrent face geometry optimizer that effectively aligns UV-space
semantic feature with the multi-view visual features for reconstruction with consistent mesh connectivity.
The recurrent optimization is centered around a per-pixel visual semantic correlation (VSC) that serves to
iteratively refine the face geometry and a canonical head pose. The refined geometry then provides pixel-
aligned signals to a texture inference network that accurately infers albedo, specular and displacement
map in the same UV space.
Experiments in Section 4.2.7 have validated that our system ReFA meets its goal in fast and accurate
multi-view face reconstruction, outperforming the previous state-of-the-art methods in both visual and
numerical measurements. We further show in an ablation study that our design choices are effective and
our model is robust to sparse view input. As ReFA utilizes a flexible shape representation and produces a
full set of face assets that is ready for production-level animation, we demonstrate applications in avatar
creation, 4D capture and adaptation of our model to the productions of other digital formats.
In summary, our contributions are:
76
• ReFA, the first neural-based comprehensive face capturing system that faithfully reconstructs both
the geometry and the skin assets of a human face from multi-view images input and fully automati-
cally create a 3D face avatar that is production-ready. Our model outperforms previous neural-based
approaches in both accuracy and speed, with a median error of0.6mm and a speed at 4.5FPS.
• Novel formulations of a recurrent geometry optimizer that operates on UV-space geometry features
and provides an effective solution to high-quality face asset creation.
• The proposed system has great application values in many downstream tasks including rapid avatar
creation and 4D performance capture. We believe the improvement in speed and accuracy brought
by our system would greatly facilitate the accessibility of face capturing to support an emerging
industrial field that is data-hungry.
4.1 RelatedWork
Multi-view Stereo Face Capture Today’s high-quality performance capture of human face is com-
monly done with passive or active MVS capture systems (e.g. [11, 49, 106]). The complete procedures to
acquire 3D avatars from the captured data involve two major steps from multi-view stereopsis to registra-
tion, and each of them has been studied as individual problem.
Multi-view stereopsis is commonly the first step for acquiring dense 3D geometry and the algorithms
proposed in the past have emphasized various deigns for both joint view selection [72, 145, 138] and nor-
mal/depth estimation [138, 15, 45]. Neural-based MVS approaches proposed in recent years [67, 64, 58, 25]
have significantly increased the efficiency and generalized well to as few as a pair of stereo images. Since
our focus in on the digital face reconstruction, we refer interested readers to [82, 2] for more comprehen-
sive reviews.
77
The output of the multi-view stereopsis is, in general, in the form of dense depth maps, point clouds,
or 3D surfaces. Regardless of the specific representations, the geometries are processed into 3D meshes,
and a follow-up registration process aligns all captured shapes to a predefined template mesh connectivity.
The registration process is done either by explicitly regressing coefficients of a parametric face model [3,
13, 14, 92], directly optimizing shape with a non-rigid Iterative Closest Point registration algorithm [89]
or globally optimizing over a dataset of scanned face to find the groupwise correspondences [17, 172].
Learned Face Reconstruction from Images Settings where the face geometries are reconstructed
from a monocular image or a sparse set of views are in general ill-posed. Efforts in this direction are thus
mainly data-driven, where a popular line of methods can be considered as fitting parametric models to
the target image space, as seen in [46, 87, 150]. Deep neural networks have been utilized in most recent
works [135, 134, 136, 47, 148, 40, 149] for the regression of the parameters that drive a morphable model.
The quality and accuracy of monocular face reconstruction, albeit appealing in some circumstances, are
not suitable for production use in professional settings. The inherent ambiguity of focal length, scale, and
shape oftentimes lead a monocular reconstruction network to produce different shapes for the same face
viewed at different angle [8].
Few works prior to us have attempted a data-driven approach to MVS face reconstruction. When the
camera views are abundant, modern face capture pipelines, e.g. [12, 19, 44], have demonstrated highly de-
tailed and precise face reconstruction with pore-level appearance without the needs for a learned mapping
in their computations. However, as introduced in the previous section, the manual costs and computation
overhead of these pipelines have at least inspired many to propose neural-based frameworks that auto-
mate and accelerate key steps in face capture applications, e.g. deep stereo matching and registration. The
recent work ToFu [93] is a notable neural framework that offers end-to-end solution for registered face
geometry reconstruction, based on the prediction of the probabilistic distributions of individual vertices of
78
a template face mesh. ReFA expands its setting by including texture inference in the end-to-end network,
while our formulation, compared to ToFu, is able to infer denser geometry at an even faster speed.
LearnedOptimizersforGeometryInference. Our method is related to a broader trend of solving ge-
ometrical optimization problem with recurrent neural networks, where feature correlations are computed
iteratively to refine optical flow [147], depth [170] or vanishing points [102]. Motivated by the success of
neural optimizers in geometric refinement, we consider a novel reformulation of the face reconstruction as
an iterative refinement to a UV position map. Different from past literature that computes correlations in
image spaces, our method aligns vertices embedded in the UV-space position map to pixels from multiple
image-space views.
Textures Inference for Photo-realistic Rendering. Controlled environments are usually needed to
collect the ground-truth photo-realistic appearance of a human face, exemplified by the Light Stage [33,
54, 48]. Neural-based reconstruction network trained on the captured appearance information generally
employs an encoder-decoder structure to simultaneously infer skin reflectance and illumination alongside
the geometry [26, 168], where the quality of the inferred textures were limited due to either reliance on
synthetic data or oversimplified reflectance model. The recent works [83] and [91] both utilized generative
adversarial training and an image translation network to performance texture inference that are photo-
realistic and render-ready, where high-quality albedo, displacement and specular maps were decoupled
from the input face images.
79
4.2 RapidFaceAssetAcquisitionwithRecurrentFeatureAlignment
4.2.1 Overview
As shown in Figure 4.4, our end-to-end system takes multi-view images and a predefined template UV
position map in a canonical space as input and produces 1) an updated position map, 2) estimated head
pose (3D rotation and translation) parameters to rigidly align the updated position map in camera space
to the canonical template space and 3) texture maps including the albedo map, the specular map, and the
displacement map. To support direct use for animation, the position map and the texture maps form the
entire face assets for realistic rendering and are all conveniently defined in the same (or up-sampled) UV
space. In the following Section 4.2.1.1, we detail the representations of the aforementioned entities as well
as the camera model.
The subsequent sections are dedicated to the three main components of our system: (1) the feature
extraction networks (Section 4.2.2) that extract features for the input images and a predefined UV-space
feature map; (2) the recurrent face geometry networks (Section 4.2.3) that take the output of the feature
extraction network and use a learned neural optimizer to iteratively refine the geometry from an initial
condition to a finer shape; and (3) the texture inference networks (Section 4.2.4) that take the inferred
geometry and the multi-view texture features to infer the high-resolution texture maps.
4.2.1.1 Preliminaries
DataFormat. Table 4.1 specifies the symbols and formats of the input and output data involved in our
pipeline. In addition to the details provided in the table, the input multi-view images are indexed by the
camera view:
I
i
K
i=1
from K views with known camera calibrations
P
i
|P
i
∈ R
3× 4
K
i=1
. All feature
maps are bilinearly sampled given real-valued coordinates. Specifically, the displacement map is designed
to be added along the normal direction of the position map to provide high-frequency geometry details.
80
Name Symbol Dimension
Input multi-view images I R
H× W× 3
Camera parameters P R
3× 4
Head pose [R,t] R
3× 4
UV-space position map M R
Ht× Wt× 3
UV-space albedo map A R
8Ht× 8Wt× 3
UV-space specular map S R
8Ht× 8Wt
UV-space displacement map D R
8Ht× 8Wt
Table 4.1: Symbol table. By default, we setH =W =H
t
=W
t
=512.
GeometryRepresentation. The position mapM is our representation of the face geometry.M comes
with a UV mapping from a template mesh connectivity, and thus each pixel onM is mapped to a vertex or
a surface point of a 3D mesh. All the scanned meshes with different identities and expressions share the
same UV mapping. Furthermore, each pixel inM stores theR
3
coordinates of its location in the canonical
space. It therefore suffices to define a high-resolution geometry given a dense mesh and a UV mapping, as
converting the position map to a 3D mesh simply amounts to setting the vertex positions of the mesh.
The UV-space representation of the geometry is in particular amenable to shape inference with a neural
network, as the position map links the geometry space to a texture space that can be processed by 2D
convolutional neural networks effectively. Since each pixel in M corresponds to a mesh vertex, a position
mapM of 512× 512 resolution supports a dense geometry of up to2.6M vertices. Thus we believe that the
position map is a powerful geometry representation that enables inference of highly detailed face assets.
Our system uses a common UV space across all the subjects and the expressions. This ensures that all
the inferred geometries are registered. An additional advantage is that we can use any mesh connectivity
that embraces the same UV mapping to sample from the position map and recover the vertex coordinates.
Camera Model. We follow the pinhole camera model. For a 3D pointX = [X,Y,Z]
T
∈ R
3
in the
world space, its projection on the image planex = [x,y]
T
∈ R
2
can be computed as: z· [x,y,1]
T
=
81
(c) Face Region Map (a) UV Coordinates (b) Position Map
Figure 4.1: Composition of the UV-space featureG.G is a concatenation of (a) UV space coordinates, (b)
position map of a mean shape and (c) a carefully crafted face region map (31 dimensional one-hot vector).
The composition serves to encode the facial semantics and the geometry priors necessary for the future
steps.
P· [X,Y,Z,1]
T
, whereP∈R
3× 4
is the camera parameters including the intrinsic and extrinsic matrices.
For convenience, we denote this relationship as
x=Π P
(X). (4.1)
4.2.2 FeatureExtractionNetworks
Image Space Features. From the input multi-view images
I
i
K
i=1
, we use a ResNet-like [61] back-
bone network to extract 2D features at
1
8
of the image resolution. The output are split into two branches:
the geometry featuref
i
∈ R
W
8
× H
8
× C
and the texture featuref
text
i
∈ R
W
8
× H
8
× Ct
given the view index
i. The geometry feature map is used for estimating the head pose, represented as a 6-DoF rigid trans-
formation[R,t], and the position mapM (Section 4.2.3). The texture feature map is used for generating
high-resolution texture maps including albedo mapsA, specular mapsS, and displacement mapsD (Sec-
tion 4.2.4).
UVSpaceFeatures. As shown in Figure 4.1, from the template mesh and its UV mapping, we assemble
the UV-space feature mapG∈R
Wt× Ht× 36
by concatenating the following features for each pixelu: (1)
the 2D coordinates ofu itself, normalized to[− 1,1]
2
(Figure 4.1a); (2) the corresponding 3D coordinates
82
Image feature
view 1
Position map
Bilinear sample
Bilinear sample
Bilinear sample
Image feature
view 2
Image feature
view 3
UV space feature
Image space
geometry feature
Inner-
product
AAAB+nicbZDLSsNAFIYn9VbrLdWlm8EiuCglEVEXIgU3LivYCzShTCaTduhkJsxMlBL7KG5cKOLWJ3Hn2zhNs9DWHwY+/nMO58wfJIwq7TjfVmlldW19o7xZ2dre2d2zq/sdJVKJSRsLJmQvQIowyklbU81IL5EExQEj3WB8M6t3H4hUVPB7PUmIH6MhpxHFSBtrYFevPBwK7dXrXj2n64FdcxpOLrgMbgE1UKg1sL+8UOA0JlxjhpTqu06i/QxJTTEj04qXKpIgPEZD0jfIUUyUn+WnT+GxcUIYCWke1zB3f09kKFZqEgemM0Z6pBZrM/O/Wj/V0aWfUZ6kmnA8XxSlDGoBZznAkEqCNZsYQFhScyvEIyQR1iatignBXfzyMnROG+55w707qzWdIo4yOARH4AS44AI0wS1ogTbA4BE8g1fwZj1ZL9a79TFvLVnFzAH4I+vzB9xvkws=
< ·, ·>
Inner-
product
AAAB+nicbZDLSsNAFIYn9VbrLdWlm8EiuCglEVEXIgU3LivYCzShTCaTduhkJsxMlBL7KG5cKOLWJ3Hn2zhNs9DWHwY+/nMO58wfJIwq7TjfVmlldW19o7xZ2dre2d2zq/sdJVKJSRsLJmQvQIowyklbU81IL5EExQEj3WB8M6t3H4hUVPB7PUmIH6MhpxHFSBtrYFevPBwK7dXrXj2n64FdcxpOLrgMbgE1UKg1sL+8UOA0JlxjhpTqu06i/QxJTTEj04qXKpIgPEZD0jfIUUyUn+WnT+GxcUIYCWke1zB3f09kKFZqEgemM0Z6pBZrM/O/Wj/V0aWfUZ6kmnA8XxSlDGoBZznAkEqCNZsYQFhScyvEIyQR1iatignBXfzyMnROG+55w707qzWdIo4yOARH4AS44AI0wS1ogTbA4BE8g1fwZj1ZL9a79TFvLVnFzAH4I+vzB9xvkws=
< ·, ·>
Inner-
product
AAAB+nicbZDLSsNAFIYn9VbrLdWlm8EiuCglEVEXIgU3LivYCzShTCaTduhkJsxMlBL7KG5cKOLWJ3Hn2zhNs9DWHwY+/nMO58wfJIwq7TjfVmlldW19o7xZ2dre2d2zq/sdJVKJSRsLJmQvQIowyklbU81IL5EExQEj3WB8M6t3H4hUVPB7PUmIH6MhpxHFSBtrYFevPBwK7dXrXj2n64FdcxpOLrgMbgE1UKg1sL+8UOA0JlxjhpTqu06i/QxJTTEj04qXKpIgPEZD0jfIUUyUn+WnT+GxcUIYCWke1zB3f09kKFZqEgemM0Z6pBZrM/O/Wj/V0aWfUZ6kmnA8XxSlDGoBZznAkEqCNZsYQFhScyvEIyQR1iatignBXfzyMnROG+55w707qzWdIo4yOARH4AS44AI0wS1ogTbA4BE8g1fwZj1ZL9a79TFvLVnFzAH4I+vzB9xvkws=
< ·, ·>
View aggregation function
Correlation
feature
AAAB8XicjVDLSsNAFL2pr1pfVZduBovgqiQi6rLgxo1QwT6wDWUynbRDJ5MwcyOUkL9w40IRt/6NO//GSduFioIHBg7n3Ms9c4JECoOu++GUlpZXVtfK65WNza3tneruXtvEqWa8xWIZ625ADZdC8RYKlLybaE6jQPJOMLks/M4910bE6hanCfcjOlIiFIyile76EcVxEGbX+aBa8+ruDORvUoMFmoPqe38YszTiCpmkxvQ8N0E/oxoFkzyv9FPDE8omdMR7lioaceNns8Q5ObLKkISxtk8hmalfNzIaGTONAjtZJDQ/vUL8zeulGF74mVBJilyx+aEwlQRjUnyfDIXmDOXUEsq0sFkJG1NNGdqSKv8roX1S987q3s1preEu6ijDARzCMXhwDg24gia0gIGCB3iCZ8c4j86L8zofLTmLnX34BuftE7zVkOg=
M
AAACJ3icbVDLSsNAFJ3UV62vqEs3g0VoUUoioq6koAs3QgX7gCaEyXTSDp08mJkIJeRv3PgrbgQV0aV/4qRNUVsPDJw5517uvceNGBXSMD61wsLi0vJKcbW0tr6xuaVv77REGHNMmjhkIe+4SBBGA9KUVDLSiThBvstI2x1eZn77nnBBw+BOjiJi+6gfUI9iJJXk6BdWgzqJ5SM5cL2kkTo0rUx/Nz80TqvwEFpXhEkEp2InrTp62agZY8B5YuakDHI0HP3F6oU49kkgMUNCdE0jknaCuKSYkbRkxYJECA9Rn3QVDZBPhJ2M70zhgVJ60Au5eoGEY/V3R4J8IUa+qyqzFcWsl4n/ed1Yeud2QoMoliTAk0FezKAMYRYa7FFOsGQjRRDmVO0K8QBxhKWKtqRCMGdPniet45p5WjNvT8r1ozyOItgD+6ACTHAG6uAaNEATYPAAnsAreNMetWftXfuYlBa0vGcX/IH29Q1ycaYw
⇧ Pi
(M(u)+ X)
Projection to
image space
AAACVnicdVFNbxMxEPVuKS3hawtHLiMipF5YeUMK6QGpEhw4Fom0kbKryOvMJm683pU9WxGt8ifhAj+FC8L5qIAKRrL05r15suc5r7VyxPn3INy7s3/34PBe5/6Dh48eR0dPLlzVWIlDWenKjnLhUCuDQ1KkcVRbFGWu8TJfvFvrl9donarMJ1rWmJViZlShpCBPTaIyfY+aBKSloHletKMVvAUJqcaCxpDmOFOm9ZpVn1ewm1XwEiyk6U1/datfbHs0099Oq2ZzyiZRl8enp7zfHwCPT3iv1z/xgL/qDQYJJDHfVJft6nwSfUmnlWxKNCS1cG6c8JqyVlhSUuOqkzYOayEXYoZjD40o0WXtJpYVvPDMFIrK+mMINuyfjlaUzi3L3E+ut3e3tTX5L23cUDHIWmXqhtDI7UVFo4EqWGcMU2VRkl56IKRV/q0g58IKSf4nOj6Em03h/+CiFyev4+Rjv3t2vIvjkD1jz9kxS9gbdsY+sHM2ZJJ9ZT+CMNgLvgU/w/3wYDsaBjvPU/ZXhdEvqKaxcA==
X= c
2 4 i r
j r
k r
3 5 Grid sample offset
AAACJnicbZDLSgMxFIYzXmu9jbp0EyxCC6XMiKgboaALlxXsBdpaMmmmjc1cSM4IZZinceOruHFREXHno5hpa9HWHwIf/zmHk/M7oeAKLOvTWFpeWV1bz2xkN7e2d3bNvf2aCiJJWZUGIpANhygmuM+qwEGwRigZ8RzB6s7gKq3XH5lUPPDvYBiytkd6Pnc5JaCtjnkZtzwCfceNh0lyH+ehkOR/nCgp4tY1E0Awn9HDjAYF3DFzVskaCy+CPYUcmqrSMUetbkAjj/lABVGqaVshtGMigVPBkmwrUiwkdEB6rKnRJx5T7Xh8ZoKPtdPFbiD18wGP3d8TMfGUGnqO7kwvUPO11Pyv1ozAvWjH3A8jYD6dLHIjgSHAaWa4yyWjIIYaCJVc/xXTPpGEgk42q0Ow509ehNpJyT4r2benuXJxGkcGHaIjlEc2OkdldIMqqIooekIvaITejGfj1Xg3PiatS8Z05gD9kfH1DWADpE8=
y
(t)
(u, i, j, k)
Figure 4.2: An illustration of visual-semantic correlation (VSC). A 3D local grid is built around the 3D
position of each pixel in the UV-space position map. The volume of correlation feature is then constructed
by taking the inner product between each UV-space feature in the local grid the its projected features in
the multi-view image space. The correlation feature is a local representation of the alignment between the
observed visual information and the semantic priors.
ofu in the mean face mesh (Figure 4.1b); (3) the one-hot encoding of its face region, where we manually
create a semantic face region map including 31 regions (Figure 4.1c).
We process the featureG using a convolutional neural network and get the resulting UV space feature
mapg∈R
W
t
8
× H
t
8
× C
. SinceG is a constant, the UV feature mapg can also be understood as a trainable
parameter, which is regularized by the CNN architecture and the construction of G. Once trained, we
discard the neural network and setg as a fixed constant.
4.2.3 RecurrentFaceGeometryOptimizer
Our network tackles the reconstruction task by solving an alignment problem. A UV-space position map
that represents the geometry is first initialized to be the mean face shape. In a practical face capturing
setting, the pose of the head relative to the geometry defined by the position map is unknown, so we
initialize the head pose as a random pose that is visible in all cameras. The initialized face geometry,
when projected to the image space, will show misalignment with the underlying geometry depicted in
the multi-view images. For instance, a projection of the eye on the initialized face geometry is likely not
aligned with the actual eye location in the image space. Our framework thus optimizes the face geometry
iteratively, such that the projection of the face in the UV space gradually converge to the ground truth
83
locations in all image views. In order to solve the alignment problem, the features in the UV space and
the image space (Section 4.2.2) are joined in a unified feature space, such that the corresponding locations
in both spaces are trained to encode similar features. We compute a dense all-pair correlation between
the UV space and the image space and use a recurrent neural network to find the optimal matching in the
correlation tensor. Once the optimal matching is found in this process, the shape depicted by the position
map naturally reconstructs the shape depicted in the multi-view images.
In each network step, we update the position mapM as well as the head pose[R,t] separately, given
the correlation tensor between the two misaligned feature maps of interest, namely the UV feature map
g and the image space featuref. We term the optimizer that performs such actions the Recurrent Face
Geometry Optimizer. In the following paragraphs, we describe in detail how our optimizer initializes,
updates, and finalizes the corrections in order to recover M and[R,t].
Initialization. we initialize the head pose with a randomly selected rotation and translation of the mean
camera distance (≈ 1.3 meters). We also initialize the position map as the mean shapeM
0
= M. Such
design is due to the fact that the captured subjects’ head may turn from an upright position in a more
practical setting. In other word, we do not assume that the absolute pose of the head is known.
Compute Gradient. The Recurrent Face Geometry Optimizer is based on a recurrent neural network
(RNN) composed of Gated Recurrent Units (GRU) [28], which computes the gradient on the pose (rotation
R, translationt) and the geometry (position mapM). At thet-th step, the neural network process could
be written as:
y
(t)
← VSC(
f
i
K
i=1
,g,R
(t− 1)
,t
(t− 1)
,M
(t− 1)
), (4.2)
h
(t)
← GRU(y
(t)
,h
(t− 1)
), (4.3)
∆ (t)
← Decoder(h
(t)
). (4.4)
84
In Equation (4.2), our Visual Semantic Correlation (VSC) network (Section 4.2.3.1) matches the UV space
feature and the image space feature, and produces a correlation feature mapy
(t)
∈R
W
t
8
× H
t
8
× C
VSC
. Next,
y
(t)
is fed to a GRU-based RNN [28] and the hidden state h
(t)
is updated from the previous h
(t− 1)
in
Equation (4.3). Then, the Geometry Decoding Network (Section 4.2.3.2) processes the hidden vectorh
(t)
and computes the geometry update tuple∆ (t)
=
δR
(t)
,δt
(t)
,δ M
(t)
. The update tuple is applied by
R
(t)
← R
(t− 1)
· δR
(t)
t
(t)
← t
(t− 1)
+δt
(t)
(4.5)
M
(t)
← M
(t− 1)
+δ M
(t)
.
Given the total iterationsT , the final output of the optimizer is simply [R,t]
(T)
andM
(T)
.
4.2.3.1 Visual-SemanticCorrelation(VSC)Networks
To predict the update tuple, we construct a 2D feature map containing the signals whereδ M
(t)
(u) and
[δR
(t)
,δt
(t)
] should orient. Our method is illustrated in Figure 4.2.
The 2D feature map computes similarity between the multi-view geometry features f and the UV
featuresg by constructing a correlation volume. Specifically, let
ˆ
M
(t)
=R
(t)
M
(t)
+t
(t)
be the transformed
position map at t-th step, we first enumerate a 3D regular grid of size (2r +1)× (2r +1)× (2r +1)
around
ˆ
M
(t)
(u) for each pixelu in the UV space, where r ∈ N is the grid resolution. We then project
the grid points to the image space using the camera parametersP
i
, and compare the features between the
corresponding points in the image space off and the UV space ofg.
We use bilinear sampling to sample the feature at non-integer indices in both spaces, and calculate the
similarity as the inner-product between two features: the UV features that contain semantic information,
85
and the image features that contain visual information. We therefore call this process Visual-Semantic
Correlation (VSC). Mathematically, this process is represented as
˜ y
(t)
i
(u,∆ i,∆ j,∆ k)=
f
i
Π P
i
ˆ
M
(t)
(u)+c
∆ i− r
∆ j− r
∆ k− r
,g(u)
, (4.6)
where ˜ y
(t)
i
is the constructed 5D correlation tensor for i-th camera view, f
i
andg are the feature maps
introduced in Section 4.2.2,Π P
i
is the projection operator introduced in Equation (4.1),u is the 2D co-
ordinates in the UV space, r is the grid resolution, c is the searching radius controlling the grid size,
∆ i,∆ j,∆ k ∈{1,2,...,2r+1} is the offset in the x-axis, y-axis, andz-axis, respectively, and “⟨·,·⟩ ” is
the inner-product operator. The constructed 5D correlation tensor can be understood as guidance features
for drawing the 3D points, represented byM
(t)
(u), to new locations. After the correlation tensor ˜ y
(t)
i
is
computed, we flatten it along the dimensions of ∆ i, ∆ j, and ∆ k. Finally, the flattened features at each
view are fused by a chosen aggregation functionσ to produce the input feature to the decoder, for which
we have chosen the max pooling function:
y
(t)
=σ (˜ y
(t)
0
,...,˜ y
(t)
K
) (4.7)
4.2.3.2 GeometryDecodingNetwork
The decoder, termed Geometry Decoding Network, decodes the the hidden stateh
(t)
into 1) a 7D vector
representing correction to the head pose: 4D quaternion, which is then converted to a rotation matrix
δR
(t)
∈R
3× 3
, and 3D translationδt
(t)
∈R
3
, and 2) correction to the position mapδ M
(t)
. To compute
the updates to the head pose, h
(t)
is down-sampled with 3 2-stride convolutions, followed by a global
average pooling and two fully-connected layers. Updates to the position map is processed by a standard
stacked hourglass network [123].
86
4.2.4 TextureInference
The goal of the texture inference is to predict the UV-space albedo mapA, specular mapS and displace-
ment mapD from the input images and the inferred geometry. Being able to predict geometry in the UV
space, our formulation offers a direct benefit to the texture inference module, as the pixel-aligned signals
between the UV space and the multi-view inputs are already prepared in the previous geometry inference
step. The high resolution texture maps are inferred based on the image texture features reprojected back
to the UV space. Given the coordinates u in the UV space, the multi-view camera poses
P
i
K
i=1
, the
inferred position mapM
(T)
, and the inferred head pose[R,t]
(T)
, the pixel-aligned features for each view
can be obtained as:
˜ y
text
i
(u)=f
text
i
Π P
i
R
(T)
M
(T)
(u)+t
(T)
, (4.8)
wheref
text
is the texture feature map introduced in Section 4.2.2. We index the feature map using bilinear
sampling for non-integer coordinates. Similar to our face geometry module, we aggregate the UV-space
features with the aggregation function:
y
text
(u)=σ
˜ y
text
1
(u),...,˜ y
text
K
(u)
, (4.9)
whereσ is the aggregation function that aggregates the pixel-wise feature across all views, which could be
max,min,var, etc. Once the reprojected feature is obtained, three independent decoders regressA,S,D
simultaneously in the UV space in a coarse-to-fine fashion. Specifically, we first employ stacked hourglass
networks to regress the diffuse albedo, specular and displacement maps in 512 × 512 resolution. We then
use three consecutive image upsampling networks [162] to upsample the texture maps to 1024× 1024,
2048× 2048, and 4096× 4096, respectively. For diffuse albedo networks, tanh is used as the activation
function, while we do not add any activation functions for the specular networks and displacement net-
works. The distribution discrepancy is large for different texture map representations, although they are
87
defined in the same UV space. Thus the network parameters for the decoders are not shared for different
map representations. In order to produce sharp and high-fidelity textures, we follow [160, 166] to use an
adversarial loss in addition to the reconstruction loss for the training of the texture reconstruction.
4.2.5 TrainingLoss
The training of the face geometry is supervised using the ground truth head pose[R,t]
gt
and position map
M
gt
. Both is supervised with a L
1
loss between the prediction and the ground truth, which is summed
over all iterations. For the head pose, we compute the loss function as:
L
P
=
X
t
R
(t)
− R
gt
1
+
t
(t)
− t
gt
1
.
For the position map, we supervise the network with a denseL
1
loss computed between the predicted
position map and the ground truth after applying the corresponding head pose transformation:
L
M
=
X
t
X
u
R
(t)
M
(t)
(u)+t
(t)
− R
gt
M
gt
(u)− t
gt
1
.
In order to learn accurate and photo-realistic textures, we supervise our texture inference network
withL
1
and adversarial losses (adv) on all texture maps includingA,S, andD:
L
t
=λ adv
X
T∈{A,S,D}
adv
T,T
gt
+
X
u
T(u)− T
gt
(u)
1
.
Overall, we jointly train all modules using a multi-task loss:
L=λ P
L
P
+λ M
L
M
+λ t
L
t
.
88
4.2.6 ImplementationDetails
Our system is fully implemented in PyTorch. All the training and testing is performed on NVIDIA V100
graphics cards. All the network parameters are randomly initialized and are trained using the Adam opti-
mizer for 200,000 iterations with a learning rate set to3× 10
− 4
. For the recurrent face geometry optimizer,
we set the inference step toT = 10, the grid resolution tor = 3, the search radius toc = 1mm, and the
loss weights of the head pose (λ P
) and the position map (λ M
) to 0.1 and 1, respectively. For the texture
inference network, we use three separate discriminators forA,S andD. The loss weight of the L1 (λ t
)
and discriminators (λ adv
) are set to 1 and 0.01, respectively. The dimensions for the input image H,W
and the UV mapsH
t
,W
t
are set to be 512. During training, we randomly select 8 camera views for each
scan. We found it sufficient to train the network without data augmentation. The training process takes
approximately 30 hours using 4 graphics cards. During inference, arbitrary numbers of camera views can
be used as input since our view aggregation function is not sensitive to the number of views. For inference
with 8 camera views, our network consumes approximately 2GB of GPU memory.
4.2.7 Results
4.2.7.1 DataCollection
Capture System Setup. Our training data is acquired by a Light Stage scan system, which is able to
capture at pore-level accuracy in both geometry and reflectance maps by combining photometric stereo
reconstruction [49] and polarization promotion [85]. The camera setup consists of 25 Ximea machine
vision cameras, including 17 monochrome and 8 color cameras. The monochrome cameras, compared
to their color counterparts, support more efficient and higher-resolution capturing, which are essential
for sub-millimeter geometry details, albedo, and specular reflectance reconstruction. The additional color
cameras aid in stereo-based mesh reconstruction. The RGB colors in the captured images are obtained
by adding successive monochrome images recorded under different illumination colors as shown in [85].
89
Figure 4.3: An example set of subject data used for training. (a) Selected views of the captured images as
input. (b) Processed geometry in the form of a 3D mesh. In addition to the face, head, and neck, our model
represents teeth, gums, eyeballs, eye blending, lacrimal fluid, eye occlusion, and eyelashes. The green
region denotes the face area that our model aims to reconstruct. The other parts are directly adopted
from a template (c) 4K× 4K physically-based skin properties, including albedo (bottom-left), specular
(top-left) and displacement maps (top-right) used for texture supervision, and the512× 512 position map
(bottom-right), converted from the 3D mesh in (b), used for geometry supervision.
We selected a FACS set [36] which combines 40 action units to a condensed set of 26 expressions for each
subjects to perform. A total number of 64 subjects, ranging from age 18 to 67, were scanned.
DataPreparation. Starting from the multi-view images, we first reconstruct the geometry of the scan
with neutral expression of the target subject using a multi-view stereo (MVS) algorithm. Then the recon-
structed scans are registered using a linear fitting algorithm based on a 3D face morphable model, similar
to the method in [13]. In particular, we fit the scan by estimating the morphable model coefficients using
90
Sec. 4.3 Learned recurrent face geometry optimizers
Sec. 4.4 Texture inference networks Sec. 4.2 Feature
extraction networks
Encoder
Encoder
Decoder Warp
VSC
Visual-
semantic
correlation
Wrap
Feature
wraping
Apply update
Input multi-view images
UV-space feature map G
Texture feature
Geometry feature
UV feature
(constant during
testing)
Albedo
Specular
Displacement
g
VSC
GRU
Decoder
VSC
GRU
Decoder
VSC
GRU
Decoder
Update for T times
AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiRF1GXBjRuhgn1AG8tkOmmHTiZhZlIoIX/ixoUibv0Td/6NkzYLbT0wcDjnXu6Z48ecKe0439ba+sbm1nZpp7y7t39waB8dt1WUSEJbJOKR7PpYUc4EbWmmOe3GkuLQ57TjT25zvzOlUrFIPOpZTL0QjwQLGMHaSAPb7odYj/0gvc+e0qpzkQ3silNz5kCrxC1IBQo0B/ZXfxiRJKRCE46V6rlOrL0US80Ip1m5nygaYzLBI9ozVOCQKi+dJ8/QuVGGKIikeUKjufp7I8WhUrPQN5N5TrXs5eJ/Xi/RwY2XMhEnmgqyOBQkHOkI5TWgIZOUaD4zBBPJTFZExlhiok1ZZVOCu/zlVdKu19yrmvtwWWnUizpKcApnUAUXrqEBd9CEFhCYwjO8wpuVWi/Wu/WxGF2zip0T+APr8wf2DZMr
M
(0)
AAACB3icbZDLSsNAFIZP6q3WW9SlIINFqCAlKaIuC25cVrEXaGOZTCft0MmFmYlQQnZufBU3LhRx6yu4822ctBG0+sPAx3/OYc753YgzqSzr0ygsLC4trxRXS2vrG5tb5vZOS4axILRJQh6Kjosl5SygTcUUp51IUOy7nLbd8UVWb99RIVkY3KhJRB0fDwPmMYKVtvrmfrfnYzVyveQ6PUbfrFLnNqlYR2nfLFtVayr0F+wcypCr0Tc/eoOQxD4NFOFYyq5tRcpJsFCMcJqWerGkESZjPKRdjQH2qXSS6R0pOtTOAHmh0C9QaOr+nEiwL+XEd3Vntqicr2Xmf7VurLxzJ2FBFCsakNlHXsyRClEWChowQYniEw2YCKZ3RWSEBSZKR1fSIdjzJ/+FVq1qn1btq5NyvZbHUYQ9OIAK2HAGdbiEBjSBwD08wjO8GA/Gk/FqvM1aC0Y+swu/ZLx/AW+XmPo=
[R,t]
(0)
}
Position map
of a mean shape
Set to a random
camera pose
}
AAACB3icbZDLSsNAFIYn9VbrLepSkMEiVJCSFFGXBTcuq/QGaSyT6aQdOpmEmYlQQnZufBU3LhRx6yu4822ctBG09YeBj/+cw5zzexGjUlnWl1FYWl5ZXSuulzY2t7Z3zN29tgxjgUkLhywUXQ9JwignLUUVI91IEBR4jHS88VVW79wTIWnIm2oSETdAQ059ipHSVt88dHoBUiPPT27TU/jDKnXvkkrzJO2bZatqTQUXwc6hDHI1+uZnbxDiOCBcYYakdGwrUm6ChKKYkbTUiyWJEB6jIXE0chQQ6SbTO1J4rJ0B9EOhH1dw6v6eSFAg5STwdGe2qJyvZeZ/NSdW/qWbUB7FinA8+8iPGVQhzEKBAyoIVmyiAWFB9a4Qj5BAWOnoSjoEe/7kRWjXqvZ51b45K9dreRxFcACOQAXY4ALUwTVogBbA4AE8gRfwajwaz8ab8T5rLRj5zD74I+PjG6ZvmR4=
[R,t]
(T)
AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyUpoi4LbtwIFfqCNpbJdNIOnUzCzKRQQv/EjQtF3Pon7vwbJ20W2npg4HDOvdwzx485U9pxvq3CxubW9k5xt7S3f3B4ZB+ftFWUSEJbJOKR7PpYUc4EbWmmOe3GkuLQ57TjT+4yvzOlUrFINPUspl6IR4IFjGBtpIFt90Osx36QPsyf0krzcj6wy07VWQCtEzcnZcjRGNhf/WFEkpAKTThWquc6sfZSLDUjnM5L/UTRGJMJHtGeoQKHVHnpIvkcXRhliIJImic0Wqi/N1IcKjULfTOZ5VSrXib+5/USHdx6KRNxoqkgy0NBwpGOUFYDGjJJieYzQzCRzGRFZIwlJtqUVTIluKtfXiftWtW9rrqPV+V6La+jCGdwDhVw4QbqcA8NaAGBKTzDK7xZqfVivVsfy9GCle+cwh9Ynz8s9JNP
M
(T)
Output
position map
Output
head pose
AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRF1GXBje4q2Ac0IUymk3boZBJmJoUS+iduXCji1j9x5984abPQ1gMDh3Pu5Z45YcqZ0o7zbVU2Nre2d6q7tb39g8Mj+/ikq5JMEtohCU9kP8SKciZoRzPNaT+VFMchp71wclf4vSmViiXiSc9S6sd4JFjECNZGCmzby70Y63EY5Q/zgHnzwK47DWcBtE7cktShRDuwv7xhQrKYCk04VmrgOqn2cyw1I5zOa16maIrJBI/owFCBY6r8fJF8ji6MMkRRIs0TGi3U3xs5jpWaxaGZLFKqVa8Q//MGmY5u/ZyJNNNUkOWhKONIJ6ioAQ2ZpETzmSGYSGayIjLGEhNtyqqZEtzVL6+TbrPhXjfcx6t6q1nWUYUzOIdLcOEGWnAPbegAgSk8wyu8Wbn1Yr1bH8vRilXunMIfWJ8/5XyTyA==
{I i}
AAAB+XicbVDLSsNAFL3xWesr6tLNYBFclaSIuiy4cVnBPqAJYTKdtEMnkzAzKZSQP3HjQhG3/ok7/8ZJm4W2Hhg4nHMv98wJU86Udpxva2Nza3tnt7ZX3z84PDq2T057KskkoV2S8EQOQqwoZ4J2NdOcDlJJcRxy2g+n96Xfn1GpWCKe9DylfozHgkWMYG2kwLa93IuxnoRRHhUB84rAbjhNZwG0TtyKNKBCJ7C/vFFCspgKTThWaug6qfZzLDUjnBZ1L1M0xWSKx3RoqMAxVX6+SF6gS6OMUJRI84RGC/X3Ro5jpeZxaCbLlGrVK8X/vGGmozs/ZyLNNBVkeSjKONIJKmtAIyYp0XxuCCaSmayITLDERJuy6qYEd/XL66TXaro3TffxutFuVXXU4Bwu4ApcuIU2PEAHukBgBs/wCm9Wbr1Y79bHcnTDqnbO4A+szx8SEJPl
{f i}
AAACB3icbVDLSsNAFJ34rPUVdSlIsAiuSlJEXRbcuKxgH9DUMJlO2qGTSZi5EUvIzo2/4saFIm79BXf+jZM0C209MHDmnHu59x4/5kyBbX8bS8srq2vrlY3q5tb2zq65t99RUSIJbZOIR7LnY0U5E7QNDDjtxZLi0Oe060+ucr97T6VikbiFaUwHIR4JFjCCQUueeeSmbohh7AdpkHnsrvjIMAX6AJmbeWbNrtsFrEXilKSGSrQ888sdRiQJqQDCsVJ9x45hkGIJjHCaVd1E0RiTCR7RvqYCh1QN0uKOzDrRytAKIqmfAKtQf3ekOFRqGvq6Ml9TzXu5+J/XTyC4HKRMxAlQQWaDgoRbEFl5KNaQSUqATzXBRDK9q0XGWGICOrqqDsGZP3mRdBp157zu3JzVmo0yjgo6RMfoFDnoAjXRNWqhNiLoET2jV/RmPBkvxrvxMStdMsqeA/QHxucPtayacQ==
{f
text
i
}
AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyUpoi4LblxWsA9oY5lMJ+3QySTMTAoh9E/cuFDErX/izr9x0mahrQcGDufcyz1z/JgzpR3n2yptbG5t75R3K3v7B4dH9vFJR0WJJLRNIh7Jno8V5UzQtmaa014sKQ59Trv+9C73uzMqFYvEo05j6oV4LFjACNZGGtr2IMR64gdZOn/Kau7lfGhXnbqzAFonbkGqUKA1tL8Go4gkIRWacKxU33Vi7WVYakY4nVcGiaIxJlM8pn1DBQ6p8rJF8jm6MMoIBZE0T2i0UH9vZDhUKg19M5nnVKteLv7n9RMd3HoZE3GiqSDLQ0HCkY5QXgMaMUmJ5qkhmEhmsiIywRITbcqqmBLc1S+vk06j7l7X3YerarNR1FGGMziHGrhwA024hxa0gcAMnuEV3qzMerHerY/laMkqdk7hD6zPHzuGk1g=
y
(1)
AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiRF1GXBjcsKfUEby2Q6aYdOJmFmUgghf+LGhSJu/RN3/o3TNgttPTBwOOde7pnjx5wp7Tjf1sbm1vbObmmvvH9weHRsn5x2VJRIQtsk4pHs+VhRzgRta6Y57cWS4tDntOtP7+d+d0alYpFo6TSmXojHggWMYG2koW0PQqwnfpCl+VNWbV3lQ7vi1JwF0DpxC1KBAs2h/TUYRSQJqdCEY6X6rhNrL8NSM8JpXh4kisaYTPGY9g0VOKTKyxbJc3RplBEKImme0Gih/t7IcKhUGvpmcp5TrXpz8T+vn+jgzsuYiBNNBVkeChKOdITmNaARk5RonhqCiWQmKyITLDHRpqyyKcFd/fI66dRr7k3NfbyuNOpFHSU4hwuoggu30IAHaEIbCMzgGV7hzcqsF+vd+liObljFzhn8gfX5A3DYk3s=
y
(T)
AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyUpoi4LblxWsA9oY5lMJ+3QySTMTAoh9E/cuFDErX/izr9x0mahrQcGDufcyz1z/JgzpR3n2yptbG5t75R3K3v7B4dH9vFJR0WJJLRNIh7Jno8V5UzQtmaa014sKQ59Trv+9C73uzMqFYvEo05j6oV4LFjACNZGGtr2IMR64gdZOn/Kao3L+dCuOnVnAbRO3IJUoUBraH8NRhFJQio04VipvuvE2suw1IxwOq8MEkVjTKZ4TPuGChxS5WWL5HN0YZQRCiJpntBoof7eyHCoVBr6ZjLPqVa9XPzP6yc6uPUyJuJEU0GWh4KEIx2hvAY0YpISzVNDMJHMZEVkgiUm2pRVMSW4q19eJ51G3b2uuw9X1WajqKMMZ3AONXDhBppwDy1oA4EZPMMrvFmZ9WK9Wx/L0ZJV7JzCH1ifPz0Mk1k=
y
(2)
Figure 4.4: Network architecture of ReFA. Our model recurrently optimizes for the facial geometry and the
head pose based on computation of visual-semantic correlation (VSC) and utilizes the pixel-aligned signals
learned thereof for high-resolution texture inference.
<0.2mm(%)<1mm(%)<2mm(%) Mean (mm) Med. (mm) Consistency Dense Texture
DFNRMVS [4] 5.266 25.900 48.345 2.817 2.084 ✓ ✓ ✗
DPSNet [67] 12.645 55.042 82.171 1.197 0.882 ✗ ✓ ✗
ToFu [93] 15.245 61.493 83.162 1.273 0.742 ✓ ✗ ✗
∗ ReFA (Ours) 18.382 70.547 91.605 0.848 0.603 ✓ ✓ ✓
Table 4.2: Quantitative comparison on our Light stage captured dataset. The table measures the percentage
of points that are within Xmm to the ground truth surface (column 1-3), mean and median scan-to-mesh
errors (column 4-5), and a comparison of the supported features (column 6-8). “Consistency” denotes
whether the reconstructed has consistent mesh connectivity. “Dense” denotes whether the model recon-
structs geometry of more than 500k vertices [107], and “Texture” denotes whether the network output
includes texture information. Although the original work of ToFU includes texture inference, the module
is separate from its main architecture and thus not learned end-to-end.
91
ToFu DPSNet DFNRMVS Ours Input Images Ground Truth
Figure 4.5: Qualitative comparison with the baselines on our testing dataset. As the release codes of the
baseline methods [67] and [93] do not produce appearance maps, the results presented here are the network
direct output geometry rendered with a basic shader using Blender. Visual inspection suffices to reveal
the improvement our model has achieved: ReFA produces results that are more robust to challenging
expressions (row 1,4,5), facial shapes (row 6,7) and reconstructs a wider face area including the ears and
forehead when compared to [93] and [4].
92
(b) Ours Geometry (a) Input (c) Ours Rendering (d) NeRF Geometry (e) NeRF Rendering
Figure 4.6: Qualitative comparison with NeRF-based method.
(c) Ours (b) MVS+Fitting (a) Input
Figure 4.7: A comparison between our method (c) and a traditional MVS and fitting pipeline (b). The
traditional pipeline incorrectly reconstructs the two challenging input examples as shown in the figure:
pointy ear in the upper case due to hair occlusion and closed eyes in the lower case. Our system not only
correctly reconstructs the fine geometry details, but also at a significantly faster speed.
93
Figure 4.8: Cumulative density function (CDF) curves of scan-to-mesh distance comparison on our testing
dataset.
linear regression to obtain an initial shape in the template topology. Then a non-rigid Laplacian deforma-
tion is performed to further minimize the surface-to-surface distance. We deform all the vertices on the
initially fitted mesh by setting the landmarks to match their correspondence on the scan surface as data
terms and use the Laplacian of the mesh as a regularization term. We adopt and implement a variation
of [144] to solve this system. Once the neutral expression of the target person is registered, the rest of
the expressions are processed based on it. We first adopted a set of generic blendshapes (a set of vertex
differences computed between each expression and the neutral, with 54 predefined orthogonal expressions
) and the fitted neutral base mesh to fit the scanned expressions and then performed the same non-rigid
mesh registration step to further minimize the fitting error. Additionally, to ensure the cross-expression
consistency for the same identity, optical flow from neutral to other expressions is added as a dense consis-
tency constraint in the non-rigid Laplacian deformation step. This 2D optical flow will be further used as
a projection constraint when solving for the 3D location of a vertex on the target expression mesh during
non-linear deformation.
94
Input multi-view images Captured geometry Rendering
Figure 4.9: Testing result on the Triple Gangers [152] dataset, whose capturing setup contains different
camera placements, no polarizer and a lighting condition that is not seen in our training dataset. The
result demonstrated here shows that our model generalizes well to unseen multi-view dataset that has
been captured in different settings.
All the processed geometries and textures share the same mesh connectivity and thus have dense
vertex-level correspondence. The diffuse-specular separation is computed under a known spherical illu-
mination [106]. The pore-level details of the geometry are computed by employing albedo and normal
maps in the stereo reconstruction [49] and represented as displacement maps to the base mesh. The full
set of the generic model consists of a base geometry, a head pose, and texture maps (albedo, specular in-
tensity, anddisplacement) encoded in4K resolution. 3D vertex positions are rasterized to a three-channel
HDR bitmap of 256× 256 pixels resolution to enable joint learning of the correlation between geome-
try and albedo. 15 camera views are used for the default setting to infer the face assets with our neural
network. Figure 4.3 shows an example of captured multi-view images and a full set of our processed face
asset that is used for training. In addition to the primary assets generated using our proposed network,
95
Figure 4.10: Speed-accuracy graph of ReFA with varying numbers of the inference iterations against ToFu.
we may also assemble secondary components (e.g., eyeballs, lacrimal fluid, eyelashes, teeth, and gums) to
the network-created avatar. Based on a set of handcrafted blendshapes with all the primary and secondary
parts, we linearly fit the reconstructed mesh by computing the blending weights that drive the secondary
components to travel with primary parts, such that the eyelashes will travel with eyelids. Except for the
eyeball, other secondary parts share a set of generic textures for all the subjects. For eyeball, we adopt an
eyeball assets database [79] with 90 different pupil patterns to match with input subjects. Note that all the
eyes share the same shape as in [79] and in our database. For visualization purposes, we manually pick the
matching eye color. The dataset is split into 45 subjects for the training and 19 for the evaluation. Each set
of capture contains a neutral face and 26 expressions, including extreme face deformation, asymmetrical
motions, and subtle expressions.
Figure 4.11 shows the rendered results using the complete set of assets produced by our system from
randomly selected testing data, including the input reference images, the directly inferred texture maps,
and the renderings under different illuminations. In addition, Figure 4.12 shows a detailed visualization of
the inferred high-resolution texture maps: diffuse albedo, specular and displacement map. All results are
rendered using the reconstructed geometries and texture maps with Maya Arnold using a physically-based
shader under environment illumination provided by HDRI images.
96
Reference image / Inferred maps Rendering (illumination 1) Rendering (illumination 2) Reference image / Inferred maps Rendering (illumination 1) Rendering (illumination 2)
Figure 4.11: Images rendered from our reconstructed face assets. Geometry constructed from the input
images and the inferred appearance maps are used in the physical renderings with Maya Arnold under an
lighting environments provided HDRI images. The renderings achieve photo-realistic quality that faith-
fully recovers the appearance and expression captured in the input photos.
97
Reference image Diffuse albedo Specular map Displacement map Reference image Diffuse albedo Specular map Displacement map
Zoom in
Zoom in
Zoom in
Zoom in
Figure 4.12: Detailed results for the texture map inference. The even rows display the zoom-in images of
the 4096× 4096 texture maps. Our texture inference network constructs texture maps from the multi-view
images with high-frequency details that essentially allow for photo-realistic renderings of the face assets.
In the following sections, we provide comparative evaluation to directly related baseline methods (Sec-
tion 4.2.7.2) as well as an ablation study (Section 4.2.7.3). In addition, we also demonstrate three meaningful
applications that ReFA enables in Section 4.2.7.5.
To quantitatively evaluate the geometry reconstruction, we first convert our inferred position map to
a mesh representation as described in section 4.2.1.1. We then compute the scan-to-mesh errors using a
method that follow [93], with the exception that the errors are computed on a full face region including
the ears. We measure both the mean and median errors as the main evaluation metrics, given that the
two statistics capture the overall accuracy of the reconstruction models. To better quantize the errors
for analysis, we additionally show the Cumulative Density Function (CDF) curves of the errors, which
measure the percentage of point errors that falls into a given error threshold.
98
Method Time (s) FPS Med. (mm)
Traditional ≈ 30min - -
DFNRMVS [4] 4.5s - 2.084
ToFu [93] 0.39s 2.6 0.742
Ours 0.22s 4.5 0.603
Ours (small) 0.11s 9 0.763
Table 4.3: Inference Time comparison. Our model is both more efficient and more effective to the baselines
at a 4.5FPS speed. A lighter model that achieves similar accuracy to ToFu runs at 9FPS, which is close to
real-time performance.
4.2.7.2 Comparisons
Baselines. We compare our method regarding to the geometry accuracy with 3 strong deep learning-
based baselines from 3 different categories: (1) topologically consistent face inference network DFNR-
MVS [4] from a sequence of images; (2) MVS networks DPSNet [67] that is a representative depth esti-
mation network that achieved state-of-the-art results on several MVS benchmarks; (3) topologically con-
sistent multi-view face reconstruction network ToFu [93] that most resembles our setting and, prior to
our work, achieved state-of-the-art result on neural-based face reconstruction. The baseline results were
obtained with their publicly released codes with the 15-view input from our prepared dataset (discussed
in Section 4.2.7.1).
Qualitative Comparisons. Figure 4.5 provides a visual comparison of the reconstructed geometries
between our method and the baselines. Visual inspection suffices to show the qualitative improvements
brought by our method. First, for a certain examples (3rd row, 5th row, 6th row), our reconstructed faces
faithfully resemble the ground truth appearances whereas the model-based methods (DFNRMVS, ToFu)
display apparent errors in the overall facial shape and specific parts (eye size, jaw shape). Second, our
reconstruction is more robust to challenging expressions: mouth funnel (1st row), cheek raising (4th row),
lip stretching (7th row) and asymmetrical and non-linear mouth movement - mouth stretching to one side
(5th row). Third, as we focus on a full-face reconstruction, we note that certain methods (DFNRMVS,
ToFu) fail in reconstructing the ears of the subjects, whereas ours correctly infer the shape of the ears as
99
seen from the input images. Last but not least, our results shows the best geometry details, as our method
captures the middle-frequency details where others fail, such as the wrinkle on the forehead of the 2nd
and 4th row and the dimple on the face of the 2nd and 7th row.
In Figure 4.7, we show the comparison between our method and the traditional face reconstruction and
registration pipeline (described in Section 4.2.7.1) given challenging inputs. In these cases of occlusion and
noise, for example, due to hair occulusion (upper case) and the specific eye pose (lower case), the traditional
pipeline struggles to either reconstruct the accurate face shape or fit the template mesh connectivity to the
correct expressions. In practice, the raw reconstruction from MVS algorithm contains a certain geometry
noise, which requires manual clean-up by professional artists to remove the errors before the registration
step. In contrast, despite not trained with these examples, our network manages to infer the correct shape
automatically. We believe that this is attributed to the learned data prior such as face semantic information
from the training dataset. This validates that our system is more robust than the traditional pipeline in
challenging and noisy situations.
In Figure 4.6, we also compare our method with NeRF-based method, which is a recent stream of
image-based rendering (IBR) works. The NeRF baseline is an implementation from [120]. With 15 views,
NeRF fails to produce production-ready geometries due to the lack of visual correspondences and facial
semantics, despite the decent rendered results, while our method achieves superior qualities in geometry
reconstruction. In addition, our inferred maps are amenable to renderings in different environments using
established physically-based render engines.
QuantitativeComparisons. Table 4.2 and Figure 4.8 show our quantitative comparison and CDF curve
comparison with the baseline methods on our test dataset, respectively. ReFA outperforms the baselines in
all metrics. Remarkably, ReFA achieves a median error of 0.603mm, which outperforms the strongest base-
line by 19%. In terms of recall, we observe that our model brings the best improvement in high-precision
range, covering 20.6% more points within the 0.2mm precision when compared to the best baseline, and
100
14.7% and 10.1% in the 1mm and 2mm thresholds, respectively. The increased accuracy of our model is
augmented with the fact that, to our knowledge, our model is the only neural-based face asset creation
system that simultaneously generates consistent, dense and textured assets in an end-to-end manner (see
the right panel in Table 4.2).
Besides the accuracy, our system also runs significantly faster than previous works. We show the
inference time comparison in Table 4.3 and Figure 4.10. The traditional method takes hours to process a
single frame of a multi-view capture. Despite being accurate, the time consumption becomes tremendous
for processing large scale data and performance sequences. Compared to previous deep learning-based
works, our system achieves significantly better accuracy while being 40% faster.
To draw a controlled comparison for showing the speed improvement, we have designed a smaller
model by slightly modifying our network by: (1) using a light-weight feature extraction network; (2)
reducing the searching grid dimension from r = 3 to r = 2; (3) reducing the UV space resolution to
128× 128. This smaller model achieves similar accuracy and model resolution as the previous state-of-
the-art method [93], while achieving an inference speed of 9 frames per second (FPS).
We have also quantitatively evaluated the accuracy of the head pose and the quality of the inferred
albedo maps. The estimated head poses are found to have a rotation mean error of 1.429°and a median
error of 1.255°and a mean translation error of 2.91mm and a median error of 2.70mm. The inferred albedo
maps, when compared to the ground truths, have a mean PSNR of 28.29dB and a mean SSIM of 0.75.
Generalization. Our model is tested on the Triple Gangers [152] dataset in Figure. 4.9, which contains
different illumination and camera placements from our training set. In this generalized setting, our model
produces high reconstruction quality comparable to the results on our testing set. In particular, our model
is able to infer the displacement and specular maps, which would not have been accessible using the
traditional approach, due to the hardware limits in its capturing system (without the use of polarizers, it
is hard to separate the skin reflectance). In addition, our model also achieves a 0.89mm median error on
101
images with random light sources (shadows), which indicates that our model can generalize to different
lighting conditions to certain extent.
4.2.7.3 AblationStudy
Captured geometry Reference image
Frame 3 Frame 2 Frame 1 Frame 4 Frame 6 Frame 5
Reference image Captured geometry
Frame 7 Frame 8
Figure 4.13: Reconstruction of a video sequence at 4.5FPS, where the expression and the head pose of the
subject changes over time.
To validate our design choices, we conduct extensive ablation studies by altering our key building
modules, including 1) the choices of the correlation feature and the learned embedding feature, 2) the
choices of view aggregation function and the search radius in computing the visual-semantic correlation,
3) the resolution of the UV map 4) removal of GRU in recurrent layer and 5) the number of input views.
The detailed statistics of the ablation study is shown in Table 4.4.
Correlation features. Based on designs experimented in prior works [147, 67], we have altered the
correlation feature by directly concatenating the UV features and the multi-view image features (“concat”
102
in the first row of Table 4.4), instead of computing their similarity through an inner product. This change
has increased training difficulty and is shown to be less effective in the position matching task.
View aggregation function. Three different functions for fusing features across view, max pooling,
mean pooling and variance pooling, are investigated and the results are shown in the second row of Ta-
ble 4.4. A little to our surprise, the max aggregation function performs significantly better when compared
to the others, although mean pooling is commonly utilized in other multi-view frameworks (e.g. [67]). We
speculate that the ability of max pooling to attend to the most salient views allows it to discard noise.
The behavior also suggests a difference between our formulation of visual-semantic correlation features
and the typical MVS network features, which are typically based solely on visual similarity. Notably, max
pooling is also more robust to scenarios where regions of the face are occluded in a certain views.
SearchradiusinVSC. According to the result, our model achieves the best performance at a reasonable
search radius of 3mm. We believe this is because a smaller search radius requires more update steps, while
a larger search radius leads to less precision.
UV-spaceresolution. By default, we set our UV-space resolution of the position map to be 512× 512,
which is equivalent to a dense mesh of approximately 260K faces. In many practical situations, inference
speed may be preferred over precision. We thus investigate the effectiveness of our system under various
UV-space resolutions, including 512× 512, 256× 256 (≈ 65K faces), and 128× 128 (≈ 16K faces). From
the results we can observe that decreasing the UV-space resolution slightly decreases the performance.
However, even under lowest UV-space resolution, our model still outperforms the baseline, as a 128× 128 position map still encodes a denser mesh when compared to the parametric models utilized by the
baselines (e.g. [93]). This validates the position map as a more advantageous representation in the face
asset inference task.
103
(a) UV-space feature
w/ embedding networks
(b) UV-space feature
w/o embedding network
Figure 4.14: Ablation study on the UV-space embedding network: (a) using our proposed UV-space features
along with a neural network; (b) directly setting the UV-space feature as a learnable parameter. To visualize
the features, we use T-SNE [155] to embed the feature to the 3-dimension space.
Recurrentlayer. We investigate the effectiveness of the gated recurrent unit (GRU) module. GRU layers
take the input of the current step and the hidden layer from the previous step as inputs, and outputs the
updated hidden layer. We replace the GRU layers with convolution layers. The results show that GRU
performs significantly better than convolution. This shows that GRU layers can better capture the long-
term memory.
UVfeatures. We remove different components of the UV features, including UV coordinates (U), posi-
tion map (P), region map (R) (see “UV-space Feature” row of Table 4) and found that removing any compo-
nent decreases the performance, where removing the region map causes the largest decrease. We speculate
that this is because the region map, when compared to the UV coordinate feature, explicitly encodes the
semantics. In addition, we have tried to directly use an embedding tensor as a learnable parameter for
the UV features (“Parameter” in the “UV-space Embedding” row of Table 4.4) instead of filtering the input
UV maps with a neural network (“Network” in the “UV-space Embedding” row of Table 4.4). This drasti-
cally increases the number of parameters that need to be trained without regularization, the performance
under such design decreases slightly. To better understand the effectiveness of the learned embedding
104
Mesh (coarse) Position map Reference image Mesh (dense) Landmarks Point cloud Region map
Figure 4.15: Given valid UV mappings, the position map representation is amenable to conversions to
various representations, as shown in each column. This include 3D meshes of different subdivisions, which
enables Level of Detail (LOD) rendering; point cloud, landmarks and region map representation that are
commonly used in mobile applications.
network, we visualize the UV features of both designs with T-SNE [155] in Figure 4.14. We can observe
that both feature maps exhibit symmetry property and regional distribution that complies the actual face
region. However, a learned embedding produces much better regularized feature maps, which validates
the effectiveness of our UV-space embedding network.
Numberofviews. Decreasing the number of capture views needed to faithfully reconstruct a face as-
set is essential for supporting a light-weighted capturing system. The benefits from fewer view come in
twofold: (1) obviously, the capturing system needs fewer cameras; (2) the saving on storage space needed
for the raw data is quite significant - approximately 2 terabytes of storage space can be free if only half of
the 16 cameras are needed for a 10 minutes video of a subject. The “Input View” row of Table 4.4 shows
our model’s performance given different number of views as input. It is noteworthy that while the perfor-
mance decreases the less the available views, reducing 16 views to 6 views only resulted in a 9% increase
in median error and the achieved precision is enough for use in a professional setting (<1mm). Reducing
to 4 views comes with a 20% increase in median error. However, even the 4-view reconstruction with our
model outperforms all compared baselines that utilize 15-view input (see Table 4.2). We believe that the
105
results encourages a practical solution to reconstructing face assets from sparse views, where a traditional
acquisition algorithm would struggle.
4.2.7.4 DevicesandReal-worldCaptures
To demonstrate the capability of ReFA in handling real-world capturing, we build a real-world face captur-
ing system from scratch, starting from data preparation, model training, device building, to live capturing.
Syntheticdatageneration. In order to generalize to in-the-wild camera setting, we use our face asset
dataset with controlled scene setups to generate the photo-realistic synthetic data. We randomly generate
the eye colors, and manually add glasses to 10% of the data. We also randomly decrease the roughness of
the skin to simulate the oily skins. For each face model, we randomly generate 24 views with a focal length
in the range of 1905px and 1910px, and use a Gaussian distribution for the principle axis, with a sigma of
3px. The cameras are posed from 1.3 meters to 1.5 meters uniformly from the center of the face model,
with a randomXY -axis rotation and in-plane rotation both are Gaussian distribution with a sigma of 1.7°.
Totally, our synthetic dataset contains 300,000 images.
Modeltraining. We train our model with the synthetic dataset for 30 epochs. The model setup and the
learning rate scheduling is the same as the previous experiments.
Capturingdevices. We build a prototype of a portable capturing device shown in Figure 4.16. It contains
6 commercially available cameras with 5 horizontally distributed and 1 looking above, and 4 light boxes.
It took 3 people 6 hours to build this device, and 15 minutes to perform the camera calibration.
Testingresults. Our real-world testing result is shown in Figure 4.17. This is an example testing result
on real-world images. Surprisingly, ReFA generalize well to facial hairs even if there are very few synthetic
data contain heavy facial hairs.
106
Figure 4.16: Our capture devices.
4.2.7.5 Applications
Performance capture. Fast geometry inference is a notable strength of our model. In particular our
small model (128× 128 resolution) achieves a close-to-real-time performance at 9FPS. The efficiency re-
veals the potential application of our model in neural-based performance capture. We thus demonstrate
two dynamic sequences in fig. 4.13, where only 8 camera views are used. Both the input images and the re-
constructed meshes (converted from inferred output) are visualized. Importantly, the reconstructed meshes
are color-mapped by the UV coordinates. The accurate reconstruction together with the color mapping
demonstrates that our system is capable of capturing accurate face geometry from a video sequence while
maintaining correspondences across the captured shapes. We believe that this application is a showcase
of the readiness of our system for performance capture in the digital industry.
Extendedrepresentation. The UV-space position map offers significant flexibility in supporting vari-
ous types of output. Figure 4.15 offers different use cases, where the same position map is shown converted
to meshes of various densities, point cloud, landmarks and region segmentation maps. Specifically, our
position map can be converted to different mesh topologies seamlessly as long as a solid UV mapping is
107
Figure 4.17: Our capture devices.
provided. It thus opens up potential use in Level of Detail (LOD) rendering, where faces of different detail
levels are needed in real-time applications. In addition, by choosing specific points in the UV space, point
cloud, landmarks and region maps representations can be extracted from the position maps.
4.2.8 Discussion
We present an end-to-end neural face capturing framework, ReFA, that effectively and efficiently infers
dense, topologically consistent face geometry and high-resolution texture maps that are ready for pro-
duction use and animation. Our model tackles the challenging problem of multi-view face asset inference
by utilizing a novel geometry representation, the UV-space position map, and a recurrent face geome-
try optimizer that iteratively refines the shape and pose of the face through an alignment between the
input multi-view images and the UV-space features. Experiment results have demonstrated that our de-
sign choices allow ReFA to improve upon previous neural-based methods and achieve the state-of-the-art
results in accuracy, speed and completeness of the shape inference. In addition, our model is shown to
be device-agnostic to various capture settings, including sparse views and views under different lighting
108
conditions, with little compromise taken on its performance. We believe that the progress we make opens
up ample opportunities for rapid and easily accessible face acquisition that meets the high demand for face
asset capturing in the digital industry.
Future work Our current network is not originally designed for performance capture, as it is trained
with a database consisting of static scans. A natural future step of this work is to extend our design
to specifically process video sequences for performance capturing. We believe that features specifically
designed for temporal integration may enhance the speed and temporal consistency based on what our
current framework can achieve. Another meaningful direction to explore is to extend our approach to a
single-view setting, or the more challenging setting where the input is in-the-wild. As occlusion, shad-
ows, and noise may become a major obstacle in limiting the performance of a single-view reconstruction
algorithm, we believe that leveraging additional priors, such as symmetry assumption on the face, may be
a meaningful direction to explore.
109
Method Mean (mm) Med. (mm)
Correlation Feature
Correlation 0.848 0.603
Concat 0.990 0.713
View Aggr. Func.
Max 0.848 0.603
Mean 1.083 0.827
Var 1.169 0.865
Search Radius
1mm 0.869 0.615
3mm 0.848 0.603
5mm 0.901 0.640
UV-space Resolution
512 0.848 0.603
256 0.872 0.620
128 0.966 0.668
Recurrent Layer
GRU 0.848 0.603
Conv. 0.880 0.623
UV-space Feature
U+P+R 0.848 0.603
U+P 1.225 0.918
U 1.246 0.951
UV-space Embedding
Network 0.848 0.603
Parameter 0.935 0.692
Input View
15 0.848 0.603
8 0.901 0.632
6 0.930 0.652
4 1.014 0.720
Table 4.4: Ablation study on our dataset. Underlined items are our default settings. Correlation feature:
whether use as default (“correlation”) or simply concatenate the semantic and visual features (“concat”).
View aggr func.: choice of the pooling function. Grid size: the total size length of the 3D grid built for
computing correlation. Search radius: the search radius in computing the visual-semantic correlation. Re-
current Layer: whether GRU is used or is replaced by convolution layers. UV-space Feature: components
of the UV-space features: UV coordinates (U), position map (P), and face region map (R). UV-space Em-
bedding: whether the UV-space featureg is learned by a neural network (“Network”) or directly set as
learnable parameters (“Parameter”). Input View: number of views used as input in the inference. Notably,
decreasing the number of views for the inference only results in a slight decrease in performance. Our
model’s performance with only 4 views still achieve the best accuracy when compared to the best baseline
that utilizes 15 views.
110
Chapter5
ConclusionandFutureDirections
5.1 SummaryofResearch
This dissertation addresses the practical problems in real-world 3D computer vision applications. We
present methods and algorithms that infer the geometry and appearance of general objects that have no
3D ground truths, the accurate estimation of vanishing points, and rapid acquisition of human face avatars.
More importantly, we demonstrate the generalizability of the proposed framework to other geometry tasks.
For general objects, we introduce a novel differentiable renderer that utilizes a truly differentiable
formulation. This approach provides analytically consistent forward and backward functions, enabling
high-quality shape and texture inference from 2D image supervision. We also plug the operator into a
neural network that is equipped with a shape generator and a color generator. We render the inferred
shape back to 2D image and compute loss functions against the input images. This allows us to train
a single-view reconstruction neural network using only 2D supervision. Due to our truly differentiable
formulation, our method significantly outperforms previous works on ShapeNet benchmark, and complex
pose optimization tasks.
Next, we focus on the scene structure understanding. We investigate the vanishing point detection
problem. This is because vanishing points can be converted to camera or scene parameters, making them
111
a crucial element in understanding the underlying structure of a scene.We emphasize the performance, es-
pecially the accuracy and the running speed of the vanishing point detection task. To enable high efficiency,
we train a recurrent neural network to optimize the vanishing point position given a rough initialization.
The recurrent neural network utilize conic convolution as the basic building block, which is equivariant to
the vanishing point rotation that helps accurate detection. The proposed method runs more than 10 times
faster than previous state-of-the-art, with even better accuracy.
Finally, we address the human face avatar acquisition task, which is a very important aspect of AR/VR
applications. Traditional method process the scan at around 20 minutes, which is unacceptable for large-
scale face acquisition. We extend the neural optimization framework to work with more complex shape
representation to boost the inference speed of face avatar acquisition application. Specifically, we combine
the strength of the semantic consistency from meshes, and the continuity of implicit surfaces, and propose
to use a UV position map as our shape representation. We then define a UV feature space and an image
feature space, and convert the surface acquisition problem into a feature matching problem. To this end,
we propose a visual-semantic correlation that computes the correlation signal in the neighbor space of
each point on the surface, and we use this correlation volume as signals for the neural optimizer. Our
model, runs at 4.5 FPS with 15 input views, which runs 1000 times faster than the traditional method, and
50% faster than previous learning-based method with a significantly better accuracy.
5.2 OpenQuestionsandFutureWork
In this section, we discuss open questions and potential future research opportunities.
Multi-taskGeometrySupervision. One potential future direction for the research works mentioned
could be to explore ways to incorporate more complex and diverse data sources into the training process.
112
For example, for the 3D object reconstruction and vanishing point detection tasks, incorporating additional
modalities such as LiDAR or depth sensors could potentially improve performance and accuracy.
TemporalInformation. Another direction could be to explore ways to incorporate temporal informa-
tion, such as video sequences, into the training process. This could potentially improve the accuracy and
robustness of the models, especially for tasks such as vanishing point detection or scene understanding
where the temporal information can provide additional context.
DynamicFacialMovementCapturing. Additionally, for the human face avatar acquisition task, ex-
ploring ways to incorporate dynamic facial expressions and movements could be a potential direction for
future work. This could enable the creation of more realistic and expressive avatars for AR/VR applica-
tions.
113
Bibliography
[1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. “Learning
representations and generative models for 3d point clouds”. In: arXiv preprint arXiv:1707.02392
(2017).
[2] Jens Ackermann and Michael Goesele. “A survey of photometric stereo techniques”. In:
Foundations and Trends® in Computer Graphics and Vision 9.3-4 (2015), pp. 149–254.
[3] Brian Amberg, Reinhard Knothe, and Thomas Vetter. “Expression invariant 3D face recognition
with a morphable model”. In: 2008 8th IEEE International Conference on Automatic Face & Gesture
Recognition. IEEE. 2008, pp. 1–6.
[4] Ziqian Bai, Zhaopeng Cui, Jamal Ahmed Rahim, Xiaoming Liu, and Ping Tan. “Deep facial
non-rigid multi-view stereo”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2020, pp. 5850–5860.
[5] Daniel Barath and Jiri Matas. “Multi-class model fitting by energy minimization and
mode-seeking”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018,
pp. 221–236.
[6] Olga Barinova, Victor Lempitsky, Elena Tretiak, and Pushmeet Kohli. “Geometric image parsing
in man-made environments”. In: Proceedings of European conference on computer vision. Springer.
2010, pp. 57–70.
[7] Stephen T Barnard. “Interpreting perspective images”. In: Artificial intelligence 21.4 (1983),
pp. 435–462.
[8] Anil Bas and William AP Smith. “What does 2D geometric information really tell us about 3D
face shape?” In: International Journal of Computer Vision 127.10 (2019), pp. 1455–1473.
[9] Louis Bavoil and Kevin Myers. “Order independent transparency with dual depth peeling”. In:
NVIDIA OpenGL SDK (2008), pp. 1–12.
[10] Jean-Charles Bazin and Marc Pollefeys. “3-line RANSAC for orthogonal vanishing point
detection”. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE.
2012, pp. 4282–4287.
114
[11] Thabo Beeler, Bernd Bickel, Paul A. Beardsley, Bob Sumner, and Markus H. Gross. “High-quality
single-shot capture of facial geometry”. In: ACM Transactions on Graphics (TOG). 2010.
[12] Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman,
Robert W Sumner, and Markus Gross. “High-quality passive facial performance capture using
anchor frames”. In: ACM SIGGRAPH 2011 papers. 2011, pp. 1–10.
[13] Volker Blanz and Thomas Vetter. “A morphable model for the synthesis of 3D faces”. In:
Proceedings of the 26th annual conference on Computer graphics and interactive techniques. 1999,
pp. 187–194.
[14] Volker Blanz and Thomas Vetter. “Face recognition based on fitting a 3d morphable model”. In:
IEEE Transactions on pattern analysis and machine intelligence 25.9 (2003), pp. 1063–1074.
[15] Michael Bleyer, Christoph Rhemann, and Carsten Rother. “Patchmatch stereo-stereo matching
with slanted support windows.” In: Bmvc. Vol. 11. 2011, pp. 1–11.
[16] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and
Michael J Black. “Keep it SMPL: Automatic estimation of 3D human pose and shape from a single
image”. In: European Conference on Computer Vision. Springer. 2016, pp. 561–578.
[17] Timo Bolkart and Stefanie Wuhrer. “A groupwise multilinear correspondence optimization for 3d
faces”. In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 3604–3612.
[18] Ali Borji. “Vanishing point detection with convolutional neural networks”. In: arXiv preprint
arXiv:1609.00967 (2016).
[19] George Borshukov, Dan Piponi, Oystein Larsen, John P Lewis, and Christina Tempelaar-Lietz.
“Universal capture-image-based facial animation for" The Matrix Reloaded"”. In: ACM Siggraph
2005 Courses. 2005, 16–es.
[20] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. “OpenPose: Realtime
Multi-Person 2D Pose Estimation using Part Affinity Fields”. In: arXiv preprint arXiv:1812.08008
(2018).
[21] Loren Carpenter. “The A-buffer, an antialiased hidden surface method”. In: Proceedings of the 11th
annual conference on Computer graphics and interactive techniques. 1984, pp. 103–108.
[22] Jonathan C Carr, Richard K Beatson, Jon B Cherrie, Tim J Mitchell, W Richard Fright,
Bruce C McCallum, and Tim R Evans. “Reconstruction and representation of 3D objects with
radial basis functions”. In: Proceedings of the 28th annual conference on Computer graphics and
interactive techniques. ACM. 2001, pp. 67–76.
[23] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li,
Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. “Shapenet: An information-rich 3d
model repository”. In: arXiv preprint arXiv:1512.03012 (2015).
115
[24] Chin-Kai Chang, Jiaping Zhao, and Laurent Itti. “DeepVP: Deep learning for vanishing point
detection on 1 million street view images”. In: 2018 IEEE International Conference on Robotics and
Automation (ICRA). IEEE. 2018, pp. 1–8.
[25] Jia-Ren Chang and Yong-Sheng Chen. “Pyramid stereo matching network”. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. 2018, pp. 5410–5418.
[26] Anpei Chen, Zhang Chen, Guli Zhang, Kenny Mitchell, and Jingyi Yu. “Photo-realistic facial
details synthesis from single image”. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2019, pp. 9429–9439.
[27] Zhiqin Chen and Hao Zhang. “Learning Implicit Fields for Generative Shape Modeling”. In:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
[28] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. “Learning Phrase Representations using RNN
Encoder-Decoder for Statistical Machine Translation”. In: Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing (EMNLP). Association for Computational
Linguistics, 2014.
[29] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. “3D-R2N2: A
Unified Approach for Single and Multi-view 3D Object Reconstruction”. In: Proceedings of the
European Conference on Computer Vision (ECCV). 2016.
[30] Roberto Cipolla, Tom Drummond, and Duncan P Robertson. “Camera Calibration from Vanishing
Points in Image of Architectural Scenes.” In: BMVC. Vol. 99. 1999, pp. 382–391.
[31] Timothy F Cootes, Gareth J Edwards, and Christopher J Taylor. “Active appearance models”. In:
IEEE Transactions on Pattern Analysis & Machine Intelligence 6 (2001), pp. 681–685.
[32] James M Coughlan and Alan L Yuille. “Manhattan world: Compass direction from a single image
by bayesian inference”. In: Proceedings of the seventh IEEE international conference on computer
vision. Vol. 2. IEEE. 1999, pp. 941–947.
[33] Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Westley Sarokin, and
Mark Sagar. “Acquiring the reflectance field of a human face”. In: Proceedings of the 27th annual
conference on Computer graphics and interactive techniques. 2000, pp. 145–156.
[34] Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau.
“Single-image SVBRDF capture with a rendering-aware deep network”. In: ACM Transactions on
Graphics (TOG) 37.4 (2018), p. 128.
[35] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. “Image super-resolution using
deep convolutional networks”. In: IEEE transactions on pattern analysis and machine intelligence
38.2 (2015), pp. 295–307.
[36] Paul Ekman and Wallace V. Friesen. “Facial action coding system: a technique for the
measurement of facial movement”. In: Consulting Psychologists Press. 1978.
116
[37] Eric Enderton, Erik Sintorn, Peter Shirley, and David Luebke. “Stochastic transparency”. In: IEEE
transactions on visualization and computer graphics 17.8 (2010), pp. 1036–1047.
[38] Haoqiang Fan, Hao Su, and Leonidas J Guibas. “A Point Set Generation Network for 3D Object
Reconstruction from a Single Image.” In: CVPR. Vol. 2. 4. 2017, p. 6.
[39] Chen Feng, Fei Deng, and Vineet R Kamat. “Semi-automatic 3d reconstruction of piecewise planar
building models from single image”. In: CONVR (Sendai:) (2010).
[40] Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. “Learning an animatable detailed 3D
face model from in-the-wild images”. In: ACM Transactions on Graphics (TOG) 40.4 (2021),
pp. 1–13.
[41] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. “Joint 3d face reconstruction and
dense alignment with position map regression network”. In: Proceedings of the European
Conference on Computer Vision (ECCV). 2018, pp. 534–551.
[42] John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan Overbeck,
Noah Snavely, and Richard Tucker. “Deepview: View synthesis with learned gradient descent”.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019,
pp. 2367–2376.
[43] Yasutaka Furukawa and Jean Ponce. “Accurate, dense, and robust multiview stereopsis”. In: IEEE
transactions on pattern analysis and machine intelligence 32.8 (2010), pp. 1362–1376.
[44] Graham Fyffe, Koki Nagano, Loc Huynh, Shunsuke Saito, Jay Busch, Andrew Jones, Hao Li, and
Paul Debevec. “Multi-View Stereo on Consistent Face Topology”. In: Computer Graphics Forum.
Vol. 36. 2. Wiley Online Library. 2017, pp. 295–309.
[45] David Gallup, Jan-Michael Frahm, Philippos Mordohai, Qingxiong Yang, and Marc Pollefeys.
“Real-time plane-sweeping stereo with multiple sweeping directions”. In: 2007 IEEE Conference on
Computer Vision and Pattern Recognition. IEEE. 2007, pp. 1–8.
[46] Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick Pérez, and
Christian Theobalt. “Reconstruction of personalized 3D face rigs from monocular video”. In: ACM
Transactions on Graphics (TOG) 35.3 (2016), pp. 1–15.
[47] Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic, and
William T Freeman. “Unsupervised training for 3d morphable model regression”. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 8377–8386.
[48] Abhijeet Ghosh, Graham Fyffe, Borom Tunwattanapong, Jay Busch, Xueming Yu, and
Paul Debevec. “Multiview face capture using polarized spherical gradient illumination”. In:
Proceedings of the 2011 SIGGRAPH Asia Conference. 2011, pp. 1–10.
[49] Abhijeet Ghosh, Graham Fyffe, Borom Tunwattanapong, Jay Busch, Xueming Yu, and
Paul E. Debevec. “Multiview face capture using polarized spherical gradient illumination”. In:
ACM Transactions on Graphics (TOG) 30 (2011), p. 129.
117
[50] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hierarchies for
accurate object detection and semantic segmentation”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2014, pp. 580–587.
[51] Ioannis Gkioulekas, Anat Levin, and Todd Zickler. “An evaluation of computational imaging
techniques for heterogeneous inverse scattering”. In: European Conference on Computer Vision.
Springer. 2016, pp. 685–701.
[52] Ioannis Gkioulekas, Shuang Zhao, Kavita Bala, Todd Zickler, and Anat Levin. “Inverse volume
rendering with material dictionaries”. In: ACM Transactions on Graphics (TOG) 32.6 (2013), p. 162.
[53] Álvaro González. “Measurement of areas on a sphere using Fibonacci and latitude–longitude
lattices”. In: Mathematical Geosciences 42.1 (2010), p. 49.
[54] Paul Graham, Borom Tunwattanapong, Jay Busch, Xueming Yu, Andrew Jones, Paul Debevec,
and Abhijeet Ghosh. “Measurement-based synthesis of facial microgeometry”. In: Computer
Graphics Forum. Vol. 32. 2pt3. Wiley Online Library. 2013, pp. 335–344.
[55] Karol Gregor and Yann LeCun. “Learning fast approximations of sparse coding”. In: Proceedingsof
the 27th international conference on international conference on machine learning. 2010,
pp. 399–406.
[56] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan Russell, and Mathieu Aubry.
“AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation”. In: Proceedings IEEE
Conf. on Computer Vision and Pattern Recognition (CVPR). 2018.
[57] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C. Russell, and Mathieu Aubry.
“AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation”. In: computer vision
and pattern recognition (2018).
[58] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. “Cascade cost
volume for high-resolution multi-view stereo and stereo matching”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 2495–2504.
[59] Erwan Guillou, Daniel Meneveaux, Eric Maisel, and Kadi Bouatouch. “Using vanishing points for
camera calibration and coarse 3D reconstruction from a single image”. In: The Visual Computer
16.7 (2000), pp. 396–410.
[60] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge
university press, 2003.
[61] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 770–778.
[62] Varsha Hedau, Derek Hoiem, and David Forsyth. “Recovering the spatial layout of cluttered
rooms”. In: 2009 IEEE 12th international conference on computer vision. IEEE. 2009, pp. 1849–1856.
118
[63] Paul Henderson and Vittorio Ferrari. “Learning to Generate and Reconstruct 3D Meshes with
only 2D Supervision”. In: British Machine Vision Conference (BMVC). 2018.
[64] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. “Deepmvs:
Learning multi-view stereopsis”. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2018, pp. 2821–2830.
[65] Zeng Huang, Tianye Li, Weikai Chen, Yajie Zhao, Jun Xing, Chloe LeGendre, Linjie Luo,
Chongyang Ma, and Hao Li. “Deep volumetric video from very sparse multi-view performance
capture”. In: European Conference on Computer Vision. Springer. 2018, pp. 351–369.
[66] Loc Huynh, Weikai Chen, Shunsuke Saito, Jun Xing, Koki Nagano, Andrew Jones, Paul Debevec,
and Hao Li. “Mesoscopic facial geometry inference using deep neural networks”. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 8407–8416.
[67] Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So Kweon. “DPSNet: End-to-end Deep Plane
Sweep Stereo”. In: International Conference on Learning Representations. 2018.
[68] Eldar Insafutdinov and Alexey Dosovitskiy. “Unsupervised learning of shape and pose with
differentiable point clouds”. In: Advances in Neural Information Processing Systems. 2018,
pp. 2802–2812.
[69] Jon Jansen and Louis Bavoil. “Fourier opacity mapping”. In: Proceedings of the 2010 ACM
SIGGRAPH symposium on Interactive 3D Graphics and Games. 2010, pp. 165–172.
[70] Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. “End-to-end recovery of
human shape and pose”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2018, pp. 7122–7131.
[71] Angjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. “Learning
Category-Specific Mesh Reconstruction from Image Collections”. In: arXiv preprint
arXiv:1803.07549 (2018).
[72] Sing Bing Kang, Richard Szeliski, and Jinxiang Chai. “Handling occlusions in dense multi-view
stereo”. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition. CVPR 2001. Vol. 1. IEEE. 2001, pp. I–I.
[73] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. “Neural 3d mesh renderer”. In:Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 3907–3916.
[74] Alex Kendall, Matthew Grimes, and Roberto Cipolla. “Posenet: A convolutional network for
real-time 6-dof camera relocalization”. In: Proceedings of the IEEE international conference on
computer vision. 2015, pp. 2938–2946.
[75] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv
preprint arXiv:1412.6980 (2014).
[76] Nahum Kiryati, Yuval Eldar, and Alfred M Bruckstein. “A probabilistic Hough transform”. In:
Pattern recognition 24.4 (1991), pp. 303–316.
119
[77] Florian Kluger, Hanno Ackermann, Michael Ying Yang, and Bodo Rosenhahn. “Deep learning for
vanishing point detection using an inverse gnomonic projection”. In: German Conference on
Pattern Recognition. Springer. 2017, pp. 17–28.
[78] Florian Kluger, Eric Brachmann, Hanno Ackermann, Carsten Rother, Michael Ying Yang, and
Bodo Rosenhahn. “Consac: Robust multi-model fitting by conditional sample consensus”. In:
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020,
pp. 4634–4643.
[79] Andor Kollar. Realistic Human Eye. http://kollarandor.com/gallery/3d-human-eye/. Online;
Accessed: 2022-3-30. 2019.
[80] Jana Košecká and Wei Zhang. “Video compass”. In: European conference on computer vision.
Springer. 2002, pp. 476–490.
[81] Abhijit Kundu, Yin Li, and James M Rehg. “3D-RCNN: Instance-Level 3D Object Reconstruction
via Render-and-Compare”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2018, pp. 3559–3568.
[82] Hamid Laga, Laurent Valentin Jospin, Farid Boussaid, and Mohammed Bennamoun. “A survey on
deep learning techniques for stereo-based depth estimation”. In: IEEE Transactions on Pattern
Analysis and Machine Intelligence (2020).
[83] Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis,
Vasileios Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. “AvatarMe: Realistically
Renderable 3D Facial Reconstruction" In-the-Wild"”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2020, pp. 760–769.
[84] Seokju Lee, Junsik Kim, Jae Shin Yoon, Seunghak Shin, Oleksandr Bailo, Namil Kim, Tae-Hee Lee,
Hyun Seok Hong, Seung-Hoon Han, and In So Kweon. “Vpgnet: Vanishing point guided network
for lane and road marking detection and recognition”. In: Proceedings of the IEEE international
conference on computer vision. 2017, pp. 1947–1955.
[85] Chloe LeGendre, Kalle Bladin, Bipin Kishore, Xinglei Ren, Xueming Yu, and Paul Debevec.
“Efficient multispectral facial capture with monochrome cameras”. In: Color and Imaging
Conference. 2018.
[86] Hendrik Lensch, Jan Kautz, Michael Goesele, Wolfgang Heidrich, and Hans-Peter Seidel.
“Image-based reconstruction of spatial appearance and geometric detail”. In:ACMTransactionson
Graphics (TOG) 22.2 (2003), pp. 234–257.
[87] Martin D Levine and Yingfeng Chris Yu. “State-of-the-art of 3D facial reconstruction methods for
face recognition based on a single 2D training image per person”. In: Pattern Recognition Letters
30.10 (2009), pp. 908–913.
[88] José Lezama, Rafael Grompone von Gioi, Gregory Randall, and Jean-Michel Morel. “Finding
vanishing points via point alignments in image primal and dual domains”. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 509–515.
120
[89] Hao Li, Bart Adams, Leonidas J Guibas, and Mark Pauly. “Robust single-view geometry and
motion reconstruction”. In: ACM Transactions on Graphics (ToG) 28.5 (2009), pp. 1–10.
[90] Haoang Li, Ji Zhao, Jean-Charles Bazin, Wen Chen, Zhe Liu, and Yun-Hui Liu. “Quasi-Globally
Optimal and Efficient Vanishing Point Estimation in Manhattan World”. In: Proceedings of the
IEEE International Conference on Computer Vision. 2019, pp. 1646–1654.
[91] Jiaman Li, Zhengfei Kuang, Yajie Zhao, Mingming He, Karl Bladin, and Hao Li. “Dynamic facial
asset and rig generation from a single scan.” In: ACM Trans. Graph. 39.6 (2020), pp. 215–1.
[92] Ruilong Li, Karl Bladin, Yajie Zhao, Chinmay Chinara, Owen Ingraham, Pengda Xiang,
Xinglei Ren, Pratusha Prasad, Bipin Kishore, Jun Xing, et al. “Learning formation of
physically-based face attributes”. In: Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition. 2020, pp. 3410–3419.
[93] Tianye Li, Shichen Liu, Timo Bolkart, Jiayi Liu, Hao Li, and Yajie Zhao. “Topologically Consistent
Multi-View Face Inference Using Volumetric Sampling”. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2021, pp. 3824–3834.
[94] Tzu-Mao Li, Miika Aittala, Frédo Durand, and Jaakko Lehtinen. “Differentiable Monte Carlo Ray
Tracing through Edge Sampling”. In: ACM Trans. Graph. (Proc. SIGGRAPH Asia) 37.6 (2018),
222:1–222:11.
[95] Yiyi Liao, Simon Donné, and Andreas Geiger. “Deep marching cubes: Learning explicit surface
representations”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2018, pp. 2916–2925.
[96] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. “Learning efficient point cloud generation for
dense 3D object reconstruction”. In: Thirty-Second AAAI Conference on Artificial Intelligence . 2018.
[97] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian D Reid. “Learning Depth from Single
Monocular Images Using Deep Convolutional Neural Fields.” In: IEEE Trans. Pattern Anal. Mach.
Intell. 38.10 (2016), pp. 2024–2039.
[98] Feng Liu, Dan Zeng, Qijun Zhao, and Xiaoming Liu. “Joint face alignment and 3d face
reconstruction”. In: European Conference on Computer Vision. Springer. 2016, pp. 545–560.
[99] Guilin Liu, Duygu Ceylan, Ersin Yumer, Jimei Yang, and Jyh-Ming Lien. “Material editing using a
physically based rendering network”. In: Computer Vision (ICCV), 2017 IEEE International
Conference on. IEEE. 2017, pp. 2280–2288.
[100] Jingchen Liu and Yanxi Liu. “Local regularity-driven city-scale facade detection from aerial
images”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014,
pp. 3778–3785.
[101] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. “Soft Rasterizer: A Differentiable Renderer for
Image-based 3D Reasoning”. In: IEEE International Conference on Computer Vision (ICCV). 2019.
121
[102] Shichen Liu, Yichao Zhou, and Yajie Zhao. “Vapid: A rapid vanishing point detector via learned
optimizers”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021,
pp. 12859–12868.
[103] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black.
“SMPL: A skinned multi-person linear model”. In:ACMtransactionsongraphics(TOG) 34.6 (2015),
p. 248.
[104] Matthew M Loper and Michael J Black. “OpenDR: An approximate differentiable renderer”. In:
European Conference on Computer Vision. Springer. 2014, pp. 154–169.
[105] Guillaume Loubet, Nicolas Holzschuch, and Wenzel Jakob. “Reparameterizing discontinuous
integrands for differentiable rendering”. In: ACM Transactions on Graphics (TOG) 38.6 (2019),
pp. 1–14.
[106] Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Félix Chabert, Malte Weiss, and
Paul E. Debevec. “Rapid Acquisition of Specular and Diffuse Normal Maps from Polarized
Spherical Gradient Illumination”. In: Rendering Techniques. 2007.
[107] Wan-Chun Ma, Andrew Jones, Tim Hawkins, Jen-Yuan Chiang, and Paul Debevec. “A
high-resolution geometry capture system for facial performance”. In: ACM SIGGRAPH 2008 talks.
2008, pp. 1–1.
[108] Luca Magri and Andrea Fusiello. “Fitting multiple heterogeneous models by multi-class cascaded
t-linkage”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2019, pp. 7460–7468.
[109] Luca Magri and Andrea Fusiello. “T-Linkage: A continuous relaxation of J-Linkage for
multi-model fitting”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2014, pp. 3954–3961.
[110] Vikash K Mansinghka, Tejas D Kulkarni, Yura N Perov, and Josh Tenenbaum. “Approximate
bayesian image interpretation using generative probabilistic graphics programs”. In: Advances in
Neural Information Processing Systems. 2013, pp. 1520–1528.
[111] Iacopo Masi, Stephen Rawls, Gérard Medioni, and Prem Natarajan. “Pose-aware face recognition
in the wild”. In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition. 2016,
pp. 4838–4846.
[112] Daniel Maturana and Sebastian Scherer. “Voxnet: A 3d convolutional neural network for
real-time object recognition”. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS). IEEE. 2015, pp. 922–928.
[113] Wojciech Matusik, Chris Buehler, Ramesh Raskar, Steven J Gortler, and Leonard McMillan.
“Image-based visual hulls”. In: Proceedings of the 27th annual conference on Computer graphics and
interactive techniques. ACM Press/Addison-Wesley Publishing Co. 2000, pp. 369–374.
122
[114] Marilena Maule, João Comba, Rafael Torchelsen, and Rui Bastos. “Hybrid transparency”. In:
Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games. 2013,
pp. 103–118.
[115] Morgan McGuire and Louis Bavoil. “Weighted blended order-independent transparency”. In:
Journal of Computer Graphics Techniques (2013).
[116] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger.
“Occupancy Networks: Learning 3D Reconstruction in Function Space”. In: Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
[117] Houman Meshkin. “Sort-independent alpha blending”. In: GDC Talk (2007).
[118] Mateusz Michalkiewicz, Jhony K Pontes, Dominic Jack, Mahsa Baktashmotlagh, and
Anders Eriksson. “Deep Level Sets: Implicit Surface Representations for 3D Shape Inference”. In:
arXiv preprint arXiv:1901.06802 (2019).
[119] Faraz M Mirzaei and Stergios I Roumeliotis. “Optimal estimation of vanishing points in a
manhattan world”. In: 2011 International Conference on Computer Vision. IEEE. 2011,
pp. 2454–2461.
[120] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. “Instant Neural Graphics
Primitives with a Multiresolution Hash Encoding”. In: ACM Trans. Graph. 41.4 (July 2022),
102:1–102:15.doi: 10.1145/3528223.3530127.
[121] Kevin Myers and Louis Bavoil. “Stencil routed A-buffer.” In: SIGGRAPH Sketches. 2007, p. 21.
[122] Oliver Nalbach, Elena Arabadzhiyska, Dushyant Mehta, H-P Seidel, and Tobias Ritschel. “Deep
shading: convolutional neural networks for screen space shading”. In: Computer graphics forum.
Vol. 36. 4. Wiley Online Library. 2017, pp. 65–78.
[123] Alejandro Newell, Kaiyu Yang, and Jia Deng. “Stacked hourglass networks for human pose
estimation”. In: European conference on computer vision. Springer. 2016, pp. 483–499.
[124] Thu Nguyen-Phuoc, Chuan Li, Stephen Balaban, and Yongliang Yang. “RenderNet: A deep
convolutional network for differentiable rendering from 3D shapes”. In: arXiv preprint
arXiv:1806.06575 (2018).
[125] Merlin Nimier-David, Delio Vicini, Tizian Zeltner, and Wenzel Jakob. “Mitsuba 2: a retargetable
forward and inverse renderer”. In: ACM Transactions on Graphics (TOG) 38.6 (2019), p. 203.
[126] Junyi Pan, Xiaoguang Han, Weikai Chen, Jiapeng Tang, and Kui Jia. “Deep Mesh Reconstruction
From Single RGB Images via Topology Modification Networks”. In: The IEEE International
Conference on Computer Vision (ICCV). Oct. 2019.
[127] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove.
“DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation”. In:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
123
[128] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis.
“Coarse-to-fine volumetric prediction for single-image 3D human pose”. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 7025–7034.
[129] Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun, and Jiaya Jia. “GeoNet: Geometric
Neural Network for Joint Depth and Surface Normal Estimation”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2018, pp. 283–291.
[130] Long Quan and Roger Mohr. “Determining perspective structures using hierarchical Hough
transform”. In: Pattern Recognition Letters 9.4 (1989), pp. 279–286.
[131] Srikumar Ramalingam and Matthew Brand. “Lifting 3d manhattan lines from a single image”. In:
Proceedings of the IEEE International Conference on Computer Vision. 2013, pp. 497–504.
[132] Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and
Nicolas Heess. “Unsupervised learning of 3d structure from images”. In: Advances in Neural
Information Processing Systems. 2016, pp. 4996–5004.
[133] Helge Rhodin, Nadia Robertini, Christian Richardt, Hans-Peter Seidel, and Christian Theobalt. “A
Versatile Scene Model with Differentiable Visibility Applied to Generative Pose Estimation”. In:
Proceedings of the 2015 International Conference on Computer Vision (ICCV 2015). 2015.url:
http://gvv.mpi-inf.mpg.de/projects/DiffVis.
[134] Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. “Learning detailed face reconstruction
from a single image”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2017, pp. 1259–1268.
[135] Elad Richardson, Matan Sela, and Ron Kimmel. “3D face reconstruction by learning from synthetic
data”. In: 2016 fourth international conference on 3D vision (3DV). IEEE. 2016, pp. 460–469.
[136] Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael J Black. “Learning to regress 3D face
shape and expression from an image without 3D supervision”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2019, pp. 7763–7772.
[137] Grant Schindler and Frank Dellaert. “Atlanta world: An expectation maximization framework for
simultaneous low-level edge grouping and camera calibration in complex man-made
environments”. In: Proceedings of CVPR. Vol. 1. IEEE. 2004.
[138] Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. “Pixelwise view
selection for unstructured multi-view stereo”. In: European Conference on Computer Vision.
Springer. 2016, pp. 501–518.
[139] Chen Shen, James F O’Brien, and Jonathan R Shewchuk. “Interpolating and approximating
implicit surfaces from polygon soup”. In: ACM Siggraph 2005 Courses. ACM. 2005, p. 204.
[140] Jefferey A Shufelt. “Performance evaluation and analysis of vanishing point detection techniques”.
In: IEEE transactions on pattern analysis and machine intelligence 21.3 (1999), pp. 282–288.
[141] Christian Sigg. “Representation and rendering of implicit surfaces”. PhD thesis. ETH Zurich, 2006.
124
[142] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. “Indoor segmentation and
support inference from rgbd images”. In: European conference on computer vision. Springer. 2012,
pp. 746–760.
[143] Gilles Simon, Antoine Fond, and Marie-Odile Berger. “A-contrario horizon-first vanishing point
detection using second-order grouping laws”. In: Proceedings of the European Conference on
Computer Vision (ECCV). 2018, pp. 318–333.
[144] Olga Sorkine, Daniel Cohen-Or, Yaron Lipman, Marc Alexa, Christian Rössl, and H-P Seidel.
“Laplacian surface editing”. In: Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium
on Geometry processing. 2004, pp. 175–184.
[145] Christoph Strecha, Rik Fransens, and Luc Van Gool. “Combined depth and outlier estimation in
multi-view stereo”. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’06). Vol. 2. IEEE. 2006, pp. 2394–2401.
[146] Jean-Philippe Tardif. “Non-iterative approach for fast and accurate vanishing point detection”. In:
2009 IEEE 12th International Conference on Computer Vision. IEEE. 2009, pp. 1250–1257.
[147] Zachary Teed and Jia Deng. “Raft: Recurrent all-pairs field transforms for optical flow”. In:
European conference on computer vision. Springer. 2020, pp. 402–419.
[148] Ayush Tewari, Michael Zollhofer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard,
Patrick Perez, and Christian Theobalt. “Mofa: Model-based deep convolutional face autoencoder
for unsupervised monocular reconstruction”. In: Proceedings of the IEEE International Conference
on Computer Vision Workshops. 2017, pp. 1274–1283.
[149] Ayush Tewari, Michael Zollhöfer, Pablo Garrido, Florian Bernard, Hyeongwoo Kim,
Patrick Pérez, and Christian Theobalt. “Self-supervised multi-level face model learning for
monocular reconstruction at over 250 hz”. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2018, pp. 2549–2559.
[150] Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner.
“Face2face: Real-time face capture and reenactment of rgb videos”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016, pp. 2387–2395.
[151] Luan Tran and Xiaoming Liu. “Nonlinear 3d face morphable model”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2018, pp. 7346–7355.
[152] Triplegangers. Triplegangers Face Models. https://triplegangers.com/. Online; Accessed:
2021-12-05. 2021.
[153] Shubham Tulsiani and Jitendra Malik. “Viewpoints and keypoints”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2015, pp. 1510–1519.
[154] Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. “Multi-view supervision for
single-view reconstruction via differentiable ray consistency”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2017, pp. 2626–2634.
125
[155] Laurens Van der Maaten and Geoffrey Hinton. “Visualizing data using t-SNE.” In: Journal of
machine learning research 9.11 (2008).
[156] Etienne Vincent and Robert Laganiére. “Detecting planar homographies in an image pair”. In:
ISPA 2001. Proceedings of the 2nd International Symposium on Image and Signal Processing and
Analysis. In conjunction with 23rd International Conference on Information Technology Interfaces
(IEEE Cat. IEEE. 2001, pp. 182–187.
[157] Rafael Grompone Von Gioi, Jeremie Jakubowicz, Jean-Michel Morel, and Gregory Randall. “LSD:
A fast line segment detector with a false detection control”. In: IEEE transactions on pattern
analysis and machine intelligence 32.4 (2008), pp. 722–732.
[158] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. “Pixel2Mesh:
Generating 3D Mesh Models from Single RGB Images”. In: ECCV. 2018.
[159] Rui Wang, David Geraghty, Kevin Matzen, Richard Szeliski, and Jan-Michael Frahm. “VPLNet:
Deep Single View Normal Estimation With Vanishing Points and Lines”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 689–698.
[160] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro.
“High-resolution image synthesis and semantic manipulation with conditional gans”. In:
Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 8798–8807.
[161] Xiaolong Wang, David Fouhey, and Abhinav Gupta. “Designing deep networks for surface
normal estimation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2015, pp. 539–547.
[162] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and
Chen Change Loy. “ESRGAN: Enhanced super-resolution generative adversarial networks”. In:
The European Conference on Computer Vision Workshops (ECCVW). Sept. 2018.
[163] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. “Convolutional pose
machines”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
2016, pp. 4724–4732.
[164] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. “PoseCNN: A Convolutional
Neural Network for 6D Object Pose Estimation in Cluttered Scenes”. In: 2018.
[165] Yiliang Xu, Sangmin Oh, and Anthony Hoogs. “A minimum error vanishing point detection
approach for uncalibrated monocular images of man-made environments”. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. 2013, pp. 1376–1383.
[166] Shugo Yamaguchi, Shunsuke Saito, Koki Nagano, Yajie Zhao, Weikai Chen, Kyle Olszewski,
Shigeo Morishima, and Hao Li. “High-fidelity facial reflectance and geometry inference from an
unconstrained image”. In: ACM Transactions on Graphics (TOG) 37.4 (2018), pp. 1–14.
[167] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. “Perspective transformer
nets: Learning single-view 3d object reconstruction without 3d supervision”. In: Advances in
Neural Information Processing Systems. 2016, pp. 1696–1704.
126
[168] Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, and Xun Cao.
“Facescape: a large-scale high quality 3d face dataset and detailed riggable 3d face prediction”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020,
pp. 601–610.
[169] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. “Foldingnet: Point cloud auto-encoder via
deep grid deformation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2018, pp. 206–215.
[170] Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. “Recurrent mvsnet for
high-resolution multi-view stereo depth inference”. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 2019, pp. 5525–5534.
[171] Menghua Zhai, Scott Workman, and Nathan Jacobs. “Detecting vanishing points using global
image context in a non-manhattan world”. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2016, pp. 5657–5665.
[172] Chao Zhang, William AP Smith, Arnaud Dessein, Nick Pears, and Hang Dai. “Functional faces:
Groupwise dense correspondence using functional maps”. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 2016, pp. 5033–5041.
[173] Cheng Zhang, Lifan Wu, Changxi Zheng, Ioannis Gkioulekas, Ravi Ramamoorthi, and
Shuang Zhao. “A Differential Theory of Radiative Transfer”. In: ACM Trans. Graph. 38.6 (2019).
[174] Ruo Zhang, Ping-Sing Tsai, James Edwin Cryer, and Mubarak Shah. “Shape-from-shading: a
survey”. In: IEEE transactions on pattern analysis and machine intelligence 21.8 (1999), pp. 690–706.
[175] Xiaodan Zhang, Xinbo Gao, Wen Lu, Lihuo He, and Qi Liu. “Dominant vanishing point detection
in the wild with application in composition analysis”. In:Neurocomputing 311 (2018), pp. 260–269.
[176] Huizhong Zhou, Danping Zou, Ling Pei, Rendong Ying, Peilin Liu, and Wenxian Yu.
“StructSLAM: Visual SLAM with building structure lines”. In: IEEE Transactions on Vehicular
Technology 64.4 (2015), pp. 1364–1375.
[177] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. “View synthesis
by appearance flow”. In: European conference on computer vision. Springer. 2016, pp. 286–301.
[178] Yichao Zhou, Jingwei Huang, Xili Dai, Linjie Luo, Zhili Chen, and Yi Ma. HoliCity: A City-Scale
Data Platform for Learning Holistic 3D Structures. arXiv:2008.03286 [cs.CV]. 2020.
[179] Yichao Zhou, Haozhi Qi, Jingwei Huang, and Yi Ma. “NeurVPS: Neural Vanishing Point Scanning
via Conic Convolution”. In: Advances in Neural Information Processing Systems. 2019, pp. 866–875.
[180] Yichao Zhou, Haozhi Qi, Yuexiang Zhai, Qi Sun, Zhili Chen, Li-Yi Wei, and Yi Ma. “Learning to
reconstruct 3D Manhattan wireframes from a single image”. In: Proceedings of the IEEE
International Conference on Computer Vision. 2019, pp. 7698–7707.
127
[181] Zihan Zhou, Farshid Farhat, and James Z Wang. “Detecting dominant vanishing points in natural
scenes with application to composition-sensitive image retrieval”. In: IEEE Transactions on
Multimedia 19.12 (2017), pp. 2651–2665.
[182] Jacek Zienkiewicz, Andrew Davison, and Stefan Leutenegger. “Real-time height map fusion using
differentiable rendering”. In: Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International
Conference on. IEEE. 2016, pp. 4280–4287.
128
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Complete human digitization for sparse inputs
PDF
Sharpness analysis of neural networks for physics simulations
PDF
Single-image geometry estimation for various real-world domains
PDF
Recording, reconstructing, and relighting virtual humans
PDF
Deep representations for shapes, structures and motion
PDF
Simulation and machine learning at exascale
PDF
3D deep learning for perception and modeling
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Point-based representations for 3D perception and reconstruction
PDF
Towards learning generalization
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Learning controllable data generation for scalable model training
PDF
Dynamic topology reconfiguration of Boltzmann machines on quantum annealers
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Green learning for 3D point cloud data processing
PDF
Human appearance analysis and synthesis using deep learning
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Algorithms and systems for continual robot learning
Asset Metadata
Creator
Liu, Shichen
(author)
Core Title
Learning to optimize the geometry and appearance from images
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
04/20/2023
Defense Date
02/28/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer graphics,computer vision,deep learning,machine learning,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Hill, Randall Jr. (
committee chair
), Nakano, Aiichiro (
committee member
), Nealen, Andrew (
committee member
), Nikolaidis, Stefanos (
committee member
), Zhao, Yajie (
committee member
)
Creator Email
liushich@usc.edu,liushichen95@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113057196
Unique identifier
UC113057196
Identifier
etd-LiuShichen-11673.pdf (filename)
Legacy Identifier
etd-LiuShichen-11673
Document Type
Dissertation
Format
theses (aat)
Rights
Liu, Shichen
Internet Media Type
application/pdf
Type
texts
Source
20230421-usctheses-batch-1027
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
computer graphics
computer vision
deep learning
machine learning