Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Single-image geometry estimation for various real-world domains
(USC Thesis Other)
Single-image geometry estimation for various real-world domains
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Single-Image Geometry Estimation
for Various Real-World Domains
by
Cho-Ying Wu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2023
Copyright 2023 Cho-Ying Wu
Acknowledgements
First and foremost, I am sincerely grateful to Prof. Ulrich Neumann for his support and care
during my Ph.D. journey. His insightful opinions always guide me and inspire me to complete
every research work in my studies. He always emphasizes working on fundamental and practical
problems, which leads me to think about the usefulness of each research work. It was my great
honor to be his advised student at USC.
I want to thank my co-authors, Qiangeng Xu, Yiqi Zhong, Junying Wang, Chin-Cheng Hsu,
Jialiang Wang, Shuochen Su, Michael Hall, Jingjing Zheng, Jim Thomas, Cheng-Hao Kuo, Xiaoyan
Hu, Michael Happold, Riccardo de Lutio, Zan Gojcic, Ling Huan, Or Litany, and Sanja Fidler,
whom I appreciate working with. I want to thank my labmates at USC, Qiangeng Xu, Yiqi Zhong,
Weiyue Wang, Junying Wang, Quankai Gao, Tianye Li, Mianlun Zheng, Danyong Zhao, and Bohan
Wang, for all your friendship and support. Special thanks to Te-Lin Wu and Chin-Cheng Hsu for
their valuable discussion on various and interesting research problems. I want to thank Meta Inc.,
Amazon Inc., Nvidia Inc., and Argo AI LLC. for providing valuable internship experience. I want
to thank Prof. Jian-Jiun Ding from National Taiwan University in my master’s studies for leading
me to computer vision and machine learning this realm. I want to thank my family for their support,
care, and encouragement along my Ph.D. journey.
Last but not least, I would like to thank Prof. C. -C. Jay Kuo, Prof. Andrew Nealen, Prof.
Laurent Itti, Prof. Ram Nevatia, and Prof. Antonio Ortega for being my qualification exam or
dissertation defense committee members. I appreciate their insightful and helpful feedback.
ii
Table of Contents
Acknowledgements ii
List of Tables v
List of Figures viii
Abstract xii
Chapter 1: Overview: Geometry Estimation from Images 1
Chapter 2: Human Faces: Geometry from Images 3
2.1 3D Morphable Models (3DMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 SynergyNet: Synergy between 3D landmarks and 3DMM parameters . . . 6
2.1.2 From 3DMM to Refined 3D Landmarks . . . . . . . . . . . . . . . . . . . 6
2.1.3 From Refined Landmarks to 3DMM . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Representation Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 3: Human Faces: Geometry from V oices 12
3.1 Cross-Modal Perceptionist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Supervised Learning with V oice/Mesh Pairs . . . . . . . . . . . . . . . . . . . . . 15
3.3 Unsupervised Learning with KD . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.2 Subjective Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 4: Outdoor Driving Scenes: A Sparse-to-Dense Approach 27
4.1 Scene Completeness-Aware Depth Completion . . . . . . . . . . . . . . . . . . . 29
4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Outdoor RGBD Semantic Segmentation . . . . . . . . . . . . . . . . . . . 33
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Chapter 5: Indoor Scenes: Practical Indoor Depth Estimation 34
5.1 DistDepth: Structure Distillation from Expert . . . . . . . . . . . . . . . . . . . . 36
iii
5.2 Experiments and Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.1 Experiments on synthetic data . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.2 Experiments on synthetic data . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Chapter 6: Meta-Learning for Single-Image Depth Preidction: A Data-Efficient Learning
Approach 46
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Meta-Learning Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3 Fine-Grained Task Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3.1 Difficulty in Accurate Depth Estimation . . . . . . . . . . . . . . . . . . . 50
6.3.2 Single RGB-D Pair as Fine-Grained Task . . . . . . . . . . . . . . . . . . 51
6.3.3 Meta-Initialization on Depth from Single Image . . . . . . . . . . . . . . . 52
6.3.4 Strategy and Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.4 Experiments and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4.1 Meta-Learning on Limited Scene Variety . . . . . . . . . . . . . . . . . . 58
6.4.2 Meta-Initialization v.s. ImageNet-Initialization . . . . . . . . . . . . . . . 59
6.4.3 Zero-Shot Cross-Dataset Evaluation . . . . . . . . . . . . . . . . . . . . . 60
6.4.4 Depth Supervision in NeRF . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4.5 How is fine-grained task related to other meta-learning studies? . . . . . . . 62
6.5 Campus Data with Meta-Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5.1 Analysis and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 7: Conclusion 81
Bibliography 82
iv
List of Tables
2.1 Benchmark on AFLW2000-3D for facial alignment. The original annotation
version is used. Our performance is the best with a gap over others on large poses. . 10
2.2 3D face modeling comparison on AFLW2000-3D. . . . . . . . . . . . . . . . . . 11
3.1 ARE metric study. Compared with baseline, results from CMP show that
cross-modal joint training with voice input can obtain around 20% improvements.
We also highlight the largest improvement, ER, that answers to Q4. . . . . . . . . . 24
4.1 Evaluation on KITTI Depth Completion val set. . . . . . . . . . . . . . . . . . 32
4.2 Comparison on KITTI Semantic Segmentation Dataset. Our depth could
enhance SSMA performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1 Quantitative comparison on the V A dataset. Our DistDepth attains much lower
errors than prior works of left-right consistency. DistDepth-M further uses the
test-time multi-frame strategy in ManyDepth. See the main text. . . . . . . . . . . 39
5.2 Study on the choice of the expert network for distillation. Different versions
of DPT [90] that vary in network sizes (# of params) are adopted as the expert to
teach the student. DPT-legacy localizes occluding contours better and leads to a
better-performing student network. The results of supervised learning are provided
as a reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Evaluation on NYUv2. Sup:✓- supervised learning using groundtruth depth,✗-
not using groundtruth depth, and△- semi-supervised learning (we use the expert
finetuned on NYUv2, where we have indirect access to the groundtruth). We
achieve the best results among all self-supervised methods, and our semi-supervised
and self-supervised finetuned on NYUv2 even outperform many supervised
methods. The last two rows show results without groundtruth supervision and
without training on NYUv2. In this challenging zero-shot cross-dataset evaluation,
we still achieve comparable performances to many methods trained on NYUv2.
Error and accuracy (yellow/green) metrics are reported. . . . . . . . . . . . . . . . 45
v
6.1 Generalizability with different scene variety. We compare single-stage
meta-learning (only prior learning) and supervised learning. ConvNeXt-Base
backbone is used. a→ b means training on a- and testing on b-dataset. Replica and
HM3D respectively hold lower and higher scene variety for training. Meta-Learning
has much larger improvements especially trained on low scene-variety Replica. . . 56
6.2 Effects of Meta-Initialization on intra-dataset evaluation. We train and test
meta-initialization (full Algorithm 1) on the same dataset. Hypersim and NYUv2
of higher scene variety are used. Using the same architecture, meta-initialization
(+Meta) consistently outperforms ImageNet-initialization (no marks). Both error
(in orange) and accuracy (in green) are reported. . . . . . . . . . . . . . . . . . . . 58
6.3 Zero-Shot cross-dataset evaluation using meta-initialization (Algorithm
1). Comparison is drawn between without meta-initialization (no marks,
ImageNet-initialization) and with our meta-initialization (Meta) using different
sizes ConvNeXt. Results of ”+Meta” are consistently better. . . . . . . . . . . . . 64
6.4 Zero-Shot cross-dataset evaluation on dedicated depth estimation architecture.
Plugging our meta-initialization (Algorithm 1) into the those frameworks can stably
improve results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 More results on depth-supervised NeRF. We test on Replica ’room-0’, ’room-1’,
room-2’, ’office-0’, ’office-1’, and ’office-2’ environments. We train a NeRF on
each environment with 180 views. The comparison between using depth from
meta-initialization and w/o meta-initialization for supervision is drawn. PSNR and
SSIM are image quality metrics; the higher, the better. . . . . . . . . . . . . . . . . 65
6.6 Results of pretrained SOTA methods on Campus Test. The best number is in
bold, and the second-best is underlined. . . . . . . . . . . . . . . . . . . . . . . . 67
6.7 Results on space categories. The model is MIM-SwinTransformer pretrained on
NYUv2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.8 Self-Supervised DistDepth performance trained on SimSIN. We use the
pretrained ResNet152 from [123]. . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.9 Self-Supervised DistDepth performance trained on UniSIN. We use the
pretrained ResNet152 from [123]. . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.10 Supervised learning performance trained on Hypersim. Results across space
types are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.11 Supervised learning performance trained on NYUv2. Results across space types
are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.12 Comparison between meta-learning (Meta) and supervised learning (SL),
trained on NYUv2. We adopt ConvNeXt-small (conv-sml) and ConvNeXt-base
(conv-b) as backbones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
vi
6.13 Comparison between meta-learning (Meta) and supervised learning (SL),
trained on Hypersim. We adopt ConvNeXt-small (conv-sml) and ConvNeXt-base
(conv-b) as backbones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.14 Statistics for supervised learning (SL) v.s. meta-learning (Meta). The mean and
standard deviation (STD) are computed across categories. Models are trained on
Hypersim. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.15 Results of training and testing on Campus Train/ Test. ConvNeXt-small is used. 72
6.16 Comparison between with and without class weights. ConvNeXt-small
(Conv-sml) and ConvNeXt-base (Conv-b) are used. . . . . . . . . . . . . . . . . . 73
6.17 Statistics for supervised learning v.s. meta-learning. Mean and standard
deviation are computed across categories. Models are trained on Campus Train. . . 73
6.18 Comparison between with and without using a balanced task-loader. . . . . . 74
6.19 Comparison between with and without meta-learning. Models are trained on
Campus Train and evaluated on NYUv2. . . . . . . . . . . . . . . . . . . . . . . . 75
6.20 Performance trained on different groups and test on Campus Test. SL:
supervised learning; Meta: meta-learning. See the text for the definition of groups. 76
6.21 Performance trained on different meta-groups and test on Campus-test.
Results are reported by ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.22 Multiple training dataset experiment. Four settings, including using meta-
learning and using balanced task-loader, are examined. See text for the setting
details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
vii
List of Figures
2.1 Results from our SynergyNet with monocular image inputs. Note that 3D
landmarks can predict hidden face outlines in 3D rather than follow visible outlines
on images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Framework of our SynergyNet. Backbone network learns to regress 3DMM
parameters (α
p
,α
s
, andα
e
) and reconstruct 3D face meshes from monocular face
images. A self-constraining consistency is applied to 3DMM parameters regressed
from different sources. This synergy process includes a forward representation
direction, from 3DMM parameters to refined 3D landmarks , and a reverse direction,
from 3D landmarks to regress 3DMM parameters. . . . . . . . . . . . . . . . . . . 4
2.3 Illustration of representation cycle. . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Cross-Modal Perceptionist. We study the correlations between voices and face
geometry under both supervised and unsupervised learning settings. This work
targets at more explainable human-centric cross-modal learning for biometric
applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Supervised learning framework. Given a speech input, voice embedding is
extracted byφ
v
. φ
dec
then estimates 3DMM parametersα for 3D face modeling.
The supervision is computed with groundtruthα
∗ . . . . . . . . . . . . . . . . . . . 15
3.3 Unsupervised learning with KD. The unsupervised framework contains a GAN
for face image synthesis with voice encoderφ
v
, generatorφ
g
, discriminatorφ
dis
,
and classifier φ
c
. Then, knowledge distillation is used to achieve unsupervised
learning. 2D face is a latent representation in this fashion. . . . . . . . . . . . . . . 18
3.4 Distance illustration for our ARE metric. AB: ear-to-ear distance. CD: forehead
width. EF: outer-interocular distance. GH: midline distance. IJ: cheek-to-cheek
distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Evidence for positive response to Q1. Our unsupervised framework predicts
intermediate 2D images and 3D meshes. This answers to Q1 that 3D face models
exhibiting similar face shapes to the references can be predicted from only voice
inputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
viii
3.6 A collection of results supports our positive response to Q1. This figure extends
Fig.3.5. Top to down for two row chunks: predicted intermediate face images,
predicted 3D models, real faces for references. . . . . . . . . . . . . . . . . . . . . 21
3.7 Illustration for our positive response to Q2. Consistent intermediate images and
3D faces can be predicted from the same speaker with different time-step utterances. 22
3.8 Shape variation statistics in response to Q2. Mean and std of per-vertex variation
w.r.t. the center frame are shown, calculated in frontal pose. 3D shapes recovered
from different utterances are consistent with only sub-pixel differences. . . . . . . 22
3.9 Comparison of intermediate images and meshes in response to Q3. The
cross-modal joint training strategy in our unsupervised CMP produces better-quality
images than the baseline. More reliable images as latent representations from our
CMP can facilitate mesh prediction. We include real faces for face shape references. 23
3.10 Results of subjective preference tests. The blue bars are the preference for
our method, while the red bars are the preference for the baseline method. The
percentages are labeled on the bar, and the total number of votes is enclosed in the
parentheses. The x-axis on the bottom labels the total number of responses, and
that on the top denotes the percentage. The p-values of the statistical significance
tests are provided under the bar.∼ shows the value’s order of magnitude. . . . . . 25
4.1 Comparison of depth from stereo matching network, depth completion
network, and our SCADC. Our results leverage advantages of both stereo
matching, which have more structured upper scenes, and lidars, which have more
precise depth measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Network pipeline of our SCADC. . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Depth completion comparison.Left: Qualitative results of stereo matching
(PSMNet), lidar completion (SSDC), and our SCADC on KITTI Depth Completion
validation set. Right: Comparison on KITTI Depth Completion test set. Results
of other works are directly from KITTI website. We are the only that reconstruct
upper scene structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Semantic segmentation results. SSMA with depth from our SCADC is used on
KITTI Semantic Segmentation dataset. . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1 Advantages of our framework. (A) We attain zero-shot cross-dataset inference.
(B) Our framework trained on simulation data produces on-par results with the one
trained on real data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 DistDepth overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
ix
5.3 Intra-/ Inter-Dataset inference. Prior self-supervised works can fit training data
(SimSIN) shown in the first row, but they generalize poorly to unseen testing
dataset (V A), shown in the second and third rows. Our DistDepth can produce more
structured and accurate ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Qualitative results on V A sequence. Depth and error maps are shown for
DistDepth and MonoDepth2 for comparison. These examples demonstrate that our
DistDepth predicts geometrically structured depth for common indoor objects. . . . 41
5.5 Results on Hypersim. Depth map and textured pointcloud comparison of
MonoDepth2 and our DistDepth. With structure distillation, DistDepth attains
better object structure predictions, such as tables and paintings on the wall shown
in (A) and much less distortion for the large bookcase in (B). . . . . . . . . . . . . 42
5.6 Qualitative study for depth-domain structure improvement. Two examples (A)
and (B) are shown to study the effects of distillation (dist) losses and turn-on level
α in spatial refinement to validate our design. . . . . . . . . . . . . . . . . . . . . 43
5.7 Comparison on UniSIN. Geometric shapes produced from DistDepth are better
than MonoDepth2. DistDepth concretely reduces the gap for sim-to-real: (3) and
(4) attain on-par results and sometimes training on simulation shows better structure
than training on real. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.8 Results on real data (UniSIN) using our DistDepth only trained on simulation
(SimSIN). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.1 Geometry structure comparison in 3D point cloud view. We back-project
the predicted depth maps from images into textured 3D point cloud to show the
geometry. The proposed Meta-Initialization has better domain generalizability that
leads to more accurate depth prediction hence better 3D structures. (zoom in for the
best view). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Meta-Initialization for learning image-to-depth mappings. The prior learning
stage adopts a base-optimizer and a meta-optimizer. Inside each meta-iteration,
K fine-grained tasks are sampled and used to minimize regression loss. L steps
are taken by the base-optimizer to search for weight update directions for these
K tasks. Then, the meta-optimizer follows the explored inner trends to update
meta-parameters in the Reptile style [75]. Image-to-depth priorθ
prior
is output
at the end of the stage. θ
prior
is then used as the initialization for the subsequent
supervised learning for the final model θ
∗ . . . . . . . . . . . . . . . . . . . . . . . 49
x
6.3 Fitting to training environments. var shows depth variance in the highlighted
regions. We show comparison of fitting to training environments between
pure meta-learning (Meta) and direct supervised learning (DSL) on limited
scene-variety dataset, Replica. Meta produces smooth and more precise depth.
Depth-irrelevant textures on planar regions can be resolved more correctly. In
contrast, DSL produces irregularities affected by local high-frequency details,
especially ResNet50. See Sec. 6.4.1 for details and 6.3.4 for the explanation. . . . 51
6.4 Loss curve for MAML v.s. Reptile. . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5 Analysis on scene variety and model generalizability. (A) shows limited training
scenes constrain learning image-to-depth mappings, with an extreme case (A2)
for only one training image. (B) shows though a model (A4) fits well on training
scenes, it still cannot generalize to unseen seen, especially wall paintings with
many depth-irrelevant cues. Meta-initialization attains better model generalizability.
See Sec. 6.3.4 for explanation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.6 Depth map qualitative comparison. Results of our meta-initialization have better
object shapes with clearer boundaries. Depth-irrelevant textures are suppressed,
and flat planes are predicted, as shown in Hypersim- Row 2 ceiling and 3 textured
wall examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.7 Image quality comparison for NeRF rendering. We show the quality metrics
(the higher the better) under each image. Zoom in for the best view. . . . . . . . . 61
6.8 Statistics of Campus Test and NYUv2-Depth. We map NYUv2’s spaces to our
defined Campus Data. One can see the distribution of NYU’s training data and
Campus Test data are distinct. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.9 Comparison for training on sub-groups. Meta-Learning shows better depth
accuracy of object outlines and shapes. . . . . . . . . . . . . . . . . . . . . . . . . 77
6.10 Comparison for training on multiple training datasets. Meta-Learning shows
better depth accuracy of object outlines and shapes. . . . . . . . . . . . . . . . . . 79
xi
Abstract
Predicting geometry from images is a fundamental and popular task in computer vision and has
multiple applications. For example, predicting ranges from ego-view images can help robots
navigate through indoor spaces and avoid collisions. Additional to physical applications, one can
synthesize novel views from single images with the help of depth by warping pixels to different
camera positions. Further, one can fuse depth estimation from multiple views and create a complete
3D environment for AR/VR uses.
It is difficult for traditional methods to attain the goal of geometry prediction from single
images due to the domain gap between imagery and geometry. It is an intrinsically ill-posed
problem without multi-view constraints. Thanks to recent advances in deep learning and efficient
computation, various data-driven approaches emerge that learn to regress depth maps from images
as geometry representation. Recent high-performing research works are primarily based on deep
learning to achieve the goal of bridging the domain gap between imagery and geometry.
This dissertation includes several works on the topic of estimating geometry from single images
in various data domains, including human faces, outdoor driving, and indoor scenes. Difficulties
and our solutions to solve domain-specific problems are introduced in each chapter. Then, in
the last chapter, we go beyond solutions of network architecture and loss function developments
and discover a better learning strategy, meta-learning, to learn a higher-level representation. The
learned representation more accurately characterizes the depth domain. Our presented meta-learning
approach attains better performance without involving extra data or pretrained models but directly
focuses on learning schedules. Then, we closely evaluate the generalizability on our collected
Campus Data and demonstrate meta-learning’s ability in sub-/single-/multi-dataset levels.
xii
Chapter 1
Overview: Geometry Estimation from Images
Going beyond 2D computer vision, a fundamental task lying in 3D vision is reconstructing 3D
geometry from imagery. Geometry estimation helps understand the 3D world from images. Prior
published works cope with a variety of data domains: human data, outdoor driving scene data, and
indoor scene data. These data domains have broad applications, and different data types encounter
distinct difficulties in estimating geometry from images. For instance, variations for faces or human
bodies are subtle, and thus standard approaches attempt to reconstruct 3D faces in parametric
spaces associated with mesh templates. Scene data have much greater geometric changes, and
directly estimating geometry requires the helps of further assumptions, inductive bias, or auxiliary
information. For indoor scene data, most scene images show ranges within 5 or 10 meters, which
poses an inductive bias on the range estimation so that models trained on indoor scenes would not
produce far range values. For outdoor driving data, scenes are associated with priors where the upper
parts are mostly sky, and the lower parts are almost road. Trees, buildings, and sidewalks are usually
lying on the side. These priors are beneficial for geometry estimation. Further, suppose sparse
depth measurements from sensors are available as auxiliary information. The sparse measurements
hint how far the elements are in the scenes; thus, they can be integrated into supervised terms and
enhance estimation precision.
The following chapters will elaborate on details of prior publications for geometry estimation
from single-images for various data domains: human faces, outdoor driving scenes, and indoor
scenes. These domains are accompanied by wide and popular applications where geometry plays
1
crucial roles, such as AR/VR creation, robot navigation, human-robot interaction, and autonomous
driving. We will introduce each data domain along with its difficulty and further present our
solutions to address these problems.
Chapter 2 and 3 introduce reconstructing 3D face models from single images and from voices
with images as a latent representation. 3D face reconstruction from images relies on similar facial
structures as prior; thus, we can adopt a 3D template model as prior knowledge and only learn
the deformation. Reconstructing faces from voices is subtle. Intuitively, voice is correlated to
images due to vocal cords. In the chapter, we attempt to answer how strong such a correlation is.
Chapter 4 introduces reconstructing scene depth for outdoor driving scenes using a sparse-to-dense
approach. We can more precisely estimate metric depth based on available sparse measurements
as clues. Further, we cope with the scene-completeness issue where the groundtruth points from
lidar sensors are not accessible. Chapter 5 focuses on indoor scenes and attempts to attain practical
indoor depth estimation with the advantages: learning without curated depth groundtruth, learning
from synthetic data, high generalization and accurate performance, and fast inference. Chapter 6
goes beyond advances in network architecture, loss, and data and further focuses on designing a
better learning strategy for performance gain. We focus on meta-learning and borrow its ability
on higher generalization and few-shot learning. We introduce fine-grained task setting to address
the issue where little affinity exists among scene images. To closely evaluate generalization from
meta-learning, we collect Campus Dataset with space type and range labels. These meta-labels
provide a breakdown about how a trained model works across different environments. Further, we
find meta-learning can alleviate the imbalance in the training set for each scene type, and it works
from sub-dataset to multiple datasets.
This dissertation covers different data domains to learn geometry from images, and different
approaches for developments, including neural network architecture, loss function design, and
learning paradigm and strategy.
2
Chapter 2
Human Faces: Geometry from Images
Faces are special targets for studying geometry estimation from images. Variations between
facial geometry are subtle, and thus a template face model can serve as a good prior and can
penalize unlikely face geometry. 3D Morphable Models (3DMM) [26] are widely for 3D face
estimation. 3DMM creates a face mesh with a mean-face template and deforms this template
mesh with estimated identity and expression variation in parameter spaces. 3D face meshes can be
built from 3DMM, and other attributes such as face pose or 3D landmarks can be extracted from
mesh. There are different tasks related to facial geometry prediction: 3D facial alignment, face
orientation estimation, and 3D face modeling. The first two aim at the registration of face images
for applications such as feature extraction from canonical poses or human engagement with face
orientation. The last focuses on the recovery of the holistic face mesh to create fine-grained face
models.
To learn 3DMM parameters, 3D landmarks are widely used to guide 3D facial geometry
learning. Previous works [39, 106, 148] only directly extract coarse landmarks from fitted 3D
faces and compute supervised alignment losses with groundtruth landmarks. These works utilize a
representation direction, from 1D parameters to 3D landmarks. However, though 3D landmarks
are very sparse (a 68-point definition is commonly used), they compactly and efficiently describe
facial outlines in 3D space. We think 3D landmarks can be further exploited to predict underlying
3D faces as supportive information. Hence, in addition to only going from 1D parameters to 3D
landmarks, we propose a further step to reversely regress 3DMM parameters from 3D landmarks
3
Figure 2.1: Results from our SynergyNet with monocular image inputs. Note that 3D landmarks
can predict hidden face outlines in 3D rather than follow visible outlines on images.
Figure 2.2: Framework of our SynergyNet. Backbone network learns to regress 3DMM parameters
(α
p
,α
s
, andα
e
) and reconstruct 3D face meshes from monocular face images. A self-constraining
consistency is applied to 3DMM parameters regressed from different sources. This synergy process
includes a forward representation direction, from 3DMM parameters to refined 3D landmarks , and
a reverse direction, from 3D landmarks to regress 3DMM parameters.
and establish a representation cycle. The advantage is that predicting 3D face using 3DMM from
2D images is naturally an ill-posed problem, but prediction from 3D landmarks can alleviate the
intrinsic ill-posedness.
We propose SynergyNet, a synergy process network that includes two stages. The first stage
contains a backbone network to regress 3DMM parameters from images and construct 3D face
4
meshes. After landmark extraction by querying associated indices, we propose a landmark refine-
ment module that aggregates 3DMM semantics and incorporates them into point-based features to
produce refined 3D landmarks. We closely validate how each information source contributes to 3D
landmark refinement. From the representation perspective, the first stage goes from 1D parameters
to 3D landmarks. Next, the second stage contains a landmark-to-3DMM module that predicts
3DMM parameters from 3D landmarks, which is a reverse representation direction compared with
the first stage. We leverage this step to regress embedded facial geometry lying in sparse landmarks.
The overall framework is in Fig. 2.2.
Our SynergyNet contains only simple and widely-used network operations in the whole synergy
process. We quantitatively analyze performance gains introduced by each adopted information and
each regression target with extensive experiments. We evaluate our SynergyNet on all tasks of facial
alignment, face orientation estimation, and 3D face modeling using the standard datasets for each
task. Our SynergyNet attains superior performance than other related work. Fig.2.1 demonstrates
the ability of our SynergyNet.
2.1 3D Morphable Models (3DMM)
3DMM reconstructs face meshes using principal component analysis (PCA). Given a mean face
M∈R
3N
v
with N
v
3D vertices, 3DMM deforms M into a target face mesh by predicting the shape
and expression variations. U
s
∈R
3N
v
× 40
is the basis for shape variation manifold that represents
different identities, U
e
∈R
3N
v
× 10
is the basis for expression variation manifold, andα
s
∈R
40
and
α
e
∈R
10
are the associated basis coefficients. The 3D face reconstruction can be formulated in
Eq.2.1.
S
f
= Mat(M+U
s
α
s
+U
e
α
e
), (2.1)
5
where S
f
∈R
3× N
v
represents a reconstructed frontal face model after the vector-to-matrix operation
(Mat). To align S
f
with input view, a 3x3 rotation matrix R∈ SO(3), a translation vector t∈R
3
,
and a scaleτ are predicted to transform S
f
by Eq.2.2.
S
v
=τRS
f
+t, (2.2)
where S
v
∈R
3× N
v
aligns with input view. τR and t are included as 3DMM parameters in most
works [39, 125, 146], and thus we useα
p
∈R
12
instead. We follow the current best work 3DDFA-
V2 [39] to predict 62-dim 3DMM parametersα for pose, shape, and expression.
We follow 3DDFA-V2 to adopt MobileNet-V2 as the backbone network to encode input
images and use fully-connected (FC) layers as decoders for predicting 3DMM parameters from the
bottleneck image feature z. We separate the decoder into several heads by 3DMM semantics, which
jointly predict the whole 62-dim parameters. The advantage of separate heads is that disentangling
pose, shape, and expression controls secures better information flow. The illustration in Fig. 2.2
shows the encoder-decoder structure. The decoding is formulated asα
m
= Dec
m
(z), m∈{p,s,e},
showing pose, shape and expression. With groundtruth notation
∗ hereafter, the supervised 3DMM
regression loss is shown as follows.
L
3DMM
=
∑
m
∥α
m
− α
∗ m
∥
2
. (2.3)
2.1.1 SynergyNet: Synergy between 3D landmarks and 3DMM parameters
2.1.2 From 3DMM to Refined 3D Landmarks
After regressing 3DMM parameters (α
p
,α
s
,α
e
), 3D face mesh for the input face can be constructed
by Eq.2.1 and be aligned with input face by Eq.2.2. We adopt popular BFM [82], which includes
about 53K vertices, as the mean face M in Eq.2.1. Then, 3D landmarks L
c
∈R
3× N
l
are extracted by
landmark indices. N
l
= 68 is used in 300W-LP [146] as our training dataset.
6
Previous studies [39,146,148] directly use extracted landmarks L
c
to compute the alignment loss
for learning 3D facial geometry. However, these extracted landmarks are raw without refinement.
Instead, we adopt a refinement module that aggregates multi-attribute features to produce finer
landmark structures. Landmarks can be seen as a sequence of 3D points. Weight-sharing multi-layer
perceptrons (MLPs) are commonly used for extracting features from structured points. PointNet-
based frameworks [70, 83, 84, 112, 122, 128] use an MLP-encoder to extract high-dimensional
embeddings. At the bottleneck, global point max-pooling is applied to obtain global point features.
Then an MLP-decoder is used to regress per-point attributes. An MLP-based refinement module
takes sparse landmarks L
c
as inputs and uses the MLP-encoder and MLP-decoder to produce finer
landmarks.
Instead of using L
c
alone for the refinement, our refinement module adopts multi-attribute feature
aggregation (MAFA), including input images and 3DMM semantics that provide information from
different domains. For example, shape contains information of thinner/thicker faces, and expression
contains information for eyebrow or mouth movements. Therefore, these pieces of information can
help regress finer landmark structures. Specifically, our MAFA fuses information of the image, using
its bottleneck features z after global average pooling and shape and expression 3DMM parameters.
These features and parameters are global information without spatial dimensions. We first use FC
layers for domain adaption. Later we concatenate them into a multi-attribute feature vector and then
repeat this vector N
l
times to make multi-attribute features compatible with per-point features. We
last append the repeated features to the low-level point features and feed them to an MLP-decoder
to produce refined 3D landmarks.
We use groundtruth landmarks to guide the training. The alignment loss function is formulated
as follows.
L
lmk
=
∑
n
L
smL1
(L
r
n
− L
∗ n
),n∈[1,N
l
], (2.4)
where N
l
is number of landmarks,
∗ denotes groundtruth, andL
smL1
is smooth L1 loss. So far,
the operations of constructing 3D face meshes and landmark extraction and refinement transform
3DMM parameters to refined 3D landmarks.
7
2.1.3 From Refined Landmarks to 3DMM
We next describe the reverse direction of representation that goes from refined landmarks to 3DMM
parameters.
Previous works only consider 3DMM parameter regression from images [23, 39, 47, 106, 125,
146, 148]. However, facial landmarks are sparse keypoints lying at eyes, nose, mouth, and face
outlines, which are principal areas that α
s
and α
e
control. We assume that approximate facial
geometry is embedded in sparse landmarks. Thus, we further build a landmark-to-3DMM module
to regress 3DMM parameters from the refined landmarks L
r
using the holistic landmark features.
To our knowledge, we are the first to study this reverse representation direction, from landmarks to
3DMM parameters.
The landmark-to-3DMM module also contains an MLP-encoder to extract high dimensional
point features and use a global point max-pooling to obtain holistic landmark features. Later separate
FC layers transform the holistic landmark features to 3DMM parameters to get ˆ α, including pose,
shape, and expression. We refer ˆ α to landmark geometry, since this 3DMM geometry is regressed
from landmarks. We adopt a supervised loss with groundtruthα
∗ for ˆ α as follows.
L
3DMM
lmk
=
∑
m
∥ ˆ α
m
− α
∗ m
∥
2
, (2.5)
where m contains pose, shape, and expression.
Furthermore, since ˆ α regressed from the landmarks andα regressed from the face image describe
the same identity, they should be numerically similar. We further add a novel self-supervision
control as follows.
L
g
=
∑
m
∥α
m
− ˆ α
m
∥
2
, (2.6)
where m∈{p,s,e}.L
g
improves information flow that lets 3DMM regressed from images obtain
support from landmark geometry.
The advantage of self-supervision control (Eq.2.6) is that since images and sparse landmarks
are different data representations (2D grids and 3D points) using different network architectures and
8
Figure 2.3: Illustration of representation cycle.
operations, more descriptive and richer features can be extracted and aggregated under this multi-
representation strategy. Although conceptually sparse landmarks provide rough face outlines, our
experiments show that this reverse representation direction further contributes to the performance
gain and attains superior results than related work.
Overall, the total loss combination is shown as follows
L
total
=λ
1
L
3DMM
+λ
2
L
lmk
+λ
3
L
3DMM
lmk
+λ
4
L
g
, (2.7)
whereλ terms are loss weights.
2.1.4 Representation Cycle
Our overall framework creates a cycle of representations. First, the image encoder and separate
decoders regress 1D parameters from a face image input. Then, we construct 3D meshes from
parameters and refine extracted 3D landmarks—this is the forward direction that switches rep-
resentations from 1D parameters to 3D points. Next, the reverse representation direction adopts
a landmark-to-3DMM module to switch representations from 3D points back to 1D parameters.
Therefore it forms a representation cycle (Fig. 2.3), and we minimize the consistency loss to
facilitate the training. The forward and reverse representation direction between 3DMM parameters
and refined 3D landmarks form a synergy process that collaboratively improves the learning of
facial geometry. Landmarks are extracted and refined, and the refined landmarks and landmark
geometry further supports better 3DMM parameter predictions using the self-supervised consistency
loss (Eq.2.6).
9
Table 2.1: Benchmark on AFLW2000-3D for facial alignment. The original annotation version is
used. Our performance is the best with a gap over others on large poses.
AFLW2000-3D Original 0 to 30 30 to 60 60 to 90 All
ESR [16] 4.60 6.70 12.67 7.99
3DDFA [146] 3.43 4.24 7.17 4.94
Dense Corr [135] 3.62 6.06 9.56 6.41
3DSTN [10] 3.15 4.33 5.98 4.49
3D-FAN [15] 3.16 3.53 4.60 3.76
3DDFA-PAMI [148] 2.84 3.57 4.96 3.79
PRNet [29] 2.75 3.51 4.61 3.62
2DASL [106] 2.75 3.46 4.45 3.55
3DDFA-V2 (MR) [39] 2.75 3.49 4.53 3.59
3DDFA-V2 (MRS) [39] 2.63 3.42 4.48 3.51
SynergyNet (our) 2.65 3.30 4.27 3.41
Compared with a simple baseline using only the forward representation, i.e., going from image to
3DMM and directly extracting 3D points from built meshes to compute alignment loss, our proposed
landmark refinement (MAFA) and the reverse representation (landmark-to-3DMM module) only
bring about 5% more time in average for a single feed-forward pass. This is because landmarks are
sparse and compact, and weight-sharing MLPs are lightweight.
We choose simple and widely-used network operations to show that without special operations,
landmarks and 3DMM parameters can still guide the 3D facial geometry learning better. Through
the following studies and experiments, we closely validate each module we introduce to the plain
3DMM regression from images, including MAFA for landmark refinement and landmark-to-3DMM
module.
2.2 Experiments
Benchmark comparison on 3D facial alignment. We benchmark performance on the widely-used
AFLW2000-3D. The two versions of annotations (original and reannotated) are used. Table 2.1
shows the comparison on the original annotation. Our SynergyNet holds the best performance among
all the related work on this standard dataset in Table 2.1. From the breakdown, our performance
gain mainly comes from medium and large pose cases. Our SynergyNet has the best performance,
especially with a performance gap over others on large-pose cases.
10
Table 2.2: 3D face modeling comparison on AFLW2000-3D.
Protocol-1 [29, 106, 146] 3DDFA
[146]
DeFA [65] PRNet
[29]
2DASL
[106]
SynergyNet
(our)
NME 5.37 5.55 3.96 2.10 1.97
Protocol-2
[39]
3DDFA
[146]
DeFA [65] 3DDFA-V2
[39]
SynergyNet
(our)
NME 6.56 6.04 4.18 4.06
Benchmark comparison on 3D facial alignment. Following [29,39,106], we evaluate 3D face
modeling on AFLW2000-3D. Two protocols are used. Protocol 1 suggested by [29, 106, 146] uses
the iterative closet point (ICP) algorithm to register groundtruth 3D models and predicted models.
NME of per-point error normalized by interocular distances is calculated. Protocol 2, suggested
by [39] and also called dense alignment, calculates the per-point error normalized by bounding box
sizes with groundtruth models aligned with images. Since ICP is not used, pose estimation would
affect the performance under this protocol, and the NME would be higher. We illustrate numerical
comparison in Table 2.2. The results show the ability of SynergyNet to recover 3D face models
from monocular inputs and attain the best performance.
2.3 Summary
This work proposes a synergy process that utilizes the relation between 3D landmarks and 3DMM
parameters, and they collaboratively contribute to better performance. We establish a representation
cycle, including forward direction, from 3DMM to 3D landmarks, and reverse representation
direction, from 3D landmarks to 3DMM. Specifically, We propose two modules, multi-attribute
feature aggregation for landmark refinement and the landmark-to-3DMM module. Extensive
experiments validate our network design, and we show a detailed performance breakdown for each
included attribute and regression target. Our SynergyNet only adopts simple network operations
and attains superior performance, making it a fast, accurate, and easy-to-implement method.
11
Chapter 3
Human Faces: Geometry from Voices
The previous section introduces how to estimate accurate facial geometry from images, and in this
chapter, we take a bold approach to investigate whether it is possible to recover face geometry from
voice. We may experience that when an unheard voice comes, we can approximately how does the
speaker look like. The experience indicates that there exist potential correlations between face and
voice, but it is interesting to what extent these two modalities are correlated.
This work aims at revealing the underlying correlation between voice and face geometry.
Many physiological attributes are embedded in voices. For example, speech is produced by
articulatory structures, such as vocal folds, facial muscles, and facial skeletons, which are all
densely connected. Such a fact intuitively indicates potential correlations between voices and face
shapes [40]. Experiments in cognitive science point out that audio cues are associated with visual
cues in human perception– especially in recognizing a person’s identity [9]. Recent neuroscience
research further shows that two parallel processing of low-level auditory and visual cues are
integrated in the cortex, where voice processing affects facial structural analysis for the perception
purpose [134].
Traditional research in the voice domain focuses on utilizing voice inputs for predicting more
conspicuous attributes which include speaker identity, age, gender, and emotion. A novel direction
in recent development goes beyond predicting these attributes and tries to reconstruct 2D face images
from voice [19,77,121]. Their research is built on an observation that one can approximately envision
how an unknown speaker looks when listening to the speaker’s voice. Attempts towards validating
12
Speech Face Geometry
A possible task?
Supervised Learning:
Learn from paired Voice and 3D face
Unsupervised Learning:
Transfer knacks from experts of
Image-to-2D and 2D-to-3D
Figure 3.1: Cross-Modal Perceptionist. We study the correlations between voices and face
geometry under both supervised and unsupervised learning settings. This work targets at more
explainable human-centric cross-modal learning for biometric applications.
this assumptive observation include the work [77] for image reconstruction and works [19,121] using
generative adversarial networks (GANs). They aim to output face images from only a speaker’s
voice.
However, face images from voices are inherently ill-posed: the task involves predicting extrane-
ous attributes that voices cannot hint, including image backgrounds, hairstyles, headgears, or beards.
These attributes are apparently that one can choose without changing voices. Similar concerns arise
regarding the correlations between voices and facial textures or ethnicity. [77] demonstrates a t-SNE
plot in which ethnicity is scattered across all samples, indicating its low correlations to voices. As a
result, quantifying the differences between an output face image and a reference is hard and less
grounded.
Instead of producing face images, our analysis moves to the 3D domain with mesh representa-
tions and predicts one’s face geometry or skull structures from voices, which is free from the
above issues. Working on 3D meshes is less ambiguous than images because the former includes less
noisy variations unrelated to a speaker’s voice, such as stylistic variations, hairstyles, background,
and facial textures. Moreover, meshes enable more straightforward quantification of differences
between prediction and groundtruth in the Euclidean space– unlike the case in using face images,
where sources of differences involve backgrounds and hairstyles.
13
From the perspective of 3D faces, much research attention has been paid to 3D reconstruction
from monocular images [39, 97, 124, 146] or video sequences [33, 55] for 3D face animation or
talking face synthesis. In contrast, we are the first to investigate the correlations between one’s 3D
face geometry and voices, and we focus on the analysis of the face geometry gleaned from one’s
voices. Our goal is to validate the correlations between voices and face geometry towards more
explainable human-centric cross-modal learning with neuroscience support.
The analysis inevitably involves acquiring large-scale 3D face scans with paired voices, which
is expensive and subject to privacy. To deal with this issue, we propose a novel Voxceleb-3D dataset
that includes paired voices and 3D face models. V oxceleb-3D is inherited from two widely used
datasets: V oxceleb [71]) and VGGFace [79], which include voice and face images of celebrities,
respectively. The approach [147] we adopt to create V oxceleb-3D is inspired by 300W-LP-3D [146],
the most-used 3D face dataset.
Our analysis framework Cross-Modal Perceptionist (CMP), investigates the feasibility to
predict face meshes using 3D Morphable Models from voices on the following two scenarios (Fig.
3.1). We first train neural networks directly from V oxceleb-3D in a supervised learning manner
using the paired voices and 3DMM parameters. We further investigate an unsupervised learning
setting to inspect whether face geometry can still be gleaned without paired voices and 3D faces,
which is a more realistic scenario. In this case, we use knowledge distillation (KD) [42] to transfer
knacks from the state-of-the-art method for 3D faces from images, SynergyNet [124], into our
student network and jointly train speech-to-image and image-to-3D blocks.
We design a set of metrics to measure the geometric fitness based on points, lines, and regions
for both the supervised and the unsupervised scenarios. The evaluation attempts to show correla-
tions between 3D faces and voices with straightforward neural network-based approaches. The
analysis with CMP enables us to comprehend the correlations between face geometry and voices.
Our research lays explainable foundations for human-centric cross-modal learning and biometric
applications using voice-face correlations, such as security and surveillance when only voice is
given.
14
Voice
Encoder
𝜙
!
3DMM
Regressor
𝜙
"#$ Voice
Embedding
𝛼
𝛼
∗ Supervised Loss
Speech Input
Figure 3.2: Supervised learning framework. Given a speech input, voice embedding is extracted
byφ
v
. φ
dec
then estimates 3DMM parametersα for 3D face modeling. The supervision is computed
with groundtruthα
∗ .
Our goal is not to recover high-quality 3D face meshes from voices comparable to synthesis
from visual modalities such as image or video inputs, but we try to answer the core question under
our CMP framework: can face geometry be gleaned from voice?
3.1 Cross-Modal Perceptionist
Our goal is to analyze how a person’s voice relates to one’s face geometry in the 3D space. Thus,
we learn 3D face meshes using 3D Morphable Models (3DMM) from input speech and analyze the
correlations under supervised and unsupervised learning settings. The supervised setting learns the
correlation from a paired voice and 3D face dataset. The unsupervised learning studies a realistic
case when such paired dataset is not available, is it still possible to predict face geometry from
voice?
3.2 Supervised Learning with Voice/Mesh Pairs
We first describe the supervised learning setting, illustrated in Fig. 3.2. Given a paired speech
sequence and 3DMM parameters for an identity, we build an encoder-decoder structure first to
extract voice embedding v∈R
64
from a mel-spectrogram [37], which is a commonly used time-
frequency representation for speech, of the input speech. Following [121], the voice encoderφ
v
is
pretrained on the large-scale speaker recognition task. Then, we train a decoderφ
dec
to estimate
15
3DMM parameters,α. We use groundtruth 3DMM parameters to supervise the training with L
2
loss.
L
reg
=∥α− α
∗ ∥
2
(3.1)
whereα
∗ is groundtruth 3DMM parameters.
In addition, we adopt the triplet loss on the estimated 3DMM parametersα. The triplet loss
minimizes the difference of pairwise relations between (anchor, positive) and (anchor, negative)
pairs with a soft margin.
L
tri
= max{∥α− α
p
∥
2
−∥ α− α
n
∥
2
+ 1,0}, (3.2)
whereα plays as an anchor,α
p
is a positive sample for the anchor, representing the same identity
but regressed from different images, andα
n
, coming from a different identity, is a negative sample
for the anchor. The triplet loss aims at coherent 3DMM parameters for the anchor and positive
samples due to the same identities and simultaneously contrasting to the negative sample due to a
different identity. The overall loss function isL
sup
=L
reg
+L
tri
.
The challenge of this supervised learning problem is how to obtainα
∗ . Most large voice datasets,
such as V oxceleb [71], only contain speech, and most face datasets, such as VGGFace [79], only
consist of publicly scraped face images. We first follow [121] to fetch the intersection of voice and
image data from V oxceleb and VGGFace. Then, we propose to fit 3D faces from 2D to create a
novel dataset, Voxceleb-3D, using an optimization-based approach adopted by 300W-LP-3D [146],
the most-used 3D face dataset. In detail, we use an off-the-shelf 3D landmark detector [15] to
extract facial landmarks from collected face images and then optimize 3DMM parameters to fit in
the extracted landmarks. Our V oxceleb-3D contains paired voice and 3D face data to fulfill our
supervised learning.
16
3.3 Unsupervised Learning with KD
Obtaining real 3D face scans is very expensive and limited by privacy, and the workaround of
optimization-based 3DMM fitting with facial landmarks is time-consuming. An unsupervised
framework may serve real-world scenarios. As a result, we propose an unsupervised framework
with knowledge distillation. By leveraging a well-pretrained expert, it helps to validate whether face
geometry can still be gleaned with neither real 3D face scans nor optimized 3DMM parameters.
Our unsupervised framework, illustrated in Fig. 3.3, has two stages: (1) synthesizing 2D face
images from voices with GAN and (2) 3D face modeling from synthesized face images. The
motivation is that we first use the GAN to generate 2D faces from voices to obtain the speaker’s
appearance. However, 2D images contain variations of backgrounds, textures, hairstyles that are
irrelevant to voice. Thus, the second-stage image-to-3D-face module disentangles geometry from
other variations.
Synthesizing face images from voices with GANs. Previous research develops a GAN-based
speech-to-image framework [121]. A voice encoderφ
v
extracts voice embeddings from input speech.
Then a generatorφ
g
synthesizes face images from the voice embeddings, and a discriminatorφ
dis
decides whether the synthesis is indistinguishable from a real face image. Last, a face classifier φ
c
learns to predict the identity of an incoming face, ensuring that the generator produces face images
that are truly close to the identity in interest. Here we overload notations ofφ
v
and other components
introduced later for 3D face modeling in both Sec.3.2 and 3.3 due to the same functionalities.
In detail, given a speech input S, its corresponding speaker ID id, and real face images I
r
for
the speaker, the image synthesized from the generator is I
f
=φ
g
(φ
v
(S)). The loss formulation
is divided into two parts: real and fake images. For real images, the discriminator learns to
assign them to ”real” (r) and the classifier learns to assign them to id. The loss for real images is
L
r
=L
d
(φ
dis
(I
r
),r)+L
c
(φ
c
(I
r
),id) showing the discriminator and classifier losses respectively.
For fake images, after producing I
f
from φ
g
, the discriminator learns to assign them to ”fake”
(¯ r) and the classifier also learns to assign them to id. The loss counterpart for fake images is
L
f
=L
d
(φ
dis
(I
f
), ¯ r)+L
c
(φ
c
(I
f
),id).
17
Speech-to-Image
Module:
𝜙
!
, 𝜙
"
Image-to-3D
Module:
𝜙
#
,𝜙
$%&
Image-to-3D
Module:
𝜙
#
'
,𝜙
$%&
'
Discriminator
𝜙
$()
Classifier
𝜙
&
Speaker ID
Genuine/Fake
Pseudo-Groundtruth
Knowledge Distillation
𝛼
𝛼
#
Figure 3.3: Unsupervised learning with KD. The unsupervised framework contains a GAN for
face image synthesis with voice encoderφ
v
, generatorφ
g
, discriminatorφ
dis
, and classifier φ
c
. Then,
knowledge distillation is used to achieve unsupervised learning. 2D face is a latent representation in
this fashion.
3D face modeling from synthesized images. After image synthesis by GAN, we build a
network to estimate 3DMM parameters from fake images. The parameter estimation consists of an
encoderφ
I
and an decoderφ
dec
to obtain 3DMM parametersα =φ
dec
(φ
I
(I
f
)).
Knowledge distillation for unsupervised learning
To fulfill the unsupervised training, we distill the knowledge of image-to-3D-face reconstruction
from a pretrained expert network. The expert, consisting of encoderφ
E
I
and decoderφ
E
dec
, recon-
structs 3D face models from synthesized face images and produces pseudo-groundtruth of 3DMM
parametersα
E
. α
E
is used to train the student network by L
2
loss:
L
p− gt
=∥α
E
− α∥
2
. (3.3)
This KD strategy circumvents the needs of paired voice and 3D face data and helps us achieve
unsupervised learning.
In addition to pseudo-groundtruth, we also distill knowledge at intermediate layers and minimize
their distribution divergence between the expert and the student. We measure the distributions in the
18
feature spaces by the extracted image embedding z
E
∈R
B× ν
and z∈R
B× ν
of the expert and the
student network. We maintain the batch dimension B and collapse the rest toν. Then as in [80], we
calculate the conditional probability z between feature points as follows.
z
i| j
=
K(z
i
,z
j
)
∑
k,k̸= j
K(z
k
,z
j
)
,z
E
i| j
=
K(z
E
i
,z
E
j
)
∑
k,k̸= j
K(z
E
k
,z
E
j
)
, (3.4)
where K(· ,·) is scaled and shifted cosine similarity whose outputs lie in [0,1]. Kullback-Leibler
(KL) divergence is then used to minimize the two conditional distributions.
L
div
=
∑
i
∑
j̸=i
z
E
j|i
log
z
E
j|i
z
j|i
!
. (3.5)
The KD loss isL
KD
=L
p− gt
+L
div
. The overall unsupervised learning loss is combined with
GAN loss and also triplet loss in Eq.3.2.
L
unsuper
=L
f
+L
r
+L
KD
+L
tri
. (3.6)
3.4 Experiments
Datasets. We use our created V oxceleb-3D dataset described in Sec. 3.2. There are about 150K
utterances and 140K frontal face images from 1225 subjects. The train/test split for V oxceleb-3D is
the same as [121]: Names starting with A-E are used for testing, and the others are for training. We
manually pick the best-fit 3D face models for each identity as reference models for evaluations.
Data Processing and Training. We follow [121] and extract 64-dimensional log mel-spectrograms
with a window size of 25 ms, and perform normalization by mean and variance of each frequency
bin for each utterance. In the unsupervised setting, we adopt SynergyNet [124] as the expert. Face
images from the generator are 64× 64, and we bilinearly upsample them to 120× 120 to fit the
input size of the expert for 3D face reconstruction from images. Our framework is implemented
in PyTorch [81]. We use Adam optimizer [56] and set the learning rate to 2× 10
− 4
, batch size to
64, and a total number of training steps to 50,000, which consumes about 16 hours to train on a
machine with a GeForce RTX 2080 GPU.
19
Figure 3.4: Distance illustration for our ARE metric. AB: ear-to-ear distance. CD: forehead
width. EF: outer-interocular distance. GH: midline distance. IJ: cheek-to-cheek distance.
To train with triplet loss, for each sample in a batch, we further uniformly sample one utterance
of the same person as the positive sample and sample the other one of the different person as the
negative sample.
Metrics. We design ARE to evaluate 3D face deformation based onα. Absolute Ratio Error
(ARE, line-based): Distances between facial points are commonly used as measures related to
aesthetics or surgical purposes [1, 78, 94]. We pick point pairs (shown in Fig. 3.4) that are most
representative for evaluation and calculate the distance ratios to outer-interocular distance (OICD).
For example, ear ratio (ER) is AB/EF, and the same for forehead ratio (FR), midline ratio (MR), and
cheek ratio (CR). We evaluate our models by the absolute ratio error (ARE) between the predicted
and the reference face meshes because these ratios can capture face deformation. As an example,
ARE of ER is|ER− ER
∗ |, where
∗ denotes the ratios of reference models.
Baseline. We build a straightforward baseline by directly cascading two separately pretrained
methods without joint training: the GAN-based speech-to-image block [121] and SynergyNet [124]
for image-to-3D-face block to produce 3D meshes from voices as the baseline framework. In
addition, 3DDFA-V2 [39] is another method for 3D face modeling from monocular images using
BFM and holds a close performance to SynergyNet. Thus, we experiment with combinations of
speech-to-image block + 3DDFA-V2 (Base-1) and speech-to-image block + SynergyNet (Base-2).
20
Geometry
Real Faces
(shape reference)
Intermediate
Faces
Figure 3.5: Evidence for positive response to Q1. Our unsupervised framework predicts interme-
diate 2D images and 3D meshes. This answers to Q1 that 3D face models exhibiting similar face
shapes to the references can be predicted from only voice inputs.
Figure 3.6: A collection of results supports our positive response to Q1. This figure extends
Fig.3.5. Top to down for two row chunks: predicted intermediate face images, predicted 3D models,
real faces for references.
3.4.1 Analysis
We attempt to answer Q1-Q4 in this section and respond to each respective question in A1-A4.
In A1-A3, we show predictions using our unsupervised learning setting since the by-product
intermediate images help explain 3D mesh prediction for better comprehension of the mechanism.
A1: Meshes and intermediate images. In Fig. 3.5, we display intermediate 2D images, 3D
meshes, and real faces. Note that the real faces should only be treated for identification purposes in
terms of face shapes because those images include backgrounds or hairstyle variations that differ
21
t0 consistent 2D shapes 3D t1 t2 t0 t1 t2 Wider jawbone & thicker shapes Narrower jawbone & thinner shapes Figure 3.7: Illustration for our positive response to Q2. Consistent intermediate images and 3D
faces can be predicted from the same speaker with different time-step utterances.
Center +1 frame +2 frame +3 frame -1 frame -2 frame -3 frame (Unit in
pixels) Mean 0.345 0.350 0.347 0.0 0.412 0.415 0.316 Std 0.077 0.069 0.067 0.0 0.090 0.118 0.786 2D 3D Figure 3.8: Shape variation statistics in response to Q2. Mean and std of per-vertex variation
w.r.t. the center frame are shown, calculated in frontal pose. 3D shapes recovered from different
utterances are consistent with only sub-pixel differences.
from references. Our end targets are the 3D face meshes that are free from these factors. Prediction
from our framework generates wider meshes in Column 2 and thinner meshes in Column 3 and 4,
which reflect the real face wideness. All the generated 3D meshes fit in 2D facial outlines well.
These results exemplify the ability to convert voices into plausible 3D face meshes. Although
meshes are rough compared with 3D synthesis from images or videos modalities, the results
conform to our intuitions that when an unheard speech comes, one can roughly envision whether
the speaker’s face is overall wider or thinner. However, we cannot picture subtle details, such as
bumps or wrinkles on faces. The same trends can be observed in a vast result collection in Fig. 3.6.
The results are not cherry-picked.
22
Real Faces
(shape reference)
Base-2
Our CMP
(a) (b) (c) (d)
Intermediate
Faces
Geometry
Intermediate
Images
Geometry
Figure 3.9: Comparison of intermediate images and meshes in response to Q3. The cross-modal
joint training strategy in our unsupervised CMP produces better-quality images than the baseline.
More reliable images as latent representations from our CMP can facilitate mesh prediction. We
include real faces for face shape references.
A2: Prediction coherence of the same speaker. To address Q2, we showcase in Fig. 3.7 and
3.8 for the coherence of the predicted face shapes from different utterances of the same speakers.
The 2D predictions exhibit face shape and outline consistency, though they are still plagued by
stylistic variations that are geometrically unrelated to our task. This not only confirms the ability
to produce coherent face meshes but also underlines why predicting face meshes from voices is
regarded as less noisy than face image synthesis.
A3: Gain from cross-modal joint training. For Q3, we compare results from our unsupervised
framework against those from the baseline in Fig. 3.9. Joint training for the speech-to-image and
image-to-3D sub-networks attain higher and more stable image synthesis quality, which benefits 3D
mesh prediction. In contrast, those from the baseline (Base-2) include more artifacts. This justifies
our CMP’s cross-modal joint training strategy, which lets networks learn to predict 3D faces with
voice input at the training, improves over the baseline that is separately trained.
To this end, we understand that voices can help 3D face prediction and produce visually
reasonable meshes that are close to real face shapes.
23
Table 3.1: ARE metric study. Compared with baseline, results from CMP show that cross-modal
joint training with voice input can obtain around 20% improvements. We also highlight the largest
improvement, ER, that answers to Q4.
ARE Base-1 Base-2 CMP-
supervised
CMP-
unsupervised
ER 0.0319 0.0311 0.0152 0.0181
FR 0.0184 0.0173 0.0186 0.0169
MR 0.0177 0.0173 0.0169 0.0174
CR 0.0562 0.0551 0.0457 0.0480
Mean 0.0311 0.0302 0.0241 0.0251
Gain - 0% -20.2% -16.9%
A3-Quantification+A4 . We numerically compare supervised and unsupervised settings of our
analysis framework, CMP, against the baseline using the ARE. Both supervised and unsupervised
settings improve the line-based ARE over the baseline around 20%, as exhibited in Table 3.1.
The results show that cross-modal joint training achieves better results than the direct cascade of
pretrained blocks. These improvements reveal underlying correlations between voices and face
shapes such that training face mesh prediction with joined voice information is helpful. Among all
metrics, ear ratio (ER) has the most prominent improvements, indicating that the best indicative
attribute voice can hint is the head width, and thus it answers Q4. This analysis aligns with the
findings in Sec. 3.4.1 that voice can indicate wider/thinner faces, which corresponds to our intuition
that we can roughly envision a speaker’s face width from voices. Through this study, we quantify the
improvements of cross-modal learning from voice inputs, and the findings echo human perception
intuitively.
3.4.2 Subjective Evaluations
We further conduct subjective preference tests over the outputs to quantify the difference of prefer-
ence. The test was divided into three sections, considering images, 3D models, and joint materials.
Though we favor face meshes over images because the former are free from irrelevant textures
or backgrounds, we included intermediate images from our unsupervised setting in the test and
24
0% 50% 100%
Preference (%)
0 770 1540
60.52%
(932)
39.48%
(608)
Image
59.35%
(914)
40.65%
(626)
3D Model
57.79%
(890)
42.21%
(650)
Image +
3D Model
0 2310 4620
Number of preference votes
Overall 59.22%
(2736)
40.78%
(1884)
Image 3D Model Image +
3D Model
Overall
p-value ∼ 10
− 16
∼ 10
− 14
∼ 10
− 10
∼ 10
− 16
Figure 3.10: Results of subjective preference tests. The blue bars are the preference for our
method, while the red bars are the preference for the baseline method. The percentages are labeled
on the bar, and the total number of votes is enclosed in the parentheses. The x-axis on the bottom
labels the total number of responses, and that on the top denotes the percentage. The p-values of
the statistical significance tests are provided under the bar. ∼ shows the value’s order of magnitude.
asked subjects to focus on face shapes since better-outlined shapes on images lead to better-shaped
meshes, as indicated in Fig. 3.9.
Evaluation design. Thirty questions were included in the test, and 154 subjects with no prior
knowledge of our work were invited to the test. In the first section, each of the ten questions
consisted of three images– a reference face image, a face image from our unsupervised CMP, and
a face image generated from the Base-2 ( [121]+ [124]). The order of the generated images was
randomized. The subjects were asked to select the face image ”whose shape is geometrically more
similar to the reference face?”. In the second section (10 questions), a similar design was laid out,
but 3D face models from Base-2 and our CMP were used instead of images. Finally, in the third
section (10 questions), each of the two options comprised a face image and a 3D face model; the
subject was asked to jointly consider over the two materials: ”overall, whose shape geometrically
fits the given reference image better?”
Statistical significance test. Fig. 3.10 summarizes our subjective evaluation. We conduct a
statistical significance test with the following formulation. A subject’s response to a question is
considered as a Bernoulli random variable with a parameter p. The null hypothesis (H
0
) assumes
25
p≤ 0.5, meaning that the subjects do not prefer our model. The alternative hypothesisH
1
assumes
p> 0.5, meaning that the subjects prefer our model. For each section, there are 154 subjects and
ten responses per subject. For a significance level γ = 0.001, let b
n,p
(γ) denote the quantile of order
γ for the binomial distribution with parameters p and n. We can decide whether the subjects prefer
our model by
RejectH
0
versusH
1
⇔ np≥ b
n,p
(1− γ)
H
0
: p≤ 0.5,H
1
: p> 0.5.
(3.7)
As shown in Fig. 3.10, np is well above the threshold b
n=1540,p=0.5
(1− γ)= 831, rejectingH
0
and suggesting that the subjects significantly prefer our model over the baseline. The single-sided
p-values are displayed under the bar chart. A lower p-value means stronger rejection ofH
0
. The
p-values from our tests are much lower than the level 0.001, showing high statistical significance.
In conclusion, the hypothesis test verifies that the subjects indeed favor the predictions from our
method.
3.5 Summary
In this work, we investigate a root question in human perception: can face geometry be gleaned
from voices? We first point out shortcomings in previous studies in which 2D faces are predicted:
such synthesis contains variations in hairstyles, backgrounds, and facial textures with controversial
correlations to voices. We instead focus on 3D faces whose correlations to voices have been
supported by neuroscience and cognitive science studies. As a pioneering work toward this direction,
we innovate a way to construct V oxceleb-3D that includes paired voices and 3D face models, devise
and test baseline methods and oracles, and propose a set of evaluation metrics. Our proposed main
framework, CMP, learns 3DMM parameters from voices under both supervised and unsupervised
settings. Based on CMP, we answer the core question detailed analyses. We conclude that 3D faces
can be roughly reconstructed from voices.
26
Chapter 4
Outdoor Driving Scenes: A Sparse-to-Dense Approach
In this chapter, we shift from face data to scene data and start with geometry estimation from outdoor
driving scenes, which have popular applications on autonomous vehicles. Geometry estimation
in autonomous vehicles is a crucial task to measure distances from the ego car to surrounding
environments. Autonomous vehicles usually adopt lidars as the main depth acquisition sensors
due to their high precision and practicability in outdoor depth sensing. However, lidar scans are
limited to the number of scanlines and spatial resolutions, and thus they are sparse when aligned
with images. Recent research on lidar depth completion for autonomous driving tries to complete
sparse lidar depth into a dense map [68, 69] performing on KITTI Depth Completion Dataset [34].
However, upper side of depth maps is always cropped out for processing and evaluation for two
reasons. First, these upper side areas are usually sky or trees of low scene understanding interest.
Second, lidars are active sensors with limited scanlines and smaller vertical field-of-view than
cameras. Thus, most lidar scans do not span the whole image height and are concentrated on the
lower parts of images. For KITTI, topside 1/3 to 1/4 areas are unscanned by lidars. Also, KITTI’s
depth groundtruth is acquired by accumulating 3D point clouds with a 64-scanline lidar. Hence
their groundtruth are also concentrated on the lower parts of images. Both of KITTI’s quantitative
and qualitative evaluations focus only on the lower parts.
Nevertheless, upper scenes are especially important under several autonomous driving scenarios,
such as a huge truck beside or just in front occupies a large area of the upper scene when close enough.
Traffic signs or lights are important road structures extending to the upper parts. Although more
27
Figure 4.1: Comparison of depth from stereo matching network, depth completion network,
and our SCADC. Our results leverage advantages of both stereo matching, which have more
structured upper scenes, and lidars, which have more precise depth measurements.
and more research focuses on multi-modal learning from images and depth, scene incompleteness
issue is mostly ignored for the following reasons. First, depth completion is treated as a standalone
task in previous works without validating their completed depth maps on other scene understanding
tasks such as semantic segmentation. Second, not enough data of large objects extending to the
upper scenes are collected, and thus the issue is generally omitted. In contrast, stereo matching
produces disparity maps with much structured upper scenes because disparities are derived from
images. However, stereo matching is known for less reliable depth measurements for far-range
sensing and edge bleeding artifact [110] that produces distorted shapes.
In this work, we leverage the completeness of disparity maps to help sparse depth completion.
To our knowledge, we are the first focusing on the scene completeness issue in depth completion.
Our Scene Completeness-Aware Depth Completion (SCADC) fuses depth estimations from a
stereo matching network and a lidar depth completion network. We propose Attentional Point
Confidence (APC) to regress confidence maps to fuse multi-modal information. Later, we use a
stacked hourglass network to refine estimations stage by stage with groundtruth. Output examples
are in Fig. 4.1. We use both quantitative and qualitative comparisons to validate that our SCADC
combines the advantages of stereo cameras and lidars, producing both scene completeness-aware
and precise depth maps.
28
4.1 Scene Completeness-Aware Depth Completion
Figure 4.2: Network pipeline of our SCADC.
The whole network design of our SCADC is in Fig. 4.2. Our goal is to construct a network
for sensor fusion, which takes advantage of depth from stereo matching with more structured
upper scenes, and depth from lidar completion with higher precision, to generate both scene
completeness-aware and precise depth maps.
PSMNet [17] and SSDC [68] are adopted as our base methods for stereo matching and lidar
completion respectively. We use the estimated depth maps from two modalities, D
stereo
and D
lidar
,
as inputs to our SCADC. SCADC consists of two parts, multi-modal fusion and regression with a
stacked hourglass network.
At the multi-modal fusion stage, we utilize the early fusion strategy. Early fusion incorporates
multi-modal information before an encoder stage and has the advantage of retaining finer local
structures and neighborhood relationships. Opposed to early fusion, late fusion is usually adopted
for multi-modal learning with modalities from different domains to capture higher-level semantics,
such as fusing information of images and depth [49, 143]. Our SCADC operates information fusion
only on the depth domain, and thus early fusion of retaining local features and structures is more
desirable.
We propose a novel confidence regression module, Attentional Point Confidence (APC), to
estimate the pixel-level confidence of lidars, M
lidar
∈[0,1]
H× W
, where H and W are height and
width of inputs. APC decides for each pixel which modality is more probable to estimate more
reliable depth. Previous works [86, 108] also use confidence maps for RGBD fusion without direct
29
supervision on confidence regression. However, for stereo cameras/lidars fusion, since we have
priors that depth from stereo matching is more structured on upper scenes and depth from lidar
scans is more precise in general, using direct supervision on confidence regression could make
the network generate better confidence maps of maintaining and combining both advantages from
stereo cameras and lidars.
We create guiding confidence M
g
from the raw lidar scans. Lidar measurements are compar-
atively precise, so pixel positions at raw lidar points should have higher confidence. We set their
scores to 1. Next, the depths of neighboring pixels are generally similar, so we dilate confidence
at each raw lidar point using Gaussian kernels to obtain M
g
. Kernel size and variance are based
on point density of raw lidar scans. We find density along a scanline for KITTI is 44.6% at the
center and 30.6% near the left/right side. Thus, we use a 3× 3 kernel and choose a variance which
makes confidence scores drop to half with 1-pixel distance from the center. M
g
provides better
priors of confidence maps for learning. Note that (1) M
g
is not a hard constraint. The network could
still generate confidence maps learned from all loss combinations. (2) Estimating a confidence
map from one modality is enough for the probabilistic fusion of two modalities using the sum to
1 constraint. We choose to estimate lidars’ since their scanned points are generally more precise,
giving convenience to create M
g
.
Sparse data are intrinsically hard for CNN to extract effective features. In APC, We uti-
lize Sparsity-Attentional Convolution (SAConv) [143] to extract features from sparse lidar maps.
SAConv attends on feature extraction of each nonzero point with an extra mask to keep track of
visibility. After regressing M
lidar
, we calculate the confidence loss as L
c
=∥M
lidar
− M
g
∥
2
2
. The
confidence for stereo is M
stereo
= 1− M
lidar
, and the fused depth is D
f
= D
stereo
× M
stereo
+D
lidar
× M
lidar
.
The second stage is depth regression. We use a stacked hourglass network [73] with dense
connections for regressing depth. Our stacked hourglass network consists of 3 cascaded encoder-
decoder structures. It has the advantage of refining depth maps stage by stage when compared with
mostly used single encoder-decoder of FCN-like structure in other depth completion works [68]
30
[69] [143] [49]. The stacked hourglass produces 3 stage outputs (S1, S2, and S3). We further use
skip connection and densely connect each corresponding layer of these hourglasses and feed the
regressed depth to every subsequent stage to enhance information flow. Finer depth is regressed at
later stages. At inference time, S3 is the final depth output. ReLU [72] and batch normalization [46]
are adopted after each convolution in stacked hourglass and APC.
We use groundtruth, D
gt
, to directly supervise the regression and calculate loss terms for each
stage output. The corresponding mean square error losses are computed as follows.
L
i
=∥D
gt
− Si∥
2
2
,∀i∈[1,3]. (4.1)
The total loss is L
1
+ L
2
+ L
3
+ L
c
. Note that D
gt
from KITTI Depth Completion does not contain
points on the upper scenes. We find that using more stages or deeper network would cause overfitting
on the lower parts and yield unstructured upper scenes.
4.2 Experiments
Figure 4.3: Depth completion comparison.Left: Qualitative results of stereo matching (PSMNet),
lidar completion (SSDC), and our SCADC on KITTI Depth Completion validation set. Right:
Comparison on KITTI Depth Completion test set. Results of other works are directly from KITTI
website. We are the only that reconstruct upper scene structures.
A qualitative comparison is shown in Fig.4.3 Left. From both numerical and visual results,
although PSMNet produces more structured upper scenes than SSDC, the depth estimation error is
larger on the lower part. By contrast, while SSDC has a smaller numerical error, it creates irregular
31
Table 4.1: Evaluation on KITTI Depth Completion val set.
Methods RMSE Rel δ1 δ2 δ3
PSMNet 2.4107 0.1296 98.6 99.8 99.9
SSDC 1.0438 0.0191 99.3 99.8 99.9
SCADC 1.0096 0.0226 99.5 99.9 100.0
Table 4.2: Comparison on KITTI Semantic Segmentation Dataset. Our depth could enhance
SSMA performance.
Methods mIoU
SDNet [76] 51.15
SGDepth [57] 53.04
SSMA [107] 54.76
SSMA + Our SCADC depth 61.57
and unstructured depth estimations on the upper scenes. Our SCADC combines the advantages of
both stereo matching and depth completion to generates both scene completeness-aware and precise
depth estimations. We also compare with other depth completion methods on KITTI test set. The
comparison is shown in Fig.4.3 Right. Our SCADC is the only work that successfully reconstructs
the upper scene structures among the comparison.
The quantitative comparison of depth error on KITTI Depth Completion val set is in Table
4.1. Depth maps for comparison from PSMNet [17] and SSDC [68] are generated following their
steps. Note that the numerical results only evaluate depth estimations on the lower scenes. From
both numerical and visual results, although PSMNet produces more structured upper scenes than
SSDC, the depth estimation error is larger on the lower part. By contrast, while SSDC has a smaller
numerical error, it creates irregular and unstructured depth estimations on the upper scenes. Our
SCADC combines the advantages of both stereo matching and depth completion to generates both
scene completeness-aware and precise depth estimations.
Figure 4.4: Semantic segmentation results. SSMA with depth from our SCADC is used on KITTI
Semantic Segmentation dataset.
32
4.2.1 Outdoor RGBD Semantic Segmentation
Datasets. We next validate our SCADC on outdoor semantic segmentation. KITTI Semantic
Segmentation dataset contains only images, and we match the corresponding lidar frames in KITTI
Raw that contains all public raw data. We split the fetched data into 6:1 training and validation set.
Although Cityscapes has more data for semantic segmentation, they only adopt stereo cameras as
the depth acquisition.
Evaluations. SSMA [107] is a high-performing framework on outdoor RGBD semantic
segmentation. We follow SSMA’s setting and use their Cityscapes pretrains to perform fine-tuning
on KITTI. Standard mean intersection over union (mIoU) is adopted as the metrics. Two other
RGBD outdoor semantic segmentation methods SDNet [76] and SGDepth [57] are included for
comparison. The quantitative and qualitative results are in Table 4.2 and Fig. 4.4. From the results,
our scene completeness-aware and precise depth could further help performance improvements. In
the visual results, finer structures such as road signs and traffic poles that extend to the upper scenes
could be clearly segmented.
4.3 Summary
Our SCADC combines the advantages of scene completeness from stereo matching to help lidar
depth completion, obtaining both scene complete and precise depth maps. Our APC module predicts
confidences of lidars with a guiding supervision. With APC, information fusion from the two
modalities are successfully performed. We show that our depth maps have good upper scene control
for practical scenarios that objects of interest extend to the upper scenes. Such scene completeness
is important in autonomous driving system that extends the measurements to the upper, and thus we
can better feed RGBD detection or segmentation with better depth priors.
33
Figure 5.1: Advantages of our framework. (A) We attain zero-shot cross-dataset inference. (B)
Our framework trained on simulation data produces on-par results with the one trained on real data.
Chapter 5
Indoor Scenes: Practical Indoor Depth Estimation
In this chapter, we introduce estimating geometry from indoor scene images with its challenges.
Learning indoor depth is arguably more challenging for several reasons: (1) structure priors: depth
estimation for driving scenes imposes a strong scene structure prior to the learning paradigm. The
upper parts of images, commonly occupied by the sky or buildings, are typically farther away; on
the other hand, the lower parts are usually roads extending to the distance [24]. By contrast, the
structure priors are much weaker for indoor environments since objects can be cluttered and arranged
arbitrarily in the near field. (2) distribution: scene depth for driving scenarios tends to distribute
more evenly across near to far ranges on roads, whereas indoor depth can be concentrated in either
near or far ranges, such as zoom-in views of desks or ceilings. The uneven depth distribution makes
it challenging to predict accurate metric depth for indoor scenes. (3) camera pose: depth-sensing
34
devices can move in 6DoF for indoor captures, but they are typically anchored on cars for collecting
driving data where translations are usually without elevation and rotations are dominated by yaw
angle. Therefore, a desirable network needs to be more robust to arbitrary camera poses and complex
scene structures for indoor cases. (4) untextured surfaces: large untextured regions, such as walls,
make the commonly used photometric loss ambiguous.
In this work we propose DistDepth, a structure distillation approach to enhance depth accuracy
trained by self-supervised learning. DistDepth uses an off-the-shelf relative depth estimator, DPT
[90, 91] that produces structured but only relative depth (output values reflect depth-ordering
relations but are metric-agnostic). Our structure distillation strategy encourages depth structural
similarity both statistically and spatially. In this way, depth-ordering relations from DPT can be
effectively blended into metric depth estimation branch trained by left-right consistency. Our
learning paradigm only needs an off-the-shelf relative depth estimator and stereo image inputs
without their curated depth annotations. Given a monocular image at test time, our depth estimator
can predict structured and metric-accurate depth with high generalizability to unseen indoor scenes.
Distillation also helps downsize DPT’s large vision transformer to a smaller architecture, which
enables real-time inference on portable devices.
We create SimSIN, a novel dataset consisting of about 500K simulated stereo indoor images
across about 1K indoor environments. With SimSIN, we are able to investigate performances of prior
self-supervised frameworks on indoor scenes [35, 118, 119]. We show that we can fit on SimSIN
by directly training those models, but such models generalize poorly to heterogeneous domain
of unseen environments. Using our structure distillation strategy, however, can produce highly
structured and metric-accurate depth on unseen data. We experiment on several commercial-quality
simulations and real data. Further, we investigate the gap between training on simulation data v.s.
real data. We show that our DistDepth trained on simulation data only has on-par performance with
those trained on real data.
35
5.1 DistDepth: Structure Distillation from Expert
Basic Problem Setup We describe the commonly adopted left-right and temporal photometric
consistency in self-supervised methods such as MonoDepth2, DepthHints, and ManyDepth in this
section. During training, I
t
and I
′
t
are stereo pairs at timestep t. DepthNet f
d
is used to predict
depth of I
t
, D
t
= f
d
(I
t
). With known camera intrinsic K and transformation T
t
: I
t
→ I
′
t
using stereo
baseline, one can back-project I
t
into 3D space and then re-project to the imaging plane of I
′
t
by utilizing K, D
t
, and T
t
.
ˆ
I
′
t
= I
t
⟨pro j(D
t
,T
t
,K)⟩ denotes the reprojection. The objective is to
minimize photometric lossL = pe(I
′
t
,
ˆ
I
′
t
), where pe is shown as follows.
pe(I
′
t
,
ˆ
I
′
t
)=κ
1− SSIM(I
′
t
,
ˆ
I
′
t
)
2
+(1− κ)L
1
(I
′
t
,
ˆ
I
′
t
), (5.1)
whereκ is commonly set to 0.85, SSIM [116] is used to measure the image-domain structure similar-
ity, L
1
is used to compute the pixel-wise difference. pe(I
′
t
,
ˆ
I
′
t
) measures photometric reconstruction
error of a stereo pair to attain left-right consistency.
Temporal neighboring frames are also utilized to compute photometric consistency. PoseNet
calculates relative camera pose between timestep t and t+ k: T
t+k→t
= f
p
(I
t
,I
t+k
) with k∈{1,− 1}.
Then, temporal consistency is attained by warping an image from t+ k to t and calculating photo-
metric consistency in Eq. 5.1. At inference time, depth is predicted from a monocular image via
D= f
d
(I).
Applicability. We train MonoDepth2, DepthHints, and ManyDepth on the SimSIN dataset and
exemplify the scene fitting later in Fig. 5.3. Prior arts fit the training set but do not generalize well
for cross-dataset inference due to unseen complex object arrangements for indoor environments.
DistDepth: Structure Distillation from Expert To overcome the generalizability issue when
applying self-supervised frameworks to indoor environments, we propose DistDepth (Fig. 5.2).
DPT [90] using dense vision transformer can produce highly structured but only relative depth values
36
Figure 5.2: DistDepth overview.
by D
∗ t
= f
∗ d
(I
t
)
1
. We extract the depth-domain structure of D
∗ t
and transfer to the self-supervised
learning branches, including DepthNet f
d
and PoseNet f
p
. The self-supervised branch learns metric
depth since it leverages stereo pairs with known camera intrinsic and baseline with depth warping
operation I
t
⟨pro j(D
t
,T
t
,K)⟩. Our distillation enables f
d
to produce both highly structured and
metric depth and still work in a fashion without groundtruth depth for training.
We first estimate rough alignment factors of scale a
s
and shift a
t
from DPT’s output D
∗ t
to
predicted depth D
t
by minimizing differences between
¯
D
∗ t
= a
s
D
∗ t
+ a
t
and D
t
with closed-form
expressions from the least-square optimization.
Statistical loss. Compared with image-domain structures, depth-domain structures exclude
depth-irrelevant low-level cues such as textures and painted patterns on objects and show geometric
structures. Image structure similarity can be obtained by SSIM [115–117] w.r.t. statistical constraints.
Depth-domain structures also correlate to depth distribution represented by mean, variance, and
co-variance for similarity measures. Thus, we compute the SSIM with depth map input
¯
D
∗ t
and D
t
and use the negative metric as the loss term
L
stat
= 1− SSIM(
¯
D
∗ t
,D
t
), (5.2)
1
DPT (and MiDaS) outputs relative relation in disparity (inverse depth) space since it trains on diverse data sources
(laser-based depth, depth from SfM, or stereo with unknown calibration). We inverse its outputs and compute losses in
the depth space since our training data source is single
37
Unlike the widely-used appearance loss that combines SSIM with L
1
loss, we find that pixel-wise
difference measures lead to unstable training since inverting from disparity to depth magnifies
prediction uncertainty and produces much larger outliers in arbitrary ranges. In contrast, the SSIM
loss constrains on the mean and variance terms for two distributions instead of per-pixel differences
and becomes a desirable choice.
Spatial refinement loss . SSIM loss only constrains statistical depth distribution but loses spatial
information. We next propose a spatial control using depth occluding boundary maps (Fig. 5.2 (B)).
The Sobel filter, which is a first-order gradient operator [53], is applied to compute depth-domain
gradients: g=(
∂X
∂u
,
∂X
∂v
), where X∈{
¯
D
∗ t
,D
t
} and u,v represent horizontal and vertical direction
on 2D grids. Then we calculate a turn-on levelα = quantile(∥g∥
2
,95%) at the 95%-quantile level
of gradient maps to determine the depth occluding boundaries, where gradients are larger thanα.
We compute the 0/1 binary-value maps, E
∗ and E, to represent occluding boundary locations by
thresholding
¯
D
∗ t
and D
t
with their respectiveα terms. Last we calculate the Hamming distance, i.e.
bitwise difference, of E
∗ and E and normalize it by the map size and use it as the spatial loss term,
L
spat
= E
∗ ⊕ E/|E|, (5.3)
where ⊕ is the XOR operation for two boolean sets, and |E| computes the size of a set. In
implementation, to make thresholding and binarization operations differentiable, we substract
respectiveα from
¯
D
∗ t
and D
t
and apply soft-sign function, which resembles the sign function but
back-propagates smooth and non-zero gradients, to obtain maps with values in{-1, 1}. After the
division by 2, we arrive at the element-wise Hamming distance between the maps. The loss function
for structure distillation isL
dist
=L
stat
+ 10
− 1
L
spat
. The final loss function L
t
for I
t
is combined
with left-right consistencyL
LR
= pe(I
′
t
,
ˆ
I
′
t
), temporal consistencyL
temp
= pe(I
t
,
ˆ
I
t+k→t
), where
ˆ
I
t+k→t
is forward warping and backward warping, k∈{1,− 1}, andL
dist
:
L
t
=L
LR
+L
temp
+ 10
− 1
L
dist
. (5.4)
38
Figure 5.3: Intra-/ Inter-Dataset inference. Prior self-supervised works can fit training data
(SimSIN) shown in the first row, but they generalize poorly to unseen testing dataset (V A), shown in
the second and third rows. Our DistDepth can produce more structured and accurate ranges.
Table 5.1: Quantitative comparison on the V A dataset. Our DistDepth attains much lower errors
than prior works of left-right consistency. DistDepth-M further uses the test-time multi-frame
strategy in ManyDepth. See the main text.
Test-Time Single-Frame Test-Time Multi-Frame
Method MonoDepth2
[35]
DepthHints
[118]
DistDepth Improvement ManyDepth
[119]
DistDepth-
M
Improvement
MAE 0.295 0.291 0.253 -14.2% 0.275 0.239 -13.1%
AbsRel 0.203 0.197 0.175 -13.8% 0.189 0.166 -12.2%
RMSE 0.432 0.427 0.374 -13.4% 0.408 0.357 -12.5%
RMSE
log
0.251 0.248 0.213 -15.1% 0.241 0.210 -12.9%
The designed structure distillation is key to gearing up the self-supervised depth estimator
with high generalizability to unseen textures such that it better separates depth-relevant and depth-
irrelevant low-level cues. From another perspective, the student trained by left-right consistency
helps DPT learn ranges across different indoor scenes.
39
5.2 Experiments and Studies
5.2.1 Experiments on synthetic data
Prior self-supervised methods trained on SimSIN. We first directly train MonoDepth2, Depth-
Hints, and ManyDepth on SimSIN following settings in their papers and show fitting on the training
data and inference on V A in Fig. 5.3 to investigate the generalizability. ManyDepth and DepthHints
attain better results than MonoDepth2. Our DistDepth produces highly regularized structures with
robustness to unseen examples, w.r.t groundtruth. The range prediction also improves, which we
believe is due to better structure occluding boundary reasoning.
Error analysis on V A. We show numerical comparison on the entire V A sequence in Table
5.1. All the methods in comparison are trained on SimSIN. We further equip DistDepth with the
test-time multi-frame strategy with cost-volume minimization introduced in ManyDepth and denote
this variant by DistDepth-M. Methods are categorized into test-time single-frame and test-time
multi-frame. In both cases, DistDepth attains lower errors than prior arts. This validates our network
design: with an expert used for depth-domain structure distillation, a student network f
d
can produce
both structured and metric depth that is closer to the groundtruth.
Ablation study on V A. We first study the expert network and adopt different versions of DPT
(hybrid and legacy), whose network sizes are different. Table 5.2 shows that the student network
taught by the larger-size expert, DPT-legacy, achieves lower depth estimation errors. Without
distillation, results are worse because its estimation relies only on the photometric loss, which fails
on untextured areas like walls. As a sanity check, we also provide results of supervised training
using SimSIN’s groundtruth depth with pixel-wise MSE loss and test on the V A dataset, which
shows the gap between training on curated depth and depth from expert network’s predictions.
We next study the training strategy with different distillation losses and effects of turn-on level
α. We compare (1) w/o distillation, (2) distillation with statistical loss only, and (3) distillation with
statistical and spatial refinement loss. We demonstrate qualitative results in Fig. 5.6 to show the
depth-domain structure improvements. Without distillation, spatial structures cannot be reasoned
40
Figure 5.4: Qualitative results on V A sequence. Depth and error maps are shown for DistDepth
and MonoDepth2 for comparison. These examples demonstrate that our DistDepth predicts geomet-
rically structured depth for common indoor objects.
Table 5.2: Study on the choice of the expert network for distillation. Different versions of
DPT [90] that vary in network sizes (# of params) are adopted as the expert to teach the student.
DPT-legacy localizes occluding contours better and leads to a better-performing student network.
The results of supervised learning are provided as a reference.
Self-Supervised Supervised
Expert w/o distillation DPT - hybrid DPT - legacy with groundtruth
# of params - 123M 344M -
MAE 0.295 0.276 0.253 0.221
AbsRel 0.203 0.188 0.175 0.158
RMSE 0.432 0.394 0.374 0.325
RMSE
log
0.251 0.227 0.213 0.188
crisply. With statistical refinement, depth structures are more distinct. Adding spatial refinement,
the depth-domain structures show fine-grained details. We further analyze the effects of different
turn-on levels of α. Low α makes structures blurry since the refinement does not focus on the
high-gradient occluding boundaries as highα does, which identifies only high-gradient areas as
occluding boundaries and benefits structure knowledge transfer.
41
Figure 5.5: Results on Hypersim. Depth map and textured pointcloud comparison of MonoDepth2
and our DistDepth. With structure distillation, DistDepth attains better object structure predictions,
such as tables and paintings on the wall shown in (A) and much less distortion for the large bookcase
in (B).
Comparison on Hypersim. We next exhibit depth and textured pointcloud in Fig. 5.5 for some
scenes in Hypersim. Two different views are adopted for pointcloud visualization. One can find that
our DistDepth predicts better geometric shapes in both depth map and pointcloud.
5.2.2 Experiments on synthetic data
Closing sim-to-real gap. We compare results of training on simulation (SimSIN)and real data
(UniSIN) to investigate the performance gap. We examine (1) training MonoDepth2 on simulation
and evaluate on real data, (2) training MonoDepth2 on real data and evaluate on real data, (3)
training DistDepth on simulation and evaluate on real data, and (4) training DistDepth on real and
evaluate on real data. Fig. 5.7 illustrates the results of the four settings. Comparing (1) and (2), one
can find that MonoDepth2 trained on real data produces more reliable results than on simulation.
By contrast, this gap becomes unobvious when comparing (3) and (4) using DistDepth. Results of
(3) are on-par with (4) and sometimes even produce better geometric shapes like highlighted areas.
The results validate our proposals on both method and dataset levels. First, DistDepth utilizes
an expert network to distill knacks to the student. The distillation substantially adds robustness to
models trained on simulation data and makes the results comparable to models trained on real data.
42
Figure 5.6: Qualitative study for depth-domain structure improvement. Two examples (A)
and (B) are shown to study the effects of distillation (dist) losses and turn-on level α in spatial
refinement to validate our design.
Figure 5.7: Comparison on UniSIN. Geometric shapes produced from DistDepth are better than
MonoDepth2. DistDepth concretely reduces the gap for sim-to-real: (3) and (4) attain on-par results
and sometimes training on simulation shows better structure than training on real.
This shows the ability of DistDepth for closing the gap between simulation and real data. Second,
stereo simulation data provide a platform for left-right consistency to learn metric depth from stereo
triangulation. We show a collection of results in Fig. 5.8 using DistDepth that is trained purely on
simulation.
43
Figure 5.8: Results on real data (UniSIN) using our DistDepth only trained on simulation
(SimSIN).
Evaluation on NYUv2. Table 5.3 shows evaluations on NYUv2. We first train our DistDepth
on SimSIN and finetune on NYUv2 with only temporal consistency. Note that one finetuned model
(Sup:△) is categorized as semi-supervised since it utilizes an expert that has been trained with
NYUv2’s curated depth. The finetuned models produce the best results among methods without
NYUv2’s depth supervision and even attain comparable results to many supervised methods. We
next train DistDepth only on simulation (SimSIN) or real data (UniSIN) and evaluate on NYUv2.
Performances of the model trained on SimSIN only drop a little compared with that trained on
UniSIN, which justifies our sim-to-real advantage again. Without involving any training data in
NYUv2, DistDepth still achieves comparable performances to many supervised and self-supervised
methods, which further validates our zero-shot cross-dataset advantage.
5.3 Summary
This work targets a practical indoor depth estimation framework with the following features: training
without depth groundtruth, effective training on simulation, high generalizability, and accurate
and real-time inference. We first identify the challenges of indoor depth estimation and study the
applicability of existing self-supervised methods with left-right consistency on SimSIN. Geared up
with the depth-domain structure knowledge distilled from an expert, we see substantial improvement
in both inferring finer structures and more accurate metric depth. We show zero-shot cross-dataset
inference that proves its generalizability to work on heterogeneous data domains and attain a
44
Table 5.3: Evaluation on NYUv2. Sup: ✓- supervised learning using groundtruth depth, ✗-
not using groundtruth depth, and△- semi-supervised learning (we use the expert finetuned on
NYUv2, where we have indirect access to the groundtruth). We achieve the best results among
all self-supervised methods, and our semi-supervised and self-supervised finetuned on NYUv2
even outperform many supervised methods. The last two rows show results without groundtruth
supervision and without training on NYUv2. In this challenging zero-shot cross-dataset evaluation,
we still achieve comparable performances to many methods trained on NYUv2. Error and accuracy
(yellow/green) metrics are reported.
Methods Sup Train on
NYUv2
AbsRel RMSE δ
1
δ
2
δ
3
Make3D [95] ✓ ✓ 0.349 1.214 44.7 74.5 89.7
Li et al. [63] ✓ ✓ 0.143 0.635 78.8 95.8 99.1
Eigen et al. [27] ✓ ✓ 0.158 0.641 76.9 95.0 98.8
Laina et al. [58] ✓ ✓ 0.127 0.573 81.1 95.3 98.8
DORN [31] ✓ ✓ 0.115 0.509 82.8 86.5 99.2
AdaBins [11] ✓ ✓ 0.103 0.364 90.3 98.4 99.7
DPT [90] ✓ ✓ 0.110 0.357 90.4 98.8 99.8
Zhou et al. [145] ✗ ✓ 0.208 0.712 67.4 90.0 96.8
Zhao et al. [141] ✗ ✓ 0.189 0.686 70.1 91.2 97.8
Bian et al. [13] ✗ ✓ 0.157 0.593 78.0 94.0 98.4
P
2
Net+PP [136] ✗ ✓ 0.147 0.553 80.4 95.2 98.7
StructDepth [61] ✗ ✓ 0.142 0.540 81.3 95.4 98.8
MonoIndoor [50] ✗ ✓ 0.134 0.526 82.3 95.8 98.9
DistDepth (finetuned) ✗ ✓ 0.130 0.517 83.2 96.3 99.0
DistDepth (finetuned) △ ✓ 0.113 0.444 87.3 97.4 99.3
DistDepth (SimSIN) ✗ ✗ 0.164 0.566 77.9 93.5 98.0
DistDepth (UniSIN) ✗ ✗ 0.158 0.548 79.1 94.2 98.5
broadly applicable depth estimator for indoor scenes. Even more, depth learned from simulation
data transfers well to real scenes, which shows the success of our distillation strategy.
45
Chapter 6
Meta-Learning for Single-Image Depth Preidction: A
Data-Efficient Learning Approach
In prior DistDepth work, we try to attain a robust indoor depth estimation that distills knowledge
from a relative-depth pretrained model into another metric branch. However, this approach relies on
a pretrained model on a large dataset from mixed scenes to attain robustness. Then an interesting
question arises, can we attain a more robust mapping in a more data-efficient way: without using
extra data or pretrained knowledge?
In recent development of learning algorithm, meta-learning formulates the network optimization
problem into a bi-level optimization problem, i.e., a dual optimizer loop is created to regularize the
gradients, which benefits attaining more high-level and abstract representation describing the target
domains. In literature, meta-learning has the benefits of attaining few-shot learning from a few data
samples and higher domain generalization. The following section introduces our work adopting
meta-learning on depth estimation.
6.1 Introduction
Learning precise image-to-depth mappings is challenging due to domain gaps. A model weakly
capturing such relations may produce vague depth maps or even cannot identify near and far fields.
An intuitive solution is to learn from large-scale data or employ side information such as normal
46
Figure 6.1: Geometry structure comparison in 3D point cloud view. We back-project the
predicted depth maps from images into textured 3D point cloud to show the geometry. The proposed
Meta-Initialization has better domain generalizability that leads to more accurate depth prediction
hence better 3D structures. (zoom in for the best view).
or pretrained model as guidance [123]. However, they require extra information burdens. Without
those resources when training on data of limited appearance and depth variation, referring to scene
variety, with an extreme case that only sparse and irrelevant RGB-D pairs are available, networks
can barely learn a valid image-to-depth mapping (Fig. 6.5).
Inspired by meta-learning’s advantages of domain generalizability, training robust generalization
models to achieve better results on unknown domains, usually learned from limited-source data
[30, 75], we pioneer to dig into how meta-learning applies to single-image depth prediction. The
commonly-used meta-learning problem setup follows the context of few-shot multitask settings,
where a task represents a distribution to sample data from, and most tasks are designed for image
classification. Unlike those works, we study a more complex problem of scene depth estimation: the
difficulties lie in per-pixel and continuous range values as outputs, in contrast to global and discrete
outputs for image classification. Even for the same environments, images and depth captures
can vary greatly, such as adjacent frames for a close-view object can be large room spaces. This
observation indicates that our tasks are without clear task boundaries under meta-learning’s context,
and thus we propose to treat each training sample as a fine-grained task .
47
We follow the gradient-based meta-learning, which adopts a meta-optimizer and a base-optimizer.
The base-optimizer explores multiple inner steps to find weight-updating directions. Then the meta-
optimizer updates the meta-parameters following the explored trends. After few epochs of bilevel
training, we learn a mapping functionθ
prior
from image to depth. It becomes better initialization
for the subsequent supervised learning (Fig. 6.2). We explain the improvements lie in progressive
learning style. Note that meta-learning and the following supervised learning operate on the same
training set without using extra data.
We show that meta-learning induces a prior with higher generalizability to unseen scenes
with better image-to-depth understanding, which can identify depth-relevant/-irrelevant cues
more robustly and suppress depth-irrelevant cues. To validate the generalizability brought by
meta-learning, we adopt multiple popular indoor datasets [89, 92, 101, 123] and devise protocols for
zero-shot cross-dataset evaluation. This greatly differs from most previous works focusing only on
intra-dataset evaluation, training and testing on a single dataset, which does not validate in-the-wild
performances for practical use, such as applying to user-collected data by different cameras. We
qualitatively and quantitatively show consistently superior performance by meta-learning on various
network structures, including general and dedicated depth estimation architecture.
The work not only focuses on improvements in depth estimation. From meta-learning’s perspec-
tive, we introduce fine-grained task on a continuous, per-pixel, and real-valued regression problem
to advance meta-learning’s study on practical problems.
6.2 Meta-Learning Review
Meta-Learning principles [43, 96] illustrate an oracle for learning how to learn. Popular gradient-
based algorithms such as MAML [30] and Reptile [75] are formulated in bilevel optimization
fashion with a base- and meta- optimizer. MAML uses gradients computed on the query set to
update the meta-parameters. Reptile does not distinguish support and query set and simply samples
48
sample K fine-grained tasks
L-step exploration
exploration updates
meta updates
Prior learning stage
Supervised learning stage
Initialization
Inside each meta iteration:
...
...
Figure 6.2: Meta-Initialization for learning image-to-depth mappings. The prior learning stage
adopts a base-optimizer and a meta-optimizer. Inside each meta-iteration, K fine-grained tasks are
sampled and used to minimize regression loss. L steps are taken by the base-optimizer to search for
weight update directions for these K tasks. Then, the meta-optimizer follows the explored inner
trends to update meta-parameters in the Reptile style [75]. Image-to-depth priorθ
prior
is output at
the end of the stage. θ
prior
is then used as the initialization for the subsequent supervised learning
for the final model θ
∗ .
data from task distribution and updates meta-parameters by differences between inner and meta-
parameters. We refer readers to [44] for algorithmic developments [3,4,20,30,74,75,87,88,130,132].
Meta-learning is usually used in domain adaption, generalization and few-shot learning for vision
problems. Most works focus on image classification [7, 8, 14, 18, 25, 48, 59, 62, 85, 98, 130, 142],
and few on human motion prediction [38], tracking [109], objection detection [51, 113, 129, 138],
semantic segmentation [36]. A recent work [32] studies visual regression by meta-learning, but their
regression is considered simple since they only regress a global rotation angle for a synthetic object
in each image. Furthermore, their image background is usually all-white, or objects are simply
overlaid on out-of-context images, which are far from real cases. Geometry prediction from images
is arguably harder than the above problems since (1) it needs to regress per-pixel real-valued depth
49
rather than global class scores, and (2) it lifts information from imagery into geometry and becomes
inherently ill-posed. Only few works study geometry regression with meta-learning including
stereo- [105], NeRF- [103], and SDF- [100, 111] based methods with different focuses. We are the
first to fundamentally study how meta-learning helps indoor monocular scene depth estimation to
gain higher structure resolvability.
Some works use meta-learning for depth but only for driving scenes with very different problem
setup. [104, 139, 140] are built under online learning and adaptation using stereo or monocular
videos for temporal consistency. They require affinity in nearby frames and meta-optimize within
a single sequence, while our method is non-online, purely for single images, and meta-optimizes
across multiple environments in a training set without temporal/ stereo frames. [102] groups several
driving datasets as tasks but requires multiple training sets. We are the first to study meta-learning
for depth from single images without assuming nearby frame affinity. The problem is arguably
harder due to high appearance variation for single images from various environments. We propose
fine-grained task to fulfill learning from pure single images.
6.3 Fine-Grained Task Meta-Learning
6.3.1 Difficulty in Accurate Depth Estimation
A model needs to distinguish depth-relevant and depth-irrelevant low-level cues to accurately
estimate depth from images. The former shows color or radiance changes at object boundaries to
the background. In the latter case, geometry is invariant to color changes, such as decoration or
object textures. Depth-irrelevant cues frequently appear in indoor scenes due to cluttered textured
objects in near fields.
For example, a painting triggers many depth-irrelevant features and confuses networks. It then
heavily relies on sufficient variety in training data, which demonstrate mappings from images to
depth and enable learning from global context to suppress such local high-frequency details. For
example, [12, 90, 91] train on mixed data sources to robustly estimate depth in the wild. When
50
Figure 6.3: Fitting to training environments. var shows depth variance in the highlighted
regions. We show comparison of fitting to training environments between pure meta-learning (Meta)
and direct supervised learning (DSL) on limited scene-variety dataset, Replica. Meta produces
smooth and more precise depth. Depth-irrelevant textures on planar regions can be resolved
more correctly. In contrast, DSL produces irregularities affected by local high-frequency details,
especially ResNet50. See Sec. 6.4.1 for details and 6.3.4 for the explanation.
training set size is limited, there are no sufficient examples to describe mappings between the
two domains. A model can either not be able to estimate precise depth or simply memorize seen
image-depth pairs but cannot generalize to unseen examples [5, 28]. See Fig. 6.5. Thus, we exploit
meta-learning with its advantages that without extra data sources, it attains few-shot learning and
higher generalizability. To adapt meta-learning to single-image depth estimation, we propose
fine-grained task as follows.
6.3.2 Single RGB-D Pair as Fine-Grained Task
Definition . Single-image depth prediction learns a function f
θ
:I →D, parameterized byθ, to
map from imagery to depth. A training set (I
train
,D
train
), containing image I∈I
train
and associated
depth map D∈D
train
, is used to train a model. In a minibatch with size K, each pair (I
i
, D
i
),
∀i∈[1,K] is treated as a fine-grained task . Fine-grained tasks are mutual-exclusive: no two scenes
sampled from the meta-distribution, i.e., the whole RGB-D dataset, share the same scene appearance
and depth relation. Proof: assume we have two different scene images I
1
and I
2
, and each contains
sets of regionsR
1
andR
2
. The null setφ / ∈R
− , whereR
− =(R
1
− R
2
)∪(R
2
− R
1
) that contains
51
regions appear only in either I
1
or I
2
, since I
1
and I
2
are different frames and inevitably capture
different regions. Thus, any two scenes have different appearance and depth relations.
Difference with task in meta-learning. Fine-grained tasks are different from tasks in most-used
meta-learning or few-shot learning usages [30], where a task contains data distribution and batches
are sampled from it. Fine-grained tasks do not contain data distribution but are sampled from
meta-distribution, the whole RGB-D dataset. For example, a navigating agent captures image and
depth pairs. The RGB-D pairs are sampled from the meta-distribution.
Design. Each fine-grained task is used to learn on the specific RGB-D pair. The design is
motivated by the fact that appearance and depth variation can be high. A view looking at small desk
objects and a view of large room spaces are highly dissimilar in contents and ranges.
Mappings from their scene appearance to range values are different. Still, they can be captured
in the same environment or even in neighboring frames. This contrasts with image classification
where class samples share a common label. The observation explains why we treat each RGB-D
pair as a fine-grained task instead of each environment.
6.3.3 Meta-Initialization on Depth from Single Image
We describe our approach based on gradient-based meta-learning to learn a good initialization (Fig.
6.2).
Prior learning stage. In the first prior learning stage, we adopt a meta-optimizer and a base-
optimizer. In each meta-iteration, K fine-grained tasks as a minibatch are sampled from the whole
training set: (I
i
,D
i
)∼ (I
train
,D
train
),∀i∈[1,K]. Then we take L steps to explore gradient directions
that minimize the regression loss and get (θ
1
expl
,θ
2
expl
,...,θ
L
expl
):
θ
i
expl
← θ
i− 1
expl
− α
1
K
∇
θ
∑
k∈[1,K]
L
reg
(I
k
,D
k
;θ
i− 1
expl
),∀i∈[1,L]. (6.1)
52
Figure 6.4: Loss curve for MAML v.s. Reptile.
After the L-step exploration, we update the meta-parameters using Reptile style [75], i.e., following
the explored weight updating direction in the inner steps.
θ
j
meta
← θ
j− 1
meta
− β(θ
j− 1
meta
− θ
L
expl
), (6.2)
whereα andβ are respective learning rates. i and j denote inner and meta-iterations.
Compared with MAML [30], we find Reptile is more suitable for training for fine-grained tasks.
First, as mentioned in Reptile’s paper [75], it is designed without support and query set split, and
thus it inherently does not require multiple data samples in a task, which matches our fine-grained
task definition. Next, first-order MAML computes gradients on the query set at the last inner step
θ
L
expl
to update meta-parameters. However, only one sample exists in each fine-grained task, and
each fine-grained task is mutual-exclusive and can differ greatly, depending on R
− . Thus, if taking
exploration on a support split and computing gradients on the query split, but the support and
query samples do not share common components, the gradients are nearly random and prevent
from converging. By contrast, Reptile does not entail support and query split or require common
components between samples, so it stabilizes training towards convergence and becomes the choice.
We show the loss curve in Fig. 6.4.
53
Supervised learning stage. Prior knowledgeθ
prior
is learned after the first stage. We treat it
as the initialization for the subsequent supervised learning with conventional stochastic gradient
descent to minimize the regression loss.
θ
∗ ← min
θ
L
reg
(I
train
,D
train
|θ
prior
). (6.3)
Last, the test set (I
test
,D
test
) is used to evaluate the depth estimation performance ofθ
∗ . Algorithm 1
organizes the whole procedure. The implementation only needs a few-line codes as plugins to depth
estimation frameworks, which benefits higher model generalizability as shown in later experiments.
Difference with other learning strategies. Compared with widely-used pretraining that
requires multiple data sources to gain generalizability [90, 91, 133], both the prior learning and
supervised learning stages operate on the same dataset without access to extra data or off-the-shelf
models. Thus, they are free from those burdens.
Compared with simple gradient accumulation [93], where gradients are accumulated for several
batches and then used to update parameters only once, the bilevel optimization keeps updating the
inner-parameters every step in L to find the local niche for the current batch. Besides, gradient
accumulation has the effect of large batch size, which might cause overfitting and degrade model
generalizability.
Algorithm 1 Our Meta-Initialization Procedure
1: for epoch = 1 : N do
2: for j= 1 : T (iterations) do
3: θ
0
expl
← θ
j
meta
; (I
1
, D
1
), (I
2
, D
2
)... (I
K
, D
K
)∼ (I
train
,D
train
).
4: for i = 1 : L (steps) do
5: θ
i
expl
← θ
i− 1
expl
− α
1
K
∇
θ
∑
k∈[1,K]
L
reg
(I
k
,D
k
;θ
i− 1
expl
).
6: end for
7: θ
j
meta
← θ
j− 1
meta
− β(θ
j− 1
− θ
L
expl
).
8: end for
9: end for
10: Priorθ
prior
← θ
T
meta
at epoch N.
11: Useθ
prior
as initialization. Supervised learning byθ
∗ ← min
θ
L
reg
(I
train
,D
train
|θ
prior
).
54
Figure 6.5: Analysis on scene variety and model generalizability. (A) shows limited training
scenes constrain learning image-to-depth mappings, with an extreme case (A2) for only one training
image. (B) shows though a model (A4) fits well on training scenes, it still cannot generalize to
unseen seen, especially wall paintings with many depth-irrelevant cues. Meta-initialization attains
better model generalizability. See Sec. 6.3.4 for explanation.
6.3.4 Strategy and Explanation
Meta-Initialization. We next analyze meta-learning behavior with fine-grained task. Inside each
meta-iteration, the base-learner explores the neighborhood with L-step using K fine-grained tasks.
Compared with simple single-step update, the meta-update can be seen as first taking L-step
amortized gradient descent with a lower learning rate to delicately explore local loss manifolds,
then updating meta-parameters by trends shown in the inner steps with a step size β towards
θ
L
. θ
prior
after the prior learning may underfit the training set since the algorithm suggests not
wholly following optimal gradients for each batch but with a β for control. However, it avoids
overfitting to seen RGB-D pairs and forces the inner exploration to reach higher-level image-to-depth
understanding. θ
prior
then becomes good initialization for downstream RGB-D learning.
55
Progressive learning perspective. Algorithm 1 can be seen as progressive learning on a training
set. At the first stage, meta-learning benefits learning coarse but smooth depth from global context.
In Fig. 6.3 we apply the first-stage meta-learning compared to direct supervised learning on a dataset
of limited scene variety. Meta-learning estimates smooth depth shapes and is free from irregularity
that direct supervised learning encounters. The irregularity indicates the dataset did not provide
sufficient scene variety that demonstrates how images map to depth in various environments to learn
smooth depth from global context. Consequently, only local high-frequency cues show. To illustrate,
if only sparse and irrelevant scene images are presented, finding a function that satisfactorily fits
those scenes with smooth depth from global context is hard. See Fig. 6.5. The irregularity occurs
especially at cluttered objects or surface textured areas, since those depth-irrelevant local cues are
barely suppressed.
In summary, the progressive fashion first learns coarse but smooth depth by θ
prior
. Then, the
network learns finer depth at the second supervised stage.
6.4 Experiments and Discussion
Table 6.1: Generalizability with different scene variety. We compare single-stage meta-learning
(only prior learning) and supervised learning. ConvNeXt-Base backbone is used. a→ b means
training on a- and testing on b-dataset. Replica and HM3D respectively hold lower and higher
scene variety for training. Meta-Learning has much larger improvements especially trained on low
scene-variety Replica.
Replica→ V A HM3D→ V A
Method MAE AbsRel RMSE MAE AbsRel RMSE
Direct supervised learning 0.718 0.538 1.078 0.544 0.456 0.715
Meta-Learning 0.548 0.430 0.761 0.427 0.369 0.603
Improvement -23.6% -20.1% -29.4% -21.5% -19.1% -15.7%
Aims. We validate our meta-initialization with five questions. Q1 Can meta-learning improve
learning image-to-depth mapping on limited scene-variety datasets? (Sec. 6.4.1) Q2 What im-
provements can meta-initialization bring compared with the most popular ImageNet-initialization?
(Sec. 6.4.2) Q3 How does meta-initialization help zero-shot cross-dataset generalization? (Sec. 6.4.3)
56
Q4 How does more accurate depth help learn better 3D representation? (Sec. 6.4.4) Q5 How is the
proposed fine-grained task related to other meta-learning findings? (Sec. 6.4.5)
Datasets:
• Hypersim [92] has high scene variety with 470 synthetic indoor environments, from small rooms
to large open spaces, with about 67K training and 7.7K testing images.
• HM3D [89] and Replica [101] are associated with 200K and 40K images that are taken from
SimSIN [123]. HM3D has 800 scenes with much higher scene variety than Replica, which only
has 18 overlapping scenes.
• NYUv2 [99] contains real 654 testing images. It uses older camera models with high imaging
noise and limited camera viewing direction.
• V A [123] consists of 3.5K photorealistic renderings for testing on challenging lighting conditions
and arbitrary camera viewing directions.
Training Settings. We use ResNet [41], ConvNeXt [66], and their variants as the network
architecture to extract bottleneck features. Then we build a depth regression head following [35]
containing 5 convolution blocks with skip connection from the encoder. Each convolution block
contains a 3x3 convolution, an ELU activation, and a bilinear 2× upsampling layer to recover the
input size at the end output. Channel-size of each convolution block is(256,128,64,32,16). Last, a
3x3 convolution for 1-channel output with a sigmoid activation are used to get inverse depth, which
is then converted to depth [24]. We set N = 5, L= 4, K = 50, (α,β)=(0.001,0.5) for ResNet,
and(α,β)=(0.0005,0.5) for ConvNeXt. At the supervised learning stage, we train models for 15
epochs with a learning rate of 3× 10
− 4
, optimized by AdamW [67] with a weight decay of 0.01.
Input size to the network is 256× 256. L
2
loss is used asL
reg
.
Metrics. We adopt common monocular depth estimation evaluation metrics. Error metrics
(in meters, the lower the better): Mean Absolute Error (MAE), Absolute Relative Error (AbsRel),
Root Mean Square Error (RMSE). Threshold Accuracy: δ
C
(in %, percentage of correct pixel
depth. Higher percentage implies more structured depth hence better). Correctness: ratio between
prediction and groundtruth is within 1.25
C
,C=[1,2,3].
57
Table 6.2: Effects of Meta-Initialization on intra-dataset evaluation. We train and test meta-
initialization (full Algorithm 1) on the same dataset. Hypersim and NYUv2 of higher scene
variety are used. Using the same architecture, meta-initialization (+Meta) consistently outperforms
ImageNet-initialization (no marks). Both error (in orange) and accuracy (in green) are reported.
Hypersim MAE AbsRel RMSE δ
1
δ
2
δ
3
ResNet50 1.288 0.248 1.775 64.8 87.1 94.7
ResNet50 + Meta 1.205 0.239 1.680 66.7 87.9 95.0
ResNet101 1.197 0.234 1.671 67.4 88.5 95.3
ResNet101 + Meta 1.158 0.220 1.595 68.0 89.0 95.4
ConvNeXt-base 1.073 0.201 1.534 73.6 91.1 96.3
ConvNeXt-base + Meta 0.994 0.188 1.425 74.9 91.7 96.5
NYUv2 MAE AbsRel RMSE δ
1
δ
2
δ
3
ResNet50 0.345 0.131 0.480 83.6 96.4 99.0
ResNet50 +Meta 0.325 0.122 0.454 85.4 96.8 99.3
ResNet101 0.318 0.120 0.448 85.6 97.1 99.3
ResNet101 +Meta 0.303 0.112 0.420 86.7 97.4 99.4
ConvNeXt-base 0.273 0.101 0.394 89.4 97.9 99.5
ConvNeXt-base+Meta 0.266 0.099 0.387 89.8 98.1 99.5
6.4.1 Meta-Learning on Limited Scene Variety
We first show how a single-stage meta-learning (only prior stage) performs. We train N=15 epochs
of meta-learning and compare with 15 epochs of direct supervised learning where both training
pieces converge already. The other hyperparameters are the same as given in Training Settings.
Replica Dataset of limited scene variety is used to verify gain on limited sources.
Fig. 6.3 show fitting to training scenes. From the figure, meta-learning is capable of identifying
near and far fields without irregularity that direct supervised learning struggles with. Under limited
training scenes, meta-learning induces a better image-to-depth mapping that delineates object
shapes, separates depth-relevant/-irrelevant cues, and shows flat planes where rich depth-irrelevant
textures exist. The observation follows the explanation in Sec. 6.3.4.
We next numerically examine generalizability to unseen scenes when training on different
level scene-variety data. HM3D (high scene variety) and Replica (low scene variety) are used
as training sets and V A is used for testing. Table 6.1 shows that models trained by single-stage
meta-learning substantially outperform direct supervised learning with 15.7%-29.4% improvements.
The advantage is more evident when trained on lower scene-variety Replica.
58
Figure 6.6: Depth map qualitative comparison. Results of our meta-initialization have better
object shapes with clearer boundaries. Depth-irrelevant textures are suppressed, and flat planes are
predicted, as shown in Hypersim- Row 2 ceiling and 3 textured wall examples.
6.4.2 Meta-Initialization v.s. ImageNet-Initialization
Sec. 6.4.1 shows single-stage meta-learning induces much better depth regression, but the depth
is yet detailed. We next train full Algorithm 1, using meta-learned weights as initialization for
following supervised learning. We go beyond limited sources and train on higher scene-variety
datasets. Intuitively, higher scene variety helps supervised learning attain better depth prediction
and might diminish meta-learning’s advantages of few-shot and low-source learning. However,
such studies are practical for validating meta-learning in real-world applications. Comparison is
drawn with baselines of direct supervised learning without meta-initialization that begins from
ImageNet-initialization instead.
Table 6.2 shows intra-dataset evaluation that trains/ tests on each official data splits. For
Hypersim evaluation is capped for depth at 20m and 10m for NYUv2.
Uniquely, using meta-initialization attains consistently lower errors and higher accuracy than
the baselines, especially AbsRel (averagely +6.5%) andδ
1
(averagely +1.69 points) that indicates
accurate depth structure is predicted. We further display qualitative comparison for depth and 3D
59
point cloud view in Fig. 6.6 and 6.1. The gain simply comes from better training schema without
more data or constraints, advanced loss, or model design.
6.4.3 Zero-Shot Cross-Dataset Evaluation
To faithfully validate a trained model in the wild, we design protocols for zero-shot cross-dataset
inference. High scene-variety and larger-size synthetic datasets, Hypersim and HM3D, are used as
training sets. V A, Replica, and NYUv2 serve as testing, and their evaluations are capped at 10m. We
median-scale prediction to groundtruth in the protocol to compensate for different camera intrinsic.
In Table 6.3, compared with ImageNet-initialization, meta-initialization consistently improves in
nearly all the metrics, especiallyδ
1
(averagely +1.97 points). The gain comes from that meta-prior
attains a better image-to-depth mapping with coarse but smooth and reasonable depth. Conditioned
on the initialization, the learning better calibrates to open-world image-to-depth relation hence
generalizes better to unseen scenes.
We further experiment on recent high-performing architecture dedicated to depth estimation,
including BTS [60], DPT (hybrid and large size) [90], and DepthFormer [64]. Table 6.4 displays
the comparison and shows that meta-initialization consistently improves performance for zero-shot
cross-dataset inference using the dedicated architectures, gearing higher generalizability for the
existing models.
6.4.4 Depth Supervision in NeRF
We show that more accurate depth from meta-initialization can better supervise the distance d a
ray travels in NeRF. d is determined by the volumetric rendering rule [22]. Additional to pixel
color loss, we use monocular predicted distance map d
∗ , converted from depth, to supervise the
training byL
D
=|d
∗ − d|. The experiment is conducted on Replica’s office-0 environment with 180
training views. After 30K training steps, we obtain NeRF-rendered views and calculate commonly-
used image quality metrics (PSNR and SSIM, the higher the better). We use ConvNeXt-base to
predict d
∗ . The comparison is made between using (A) without meta-initialization and (B) with
60
meta-initialization. Results: (A): (38.67, 0.9629), (B): (39.29, 0.9680). The results show better
image quality is attained induced by better 3D representation in depth from meta-initialization.
We show more quantitative comparisons for depth-supervised NeRF in Table 6.5 on Replica
and qualitative comparison in Fig. 6.7.
Figure 6.7: Image quality comparison for NeRF rendering. We show the quality metrics (the
higher the better) under each image. Zoom in for the best view.
61
6.4.5 How is fine-grained task related to other meta-learning studies?
There are several previous findings on learning techniques or issues related to meta-learning. Here
we discuss how those findings apply to fine-grained tasks.
Relation to domain-agnostic task augmentation. Domain-agnostic task augmentation is
to densify sampled data points in each task to add robustness, such as label noise [87], image
transformation [74], and MetaMix [130]. Since our fine-grained task contains only one sample
for every task, the domain-agnostic augmentation is reduced to data augmentation, where we can
simply inject mild depth label noise and image transformation, such as left-right flip and color
jittering, to create derivative samples associated with each fine-grained task.
Relation to task interpolation. Unlike domain-agnostic augmentation, task interpolation
(MLTI [131]) is investigated to remedy the requirements of large numbers of meta-training tasks.
MLTI adopts the following interpolation method to augment tasks.
H
cross
=λH
i
+(1− λ)H
j
, Y
cross
=λY
i
+(1− λ)Y
j
, (6.4)
where subscript i, j denote different tasks. H represents intermediate features, Y represents labels,
andλ is a weight to control the interpolation between the two tasks. Eq. 6.4 may not be valid for
our fine-grained tasks. For example, mixing features from different scenes does not lead to the
same interpolation for depth ranges, i.e., overlaying near and far range scenes together can break
local region dependency in each scene for depth prediction, and thus does not naturally indicate
depth values in between. A feasible approach can be fusing geometry from one scene with texture
from another, or performing 3D-aware scene interpolation, which not only interpolates intermediate
depth ranges but also attends to geometry and appearance coherence.
Relation to meta-memorization and meta-overfitting . Prior meta-learning studies [87, 130,
132] point out that memorization and overfitting may also occur at the task level. However, they
validate the claim on rather simple problems and settings such as toy sinusoidal regression and
pose prediction using 15-class Pascal3D objects [126] for chair or sofa CAD models rendered on
62
flat grounds. Their task complexity is relatively simple, and a network may easily memorize all
tasks without generalizability, compared to our study on per-pixel real-valued depth prediction. Our
fine-grained task setting shows that studying more practical and complex problems may be free from
task memorization and overfitting. As shown in paper experiments, without using techniques such as
meta-regularization and meta-augmentation to relieve the meta-memorization and meta-overfitting,
our network can still serve as a good initialization in either intra-dataset or cross-dataset evaluation.
63
Table 6.3: Zero-Shot cross-dataset evaluation using meta-initialization (Algorithm 1). Com-
parison is drawn between without meta-initialization (no marks, ImageNet-initialization) and with
our meta-initialization (Meta) using different sizes ConvNeXt. Results of ”+Meta” are consistently
better.
HM3D→ V A MAE AbsRel RMSE δ
1
δ
2
δ
3
ConvNeXt-small 0.267 0.180 0.389 74.6 91.0 96.1
ConvNeXt-small + Meta 0.233 0.162 0.345 77.8 93.1 97.3
ConvNeXt-base 0.258 0.176 0.385 76.1 91.1 95.4
ConvNeXt-base + Meta 0.238 0.163 0.356 78.0 92.5 96.7
ConvNeXt-large 0.242 0.170 0.357 78.1 91.5 95.7
ConvNeXt-large + Meta 0.226 0.160 0.330 78.9 92.2 96.3
HM3D→ NYUv2 MAE AbsRel RMSE δ
1
δ
2
δ
3
ConvNeXt-small 0.540 0.213 0.728 69.2 88.7 95.8
ConvNeXt-small + Meta 0.527 0.206 0.710 70.7 89.0 95.9
ConvNeXt-base 0.529 0.208 0.717 70.1 89.4 96.0
ConvNeXt-base + Meta 0.505 0.199 0.691 71.6 89.8 96.3
ConvNeXt-large 0.501 0.192 0.690 72.0 90.4 96.4
ConvNeXt-large +Meta 0.481 0.190 0.660 73.2 90.6 96.6
HM3D→ Replica MAE AbsRel RMSE δ
1
δ
2
δ
3
ConvNeXt-small 0.222 0.138 0.321 84.5 93.9 96.6
ConvNeXt-small + Meta 0.200 0.126 0.287 85.6 95.7 98.1
ConvNeXt-base 0.217 0.134 0.316 84.6 94.2 96.6
ConvNeXt-base + Meta 0.192 0.117 0.277 87.1 96.4 98.5
ConvNeXt-large 0.214 0.137 0.307 84.3 94.0 96.6
ConvNeXt-large + Meta 0.191 0.117 0.275 87.1 96.5 98.5
Hypersim→ V A MAE AbsRel RMSE δ
1
δ
2
δ
3
ConvNeXt-small 0.291 0.215 0.404 68.5 90.8 96.7
ConvNeXt-small + Meta 0.280 0.207 0.398 70.4 91.3 97.0
ConvNeXt-base 0.275 0.201 0.393 71.3 91.8 97.3
ConvNeXt-base + Meta 0.259 0.194 0.365 72.8 92.8 97.8
ConvNeXt-large 0.263 0.198 0.369 73.0 92.0 97.1
ConvNeXt-large + Meta 0.248 0.183 0.355 74.6 93.5 97.8
Hypersim→ NYUv2 MAE AbsRel RMSE δ
1
δ
2
δ
3
ConvNeXt-small 0.434 0.165 0.598 75.7 94.3 98.5
ConvNeXt-small + Meta 0.415 0.155 0.575 77.8 95.1 98.8
ConvNeXt-base 0.396 0.150 0.549 79.6 95.6 98.9
ConvNeXt-base + Meta 0.386 0.141 0.524 80.3 96.0 99.0
ConvNeXt-large 0.389 0.149 0.542 79.8 95.6 98.8
ConvNeXt-large + Meta 0.375 0.140 0.517 81.2 96.2 99.1
Hypersim→ Replica MAE AbsRel RMSE δ
1
δ
2
δ
3
ConvNeXt-small 0.307 0.189 0.417 72.4 92.1 97.5
ConvNeXt-small + Meta 0.294 0.178 0.404 74.5 92.7 97.5
ConvNeXt-base 0.312 0.185 0.429 74.1 92.6 97.4
ConvNeXt-base + Meta 0.288 0.173 0.399 75.6 93.3 97.9
ConvNeXt-large 0.285 0.172 0.394 75.8 93.2 97.7
ConvNeXt-large + Meta 0.273 0.165 0.380 77.0 94.0 98.1
64
Table 6.4: Zero-Shot cross-dataset evaluation on dedicated depth estimation architecture.
Plugging our meta-initialization (Algorithm 1) into the those frameworks can stably improve results.
Hypersim→ Replica MAE AbsRel RMSE δ
1
δ
2
δ
3
BTS-ResNet50 [60] 0.417 0.226 0.588 69.4 87.2 94.0
BTS-ResNet50+Meta [60] 0.386 0.208 0.565 70.6 88.4 95.3
BTS-ResNet101 [60] 0.406 0.217 0.570 70.0 87.8 94.4
BTS-ResNet101+Meta [60] 0.382 0.206 0.559 70.8 88.5 95.4
DepthFormer [64] 0.355 0.192 0.525 72.9 90.8 96.4
DepthFormer + our Meta 0.339 0.181 0.499 74.4 92.1 96.9
DPT-hybrid [90] 0.370 0.197 0.549 71.3 89.0 95.7
DPT-hybrid + our Meta 0.324 0.169 0.487 75.6 92.8 97.8
DPT-large [90] 0.331 0.172 0.499 75.4 91.6 97.1
DPT-large + our Meta 0.314 0.164 0.474 77.2 92.7 97.5
Hypersim→ NYUv2 MAE AbsRel RMSE δ
1
δ
2
δ
3
BTS-ResNet50 [60] 0.487 0.196 0.654 71.8 90.4 95.6
BTS-ResNet50+Meta [60] 0.455 0.178 0.628 73.9 92.4 97.3
BTS-ResNet101 [60] 0.468 0.187 0.641 72.3 90.8 95.8
BTS-ResNet101+Meta [60] 0.450 0.175 0.623 74.2 92.6 97.5
DepthFormer [64] 0.442 0.169 0.608 75.1 93.9 98.2
DepthFormer + our Meta 0.416 0.157 0.580 77.8 94.3 98.2
DPT-hybrid [90] 0.409 0.149 0.580 78.9 94.8 98.3
DPT-hybrid + our Meta 0.395 0.140 0.559 81.0 96.4 99.1
DPT-large [90] 0.373 0.136 0.534 82.3 96.2 98.8
DPT-large + our Meta 0.364 0.131 0.520 83.2 96.6 99.1
Table 6.5: More results on depth-supervised NeRF. We test on Replica ’room-0’, ’room-1’, room-
2’, ’office-0’, ’office-1’, and ’office-2’ environments. We train a NeRF on each environment with
180 views. The comparison between using depth from meta-initialization and w/o meta-initialization
for supervision is drawn. PSNR and SSIM are image quality metrics; the higher, the better.
w/o meta-initialization w/ meta-initialization
Environment PSNR SSIM PSNR SSIM
Room-0 29.988 0.8184 30.920 0.8373
Room-1 34.547 0.9279 34.871 0.9305
Room-2 36.680 0.9560 37.460 0.9609
Office-0 38.674 0.9629 39.290 0.9680
Office-1 36.196 0.9427 36.867 0.9460
Office-2 42.648 0.9638 42.665 0.9646
65
6.5 Campus Data with Meta-Labels
To further examine the generalizability of meta-learning, we collect Campus Data and provide
meta-labels such as space type and range. The meta-labels can hint how a pretrained model performs
for different space types. For example, a depth estimator trained on private spaces may perform
poorly in office due to unseen textures and objects in its training dataset.
The collected dataset, Campus Data, contains 38000 images from 82 environments. We manually
select 1260 images containing less motion blur and noise as the test set, and the rest serve as the
training data.
Then we label the following attributes (1) Category: 1: private room, 2: office, 3: hallway, 4:
lounge, 5: meeting room, 6: large room, 7: classroom 8: library, 9: kitchen, 10: playroom 11: living
room 12: bathroom (2) Maximal Range: 1. within 5 meters 2. within 10 meters 3. within 20 meters.
6.5.1 Analysis and Experiments
In what follows, we lay out ten problems and analyses on the collected Campus Data. These
problems and analysis aim benchmarking existing method performances, break down overall
performances by meta-labels,
[Purpose I: Benchmarking] We first fetch open-sourced SOTA pretrained models on a popular
indoor scene dataset, NYUv2. We analyze whether the performance reported on NYUv2 is also
consistent on Campus Test.
Catalog of the SOTA models. We include DPT [90], GLPDepth [54], AdaBins [11], Pix-
elFormer [2], NeWCRFs [137], BTS [60], MIM [127], IronDepth [6].
Benchmark on NYUv2 for reference: ranking order: MIM, PixelFormer, NeWCRFs, GLPDepth,
IronDepth, DPT, AdaBins, BTS
[Analysis I]: From Table 6.6, the performance on Campus Test is consistent to those on
NYUv2. MIM and PixelFormer are the best and second-best. They both use SwinTransformer-
Large, showing large backbone model can extract better features for performance enhancement.
66
Table 6.6: Results of pretrained SOTA methods on Campus Test. The best number is in bold,
and the second-best is underlined.
Method Architecture MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
DPT [90] DPT-Hybrid 0.3090 0.1224 0.0773 0.4616 0.1622 85.96 97.17 99.19
GLPDepth [54] Mit-b4 0.3068 0.1239 0.0788 0.4527 0.1635 86.05 97.36 99.16
AdaBins [11] Unet+AdaBins 0.3341 0.1333 0.0957 0.4922 0.1762 83.64 96.36 98.92
PixelFormer [2] Swin-Large 0.2982 0.1225 0.0761 0.4392 0.1650 86.08 97.03 99.10
NeWCRFs [137] Swin-Large 0.3028 0.1251 0.0823 0.4541 0.1699 86.04 96.68 98.94
BTS DenseNet-161 0.3602 0.1445 0.1162 0.5222 0.1880 81.65 95.57 98.54
IronDepth [6] - 0.3271 0.1276 0.1022 0.4894 - 85.30 96.37 98.84
MIM [127] SwinV2-Large 0.2807 0.1100 0.0679 0.4244 0.1506 88.58 97.59 99.28
Table 6.7: Results on space categories. The model is MIM-SwinTransformer pretrained on
NYUv2.
Category MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Private room 0.1814 0.0927 0.0342 0.2556 0.1248 92.05 98.99 99.83
Office 0.2169 0.1106 0.0532 0.3313 0.1537 87.67 97.42 99.44
Hallway 0.3094 0.1229 0.0805 0.5463 0.1719 85.66 96.52 98.89
Lounge 0.4796 0.1316 0.1290 0.7447 0.1839 84.15 96.22 98.93
Meeting room 0.2609 0.0984 0.0483 0.3636 0.1349 91.70 98.69 99.64
Large room 1.0351 0.2683 0.4499 1.3915 0.3220 54.93 83.89 93.91
Classroom 0.2208 0.0781 0.0334 0.3071 0.1124 94.52 99.25 99.83
Library 0.3962 0.1342 0.0978 0.6281 0.1768 85.61 96.57 98.57
Kitchen 0.1975 0.1482 0.0791 0.3374 0.2056 82.31 95.42 98.32
Playroom 0.1689 0.0702 0.0276 0.2466 0.1066 94.54 98.25 99.79
Living room 0.2293 0.1033 0.0502 0.3448 0.1460 89.07 97.88 99.52
Bathroom 0.1644 0.1456 0.0772 0.2788 0.2014 83.45 96.07 98.13
However, DPT surpassed other methods in some metrics. This is because DPT uses much more
data for training, and others only train on NYUv2.
The above experiment show an overall statistics on the whole testing set. Next we, break down
the results by provided meta-lables.
[Purpose II: Performance breakdown]: We use the pretrained MIM model on NYUv2 and
break down the overall results by meta-label of category and maximal range. This can potentially
reveal the underlying distribution of space types and ranges in NYUv2’s training set. Table 6.7
shows the results.
Top-5 categories in Table 6.7:
– Lower RMSE: playroom, private room, bathroom, classroom, office
– Higherδ
1
: playroom, classroom, private room, meeting room, living room
– Higher RMSE: large room, lounge, library, hallway, meeting room
67
Figure 6.8: Statistics of Campus Test and NYUv2-Depth. We map NYUv2’s spaces to our defined
Campus Data. One can see the distribution of NYU’s training data and Campus Test data are
distinct.
– Lowerδ
1
: large room, kitchen, bathroom, lounge, library
[Analysis II]: From Table 6.7 and statistics, the easiest categories are playroom, private room,
and classroom. The most challenging categories are large Room, lounge, and library. The easy
and hard categories match the distribution of NYUv2. Fig. 6.8 shows dataset statistics using our
defined labels. NYUv2 has living rooms and private rooms as the most common scenes, which
are typical small rooms. Thus, in Table 6.7 where we use a well-pretrained model on NYUv2,
playroom, private room, and classroom are the easiest room types that follows the NYUv2’s training
set distribution.
We next use more models trained on different datasets and different learning paradigm (super-
vised/ self-supervised learning).
[Purpose III: Generalization]: We use models pretrained on other datasets and make inferences
on Campus Test to examine the zero-shot cross-dataset performance. Specifically, we include the
following datasets and models: SimSIN (self-supervised by DistDepth), UniSIN (self-supervised by
DistDepth), Hypersim (supervised), NYUv2 (supervised). Table 6.8 to Table 6.11 show the results.
Top-5 best categories for training datasets:
– SimSIN: private room, hallway, living room, playroom, library
– UniSIN: library, hallway, meeting room, bathroom
68
Table 6.8: Self-Supervised DistDepth performance trained on SimSIN. We use the pretrained
ResNet152 from [123].
Category MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Private room 0.3253 0.1509 0.0986 0.4447 0.1812 79.36 96.48 99.53
Office 0.3915 0.1812 0.1606 0.5789 0.2265 74.01 93.47 97.80
Hallway 0.3758 0.1597 0.1239 0.6324 0.2037 78.12 94.92 98.64
Lounge 0.6261 0.1841 0.2148 0.9037 0.2328 73.86 93.11 97.94
Meeting room 0.6445 0.1962 0.2491 0.9417 0.2282 66.91 93.58 99.45
Large room 0.7378 0.1842 0.2619 1.0727 0.2303 72.79 91.87 97.88
Classroom 0.7099 0.2069 0.3288 1.0292 0.2374 67.11 91.66 98.51
Library 0.5313 0.1857 0.1913 0.8307 0.2236 75.14 93.11 97.44
Kitchen 0.3960 0.2524 0.2083 0.5649 0.2872 59.22 87.44 96.12
Playroom 0.4445 0.1597 0.1147 0.5946 0.1907 75.59 97.95 99.67
Living room 0.3745 0.1600 0.1166 0.5284 0.1986 77.05 94.90 98.89
All 0.4689 0.1746 0.1719 0.6877 0.2116 74.72 94.18 98.61
Table 6.9: Self-Supervised DistDepth performance trained on UniSIN. We use the pretrained
ResNet152 from [123].
Category MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Private room 0.3458 0.1738 0.1102 0.4539 0.2172 72.81 93.65 98.68
Office 0.3362 0.1656 0.1113 0.5020 0.2115 75.04 94.86 98.98
Hallway 0.3044 0.1262 0.0765 0.4977 0.1667 84.79 96.91 99.37
Lounge 0.4932 0.1380 0.1428 0.7351 0.1854 82.41 96.77 98.95
Meeting room 0.3928 0.1300 0.1104 0.5937 0.1667 83.94 97.36 99.63
Large room 0.6555 0.1608 0.2112 0.9447 0.2130 75.95 94.83 98.68
Classroom 0.4233 0.1313 0.1166 0.6077 0.1625 84.12 97.96 99.74
Library 0.4115 0.1230 0.1146 0.6958 0.1699 85.61 96.48 98.91
Kitchen 0.3980 0.2741 0.1997 0.5740 0.3183 52.30 85.15 96.28
Playroom 0.3932 0.1486 0.0822 0.4755 0.1842 78.63 98.23 99.84
Living room 0.3706 0.1644 0.1153 0.5106 0.2081 76.37 93.66 98.48
Bathroom 0.1604 0.1384 0.0409 0.2168 0.1750 84.67 95.28 99.68
All 0.3826 0.1509 0.1143 0.5602 0.1932 78.96 95.38 98.97
Table 6.10: Supervised learning performance trained on Hypersim. Results across space types
are shown.
Category MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Private room 0.2517 0.1321 0.0607 0.3376 0.1669 84.55 97.26 99.36
Office 0.3317 0.1678 0.1085 0.4916 0.2194 76.01 93.60 97.87
Hallway 0.5082 0.2126 0.1912 0.8347 0.2644 65.67 89.89 97.15
Lounge 0.6514 0.1937 0.2254 0.9403 0.2414 71.31 92.58 97.58
Meeting room 0.3830 0.1315 0.0837 0.5333 0.1710 81.49 98.23 99.74
Large room 1.0058 0.2525 0.4125 1.3897 0.3058 56.48 85.13 94.95
Classroom 0.3626 0.1258 0.0740 0.4700 0.1600 85.07 98.50 99.80
Library 0.5519 0.1766 0.1664 0.8654 0.2226 73.53 93.54 98.03
Kitchen 0.3069 0.2417 0.1343 0.4347 0.2888 62.81 90.25 96.43
Playroom 0.4188 0.1629 0.1132 0.5663 0.2057 77.09 93.87 99.15
Living room 0.3196 0.1485 0.0872 0.4519 0.1898 80.91 95.40 98.95
Bathroom 0.2152 0.1648 0.0627 0.2908 0.2071 76.11 95.80 98.89
All 0.4004 0.1592 0.1187 0.5803 0.2013 77.95 94.99 98.60
69
Table 6.11: Supervised learning performance trained on NYUv2. Results across space types are
shown.
Category MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Private room 0.2300 0.1189 0.0502 0.3079 0.1541 86.28 97.98 99.73
Office 0.3190 0.1590 0.0952 0.4670 0.2067 76.91 94.49 98.74
Hallway 0.4253 0.1843 0.1385 0.6777 0.2309 72.03 93.46 98.12
Lounge 0.6154 0.1690 0.1987 0.9024 0.2238 74.86 93.24 98.35
Meeting room 0.3305 0.1227 0.0680 0.4446 0.1593 84.98 98.34 99.75
Large room 0.8392 0.2146 0.3210 1.1674 0.2654 64.28 89.39 95.90
Classroom 0.3011 0.1082 0.0531 0.3851 0.1439 88.82 98.83 99.89
Library 0.5323 0.1798 0.1758 0.8319 0.2253 74.89 93.30 97.52
Kitchen 0.2476 0.1908 0.0941 0.3608 0.2357 75.13 92.66 96.81
Playroom 0.3433 0.1413 0.0737 0.4165 0.1774 83.03 96.98 99.03
Living room 0.2914 0.1365 0.0739 0.4071 0.1764 83.13 96.63 99.22
Bathroom 0.2180 0.1825 0.0687 0.2925 0.2166 72.99 93.97 99.08
All 0.3615 0.1452 0.1011 0.5202 0.1866 80.84 95.94 98.93
– Hypersim: classroom, private room, meeting room, living room, bathroom
– NYUv2: classroom, playroom, private room, meeting room, living room
[Analysis III]: From Table 6.8 to Table 6.11 SimSIN, Hypersim, and NYUv2 all have private
room in their top-5 best category. This is because these dataset mainly collect or render data in
private room and small rooms. UniSIN focuses on campus scenes; thus, it performs best in campus
scenes such as library, hallway, or meeting room.
First, models trained by supervised learning perform much better than the self-supervised
method due to its learning strategy.
Next, we find that performances on categories for these training datasets are imbalanced. For
example, δ
1
on SimSIN has only 59.22 for kitchen and 78.12 for hallway. SimSIN is generated
from re-rendered 3D scans, and the scans are incapable of reconstructing cluttered near objects.
Therefore, the performances are poor for scenes like kitchen and bathroom, where near-field small
objects, such as bottles, kitchenware, and utensils, are cluttered. In contrast, hallway is among the
high-frequency scenes, and thus it has the best performance.
UniSIN focuses on campus scenes. Accordingly, it performs much poorly in kitchen but has the
bestδ
1
in library. Hypersim has the best performance in classroom and the worst in large room.
NYUv2 has the best in classroom and the worst in large room.
70
Table 6.12: Comparison between meta-learning (Meta) and supervised learning (SL), trained
on NYUv2. We adopt ConvNeXt-small (conv-sml) and ConvNeXt-base (conv-b) as backbones.
Model MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Conv-sml (SL) 0.3756 0.1501 0.1088 0.5412 0.1934 79.78 95.42 98.70
Conv-sml (Meta) 0.3669 0.1484 0.1059 0.5252 0.1904 80.06 95.68 98.89
Conv-b (SL) 0.3715 0.1476 0.1054 0.5340 0.1901 79.92 95.70 98.88
Conv-b (Meta) 0.3615 0.1452 0.1011 0.5202 0.1866 80.84 95.94 98.93
Table 6.13: Comparison between meta-learning (Meta) and supervised learning (SL), trained
on Hypersim. We adopt ConvNeXt-small (conv-sml) and ConvNeXt-base (conv-b) as backbones.
Model MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Conv-sml (SL) 0.4044 0.1607 0.1193 0.5821 0.2054 76.89 94.77 98.66
Conv-sml (Meta) 0.3991 0.1584 0.1164 0.5741 0.2009 77.71 95.10 98.78
Conv-b (SL) 0.4004 0.1592 0.1187 0.5803 0.2013 77.95 94.99 98.60
Conv-b (Meta) 0.3902 0.1532 0.1113 0.5627 0.1957 78.94 95.50 98.84
Table 6.14: Statistics for supervised learning (SL) v.s. meta-learning (Meta). The mean and
standard deviation (STD) are computed across categories. Models are trained on Hypersim.
Model Mean RMSE STD RMSE Meanδ
1
STDδ
1
Conv-sml (SL) 0.6221 0.3087 73.84 8.3322
Conv-sml (Meta) 0.6241 0.2974 74.51 7.9858
Conv-b (SL) 0.6337 0.3161 74.24 8.8631
Conv-b (Meta) 0.6108 0.3154 76.37 8.1388
Those results disclose the underlying scene distribution where the data are collected from. The
scene-imbalance affects the robustness for cross-dataset inference, when one uses an off-the-shelf
model to make inferences on self-collected data. Due to the imbalance, the model may not perform
equally well in all categories, and such imbalance issue needs to be addressed before applying in
the wild.
The above experiments adopt publicly released pretrained models. Next, we train our de-
signed meta-learning algorithm and validate whether it can reduce the imbalance due to its higher
generalization ability.
[Purpose IV: Meta-Learning to reduce imbalance]: We attempt to investigate whether the
meta-learning strategy can help mitigate scene-imbalance from meta-learning’s higher generalization
ability.
Table 6.12 to Table 6.14 compare supervised learning and meta-learning on NYUv2 and
Hypersim. Further, the mean and standard deviation across room type for RMSE andδ
1
are shown.
71
Table 6.15: Results of training and testing on Campus Train/ Test. ConvNeXt-small is used.
Category MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Private room 0.0916 0.0492 0.0104 0.1344 0.0699 98.41 99.81 99.96
Office 0.1102 0.0616 0.0210 0.1729 0.0901 96.93 99.45 99.82
Hallway 0.1491 0.0715 0.0230 0.2354 0.0992 95.81 99.46 99.85
Lounge 0.1922 0.0609 0.0395 0.3185 0.0916 96.88 99.40 99.76
Meeting room 0.1184 0.0487 0.0144 0.1778 0.0736 98.11 99.64 99.89
Large room 0.1991 0.0515 0.0255 0.3153 0.0772 98.22 99.62 99.87
Classroom 0.1177 0.0438 0.0128 0.1725 0.0651 98.77 99.86 99.97
Library 0.1477 0.0544 0.0223 0.2543 0.0825 97.34 99.37 99.78
Kitchen 0.0935 0.0757 0.0310 0.1825 0.1155 94.63 98.86 99.57
Playroom 0.1207 0.0565 0.0184 0.1707 0.0825 96.86 98.38 99.88
Living room 0.1048 0.0504 0.0137 0.1556 0.0727 98.14 99.74 99.93
Bathroom 0.0899 0.0849 0.0241 0.1332 0.1143 94.17 97.72 99.22
All 0.1227 0.0542 0.0181 0.1918 0.0790 97.66 99.60 99.88
[Analysis IV]: Table 6.12 and 6.13 show that meta-learning consistently outperforms supervised
learning with better performance on the two experimented datasets. To investigate the source of
the gain, in Table 6.14, we show the mean and standard deviation for RMSE and δ
1
. Using
both Conv-sml and Conv-b architectures, the mean for RMSE andδ
1
improves, and the standard
deviation decreases for meta-learning. The source of gain is that meta-learning has higher domain
generalizability to unseen datasets; thus, in the zero-shot cross-dataset evaluation, meta-learning
can effectivenly mitigate the gap for different environments and increase robustness.
The source of scene-imbalance can come from dataset itself, such as higher noise in some scenes.
To verify if the imbalance comes from the data collection, we next train and test on Campus Train
and Campus Test.
[Purpose V]: Noise or bias in Campus Data: We adopt Campus Train, which contains non-
overlapping frames with the Campus Test with a total of 190 sequences. Then we supervised-learn
and test on Campus Train and Campus Test. Results shown in Table 6.15 are reported across room
types. The experiment aims to discover potential underlying biases towards specific room types or
noise contained in the training set.
[Analysis V]: In Table 6.15 in terms of RMSE, lounge, large room, and library are the most
challenging types. Lounge, large room, and library are typically associated with longer ranges
and thus higher RMSE. Bathroom, kitchen, and hallway are the most challenging types in terms
of accuracy δ
1
. The difficulty in those types is they typically contain high complexity in object
72
Table 6.16: Comparison between with and without class weights. ConvNeXt-small (Conv-sml)
and ConvNeXt-base (Conv-b) are used.
Model Class weights MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Conv-sml ✗ 0.1227 0.0542 0.0181 0.1918 0.0790 97.66 99.60 99.88
Conv-sml ✓ 0.1357 0.0606 0.0202 0.2071 0.0866 97.06 99.52 99.86
Conv-b ✗ 0.1164 0.0510 0.0174 0.1846 0.0752 97.96 99.63 99.89
Conv-b ✓ 0.1282 0.0567 0.0196 0.1986 0.0821 97.47 99.58 99.87
Table 6.17: Statistics for supervised learning v.s. meta-learning. Mean and standard deviation
are computed across categories. Models are trained on Campus Train.
Model Class weights STD RMSE STDδ
1
Conv-sml ✗ 0.06424 1.4850
Conv-sml ✓ 0.06295 1.4371
Conv-b ✗ 0.06734 1.2236
Conv-b ✓ 0.06186 1.1577
arrangements, especially for small objects. This difficulty makes occluding boundaries hard to
locate precisely in those scenes. Further, near and far fields can be obfuscated by arrangements.
These room types are dissimilat to others and hard to fit in image-to-depth relation learned from the
head class data, and thus they can be seen as tailed categories. (mean, std) are (0.2019, 0.0642) for
RMSE and (97.02, 1.4872) forδ
1
. The standard deviation shows the statistics only slightly vary
across types, and no apparent noise exists in the training dataset.
Still, tailed categories exist in Table 6.15, and in the next analysis, we study some widely-used
methods to alleviate imbalance.
[Purpose VI: Class weights]: In literature, there are some methods to alleviate the gap between
classes [21, 45, 114]. Class re-weighting was proposed to re-balance performances for different
labels. In this method, class weights assigned to each class are inversely proportional to class
occurrences, which can compensate for higher class frequency. We adopt this class reweighting
strategy on the training loss to examine if this method can achieve lower performance gaps between
categories. Table 6.16 shows the overall performance, and Table 6.17 shows the mean and standard
deviation.
[Analysis VI]: From Table 6.17, one can observe smaller standard deviation for both RMSE
andδ
1
, indicating performances across types are more balanced. However, the class re-weighting
strategy also leads to inferior overall performances. This is because head class performance drops
73
Table 6.18: Comparison between with and without using a balanced task-loader.
Category Balance Loader MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Convsml ✗ 0.1357 0.0606 0.0202 0.2071 0.0866 97.06 99.52 99.86
Convsml ✓ 0.1140 0.0501 0.0166 0.1816 0.0742 98.04 99.66 99.90
Convb ✗ 0.1164 0.0510 0.0174 0.1846 0.0752 97.96 99.63 99.89
Convb ✓ 0.1014 0.0440 0.0146 0.1667 0.0667 98.52 99.73 99.90
and affects the overall results. This effect is also observed and studied in the literature using class
re-weighting [120, 144]. Therefore, this strategy might not be a desirable choice.
In the next analysis, we study another approach: a balanced data sampler.
[Purpose VII: Balanced data sampler]: Another useful method towards balanced performance
is a class-balanced sampler, which samples training data with equivalent occurrences for each class.
We implement a balanced task-loader (a group of category as a task), so the presence of each task is
equal in a minibatch. We set task number to 8, where we split the head class (private room) into two
tasks and then group smaller categories (library, kitchen, playroom, bathroom) into a larger-size
task. 8-task setting makes each task approximately the same size, which can sample training data
more effectively without downsizing a larger-size task. Table 6.18 shows the results.
[Analysis VII]: In Table 6.18, we find a balanced task-loader helps improve overall performance,
and it attains lower standard deviation for RMSE andδ
1
. The balanced task-loader satisfactorily
reaches lower bias without sacrificing overall performances, indicating a better method than class
re-weighting. The result matches findings in literature [52], which also uses a balanced class sampler
to alleviate the long-tailed class issue.
Next, we are interested whether the balanced task-loader can be combined with meta-learning.
[Purpose VIII: Balanced task-loader with meta-learning]: Purpose VII can be seen as
grouping tasks and using a balanced sampler without bilevel optimization as in meta-learning. It
is interesting whether a balanced task-loader can benefit meta-learning as well. Toward this goal,
we apply meta-leaning with fine-grained task and use a balanced task-loader to sample equally
frequent data for each category. Campus Train is adopted as the training dataset, and to validate the
cross-dataset performance, NYUv2 is adopted as the testing dataset. Table 6.19 shows the result.
74
Table 6.19: Comparison between with and without meta-learning. Models are trained on Campus
Train and evaluated on NYUv2.
Category MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Conv-sml (SL) 0.4184 0.1630 0.1357 0.5835 0.2050 77.30 94.31 98.36
Conv-sml (Meta) 0.4094 0.1621 0.1310 0.5665 0.2020 77.58 94.51 98.44
Conv-b (SL) 0.4045 0.1592 0.1291 0.5618 0.1995 78.66 94.58 98.44
Conv-b (Meta) 0.4016 0.1581 0.1257 0.5584 0.1981 78.82 94.65 98.52
[Analysis VIII]: In Table 6.19, one can observe that meta-learning combined with a balanced
task-loader improves performances over direct supervised learning in both cases for ConvNeXt-
small and ConvNeXt-base. This is due to meta-learning’s domain generalization advantage as well
as balanced task-loader as a better data-loading strategy. Together, they attain better generalization
on this zero-shot cross-dataset evaluation.
Purpose I to VIII show training on a single dataset and show zero-shot cross-dataset evaluation
to unseen datasets. We are interested in generalization to unseen types.
[Purpose IX: Generalization to unseen types]: In addition to generalization to unseen datasets,
we are also curious about generalization to unseen types. We next divide the whole Campus Train
into different splits, train on each division, and then test on Campus Test. The whole training set
is divided into three groups based on categories. Group 1: private room, kitchen, living room,
bathroom; Group 2: office, hallway, meeting room, classroom; Group 3: lounge, large room,
playroom, library. Unlike training on the whole dataset, those trained models have not seen samples
from other groups in this setting. For example, a model trained on Group 1 did not have access to
RGBD pairs of classroom or hallway during training and may perform worse in these categories.
However, these experiments hint at how these models generalize to other categories and validate
whether meta-learning can induce better depth prior to unseen types. Results in Table 6.20 (report
on groups) and 6.21 (report on ranges) show the comparison.
[Analysis IX]: In table 6.20, we show comparison between supervised learning meta-learning.
Meta-Learning achieves consistently better performance than pure supervised learning on different
training groups. The results show that meta-learning produces higher generalization across different
groups and again verifies meta-learning’s generalizability. In Table 6.21, we show performances
in different distance ranges. Group 1, 2, and 3 have the most-frequent samples within 5m, 10m,
75
Table 6.20: Performance trained on different groups and test on Campus Test. SL: supervised
learning; Meta: meta-learning. See the text for the definition of groups.
Group MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Group 1 (SL) 0.3734 0.1419 0.1355 0.5408 0.1778 81.29 94.51 98.19
Group 1 (Meta) 0.3485 0.1325 0.1285 0.5075 0.1672 83.02 94.94 98.20
Group 2 (SL) 0.3129 0.1317 0.0961 0.4393 0.1661 83.85 96.16 98.87
Group 2 (Meta) 0.3063 0.1274 0.0958 0 .4317 0.1617 84.41 96.12 98.77
Group 3 (SL) 0.4101 0.1925 0.1825 0.5740 0.2312 72.09 91.94 97.53
Group 3 (Meta) 0.3293 0.1555 0.1034 0.4527 0.1942 77.87 94.42 98.41
Table 6.21: Performance trained on different meta-groups and test on Campus-test. Results
are reported by ranges.
Group 1 MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
SL< 5m 0.2133 0.0984 0.0476 0.2877 0.1270 89.48 98.08 99.66
SL< 10m 0.3741 0.1376 0.1241 0.5935 0.1801 80.47 94.59 98.37
SL< 20m 0.8959 0.2897 0.4375 1.3003 0.3412 55.57 82.74 93.16
Meta< 5m 0.1943 0.0880 0.0419 0.2656 0.1153 90.80 98.33 99.72
Meta< 10m 0.3363 0.1281 0.1110 0.5308 0.1672 82.64 95.19 98.40
Meta< 20m 0.8684 0.2836 0.4338 1.2685 0.3372 58.05 83.54 92.96
Group 2 MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
SL< 5m 0.2014 0.1063 0.0464 0.2720 0.1376 87.97 97.80 99.59
SL< 10m 0.2951 0.1169 0.0728 0.4263 0.1526 87.27 97.75 99.34
SL< 20m 0.7003 0.2336 0.2886 1.0028 0.2764 66.01 88.76 95.90
Meta< 5m 0.1902 0.1019 0.0459 0.2579 0.1316 88.76 97.85 99.53
Meta< 10m 0.2780 0.1092 0.0658 0.4060 0.1453 88.22 97.83 99.38
Meta< 20m 0.7222 0.2343 0.2971 1.0327 0.2812 65.32 88.24 95.54
Group 3 MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
SL< 5m 0.4354 0.2184 0.2070 0.5838 0.2569 66.78 90.83 97.38
SL< 10m 0.4805 0.2129 0.2243 0.7016 0.2562 68.46 89.54 96.39
SL< 20m 0.2372 0.0819 0.0491 0.3784 0.1153 94.08 98.65 99.50
Meta< 5m 0.3754 0.1856 0.1374 0.4953 0.2256 72.17 93.61 98.36
Meta< 10m 0.3903 0.1708 0.1312 0.5610 0.2120 74.65 92.87 97.87
Meta< 20m 0.1654 0.0554 0.0304 0.2815 0.0848 96.62 99.22 99.78
and 20m and thus perform better in those ranges when trained on specialized groups. Similarly,
applying meta-learning can improve generalizability to other ranges. Fig. 6.9 shows the qualitative
comparison of training on groups.
Above purposes (I to IX) are all focusing on a single-dataset-level or sub-dataset-level. We next
go beyond these scopes and examine meta-learning on multiple training datasets.
[Purpose X: Multiple training datasets]: The above experiments focus on training on a single
dataset. It is interesting whether this meta-learning strategy can be extended to multiple training
dataset setting. We adopt three training datasets: Hypersim, NYUv2, and HM3D, and testing
sets including V A, Replica, and Campus Test. Settings: (1) supervised learning (2) supervised +
76
Figure 6.9: Comparison for training on sub-groups. Meta-Learning shows better depth accuracy
of object outlines and shapes.
77
Table 6.22: Multiple training dataset experiment. Four settings, including using meta-learning
and using balanced task-loader, are examined. See text for the setting details.
Eval on V A MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Setting (1) 0.2732 0.1996 0.1006 0.3856 0.2386 71.62 92.07 97.18
Setting (2) 0.2686 0.1926 0.0926 0.3786 0.2319 71.92 92.36 97.84
Setting (3) 0.2662 0.1926 0.0948 0.3748 0.2312 72.62 92.29 97.54
Setting (4) 0.2536 0.1861 0.0864 0.3585 0.2246 74.33 93.12 97.69
Eval on Replica MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Setting (1) 0.2784 0.1695 0.0838 0.3848 0.2083 76.84 93.55 97.91
Setting (2) 0.2621 0.1607 0.0738 0.3608 0.1984 78.51 94.63 98.42
Setting (3) 0.2592 0.1570 0.0737 0.3600 0.1968 79.60 94.42 98.22
Setting (4) 0.2489 0.1491 0.0672 0.3498 0.1851 80.83 95.30 98.64
Eval on Campus Test MAE AbsRel SqRel RMSE logRMSE δ
1
δ
2
δ
3
Setting (1) 0.3539 0.1438 0.1010 0.5024 0.1856 81.17 96.11 98.99
Setting (2) 0.3459 0.1406 0.0949 0.4914 0.1814 82.01 96.37 99.11
Setting (3) 0.3433 0.1384 0.0938 0.4891 0.1800 82.29 96.48 99.09
Setting (4) 0.3413 0.1376 0.0936 0.4870 0.1784 82.55 96.51 99.15
balanced dataset sampler (3) meta-learning at fine-grained task (4) meta-learning at dataset level
with a balanced sampler. The purpose is to verify whether meta-learning can still apply to the
situation where rich training data exist. Further, we adopt the balanced task-loader in Purpose VII
and examine whether the task-loader can also support meta-learning.
[Analysis X]: Table 6.22 show that setting (1)-(4) perform gradually better. First, comparing
Setting (1) (2) and Setting (3) (4), the adopted task-loader can improve performances and show
higher generalizability on zero-shot cross-dataset evaluation. Similarly, comparing Setting (1)
(3) and Setting (2) (4), one can observe using meta-learning also has performance improvement
over without meta-learning. This verifies that both meta-learning and balanced task-loader can
consistently improve performances. The balanced task-loader induces less bias in the trained model
toward one specific dataset. It helps learn higher-level image-to-depth relation that benefits zero-shot
cross-dataset evaluation. We visualize depth map comparison in Fig. 6.10.
6.6 Summary
This chapter first validates single-stage meta-learning estimates coarse but smooth depth from global
context (6.4.1). It is a better initialization for the following supervised learning to obtain higher
78
Figure 6.10: Comparison for training on multiple training datasets. Meta-Learning shows better
depth accuracy of object outlines and shapes.
model generalizability on intra-dataset and zero-shot cross-dataset evaluation (6.4.2, 6.4.3), as well
as a better 3D representation in NeRF training. (6.4.4).
From depth’s perspective,
• this work provides a simple learning scheme to gain generalizability without the needs of extra
data, constraints, advanced loss, or module design;
79
• is easy to plug into general architecture or dedicated depth framework and show concrete im-
provements;
• proposes zero-shot cross-dataset protocol to attend to in-the-wild performance that most prior
work overlooks.
From meta-learning’s perspective,
• this work chooses the challenging single-image setting and meta-optimizes across environments,
unlike prior works for online video adaptation that meta-optimizes within a sequence with access
to multiple frames;
• proposes fine-grained task to overcome the lacks of affinity in sparse and irrelevant sampled
images.
• studies a complex single-image real-valued regression problem rather than widely-studied classi-
fication.
Next, we collect campus-wise data and provide space type labels to show the breakdown across
types further. This better helps us understand a model’s performance related to its training data
and reveals potential data bias in the training data. We demonstrate that meta-learning and task-
balanced sampler can alleviate training data bias. With meta-learning’s higher generalizability,
it learns an image-to-depth mapping that better calibrates to real distribution of the two domains
and cancels bias towards specific scene types. Last, we show meta-learning’s ability training on
sub-/single-/multi-datasets with consistent improvements and validate meta-learning’s effects in
extensive experiments.
80
Chapter 7
Conclusion
Estimating geometry from single-image is a fundamental computer vision problem with broad
applications in AR/VR, robotics, and automatic systems. A learned model can even be a 3D
representation extractor from images and fulfill other computer vision tasks, such as novel view
synthesis or object detection and segmentation. This dissertation visits several works on different
real-world application domains such as faces, outdoor driving, and indoor scenes. Last, going
beyond contributions in terms of model architecture or loss functions, we focus on a better learning
scheme, meta-learning, and attempt to address high variation for each image and introduce fine-
grained task. Then, we collect Campus Data to closely evaluate a pretrained model’s generalizability
by zero-shot cross-dataset evaluation. From generalization’s and robustness’s perspectives, it is
essential that pretrained models can apply to unseen scenes or even unseen datasets rather than
evaluating on the same dataset.
This dissertation covers several application data domains, identifies each domain’s difficulty,
and introduces a method to solve the problem. Then, we revisit the fundamental learning problem
and present a better learning procedure to gain higher generalizability.
81
Bibliography
[1] Mohammed Aleem Abdullah. Inner canthal distance and geometric progression as a predictor
of maxillary central incisor width. The Journal of prosthetic dentistry, 88(1):16–20, 2002.
[2] Ashutosh Agarwal and Chetan Arora. Attention attention everywhere: Monocular depth
prediction with skip attention. In Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision, pages 5861–5870, 2023.
[3] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom
Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent
by gradient descent. NeurIPS, 2016.
[4] Antreas Antoniou, Harri Edwards, and Amos Storkey. How to train your maml. In ICLR,
2019.
[5] Devansh Arpit, Stanisław Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,
Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A
closer look at memorization in deep networks. In ICML, 2017.
[6] Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Irondepth: Iterative refinement of
single-view depth using surface normal and its uncertainty. arXiv preprint arXiv:2210.03676,
2022.
[7] Yan Bai, Jile Jiao, Wang Ce, Jun Liu, Yihang Lou, Xuetao Feng, and Ling-Yu Duan.
Person30k: A dual-meta generalization network for person re-identification. In CVPR,
2021.
[8] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. Metareg: Towards domain
generalization using meta-regularization. NeurIPS, 2018.
[9] Pascal Belin, Shirley Fecteau, and Catherine Bedard. Thinking the voice: neural correlates
of voice perception. Trends in cognitive sciences, 8(3):129–135, 2004.
[10] Chandrasekhar Bhagavatula, Chenchen Zhu, Khoa Luu, and Marios Savvides. Faster than
real-time facial alignment: A 3d spatial transformer network approach in unconstrained
poses. In ICCV, pages 3980–3989, 2017.
[11] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using
adaptive bins. In CVPR, 2021.
82
[12] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M¨ uller. Zoedepth:
Zero-shot transfer by combining relative and metric depth, 2023.
[13] Jia-Wang Bian, Huangying Zhan, Naiyan Wang, Zhichao Li, Le Zhang, Chunhua Shen,
Ming-Ming Cheng, and Ian Reid. Unsupervised scale-consistent depth learning from video.
IJCV, 2021.
[14] Manh-Ha Bui, Toan Tran, Anh Tran, and Dinh Phung. Exploiting domain-specific features
to enhance domain generalization. NeurIPS, 2021.
[15] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face
alignment problem? (and a dataset of 230,000 3d facial landmarks). In ICCV, 2017.
[16] Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape
regression. International Journal of Computer Vision (IJCV), 107(2):177–190, 2014.
[17] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. In CVPR, pages
5410–5418, 2018.
[18] Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, and Xiaolong Wang. Meta-baseline:
exploring simple meta-learning for few-shot learning. In ICCV, 2021.
[19] Hyeong-Seok Choi, Changdae Park, and Kyogu Lee. From inference to generation: End-to-
end fully self-supervised generation of human face from speech. ICLR, 2020.
[20] Kurtland Chua, Qi Lei, and Jason D Lee. How fine-tuning allows for effective meta-learning.
NeurIPS, 2021.
[21] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss
based on effective number of samples. In CVPR, 2019.
[22] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer
views and faster training for free. In CVPR, 2022.
[23] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d
face reconstruction with weakly-supervised learning: From single image to image set. In
CVPR Workshops, pages 0–0, 2019.
[24] Tom van Dijk and Guido de Croon. How do neural networks see depth in single images? In
ICCV, 2019.
[25] Qi Dou, Daniel Coelho de Castro, Konstantinos Kamnitsas, and Ben Glocker. Domain
generalization via model-agnostic learning of semantic features. NeurIPS, 2019.
[26] Bernhard Egger, William AP Smith, Ayush Tewari, Stefanie Wuhrer, Michael Zollhoefer,
Thabo Beeler, Florian Bernard, Timo Bolkart, Adam Kortylewski, Sami Romdhani, et al. 3d
morphable face models—past, present, and future. ACM Transactions on Graphics (TOG),
39(5):1–38, 2020.
83
[27] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a
common multi-scale convolutional architecture. In ICCV, 2015.
[28] Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering
the long tail via influence estimation. NeurIPS, 2020.
[29] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. Joint 3d face reconstruction
and dense alignment with position map regression network. In ECCV, pages 534–551, 2018.
[30] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast
adaptation of deep networks. In ICML, 2017.
[31] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep
ordinal regression network for monocular depth estimation. In CVPR, 2018.
[32] Ning Gao, Hanna Ziesche, Ngo Anh Vien, Michael V olpp, and Gerhard Neumann. What
matters for meta-learning vision regression tasks? In CVPR, 2022.
[33] Pablo Garrido, Michael Zollh¨ ofer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick P´ erez,
and Christian Theobalt. Reconstruction of personalized 3d face rigs from monocular video.
ACM Transactions on Graphics (TOG), 35(3):1–15, 2016.
[34] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving?
the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 3354–3361, 2012.
[35] Clement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. Digging into
self-supervised monocular depth estimation. In ICCV, 2019.
[36] Rui Gong, Yuhua Chen, Danda Pani Paudel, Yawei Li, Ajad Chhatkuli, Wen Li, Dengxin
Dai, and Luc Van Gool. Cluster, split, fuse, and update: Meta-learning for open compound
domain adaptive semantic segmentation. In CVPR, 2021.
[37] Karlheinz Gr¨ ochenig. Foundations of time-frequency analysis. Springer Science & Business
Media, 2001.
[38] Liang-Yan Gui, Yu-Xiong Wang, Deva Ramanan, and Jose M. F. Moura. Few-shot human
motion prediction via meta-learning. In ECCV, September 2018.
[39] Jianzhu Guo, Xiangyu Zhu, Yang Yang, Fan Yang, Zhen Lei, and Stan Z Li. Towards fast,
accurate and stable 3d dense face alignment. In ECCV, 2020.
[40] Jonathan Harrington. Acoustic phonetics. The handbook of phonetic sciences, 2:81–129,
2010.
[41] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In CVPR, pages 770–778, 2016.
[42] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531, 2015.
84
[43] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient
descent. In ICANN, 2001.
[44] Timothy M Hospedales, Antreas Antoniou, Paul Micaelli, and Amos J Storkey. Meta-
learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2021.
[45] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation
for imbalanced classification. In CVPR, 2016.
[46] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[47] Aaron S. Jackson, Adrian Bulat, Vasileios Argyriou, and Georgios Tzimiropoulos. Large
pose 3d face reconstruction from a single image via direct volumetric cnn regression. In
ICCV, Oct 2017.
[48] Muhammad Abdullah Jamal and Guo-Jun Qi. Task agnostic meta-learning for few-shot
learning. In CVPR, 2019.
[49] Maximilian Jaritz, Raoul De Charette, Emilie Wirbel, Xavier Perrotton, and Fawzi Nashashibi.
Sparse and dense data with cnns: Depth completion and semantic segmentation. IEEE
International Conference on 3D Vision (3DV), 2018.
[50] Pan Ji, Runze Li, Bir Bhanu, and Yi Xu. Monoindoor: Towards good practice of self-
supervised monocular depth estimation for indoor environments. In ICCV, 2021.
[51] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot
object detection via feature reweighting. In CVPR, 2019.
[52] Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and
Yannis Kalantidis. Decoupling representation and classifier for long-tailed recognition. In
ICLR, 2020.
[53] Nick Kanopoulos, Nagesh Vasanthavada, and Robert L Baker. Design of an image edge
detection filter using the sobel operator. IEEE Journal of solid-state circuits, 1988.
[54] Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, and Junmo
Kim. Global-local path networks for monocular depth estimation with vertical cutdepth.
arXiv preprint arXiv:2201.07436, 2022.
[55] Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Niessner,
Patrick P´ erez, Christian Richardt, Michael Zollh¨ ofer, and Christian Theobalt. Deep video
portraits. ACM Transactions on Graphics (TOG), 37(4):1–14, 2018.
[56] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
85
[57] Marvin Klingner, Jan-Aike Term¨ ohlen, Jonas Mikolajczyk, and Tim Fingscheidt. Self-
supervised monocular depth estimation: Solving the dynamic object problem by semantic
guidance. European Conference on Computer Vision (ECCV), 2020.
[58] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab.
Deeper depth prediction with fully convolutional residual networks. In IEEE International
Conference on 3D Vision (3DV), pages 239–248, 2016.
[59] Hae Beom Lee, Taewook Nam, Eunho Yang, and Sung Ju Hwang. Meta dropout: Learning
to perturb latent features for generalization. In ICLR, 2019.
[60] Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-
scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326,
2019.
[61] Boying Li, Yuan Huang, Zeyu Liu, Danping Zou, and Wenxian Yu. Structdepth: Leveraging
the structural regularities for self-supervised indoor depth estimation. In ICCV, 2021.
[62] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Learning to generalize:
Meta-learning for domain generalization. In AAAI, 2018.
[63] Jun Li, Reinhard Klein, and Angela Yao. A two-streamed network for estimating fine-scaled
depth maps from single rgb images. In CVPR, 2017.
[64] Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang. Depthformer: Depthformer: Ex-
ploiting long-range correlation and local information for accurate monocular depth estimation.
arXiv preprint arXiv:2203.14211, 2022.
[65] Yaojie Liu, Amin Jourabloo, William Ren, and Xiaoming Liu. Dense face alignment. In
ICCV, pages 1619–1628, 2017.
[66] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and
Saining Xie. A convnet for the 2020s. CVPR, 2022.
[67] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2018.
[68] Fangchang Ma, Guilherme Venturelli Cavalheiro, and Sertac Karaman. Self-supervised
sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. In
2019 International Conference on Robotics and Automation (ICRA), pages 3288–3295. IEEE,
2019.
[69] Fangchang Ma and Sertac Karaman. Sparse-to-dense: Depth prediction from sparse depth
samples and a single image. In IEEE International Conference on Robotics and Automation
(ICRA), 2018.
[70] Xinzhu Ma, Shinan Liu, Zhiyi Xia, Hongwen Zhang, Xingyu Zeng, and Wanli Ouyang.
Rethinking pseudo-lidar representation. ECCV, 2020.
[71] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. V oxceleb: A large-scale speaker
identification dataset. INTERSPEECH, pages 2616–2620, 2017.
86
[72] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann
machines. In Proceedings of the 27th international conference on machine learning (ICML-
10), pages 807–814, 2010.
[73] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose
estimation. In European conference on computer vision (ECCV), pages 483–499. Springer,
2016.
[74] Renkun Ni, Manli Shu, Hossein Souri, Micah Goldblum, and Tom Goldstein. The close
relationship between contrastive learning and meta-learning. In ICLR, 2022.
[75] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms.
arXiv preprint arXiv:1803.02999, 2018.
[76] Matthias Ochs, Adrian Kretz, and Rudolf Mester. Sdnet: Semantically guided depth esti-
mation network. In German Conference on Pattern Recognition (GCPR), pages 288–302.
Springer, 2019.
[77] Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T Freeman, Michael Ru-
binstein, and Wojciech Matusik. Speech2face: Learning the face behind a voice. In CVPR,
pages 7539–7548, 2019.
[78] Pamela M Pallett, Stephen Link, and Kang Lee. New “golden” ratios for facial beauty. Vision
research, 50(2):149–154, 2010.
[79] Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In BMVC,
2015.
[80] Nikolaos Passalis, Maria Tzelepi, and Anastasios Tefas. Probabilistic knowledge transfer
for lightweight deep representation learning. IEEE Transactions on Neural Networks and
Learning Systems (TNNLS), 2020.
[81] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative
style, high-performance deep learning library. NeurIPS, 2019.
[82] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A
3d face model for pose and illumination invariant face recognition. In IEEE International
Conference on Advanced Video and Signal Based Surveillance, pages 296–301. IEEE, 2009.
[83] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on
point sets for 3d classification and segmentation. In CVPR, pages 652–660, 2017.
[84] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical
feature learning on point sets in a metric space. In NeurIPS, pages 5099–5108, 2017.
[85] Fengchun Qiao, Long Zhao, and Xi Peng. Learning to learn single domain generalization. In
CVPR, 2020.
87
[86] Jiaxiong Qiu, Zhaopeng Cui, Yinda Zhang, Xingdi Zhang, Shuaicheng Liu, Bing Zeng, and
Marc Pollefeys. Deeplidar: Deep surface normal guided depth prediction for outdoor scene
from sparse lidar data and single color image. arXiv preprint arXiv:1812.00488, 2018.
[87] Janarthanan Rajendran, Alexander Irpan, and Eric Jang. Meta-learning requires meta-
augmentation. In NeurIPS, 2020.
[88] Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with
implicit gradients. NeurIPS, 2019.
[89] Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex
Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang,
et al. Habitat-matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied
AI. NeurIPS Datasets and Benchmarks Track, 2021.
[90] Ren´ e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense
prediction. ICCV, 2021.
[91] Ren´ e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards
robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.
TPAMI, 2020.
[92] Mike Roberts and Nathan Paczan. Hypersim: A photorealistic synthetic dataset for holistic
indoor scene understanding. ICCV, 2021.
[93] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint
arXiv:1609.04747, 2016.
[94] David Sarver and Ronald S Jacobson. The aesthetic dentofacial analysis. Clinics in plastic
surgery, 34(3):369–394, 2007.
[95] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3D: Learning 3D scene structure from
a single still image. TPAMI.
[96] J¨ urgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how
to learn: the meta-meta-... hook. Technische Universit¨ at M¨ unchen, PhD thesis, 1987.
[97] Jiaxiang Shang, Tianwei Shen, Shiwei Li, Lei Zhou, Mingmin Zhen, Tian Fang, and Long
Quan. Self-supervised monocular 3d face reconstruction by occlusion-aware multi-view
geometry consistency. In ECCV, 2020.
[98] Yang Shu, Zhangjie Cao, Chenyu Wang, Jianmin Wang, and Mingsheng Long. Open domain
generalization with domain-augmented meta-learning. In CVPR, 2021.
[99] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and
support inference from rgbd images. In European Conference on Computer Vision (ECCV),
pages 746–760, 2012.
[100] Vincent Sitzmann, Eric Chan, Richard Tucker, Noah Snavely, and Gordon Wetzstein. Metasdf:
Meta-learning signed distance functions. In NeurIPS, 2020.
88
[101] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J
Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica
of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
[102] Qiyu Sun, Gary G Yen, Yang Tang, and Chaoqiang Zhao. Learn to adapt for monocular
depth estimation. arXiv preprint arXiv:2203.14005, 2022.
[103] Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi Schmidt, Pratul P Srinivasan,
Jonathan T Barron, and Ren Ng. Learned initializations for optimizing coordinate-based
neural representations. In CVPR, 2021.
[104] Alessio Tonioni, Oscar Rahnama, Thomas Joy, Luigi Di Stefano, Thalaiyasingam Ajanthan,
and Philip HS Torr. Learning to adapt for stereo. In CVPR, 2019.
[105] Alessio Tonioni, Fabio Tosi, Matteo Poggi, Stefano Mattoccia, and Luigi Di Stefano. Real-
time self-adaptive deep stereo. In CVPR, 2019.
[106] Xiaoguang Tu, Jian Zhao, Mei Xie, Zihang Jiang, Akshaya Balamurugan, Yao Luo, Yang
Zhao, Lingxiao He, Zheng Ma, and Jiashi Feng. 3d face reconstruction from a single image
assisted by 2d face images in the wild. IEEE Transactions on Multimedia (TMM), 2020.
[107] Abhinav Valada, Rohit Mohan, and Wolfram Burgard. Self-supervised model adaptation for
multimodal semantic segmentation. International Journal of Computer Vision (IJCV), jul
2019. Special Issue: Deep Learning for Robotic Vision.
[108] Wouter Van Gansbeke, Davy Neven, Bert De Brabandere, and Luc Van Gool. Sparse
and noisy lidar completion with rgb guidance and uncertainty. In 2019 16th International
Conference on Machine Vision Applications (MVA), pages 1–6. IEEE, 2019.
[109] Guangting Wang, Chong Luo, Xiaoyan Sun, Zhiwei Xiong, and Wenjun Zeng. Tracking by
instance detection: A meta-learning approach. In CVPR, 2020.
[110] Qiaosong Wang, Zhan Yu, Christopher Rasmussen, and Jingyi Yu. Stereo vision–based depth
of field rendering on a mobile device. Journal of Electronic Imaging, 23(2):023009, 2014.
[111] Shaofei Wang, Marko Mihajlovic, Qianli Ma, Andreas Geiger, and Siyu Tang. Metaavatar:
Learning animatable clothed human models from few depth images. In NeurIPS, 2021.
[112] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q
Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object
detection for autonomous driving. In CVPR, pages 8445–8453, 2019.
[113] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Meta-learning to detect rare objects.
In CVPR, 2019.
[114] Zhong-Qiu Wang and Ivan Tashev. Learning utterance-level representations for speech
emotion and age/gender recognition using deep neural networks. In ICASSP, pages 5150–
5154. IEEE, 2017.
89
[115] Zhou Wang and Alan C Bovik. A universal image quality index. Signal Processing Ltters,
2002.
[116] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assess-
ment: from error visibility to structural similarity. IEEE transactions on image processing
(TIP), 13(4):600–612, 2004.
[117] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for
image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems
& Computers, 2003, volume 2, pages 1398–1402. IEEE, 2003.
[118] Jamie Watson, Michael Firman, Gabriel J Brostow, and Daniyar Turmukhambetov. Self-
supervised monocular depth hints. In ICCV, 2019.
[119] Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, and Michael Firman.
The temporal opportunist: Self-supervised multi-frame monocular depth. In CVPR, 2021.
[120] Hongxin Wei, Lue Tao, Renchunzi Xie, Lei Feng, and Bo An. Open-sampling: Exploring
out-of-distribution data for re-balancing long-tailed datasets. In ICML, 2022.
[121] Yandong Wen, Bhiksha Raj, and Rita Singh. Face reconstruction from voice using generative
adversarial networks. In NeurIPS, volume 32, 2019.
[122] Cho-Ying Wu, Xiaoyan Hu, Michael Happold, Qiangeng Xu, and Ulrich Neu-
mann. Geometry-aware instance segmentation with disparity maps. arXiv preprint
arXiv:2006.07802, 2020.
[123] Cho-Ying Wu, Jialiang Wang, Michael Hall, Ulrich Neumann, and Shuochen Su. Toward
practical monocular indoor depth estimation. In CVPR, 2022.
[124] Cho-Ying Wu, Qiangeng Xu, and Ulrich Neumann. Synergy between 3dmm and 3d land-
marks for accurate 3d facial geometry. 3DV, 2021.
[125] Fanzi Wu, Linchao Bao, Yajing Chen, Yonggen Ling, Yibing Song, Songnan Li, King Ngi
Ngan, and Wei Liu. Mvf-net: Multi-view 3d face morphable model regression. In CVPR,
pages 959–968, 2019.
[126] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3d
object detection in the wild. In WACV, 2014.
[127] Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, and Yue Cao. Revealing
the dark secrets of masked image modeling. arXiv preprint arXiv:2205.13543, 2022.
[128] Qiangeng Xu, Xudong Sun, Cho-Ying Wu, Panqu Wang, and Ulrich Neumann. Grid-gcn for
fast and scalable point cloud learning. In CVPR, pages 5661–5670, 2020.
[129] Xiaopeng Yan, Ziliang Chen, Anni Xu, Xiaoxi Wang, Xiaodan Liang, and Liang Lin. Meta
r-cnn: Towards general solver for instance-level low-shot learning. In CVPR, 2019.
90
[130] Huaxiu Yao, Long-Kai Huang, Linjun Zhang, Ying Wei, Li Tian, James Zou, Junzhou Huang,
et al. Improving generalization in meta-learning via task augmentation. In ICML, 2021.
[131] Huaxiu Yao, Linjun Zhang, and Chelsea Finn. Meta-learning with fewer tasks through task
interpolation. In ICLR, 2022.
[132] Mingzhang Yin, George Tucker, Mingyuan Zhou, Sergey Levine, and Chelsea Finn. Meta-
learning without memorization. In ICLR, 2020.
[133] Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and
Chunhua Shen. Learning to recover 3d scene shape from a single image. In CVPR, 2021.
[134] Andrew W Young, Sascha Fr¨ uhholz, and Stefan R Schweinberger. Face and voice perception:
understanding commonalities and differences. Trends in cognitive sciences, 24(5):398–410,
2020.
[135] Ronald Yu, Shunsuke Saito, Haoxiang Li, Duygu Ceylan, and Hao Li. Learning dense facial
correspondences in unconstrained images. In ICCV, pages 4723–4732, 2017.
[136] Zehao Yu, Lei Jin, and Shenghua Gao. P
2
net: Patch-match and plane-regularization for
unsupervised indoor depth estimation. In ECCV, 2020.
[137] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. New crfs: Neural
window fully-connected crfs for monocular depth estimation. CVPR, 2022.
[138] Gongjie Zhang, Zhipeng Luo, Kaiwen Cui, and Shijian Lu. Meta-detr: Few-shot object
detection via unified image-level meta-learning. arXiv preprint arXiv:2103.11731, 2021.
[139] Zhenyu Zhang, St´ ephane Lathuiliere, Andrea Pilzer, Nicu Sebe, Elisa Ricci, and Jian
Yang. Online adaptation through meta-learning for stereo depth estimation. arXiv preprint
arXiv:1904.08462, 2019.
[140] Zhenyu Zhang, Stephane Lathuiliere, Elisa Ricci, Nicu Sebe, Yan Yan, and Jian Yang. Online
depth learning against forgetting in monocular videos. In CVPR, 2020.
[141] Wang Zhao, Shaohui Liu, Yezhi Shu, and Yong-Jin Liu. Towards better generalization: Joint
depth-pose learning without posenet. In CVPR, 2020.
[142] Yuyang Zhao, Zhun Zhong, Fengxiang Yang, Zhiming Luo, Yaojin Lin, Shaozi Li, and Nicu
Sebe. Learning to generalize unseen domains via memory-based multi-source meta-learning
for person re-identification. In CVPR, 2021.
[143] Yiqi Zhong, Cho-Ying Wu, Suya You, and Ulrich Neumann. Deep rgb-d canonical correlation
analysis for sparse depth completion. In NeurIPS, pages 5331–5341, 2019.
[144] Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min Chen. Bbn: Bilateral-branch network
with cumulative learning for long-tailed visual recognition. In CVPR, 2020.
[145] Junsheng Zhou, Yuwang Wang, Kaihuai Qin, and Wenjun Zeng. Moving indoor: Unsuper-
vised video depth learning in challenging environments. In CVPR, 2019.
91
[146] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li. Face alignment across
large poses: A 3d solution. In CVPR, pages 146–155, 2016.
[147] Xiangyu Zhu, Zhen Lei, Junjie Yan, Dong Yi, and Stan Z. Li. High-fidelity pose and
expression normalization for face recognition in the wild. In CVPR, June 2015.
[148] Xiangyu Zhu, Xiaoming Liu, Zhen Lei, and Stan Z Li. Face alignment in full pose range: A
3d total solution. IEEE transactions on pattern analysis and machine intelligence (TPAMI),
2019.
92
Abstract (if available)
Abstract
Predicting geometry from images is a fundamental and popular task in computer vision and has multiple applications. For example, predicting ranges from ego-view images can help robots navigate through indoor spaces and avoid collisions. Additional to physical applications, one can synthesize novel views from single images with the help of depth by warping pixels to different camera positions. Further, one can fuse depth estimation from multiple views and create a complete 3D environment for AR/VR uses.
It is difficult for traditional methods to attain the goal of geometry prediction from single images due to the domain gap between imagery and geometry. It is an intrinsically ill-posed problem without multi-view constraints. Thanks to recent advances in deep learning and efficient computation, various data-driven approaches emerge that learn to regress depth maps from images as geometry representation. Recent high-performing research works are primarily based on deep learning to achieve the goal of bridging the domain gap between imagery and geometry.
This dissertation includes several works on the topic of estimating geometry from single images in various data domains, including human faces, outdoor driving, and indoor scenes. Difficulties and our solutions to solve domain-specific problems are introduced in each chapter.
Then, in the last chapter, we go beyond solutions of network architecture and loss function developments and discover a better learning strategy, meta-learning, to learn a higher-level representation. The learned representation more accurately characterizes the depth domain. Our presented meta-learning approach attains better performance without involving extra data or pretrained models but directly focuses on learning schedules. Then, we closely evaluate the generalizability on our collected Campus Data and demonstrate meta-learning's ability in sub-/single-/multi-dataset levels.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D deep learning for perception and modeling
PDF
Deep representations for shapes, structures and motion
PDF
Learning to optimize the geometry and appearance from images
PDF
Towards learning generalization
PDF
Towards more occlusion-robust deep visual object tracking
PDF
Depth inference and visual saliency detection from 2D images
PDF
Learning controllable data generation for scalable model training
PDF
Fast and label-efficient graph representation learning
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Visual knowledge transfer with deep learning techniques
PDF
Point-based representations for 3D perception and reconstruction
PDF
Object detection and recognition from 3D point clouds
PDF
Recording, reconstructing, and relighting virtual humans
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Unsupervised learning of holistic 3D scene understanding
PDF
Visual representation learning with structural prior
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Transfer learning for intelligent systems in the wild
PDF
3D object detection in industrial site point clouds
Asset Metadata
Creator
Wu, Cho-Ying
(author)
Core Title
Single-image geometry estimation for various real-world domains
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
05/22/2023
Defense Date
04/27/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D,computer vision,deep learning,depth,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Neumann, Ulrich (
committee chair
), Itti, Laurent (
committee member
), Kuo, C.-C. Jay (
committee member
), Nealen, Andrew (
committee member
)
Creator Email
choyingw@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113134843
Unique identifier
UC113134843
Identifier
etd-WuChoYing-11882.pdf (filename)
Legacy Identifier
etd-WuChoYing-11882
Document Type
Thesis
Format
theses (aat)
Rights
Wu, Cho-Ying
Internet Media Type
application/pdf
Type
texts
Source
20230524-usctheses-batch-1048
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
3D
computer vision
deep learning
depth