Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Semantic structure in understanding and generation of the 3D world
(USC Thesis Other)
Semantic structure in understanding and generation of the 3D world
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SEMANTIC STRUCTURE IN UNDERSTANDING AND GENERATION OF THE 3D WORLD
by
Yuzhong Huang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2024
Copyright 2025 Yuzhong Huang
Dedication
To those who believed.
ii
Acknowledgements
It has been a privilege to dedicate these years to exploring topics driven purely by my intellectual curiosity.
Few are as fortunate to have the freedom to pursue what they truly love—a journey that, for me, has been
both a professional endeavor and a personal discovery.
I owe my deepest gratitude to my advisor, Prof. Fred Morstatter, for his unwavering support, guidance,
and belief in my potential. His mentorship granted me the freedom to explore diverse research directions,
while his wisdom and encouragement provided a steady anchor during challenging times. Beyond academic
mentorship, Prof. Morstatter has been a role model, demonstrating what it means to be not only a remarkable
researcher but also a compassionate and inspiring human being.
I am profoundly grateful to my collaborators and mentors from various internships and projects, whose
insights and experiences have enriched my academic journey and shaped my research.
• Dr. Rui Tang and Mr. Xiaohuang Huang at ExaCloud Inc.; Prof. Wenbin Li at the University of Bath.
• Dr. Yuanhang Su and Prof. C.-C. Jay Kuo at USC Media Communications Lab.
• Prof. Aram Galstyan, Prof. Kristina Lerman, Prof. Jay Pujara, Dr. Andrés Abeliuk, Dr. Sami Abu-El-Haija,
Ph.D. Candidate Zihao He, and Kexuan Sun at USC Information Sciences Institute.
• Dr. Xue Bai, Dr. Oliver Wang, Dr. Fabian Caba, and Dr. Aseem Agarwala at Adobe.
• Dr. Chen Liu, Dr. Ji Hou, Dr. Ke Huo, and Mr. Shiyu Dong at Meta.
• Dr. Zhong Li and Dr. Zhang Chen at OPPO US Research; Ph.D. Candidate Zhiyuan Ren at Michigan State
University; Ph.D. Candidate Zheng Chen at Indiana University; and Prof. Guosheng Lin at NUS.
• Dr. Dmytro Mishkin, Dr. Manlio Barajas Hernandez, Dr. Ian Endres, and Mr. Jack Langerman at Hover Inc.
I also extend my heartfelt thanks to my dissertation committee members: Prof. Yue Wang, Prof. Antonio
Ortega, and Prof. Aiichiro Nakano. Their constructive feedback and guidance have been instrumental in
refining this work. I am grateful to the USC Information Sciences Institute MINDS group and to all alumni,
staff, and colleagues who have offered their support and camaraderie throughout my academic journey.
To my family, my deepest appreciation:
• To my wife, Yihe Wang, for her constant encouragement, love, and unwavering belief in me.
• To my parents, Hui Huang and Hong Zheng, for providing me with the opportunity to explore the vast
and unknown world, and for their sacrifices that have made this journey possible.
This dissertation is not only the culmination of years of research but also a testament to the collective support
and inspiration I have been privileged to receive. To all who have been part of this journey, thank you.
iii
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 3D Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Bridging the Gap Between Implicit and Explicit 3D Representations . . . . . . . . . . . . . 3
1.3 Challenges in 3D Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Implicit representation excels in visualization but lacks in geometry . . . . . . . . . 4
1.3.2 Implicit representation struggle with repetition issues and Janus problem . . . . . . 4
1.3.3 Implicit representation is hard to edit . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: UniPlane: Unified Plane Detection and Reconstruction from Posed Monocular
Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Single-view plane reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Multi-view plane reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Learning-based tracking and reconstruction . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Feature volume construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Transformers-based Plane Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Unifying Tracking with Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.3 3D Geometric Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.4 3D Segmentation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.6.1 Adding standard deviation between views in feature volume . . . . . . . . 16
iv
2.4.6.2 Query vs heuristic tracking . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Conclusion and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 3: PlanarNeRF: Online Learning of Planar Primitives with Neural Radiance Fields . 18
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.3 Plane Rendering Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.4 Lightweight Plane Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.5 Global Memory Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 Baselines and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.2 Datasets and Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.4 Ablation Studies†
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter 4: OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation
Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 NeRFs for Image-to-3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 Text-to-Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.3 Text-to-3D Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Camera Orientation Conditioned Diffusion Model . . . . . . . . . . . . . . . . . . 36
4.3.2 Text to 3D generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.3 Decoupled Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.2 Orientation-Controlled Image Generation . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.3 Orientation-Controlled Diffusion for Text to 3D . . . . . . . . . . . . . . . . . . . . 43
4.4.4 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.5 Evaluation of Orientation Conditioned Image Generation. . . . . . . . . . . . . . . 44
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Chapter 5: EA3D: Edit Anything for 3D Objects . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.1 Text-to-3D Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.2 3D Editing in NeRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.3 Segment Anything Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.1 3D Masks Modeling Guided by SAM . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.2 3D Mask-Guided Dual DMTet Distillation . . . . . . . . . . . . . . . . . . . . . . 52
5.3.3 Versatile Editing Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
v
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.2 Editing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 6: Learning Where to Cut from Edited Videos . . . . . . . . . . . . . . . . . . . . . . 59
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3.1 Learning from Edited Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.2 Contrastive Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4.3 Temporal Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5.1 Classification Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5.2 Distribution of model prediction scores . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5.3 User Study Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.5.4 Grad-CAM Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 7: Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.1 OminiPlane: Reconstructing Planes from Single View, Sparse View, and Monocular Video . 73
7.1.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.1.1.1 Pretrained Geometric Prior Models . . . . . . . . . . . . . . . . . . . . . 74
7.1.1.2 Multi-Frame Aggregation and Point Cloud Representation . . . . . . . . . 74
7.1.1.3 Detection Queries for Plane Instance Segmentation . . . . . . . . . . . . 75
7.1.1.4 Detection-Guided RANSAC . . . . . . . . . . . . . . . . . . . . . . . . 75
7.1.1.5 Output: Accurate and Compact Planes . . . . . . . . . . . . . . . . . . . 75
7.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.1.2.1 Single View Plane Reconstruction . . . . . . . . . . . . . . . . . . . . . 75
7.1.2.2 Sparse View Plane Reconstruction . . . . . . . . . . . . . . . . . . . . . 77
7.1.2.3 Monocular Video Plane Reconstruction . . . . . . . . . . . . . . . . . . . 78
7.1.2.4 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.1.2.5 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.1.3 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2 Aligning automated metrics with human experts for evaluation of Structured Reconstruction 79
7.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2.2 What do we actually need from the wireframe comparison metrics? . . . . . . . . . 81
7.2.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2.4.1 Human wireframe ranking and its consistency . . . . . . . . . . . . . . . 85
Chapter 8: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
vi
List of Tables
2.1 3D geometry metrics on ScanNet. Our method outperforms the compared approaches by a
significant margin in almost all metrics. ↑ indicates bigger values are better, ↓ the opposite.
The best numbers are in bold. We use two different validation sets following Atlas [75] (top
block) and PlaneAE [58] (bottom block). . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 3D plane segmentation metrics on ScanNet. Our method also outperforms the competing
baseline approaches in almost all metrics when evaluating plane segmentation metrics.
↑ indicates bigger values are better, and ↓ indicates the opposite, namely smaller values
are better. The best numbers are in bold. We use two different validation sets following
previous work Atlas [75] (top block) and PlaneAE [58] (bottom block). . . . . . . . . . . . 16
2.3 Compare adding standard deviation in feature volume ↑ indicates bigger values are
better, ↓ the opposite. The best numbers are in bold. . . . . . . . . . . . . . . . . . . . . . 17
2.4 Compare query vs heuristic tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Comparisons for 3D geometry, memory and speed on ScanNet. Red for the best and green
for the second best (same for the following). . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 3D plane instance segmentation comparison on ScanNet. . . . . . . . . . . . . . . . . . . . 25
3.3 Ablation studies for similarity measurement. . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Ablation studies for similarity threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Ablation studies for EMA coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Comparision of R-Precision, A-LPIPS, and Zero123grad. Results are calculated on 64
prompts dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Counting of Janus Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Generation Speed Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Quantitative evaluation on image synthesis quality. . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Quantitative results for content editing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1 Quantitative comparison of accuracy values for different baselines and ablations. In this
table XE stands for cross entropy loss, CT stands for contrastive loss, and TA stands for
temporal augmentation. The last row prefixed by ⋆ is our proposed model. We can see that
adding the higher-level features improves accuracy, and our proposed contrastive learning
training scheme improves generalization to our test set. . . . . . . . . . . . . . . . . . . . . 68
6.2 User study results. Average evaluation scores (in percentage) for different tasks and
datasets. A higher number indicates that human viewers agree more with the model’s
prediction or the ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.1 Quantitative comparison of OminiPlane and baseline methods across different setups. . . . . 78
vii
List of Figures
1.1 Spectrum of 3D Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Comparison of Gaussian Splatting results on interior views versus novel views . . . . . . . . 3
1.3 Illustration of the Janus Problem in 2D and 3D; it’s more severe in 3D than in 2D . . . . . . 4
1.4 Illustration of Current Editing Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Comparison between UniPlane and PlanarRecon. Left: predictions from our baseline
PlanarReccon. Middle: reconstructions from UniPlane. Right: ground-truth plane
reconstruction. Each color represents a plane instance. Textured planes are learned with
rendering loss. Our model is able to accurately detect more planes improving both recall
and precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 2D toy example to illustrate view consistency. Each 2D pixel will project visual features
onto voxels accessible by a ray from it. Voxels receiving consistent visual features are
occupied, marked by color on the right. Voxels receiving different visual features are
unoccupied, and marked as white on the right. Voxels behind occupied voxels are occluded,
and marked as gray on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Overall Architecture From a sequence of posed images, we organize them into fragments.
A voxel feature grid is constructed for each fragment. This per voxel feature is both used
to make a per voxel prediction, and as key and value vector to a transformer decoder. The
query vector to the transforms consists of both a sequence of learnable query vector, and
plane embedding from previous fragment to track planes across fragments . . . . . . . . . . 12
2.4 Qualitative Results for Plane Detection on ScanNet. Our method outperforms
PlanarRecon. Our approach is able to reconstruct a more complete scene and retrain more
details. Different colors indicate different surfaces’ segmentation, from much we can
observe UniPlane achieves a much better result in both precision and recalls. The spectrum
color indicates surface normal. Compared to PlanarRecon, UniPlane predicts much lower
normal errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 We introduce PlanarNeRF, a framework designed to detect dense 3D planar primitives
from monocular RGB and depth sequences. The method learns plane primitives in an online
fashion while drawing knowledge from both scene appearance and geometry. Displayed are
outcomes from two distinct scenes (Best viewed in color). Each case exhibits two rows: the
top row visualizes the reconstruction progress, while the bottom row showcases rendered
2D segmentation images at different time steps. . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Overview of PlanarNeRF. PlanarNeRF processes monocular RGB and depth image sequences,
enabling online pose estimation. It offers two modes: ⃝1 PlanarNeRF-S (supervised) with 2D plane
annotations, and ⃝2 PlanarNeRF-SS (self-supervised) without annotations. The framework includes
an efficient plane fitting module and a global memory bank for consistent plane labeling. . . . . . . 21
viii
3.3 The global memory bank is updated with each new batch of estimated planes. Notations are
explained in Section 3.3.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Qualitative comparison of PlanarNeRF and baselines on ScanNet. . . . . . . . . . . . . . . . . . 29
3.5 Qualitative comparison between the recent SOTA — PlanarRecon [13] and ours on ScanNet. . . . . 30
3.6 Ablation for the number of samples used in PlanarNeRF. . . . . . . . . . . . . . . . . . . . . . 30
3.7 Results by PlanarNeRF for Replica and synthetic scenes from NeuralRGBD. . . . . . . . . . . . . 31
3.8 Tuning of different hyperparameters (see definitions of each in Section 3.3.4) in our lightweight
plane fitting module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.9 Error map with (a) allowing gradients backpropagation and (b) blocking gradients backpropagation.
Red color means a high error and blue color means a low error. Note that the dark red region appears
in (a) and (b) because the ground truth fails to capture the window area. . . . . . . . . . . . . . . 31
4.1 3D objects generated by OrientDream pipeline using text input. Our OrientDream
pipeline successfully generates 3D objects with high-quality textures, effectively free from
the view inconsistencies commonly referred to as the Janus problem. We also present
visualizations of geometry and normal maps alongside each result for further detail and
clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Illustration of the Janus Problem: This figure showcases typical Janus issue manifestations, where the bunny is depicted with three ears, the pig with two noses, the eagle with a
pair of heads, and the frog anomalously having three back legs but only one front leg. . . . . 34
4.3 Overview of camera orientation conditioned Diffusion Model: This figure illustrates the
core components of our innovative model within the OrientDream pipeline. It highlights
how we integrate encoded camera orientations with text inputs, utilizing quaternion forms
and sine encoding for precise view angle differentiation. The model, fine-tuned on the
real-world MVImgDataset, demonstrates our approach to enhancing 3D consistency and
accuracy in NeRF generation, surpassing common limitations found in models trained
solely on synthetic 3D datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Overview of Text-to-3D Generation Methodologies: This figure succinctly illustrates our
dual approach in applying multi-view diffusion models for 3D generation. It highlights
the use of multi-viewpoint images for sparse 3D reconstruction, alongside our focus on
employing an orientation-conditioned diffusion model for Score Distillation Sampling
(SDS). We figure showcases our innovative SDS adaptation, where traditional models are
replaced with our orientation conditioned diffusion model, seamlessly integrating camera
parameters and text prompts for enhanced 3D content generation. . . . . . . . . . . . . . . 38
4.5 Decoupled Sampling: We summarizes our approach to enhance NeRF optimization speed,
highlighting the shift from uniform sampling to targeted reduction in T steps and the use of
the DDIM solver for efficient computation, thereby improving both the diversity and texture
quality in 3D model generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6 Qualitative Results of Orientation-Controlled Image Generation: We showcases
the comparative results of our method against the text suffix-based approach and
MVDream [139]. While the text-based method yields images with inconsistent orientations,
both MVDream and our model exhibit precise control over viewing angles. Notably, our
method outperforms MVDream by generating photorealistic images with richer textures and
greater variety, highlighting its effectiveness in producing high-quality, orientation-specific
imagery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
ix
4.7 Qualitative experiment of text to 3D generation results. Each row represents a method
and each column represents a azimuth angle used to render the object. . . . . . . . . . . . . 42
5.1 We demonstrate the capability of our model to edit 3D objects at various levels of detail
across four rows. Users can just input the prompt (highlighted in orange text) for generating
and editing the 3D object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Illustration for our overall method. First, we start from an initial NeRF based on the
user’s prompt. We feed the projection from NeRF across different views into SAM to
get the binary masks specified by the user’s prompt. Concurrently, masks for non-target
areas are derived by deducting the masks of the specified areas from the NeRF’s depth
map. Subsequently, two occupancy networks are trained using both global and local shape
guidance, utilizing the multi-view masks from the two distinct mask sets. In phase 2 ,
we employ two DMTet networks that are initialized with the occupancy network and are
further refined under the guidance of the masked projections from NeRF. Finally, this setup
empowers users to perform versatile edits. They have the option to refine any of the DMTet
networks using a Text-to-3D approach for content modification or to implement geometric
transformations directly on the output mesh. . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Qualitative comparison with DreamFusion in terms of content editing. . . . . . . . . . . . . 54
5.4 Illustration for showing our flexible geometrical transformation qualitatively. . . . . . . . . 56
5.5 Ablation study on the importance of occupancy modeling for the mask quality. . . . . . . . 56
5.6 Ablation on how segmentation performance affects the distillation. The sub-caption
represents the input and output resolution of SAM. The multi-view images are the distilled
results from the initial NeRF guided by the segmentation maps of SAM. . . . . . . . . . . . 57
5.7 Ablation study on 3D mask initialization for DMTets. . . . . . . . . . . . . . . . . . . . . . 57
6.1 Data sampling for our method used in the start and end prediction tasks. Positive samples
are clips that start (or end) with a cut, and negative samples drawn randomly from the rest
of the video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Architecture of proposed network. Please see Section 6.4.1 for details. . . . . . . . . . . . . 63
6.3 Sampling Schemes. In each row we show two edited videos (grey bars), with cuts shown as
vertical lines. By changing where positive and negative samples are drawn from relative to
an anchor placed on a cut in the first video, we can make the task easier, or harder. . . . . . . 65
6.4 Randomized Frame Rate Conversion. We augment our training samples by jittering
sampling locations during rate conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5 Difference between the shotcut detector and our model. The shotcut detector fires when the
window includes a cut. Our model sees continuous, unedited clips and the window does not
contain any cuts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.6 Average predicted scores on frames near the clip boundary. The x-axis is the offset of the
input clip as it shifts away from the boundary. The model’s prediction is highly concentrated
on the true positive at clip boundaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.7 Grad-CAM visualization of an input video clip. We can see that the network pays attention
to regions with motion, e.g., the waiving hands. . . . . . . . . . . . . . . . . . . . . . . . . 70
6.8 Grad-CAM visualization of an input video clip. We can see that the network pays particular
attention to salient features such as the horizon . . . . . . . . . . . . . . . . . . . . . . . . 71
x
6.9 An example of automatic cut point prediction using our model. Given an input video, the
model predicts two curves. The starting cuts (orange) and ending cuts (red) are detected at
peak values of the curves. A candidate clip (blue segment) can be generated by combining
the cut points in either automatic or interactive fashion. . . . . . . . . . . . . . . . . . . . . 71
7.1 Comparison between OminiPlane and baselines in three setups. Left: Single View
Plane Reconstruction setup. The baseline is PlaneRecTR[219]. Predicted plane masks are
overlaid on the input image. Our method detects more fine-grained planes with varying
orientations. Middle: Sparse View setup. The baseline is PlaneFormers [220]. Our method
achieves better coverage and generates meshes without holes. Right: Monocular video
setup. The baseline is [13]. Our model is able to accurately detect more planes improving
both recall and precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2 Overall Architecture Proposed method supports various inputs, including single view,
sparse view, and monocular video. Pretrained geometric priors, such as DUSt3R, provide
vertex-level properties (e.g., location, normal, semantic label), which are processed by
detection queries to generate binary masks for plane instance segmentation. A detectionguided RANSAC loop refines inliers to estimate plane coefficients and masks, enabling
accurate and compact plane recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Single View plane recovery result with comparison GT and PlaneRecTR on the ScanNet
dataset. Input images are processed to generate ground truth (GT) plane masks, which
are compared with predictions from PlaneRecTR and the proposed OminiPlane method.
The visualizations include 2D plane masks and reconstructed 3D models, highlighting the
superior accuracy and detail achieved by the proposed approach. . . . . . . . . . . . . . . . 76
7.4 2-View Result. Reconstruction comparison on 2 views setup. . . . . . . . . . . . . . . . . 77
7.5 Multiview Test Results. Multiview reconstruction results on room-scale scenes. . . . . . . 77
7.6 Motivational example for this work. While the humans tend to sort the wireframes from
the best to worst in the presented order, the popular metrics sort them in different ways,
sometimes complete opposite. Top row, left to right: ground truth wireframe, wireframe
with edges split into several, maintaining geometrical and topological accuracy, wireframe
with removed parts of the edges, but same vertices, wireframe with missing vertex and
edges, wireframe with only one correct vertex. Bottom row: distances between GT and
respective wireframe, numbers, which change sorting are in red. . . . . . . . . . . . . . . . 79
7.7 Wireframe ranking interface for human labelers. . . . . . . . . . . . . . . . . . . . . . . . 82
7.8 Examples of corrupted ground truth wireframes, used for wireframe ranking. Left to right:
ground truth, deformed edges, vertex duplication and random movement, edge addition,
edge deletion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
xi
Abstract
The ability to understand, generate, and modify 3D environments is foundational for applications such
as virtual reality, autonomous driving, and generative AI tools [1, 2, 3, 4]. However, existing methods
often rely on non-semantic point clouds as their representation, which capture only geometric information
without semantic context [5, 6, 7]. This limitation creates a significant gap in both interpretability and
performance when compared to methods that leverage semantic information [8, 9, 10]. Additionally, nonsemantic approaches often struggle to scale effectively with increasing complexity, underscoring the need
for semantic structures to enhance scalability and adaptability [11, 12].
This dissertation addresses these limitations by introducing methods that emphasize controllable semantic structures in 3D understanding, generation, and editing. First, to improve 3D scene understanding, we
propose plane-aware techniques, such as planar priors and plane-splatting volume rendering, which provide
explicit geometric and semantic representations [13, 14, 15]. These methods enable more accurate and interpretable reconstructions compared to traditional point-cloud-based approaches [16, 17, 18]. Second, for 3D
content generation, we develop an orientation-conditioned diffusion model, enabling precise control over the
alignment and orientation of generated objects, enhancing flexibility and user interaction [19, 20, 21, 22].
Third, to facilitate intuitive editing of 3D environments, we introduce a method for projecting text-guided
2D segmentation maps onto 3D models, bridging the gap between semantic understanding and user-driven
modification [23, 24, 25].
These contributions collectively address the semantic and performance gaps in 3D reconstruction and
generation, demonstrating that integrating semantic information not only improves interpretability and precision but also enables models to scale more effectively for complex applications [26]. By combining controllable semantic structures with geometric understanding, this dissertation advances the state-of-the-art in
3D vision and generation, paving the way for more scalable, interpretable, and interactive 3D workflows.
xii
Chapter 1
Introduction
SDF
Structured Representation Discrete Representation Implicit Representation
Geometry Model
Point Cloud
Mesh
Voxel Grid
RGBD
NeRF
Gaussian Splatting
TriPlane
Implicit, Detail, Harder to Edit
Explicit, Structural, Easier to Edit
Figure 1.1: Spectrum of 3D Representation
1.1 3D Representation
Representing 3D shapes and scenes is a fundamental challenge in computer vision and graphics. A wide
variety of 3D representations exist, each with its own strengths and weaknesses. These representations can
be broadly categorized as:
• Structured Representations: These representations explicitly encode the underlying structure of the
3D object or scene, often using primitives like planes, cubes, or other geometric shapes. Examples
include:
– Geometry Models: Manually created CAD models are widely used in design and engineering,
providing high precision and interpretability [27].
1
– Mesh: A commonly used representation where surfaces are approximated by interconnected
polygons, defined by vertex locations and their connectivity [28]. Meshes balance detail and
editability, making them suitable for real-time rendering and simulation.
Structured representations have seen advancements in automated model generation for engineering
and design applications. For example, procedural mesh modeling [29] combines user-defined rules
with automated generation to create highly detailed models for applications such as architecture and
game design.
• Discrete Representations: These representations sample the 3D space at discrete points or volumes.
Examples include:
– Point Clouds: A collection of 3D points representing surface geometry, typically acquired
through scanning technologies such as LiDAR or RGBD sensors [30].
– RGBD Images: Images augmented with per-pixel depth information, often used in applications
such as robotics and AR/VR [31]
– Voxel Grids: A volumetric representation dividing the 3D space into regular grid cells, enabling
detailed modeling but often requiring substantial memory for high resolutions [32].
The use of voxel grids has evolved with memory-efficient data structures like sparse octrees, enabling
high-resolution modeling of complex environments [33]. Similarly, advancements in point cloud
processing have enabled robust alignment and segmentation for tasks like scene reconstruction and
object recognition [34].
• Implicit Representations: These representations define the 3D shape or scene indirectly through a
continuous function. Examples include:
– Signed Distance Functions (SDF): Represent shapes by their signed distance to a surface,
allowing efficient queries for intersection and collision detection [35].
– Neural Radiance Fields (NeRF): Represent a scene as a continuous volume, encoding appearance and density at each 3D location. The final appearance is computed by integrating density
and color along a ray, enabling photorealistic rendering and view synthesis [36].
– Gaussian Splatting: Represents scenes as a collection of Gaussian splats, capturing fine details
[37].
– Triplanes: Utilize three orthogonal planes to encode a scene by projecting 3D points onto these
planes and aggregating the corresponding features [38].
Implicit representations, particularly neural methods, have revolutionized photorealistic scene synthesis and geometry modeling. For instance, extensions of NeRF have introduced temporal modeling for
dynamic scenes [39] and incorporated multi-scale hierarchies for enhanced rendering efficiency [40].
2
1.2 Bridging the Gap Between Implicit and Explicit 3D Representations
A clear trend emerges as 3D representations transition from explicit to implicit. Explicit models (e.g., CAD
models, meshes) are structured, interpretable, and easy to edit, but often struggle to capture the fine details
afforded by implicit models. Conversely, implicit models (e.g., NeRFs, SDFs) excel at preserving intricate
details and enabling photorealistic rendering, but they can be challenging to edit and interpret.
This dissertation aims to bridge this gap by transforming the outputs of implicit models into explicit
structures. As Rich Sutton argues in "The Bitter Lesson" [12], embracing scalable computation and generalpurpose methodologies is crucial for long-term progress in machine learning. By incorporating explicit
structure into implicit representations, we adhere to this principle and achieve several key benefits:
• Performance: Our approach maintains the high performance of implicit representations, even when
dealing with complex scenes and large datasets, ensuring no significant drop in quality or efficiency.
• Interpretability: Structured outputs provide a clear and understandable representation of the 3D
scene, facilitating analysis, manipulation, and reasoning.
• Efficiency: Structured representations enable more efficient use of training data by embedding semantic priors and geometric constraints.
(a) Interior View (b) Bird’s Eye View
Figure 1.2: Comparison of Gaussian Splatting results on interior views versus novel views
3
Our approach not only improves performance and usability but also provides a pathway for scalable,
interpretable, and application-ready 3D workflows. Ultimately, this dissertation advances the field of 3D
vision and modeling by combining the strengths of both implicit and explicit representations, leading to
more impactful and versatile solutions.
1.3 Challenges in 3D Representations
1.3.1 Implicit representation excels in visualization but lacks in geometry
While implicit models excel at capturing visual appearance, they often struggle to represent accurate geometry and underlying structure. For example, Gaussian Splatting, while producing impressive results for
novel views similar to training views, can exhibit artifacts such as spikes and struggle with bird’s-eye views,
as shown in Figure 1.2. This suggests limitations in capturing global geometric structure and highlights a
key challenge in utilizing implicit representations for tasks that require precise geometry.
1.3.2 Implicit representation struggle with repetition issues and Janus problem
Figure 1.3: Illustration of the Janus Problem in 2D and 3D; it’s more severe in 3D than in 2D
Implicit representations, particularly those used in generative AI, often struggle with repetitive patterns
and the "Janus problem." [41] As illustrated in Fig. 1.3, the first row depicts 2D image generation results,
while the second row shows 3D model generation results derived by lifting 2D image guidance into 3D. In
both cases, significant anomalies are evident: the bunny has three ears, the pig two noses, the eagle two
heads, and the frog three back legs but only one front leg.
4
The "Janus problem" refers to the tendency of models to generate multiple instances of the same object
or feature, often in inconsistent or conflicting ways. This arises from:
• Lack of Global Context: Many implicit models operate locally, lacking the global context necessary
to ensure consistency and avoid repetition.
• 2D to 3D Ambiguity: When generating 3D shapes from 2D image guidance, ambiguities can arise,
leading to the model producing "typical" views even from non-typical viewpoints. This results in
inconsistencies and a lack of 3D coherence [42]..
1.3.3 Implicit representation is hard to edit
Figure 1.4: Illustration of Current Editing Approach
Editing implicit representations can be challenging due to their lack of explicit structure. As illustrated
in Fig.1.4, attempts to edit a scene by modifying the input prompt often leads to unintended consequences,
such as alterations to unrelated regions or disruptive change to the background.
This lack of granular control stems from the inherent design of implicit representations, which encode
information as a continuous function without explicit delineation of individual components [43]. Consequently, adjustments made to one part of the representation may propagate throughout the entire scene,
making it difficult to isolate and manipulate specific elements. This absence of localized control hinders
intuitive and precise editing, limiting the usability of implicit representations for fine-tuned modifications,
especially in applications requiring targeted adjustments
This limitation highlights the need for methodologies that integrate structural awareness into implicit
frameworks, enabling more intuitive and effective editing.[44, 45].
5
1.4 Thesis Statement
While implicit representations excel in reconstruction quality, they often lack the interpretability and controllability of explicit structures. This dissertation argues that grounding implicit representations in semantic
geometric structures not only enhances their expressiveness and performance but also enables the generation
of outputs that better adhere to physical rules, improving editability and controllability for a wider range of
applications.
1.5 Outline of the thesis
The rest of this dissertation is organized as follows. Chapter 2 and Chapter 3 address the lack of meaningful
geometry in implicit models by converting implicit outputs into explicit plane structures, improving interpretability and usability. Chapter 4 introduces explicit orientation control for text-to-3D generation, tackling
challenges such as the Janus problem and repetition artifacts in implicit representations. Chapter 5 focuses
on enabling intuitive, natural language-driven editing of implicit representations, overcoming usability and
control limitations. Chapter 6 explores how structured editing insights can inform intelligent video segmentation and trimming, extending the principles of semantic structure to video workflows. Chapter 7 outlines
future directions, including generalizing plane-aware techniques (Section 7.1) and aligning automated evaluation metrics with human expertise (Section 7.2). Finally, Chapter 8 concludes by summarizing these
advancements and their broader implications for scalable, interpretable, and interactive 3D modeling.
6
Chapter 2
UniPlane: Unified Plane Detection and Reconstruction from Posed
Monocular Videos
PlanarRecon Our:3DPlanes Our:TexturedPlanes GroundTruth
Figure 2.1: Comparison between UniPlane and PlanarRecon. Left: predictions from our baseline PlanarReccon. Middle: reconstructions from UniPlane. Right: ground-truth plane reconstruction. Each color
represents a plane instance. Textured planes are learned with rendering loss. Our model is able to accurately
detect more planes improving both recall and precision.
We present UniPlane, a novel method that unifies plane detection and reconstruction from posed monocular videos. Unlike existing methods that detect planes from local observations and associate them across
the video for the final reconstruction, UniPlane unifies both the detection and the reconstruction tasks in a
single network, which allows us to directly optimize final reconstruction quality and fully leverage temporal information. Specifically, we build a Transformers-based deep neural network that jointly constructs a
3D feature volume for the environment and estimates a set of per-plane embeddings as queries. UniPlane
directly reconstructs the 3D planes by taking dot products between voxel embeddings and the plane embeddings followed by binary thresholding. Extensive experiments on real-world datasets demonstrate that
UniPlane outperforms state-of-the-art methods in both plane detection and reconstruction tasks, achieving
+4.6 in F-score in geometry as well as consistent improvements in other geometry and segmentation metrics.
2.1 Introduction
3D reconstruction is one of the key topics in computer vision, which is fundamental to many high-level 3D
perception and scene understanding tasks [9, 46, 47]. 3D reconstruction is the process of reconstructing
the surrounding environment from sensor input, which may include color images, depth images, and camera
poses. The resulting reconstructed scene can take the form of a point cloud, mesh, or implicit surface. In this
7
paper, we aim for implicit surface reconstruction since applications, like augmented reality, virtual reality,
and autonomous driving, require a compact and explainable representation.
Considerable effort [3, 48] has been devoted to 3D reconstruction using depth data. However, even with
dedicated depth sensors, depth data remain unreliable and are affected by changing light conditions and
surface materials [49]. Moreover, depth sensors are often less computation-efficient, and are infeasible to
be mounted on many mobile devices, such standalone VR/AR headsets and many non-flagship phones.
Therefore, visual-only approaches have gained a lot of interest. Currently, the typical pipeline involves first
estimating depth images from color images [50, 51, 52, 53, 54] and then using depth fusion to reconstruct
3D scene geometry. However, these two-stage approaches are unable to jointly optimize estimated depths
whose quality is often the performance bottleneck.
Recently, NeuralRecon [15] proposed an approach that avoids estimating view-dependent 2D depth maps
by projecting 2D image features into a 3D voxel grid, allowing for tracking and fusion of multiple input
frames. PlanarRecon [13]extends the framework to reconstruct explicit planar surfaces and demonstrates
that the dedicated explicit planar surface reconstruction framework achieves much better performance than
adding a separate surface extraction module (e.g., RANSAC plane fitting module) on top of NeuralRecon.
However, PlanarRecon still has several limitations. First, it only supervises individual voxels to predict pervoxel plane attributes, and clusters voxels into plane instances using mean-shift. The lack of instance mask
supervision on the clustering results often leads to inaccurate boundaries and problematic segmentation
(e.g., over-segmentation). The second limitation is that it could only handle a video fragment (i.e., 9 frames)
in the detection stage, and a tracking and fusion module is needed to merge results across video fragments.
Though the tracking module utilizes learned features as one metric, it still heavily relies on other handcrafted heuristics. Merging failures in the tracking and fusion module lead to duplicate planes even with
good single-fragment results.
To address these issues, our UniPlane deploys a Transformers-based deep network to jointly construct a 3D
embedding volume for the environment and estimate a set of per-plane embeddings as queries following the
idea of [55]. During training, the dot products between voxel embeddings and plane embeddings are supervised using ground truth segmentation. During inference, we threshold dot products to reconstruct plane
masks. We can directly model 3D planes using reconstructed plane masks and predicted plane parameters.
In addition, we exploit the sparse nature of the 3D feature volume by attending to the high-occupancy region.
Furthermore, the reconstructed surfaces are refined using a rendering loss with respect to input images.
Similar to PlanarRecon, our method takes a monocular video as input, operating on a 3D feature volume
fused by a 2D image sequence. Different from PlanarRecon, our method queries per-object embedding
vector over the entire scene. To effectively query the scene, we propose to leverage a sparse attention.
To further improve the reconstruction quality, we utilize view consistency to eliminate the null space and
refine the generated surface with rendering loss, in which our approach predicts a set of 3D surfaces whose
renderings match the original 2D images.
The surface-based representation is a more effective model for representing real-world environments. It is
a sparse representation, which avoids wasting resources on empty space. Common objects such as floors,
walls, ceilings, beds, and desks can be efficiently modeled as surfaces with height fields. This approach
8
leverages prior knowledge that 3D objects are sparsely distributed throughout the environment, and that
each object occupies a continuous region in space.
To summarize, our contributions are three-folds:
1. With the differentiable instance segmentation network, UniPlane is able to learn more accurate plane
boundaries using segmentation supervision.
2. The learned per-plane embedding vectors enable feature-based plane tracking across the entire video.
3. UniPlane further improves the surface reconstruction quality by filtering null space voxels based on
view consistency and enforcing a rendering loss.
2.2 Related work
2.2.1 Single-view plane reconstruction
For plane detection from single-images, most traditional methods, [56, 57], rely on depth input. Recently,
learning-based approaches, [58, 59, 8, 10, 60], formulate the task as an instance segmentation problem and
train deep networks to jointly estimate the plane instance segmentation and per-instance plane parameters.
These methods are able to achieve high-quality plane instance segmentation but the predicted 3D geometries are less accurate due to the ill-posed nature of the task. Following works [61, 62], we improve the
reconstruction accuracy by predicting and enforcing the relationships between plane instances. Even then,
the predicted geometries are not centimeter-accurate, making it extremely challenging to develop a robust
multi-view tracking system upon these single-view detection networks.
2.2.2 Multi-view plane reconstruction
There has been extensive research on multi-view plane reconstruction with known camera poses. Early
approaches [63, 64] first reconstruct a sparse set of 3D points or line features, which were then grouped
together using certain heuristics. However, these methods heavily relied on hand-crafted features and were
not robust to lighting changes or textureless regions. Other methods [65, 66, 67, 67] approach the problem
as an image segmentation task, where each pixel is assigned to one of the plane hypotheses using MRF
formulation.
Jin et al. [68] extend the single-view learning approach [8] to two views, and add optimization to maximum
the consistency between two sets of planes reconstructed from each view. This pairwise optimization can be
repeated to handle a few more images, but the process is time-consuming and it is unclear how to generalize
it to more frames (e.g., the entire video).
Recently, PlanarRecon [13] builds a learning-based multi-view plane reconstruction system following the
volumetric reconstruction approach of NeuralRecon [15]. By learning planes directly from multi-view
in an end-to-end manner, PlanarRecon outstands from its counterparts, especially for geometry accuracy.
While PlanarRecon pushes the frontier of plane reconstruction, it has two limitations: 1) its clustering-based
segmentation is often inaccurate due to the lack of instance-level supervision on the segmentation, 2) it
9
produces plane duplicates when its tracking and fusion module fails to merge planes properly. To address
these limitations, we train transformers to jointly learn per-instance plane features and segmentation masks
(i.e., matching between plane features and voxel features). The learned plane features can also be used to
track planes across video fragments without the need of hand-crafted heuristics.
2.2.3 Learning-based tracking and reconstruction
While deep networks are able to detect object instances from single images or video fragments and extract
corresponding features, the instance association across frames is still challenging since object features are
view-dependent due to lighting changes and partial observation.
Many papers study image correspondence learning by developing feature-matching networks. Here, we
discuss a few notable ones and refer to [69] for a comprehensive survey. SuperGlue designs a featurematching network to estimate the correspondences between learned image features. The key contribution of
the SuperGlue is to use attention mechanisms to focus on the most discriminative parts of the images and
filter out irrelevant features during the matching process. TransMatcher [70] uses Transformers to apply
the attention mechanisms. However, one cannot directly apply these approaches to address the problem
of object instance tracking. Object tracking requires learned object features to be view-independent (e.g.,
similar features for two sides of the same chair) and discriminative (e.g., a chair’s feature should be different
from the others), and learning such features is not trivial. For this reason, even though PlanarRecon [13]
deploys SuperGlue [71] in their tracking and fusion module, it only uses the feature-based matching as one
optional metric among many other hand-crafted matching metrics which are shown to be more important
(e.g., Intersection-over-Union and plane-to-plane distance).
Several attempts have been made to tackle the object tracking problem. TrackFormer [72] deploys a Transformer to detect objects and learn the trajectories of the tracked objects in an end-to-end manner. However,
TrackFormer focuses on the multi-object tracking setting where tracked objects have a smooth 2D trajectory
on the image with a similar viewing angle. This generic image-based tracking approach is not not applicable
to the object instance tracking problem where viewing angles change dramatically.
To utilize object priors as a rescue, ODAM performs object detection, association, and mapping jointly using
a novel optimization framework. The key is to predict the object poses which are used as prior in the joint
optimization. Wang et al. [73] develop an end-to-end object reconstruction framework upon a volumetric
feature representation constructed by a 3D volume transformer. However, these approaches focus on object
reconstruction where object-specific priors like object bounding box and object-centric feature volume can
be used. It is unclear how to extend them to reconstruct large environments. In this paper, we track and
reconstruct large planar surfaces in large environments using a Transformer network that operates on a
sparse volume.
10
Cameral 0
Image
Feature Map
Local Feature Grid
Avg⊕Std Concatenate
Avg and Std
Cameral 1
Avg⊕ Std
Figure 2.2: 2D toy example to illustrate view consistency. Each 2D pixel will project visual features onto
voxels accessible by a ray from it. Voxels receiving consistent visual features are occupied, marked by color
on the right. Voxels receiving different visual features are unoccupied, and marked as white on the right.
Voxels behind occupied voxels are occluded, and marked as gray on the right.
2.3 Methods
This section explains how UniPlane constructs the 3D feature volume (Sec. 2.3.1), detects planes differentiably from 3D feature volumes using Transformers (Sec. 2.3.2), unifies plane reconstruction with detection
(Sec. 2.3.3, and refines geometries with a rendering loss (Sec. ??.
2.3.1 Feature volume construction
Following PlanarRecon [13], we use a 2D backbone named MnasNet [74] to extract features from 2D images
and unproject image features to a shared 3D volume space. A voxel takes the average features from multiple
pixels across different views. PlanarRecon [13] deploys sparse 3D convolution to exploit the sparsity of the
feature volume by predicting which voxels are occupied and attending to those voxels.
However, the occupancy prediction of PlanarRecon is not effective to prune freespace voxels. For a freespace
voxel, its feature mixes visual information corresponding to different surface points in different views (i.e.,
the actual surface point has the same image projection as the voxel but different depths). The mixed feature is
often not discriminative enough for occupancy prediction. We improve the occupancy prediction by adding
the standard deviation as the input feature. The intuition is that an occupied voxel on an actual surface
should have similar features across different views, and thus its feature standard deviation should be small
providing direct signals for occupancy prediction. This is similar to the view consistency idea widely used
in the literature.
We further conduct experiments on adding the 2D feature standard deviation. Experiments show that adding
the 2D feature standard deviation helps occupancy prediction and the final plane prediction.
11
Input: Fragment Posed Images
Voxel Feature Grid
3
Per Voxel Embedding
Transformer Decoder
N detection queries
X
Per Plane Embedding MLP
×
×
3 ⊗ N ×
3
Per Voxel Prediction
Per Plane Prediction
Binary Mask
Transformer Decoder
1 track queries + N detection queries
X X X X
Per Voxel Embedding
Per Plane Embedding MLP
× (1 + )
⊗
Per Voxel Prediction
Per Plane Prediction
Binary Mask
Fragment 0:
Fragment 1:
×
3
(K1 + N) ×
3
……
MLP
Voxel Feature Grid
3
Dot
Dot
MLP
Figure 2.3: Overall Architecture From a sequence of posed images, we organize them into fragments. A
voxel feature grid is constructed for each fragment. This per voxel feature is both used to make a per voxel
prediction, and as key and value vector to a transformer decoder. The query vector to the transforms consists
of both a sequence of learnable query vector, and plane embedding from previous fragment to track planes
across fragments
2.3.2 Transformers-based Plane Detection
Following PlanarRecon [13], we get per voxel embedding ξvoxel ∈ Cξ×D3
. Then by performing occupancy
analysis, we get occupied voxel set M and per-voxel features ξvoxel ∈ Cξ × M, which will be used to
regress the plane parameters, including the surface normal nm ∈ R
3
and plane offset dm ∈ R
1
, plane center
cm ∈ R
3
associated to the corresponding voxel.
As illustrated in Figure 7.2, each voxel makes a prediction of the location and normal of the primitive it
belongs to. Once we have geometric primitives for every single voxel (shifted voxels x
′
m ∈ R
3
and surface
normals nm ∈ R
3
), we group the voxels to form plane instances in the local fragment volume Fi
.
Unlike PlanarRecon [13] which uses k-means to perform voxel to plane assignment, we train a volume segmentation network that predicts instance segmentation differentiable following the idea of MaskFormer [55].
The segmentation network avoids hyperparameter tuning, and enables direct segmentation supervision.
To construct the supervision, we find the matching σ between the set of predictions z and the set of Ngt
ground truth segments z
gt = {(c
gt
i
, m
gt
i
)|c
gt
i ∈ {1, . . . , K}, m
gt
i ∈ {0, 1}
H×W }
Ngt
i=1 is required. Here c
gt
i
is
the ground truth class of the i
th ground truth segment. Since the size of prediction set |z| = N and ground
truth set |z
gt| = Ngt generally differ, we assume N ≥ Ngt and pad the set of ground truth labels with “no
object” tokens ∅ to allow one-to-one matching.
12
To train model parameters, given a matching, the main mask classification loss Lmask-cls is composed of a
cross-entropy classification loss and a binary mask loss Lmask for each predicted segment:
Lmask-cls(z, zgt) =
XN
j=1 h
− log pσ(j)
(c
gt
j
) + 1c
gt
j
̸=∅Lmask(mσ(j)
, m
gt
j
)
i
. (2.1)
The learning-based fully differentiable segmentation network achieves better segmentation quality and enables unified reconstruction as we discuss in the following section.
2.3.3 Unifying Tracking with Reconstruction
Inspired by TrackFormer [72], we unify plane tracking and reconstruction into the same volume segmentation network, by using per-plane features from previous fragment as tracking queries in the next fragment.
(Fig. 7.2).
In fragment i, detection query n detected plane instance P
n
i
. Then we use the same query n as tracking
query in fragment i + 1. If it matches a plane instances P
n
i+1, then P
n
i will be merged with P
n
i+1.
By integrating reconstruction and detection in the same unified framework, we eliminate the need for a
separate heuristic-heavy tracking and fusion module. This end-to-end network outperforms the two-stage
PlanarRecon 2.1, having less parameters and is easier to train.
Table 2.1: 3D geometry metrics on ScanNet. Our method outperforms the compared approaches by a
significant margin in almost all metrics. ↑ indicates bigger values are better, ↓ the opposite. The best
numbers are in bold. We use two different validation sets following Atlas [75] (top block) and PlaneAE [58]
(bottom block).
Method validation set Comp ↓ Acc ↓ Recall ↑ Prec ↑ F-score ↑ Max Mem. (GB) ↓ Time (ms/keyframe) ↓
NeuralRecon [15] + Seq-RANSAC
Atlas [75]
0.144 0.128 0.296 0.306 0.296 4.39 586
Atlas [75] + Seq-RANSAC 0.102 0.190 0.316 0.348 0.331 25.91 848
ESTDepth [76] + PEAC [77] 0.174 0.135 0.289 0.335 0.304 5.44 101
PlanarRecon [13] 0.154 0.105 0.355 0.398 0.372 4.43 40
Our 0.094 0.133 0.429 0.409 0.418 8.23 44
PlaneAE [58]
PlaneAE [58]
0.128 0.151 0.330 0.262 0.290 6.29 32
PlanarRecon [13] 0.143 0.098 0.372 0.412 0.389 4.43 40
Our 0.113 0.126 0.446 0.415 0.429 8.23 44
2.4 Experiments
This section discusses the implementation details (Sec. 2.4.1), the evaluation setup (Sec. 2.4.2), quantitative
and qualitative evaluations, and ablation studies.
2.4.1 Implementation Details
We use torchsparse [78] to implement the 3D backbone composed of 3D sparse convolutions. The
image backbone is a variant of MnasNet [74] and is initialized with the weights pre-trained from ImageNet [79]. The entire network is trained end-to-end with randomly initialized weights except for the image
13
PlanarRecon
Ours
Ground Truth
PlanarRecon
Normal Error
Ours
Normal Error
High Low
Figure 2.4: Qualitative Results for Plane Detection on ScanNet. Our method outperforms PlanarRecon.
Our approach is able to reconstruct a more complete scene and retrain more details. Different colors indicate
different surfaces’ segmentation, from much we can observe UniPlane achieves a much better result in both
precision and recalls. The spectrum color indicates surface normal. Compared to PlanarRecon, UniPlane
predicts much lower normal errors.
14
backbone. The occupancy score o is predicted with a Sigmoid layer. The voxel size of the last level is 4cm.
The number of detection queries is 100.
2.4.2 Setup
Datasets. We perform the experiments on ScanNetv2 [17]. The ScanNetv2 dataset contains 1613 RGB-D
video sequences taken from indoor scenes by a mobile device mounted by a depth camera. The camera
pose is associated with each frame. As no ground truths are provided in test set, we follow PlaneRCNN [8]
to generate 3D plane labels on the training and validation sets. Our method is evaluated on two different
validation sets with different scene splits used in previous works [75, 59].
Evaluation Metrics. We evaluate the performance of our method in terms of the 3D plane detection, which
can be evaluated using instance segmentation and 3D reconstruction metrics following previous works.
For plane instance segmentation, due to the geometry difference between the ground truth and prediction
meshes, we follow the semantic evaluation method proposed in [75]. More specifically, given a vertex
in the ground truth mesh, we first locate its nearest neighbor in the predicted mesh and then transfer its
prediction label. We employed three commonly used single-view plane segmentation metrics [80, 8, 60,
58] for our evaluation: rand index (RI), variation of information (VOI), and segmentation covering (SC).
We also evaluate the geometry difference between predicted planes and ground truth planes. We densely
sample points on the predicted planes and evaluate the 3D reconstruction quality using 3D geometry metrics
presented by Murez et allet@tokeneonedot [75].
Baselines. Since there are no previous work that focus on learning-based multi-view 3D plane detection,
we compare our method with the following three types of approaches: (1) single-view plane recovering [58];
(2) multi-view depth estimation [76] + depth-based plane detection [77]; and (3) volume-based 3D reconstruction [15, 75] + Sequential RANSAC [81]. (4) PlanarRecon [13]
Since baselines (1) and (2) predict planes for each view, we add a simple tracking module to merge planes
predicted by the baseline in order to provide a fair comparison. The tracking and merging process we
designed for our baselines is detailed in the supplementary material. We use the same key frames as in
PlanarRecon for baselines (1) and (2). For (3), we first employ [15, 75] to estimate the 3D mesh of the
scene, and perform sequential RANSAC to group the oriented vertices of the mesh into planes. Please refer
to our supplementary material for the details of the sequential RANSAC algorithm. For [15, 75], we run
sequential RANSAC every time when a new 3D reconstruction is completed to achieve incremental 3D
plane detection.
2.4.3 3D Geometric Metric
We first use RANSAC [81] method to extract planes from dense mesh, and then we use the extracted planes
as the ground truth targets/supervision to do plane detection. A qualitative result is available showed in
Figure 2.4. In the plane-only setting, our method correctly reconstruct wall corners without confusion on
the wall offsets, because our method explicitly reconstructs height from each surface. With the texture map
and height map enabled, our method is able to reconstruct the scene in high fidelity.
15
Table 2.2: 3D plane segmentation metrics on ScanNet. Our method also outperforms the competing
baseline approaches in almost all metrics when evaluating plane segmentation metrics. ↑ indicates bigger
values are better, and ↓ indicates the opposite, namely smaller values are better. The best numbers are in
bold. We use two different validation sets following previous work Atlas [75] (top block) and PlaneAE [58]
(bottom block).
Method VOI ↓ RI ↑ SC ↑
NeuralRecon [15] + Seq-RANSAC 8.087 0.828 0.066
Atlas [75] + Seq-RANSAC 8.485 0.838 0.057
ESTDepth [76] + PEAC [77] 4.470 0.877 0.163
PlanarRecon 3.622 0.897 0.248
Our 3.215 0.905 0.288
PlaneAE [58] 4.103 0.908 0.188
PlanarRecon 3.622 0.898 0.247
Our 3.210 0.905 0.288
2.4.4 3D Segmentation Metric
We also perform quantitative evaluation for plane detection using 3D metric, the result is available in Table 2.2. Our method produces a more complete and detailed reconstruction because it has extracted major
planes in the scenes and using rendering loss to perform pixel level optimization.
2.4.5 Qualitative Results
We provide the qualitative results in Figure 2.4. Visually, we compare with our baseline method PlanarRecon
in terms of reconstructed plane surfaces as well as the normal errors. We visualize the reconstructed surfaces
and indicate the them by different colors. From Figure 2.4, we can easily observe that our approach UniPlane
achieve much better visual results, i.e., the plane instances are more precisely reconstructed, and more plane
instances are reconstructed. The visual results matche the quantatative results in terms of precisions and
recalls. We also visualize the heat maps of normal errors. In Figure 2.4, red colors indicate higher errors
while blue color means less normal errors. To this end, it is easy to observe that UniPlane has less normal
errors in the visualizations.
2.4.6 Ablation Study
In this section, we ablate our designs of choices in our method, including adding standard deviation in
feature volume, using object query, and the effectiveness of refine surface with differentiable rendering.
2.4.6.1 Adding standard deviation between views in feature volume
We would like to verify whether adding standard deviation between views is helpful. We compare 4 models:
(1) Standard PlanarRecon (2) PlanarRecon with std between views (3) Proposed UniPlane, without std
between views (4) Proposed UniPlane, full model.
We present the experiment results in Table. 2.3. Experiment results show that adding standard deviation
between views in feature volume increases F1 score for PlanarRecon and UniPlane model by 0.011 and
16
Table 2.3: Compare adding standard deviation in feature volume ↑ indicates bigger values are better, ↓
the opposite. The best numbers are in bold.
Method Recall ↑ Prec ↑ F1-score ↑
PlanarRecon 0.355 0.398 0.372
PlanarRecon w/ std 0.359 0.412 0.383
UniPlane w/o std 0.404 0.387 0.395
UniPlane 0.429 0.409 0.418
Table 2.4: Compare query vs heuristic tracking
Method Recall ↑ Prec ↑ F1-score ↑
UniPlane w/heuristic tracking 0.377 0.398 0.387
UniPlane 0.429 0.409 0.418
0.023, respectively. The improvements on UniPlane is more significant, we conclude the query based plane
detection method could utilize these information more effectively.
2.4.6.2 Query vs heuristic tracking
We would like to verify whether using plane queries from previous fragment to track planes is more effective
than heuristic based tracking. We compare 2 models: (1) Proposed UniPlane, with heuristic tracking (2)
Proposed UniPlane, full model. PlanarRecon can not be used with query-based tracking, as there is no plane
queries learned.
We present the experiment results in Table. 2.4. We observed both recall and precision is lower with
heuristic tracking, while recall is decreased more significantly. We conclude its due to heuristic tracking
tend to incorrectly match nearby planes having similar plane normal. We can observe this in Figure 2.4 that
heuristic tracking method is incorrectly fusing nearby planes.
2.5 Conclusion and Future work
In this paper, we present a novel method UniPlane that reconstructs any 3D scenes from a posed monocular
video. We have showed its outstanding performance over current state-of-the-art methods on plane detection, and object reconstruction. Our method could be generalized as reconstruct 3D objects using primitive
geometries and associated texture and height map. So far we only use plane as the primitive geometry,
and in the future work we plan to extend this work to use box, sphere, or Non-Uniform Rational B-Splines
(NURBS) [82] surface as the primitive, which gives the model much better expressiveness while still keeping the representation compact and interpretable ability.
17
Chapter 3
PlanarNeRF: Online Learning of Planar Primitives with Neural Radiance
Fields
A B C D
A B C D
Scene 1
Scene 2
RGB Depth
D
C
B
A
A B
C
D
Figure 3.1: We introduce PlanarNeRF, a framework designed to detect dense 3D planar primitives from
monocular RGB and depth sequences. The method learns plane primitives in an online fashion while drawing knowledge from both scene appearance and geometry. Displayed are outcomes from two distinct scenes
(Best viewed in color). Each case exhibits two rows: the top row visualizes the reconstruction progress,
while the bottom row showcases rendered 2D segmentation images at different time steps.
Identifying spatially complete planar primitives from visual data is a crucial task in computer vision. Prior
methods are largely restricted to either 2D segment recovery or simplifying 3D structures, even with extensive plane annotations. We present PlanarNeRF, a novel framework capable of detecting dense 3D
planes through online learning. Drawing upon the neural field representation, PlanarNeRF brings three
major contributions. First, it enhances 3D plane detection with concurrent appearance and geometry knowledge. Second, a lightweight plane fitting module is proposed to estimate plane parameters. Third, a novel
global memory bank structure with an update mechanism is introduced, ensuring consistent cross-frame
correspondence. The flexible architecture of PlanarNeRF allows it to function in both 2D-supervised and
18
self-supervised solutions, in each of which it can effectively learn from sparse training signals, significantly
improving training efficiency. Through extensive experiments, we demonstrate the effectiveness of PlanarNeRF in various scenarios and remarkable improvement over existing works.
3.1 Introduction
Planar primitives stand out as critical elements in structured environments such as indoor rooms and urban buildings. Capturing these planes offers a concise and efficient representation, and holds great impact
across a spectrum of applications, including Virtual Reality, Augmented Reality, and robotic manipulation,
etclet@tokeneonedot. Beyond serving as a fundamental modeling block, planes are widely used in many
data processing tasks, including object detection [83], registration [84, 85], pose estimation [86], and SLAM
[87, 88, 89].
Extensive efforts have been dedicated to exploring different plane detection methodologies. Nevertheless,
notable limitations persist within current approaches. First, many of them produce only isolated per-view 2D
plane segments [8, 90, 14]. Although certain methods [91, 92, 93] establish correspondences across sparse
(typically two) views, they still lack spatial consistency, leading to incomplete scene representations. Recently, an end-to-end deep model [13] was introduced for 3D plane detection; however, its outcomes tend to
oversimplify scene structures. Moreover, the aforementioned models heavily rely on extensive annotations
— pose, 2D planes, and 3D planes — consequently limiting their generalization capabilities. While fittingbased methods like [94, 16] operate without annotations, they are typically restricted to offline detection,
involving heavy iterations and posing computational challenges.
We propose PlanarNeRF, an online 3D plane detection framework (Fig. 3.1) that overcomes the above limitations. Specifically, we extend the neural field representation to regress plane primitives with both appearance
and geometry for more complete and accurate results. The framework’s efficient network design allows for
dual operational modes: PlanarNeRF-S, a supervised mode leveraging sparse 2D plane annotations; and
PlanarNeRF-SS, a self-supervised mode that extracts planes directly from depth images. In PlanarNeRFSS, we design a lightweight plane fitting module to estimate plane parameters from a highly sparse set of
sampled points in each iteration. Then a global memory bank is maintained to ensure consistent tracking of
plane instances across different views and to generate labels for the sparse points. The inherent multi-view
consistency and smoothness of NeRF facilitate the propagation of sparse labels. Our key contributions are
as follows:
• Introduction of PlanarNeRF, a novel approach for detecting dense 3D planar primitives using online
learning, integrating insights from both the scene’s appearance and its geometric attributes.
• Development of a lightweight plane fitting module, estimating plane parameters for sparsely sampled
points.
• Establishment of a novel global memory bank module with an update mechanism, ensuring consistent
tracking of plane instances and facilitating plane label generation.
• Comprehensive evaluation of PlanarNeRF’s performance through extensive experiments, demonstrating notable superiority over existing methodologies.
19
3.2 Related Work
Single View Plane Detection. Many studies focus on directly segmenting planes from individual 2D images. PlaneNet [59] was among the first to encapsulate the detection process within an end-to-end framework directly from a single image. Conversely, PlaneRecover [10] introduces an unsupervised approach
for training plane detection networks using RGBD data. Meanwhile, PlaneRCNN [8] capitalizes on MaskRCNN’s [95] generalization capability to identify planar segmentation in input images, simultaneously regressing 3D plane normals from fixed normal anchors. In contrast, PlaneAE [90] assigns each pixel to an
embedding feature space, subsequently grouping similar features through mean-shift algorithms. Additionally, PlaneTR [14] harnesses line segment constraints and employs Transformer decoders [96] to enhance
performance further. Despite these advancements, the detected plane instances lack consistency across different frames.
Multi-view Plane Detection. SparsePlanes [91] detects plane segments in two views and uses a deep
neural network architecture with an energy function for correspondence optimization. PlaneFormers [92],
eschewing handcrafted energy optimization, introduces a Transformer architecture to directly predict plane
correspondences. NOPE-SAC [93] associates two-view camera pose estimation with plane correspondence
in the RANSAC paradigm while enabling end-to-end learning. PlaneMVS [97] unifies plane detection
and plane MVS with known poses and facilitates mutual benefits between these two branches. Although
multi-view inputs enhance segmentation consistency, they still lack a global association, preventing the
construction of complete scenarios. PlanarRecon [13] progressively fuses multi-view features and extracts
3D plane geometries from monocular videos in an end-to-end fashion, bypassing per-view segmentation.
Nonetheless, it necessitates 3D ground truth prerequisites and tends to oversimplify the resulting output.
Dense 3D Fitting. For the 3D plane clustering, depth information typically serves as essential input [56, 57].
Efficient RANSAC [16] progressively estimates plane primitives within reconstructed dense point clouds,
employing the consensus sampling paradigm [98]. [94] grows plane segments within designated seed
regions. GoCoPP [99] explores optimal plane parameters and assigns discrete labels to individual 3D points,
referring to an exploration mechanism. While these methods function effectively without annotations, their
computational demands are substantial. Additionally, approaches like [100, 101, 102] adeptly fit wellshaped point clouds to geometric primitives under supervision but grapple with generalization challenges.
Neural Scene Reconstruction. The groundbreaking NeRF [1] introduced an innovative solution for 3D
environment representation, upon which numerous studies have demonstrated outstanding performance in
scene reconstruction [103, 104, 105, 106, 5, 107, 108, 109, 110, 111]. In particular, Nice-SLAM [112]
builds a series of learnable grid architectures serving as hierarchical feature encoders and conducts pose
optimization and dense mapping. Nicer-SLAM [113] refines this approach by reducing the necessity for
depth images and achieves comparable reconstruction results. Co-SLAM [114] adopts hash maps instead of
grids as the feature container and introduces coordinate and parametric encoding for expedited convergence
and querying.
20
Figure 3.2: Overview of PlanarNeRF. PlanarNeRF processes monocular RGB and depth image sequences, enabling online pose estimation. It offers two modes: ⃝1 PlanarNeRF-S (supervised) with 2D plane annotations, and ⃝2
PlanarNeRF-SS (self-supervised) without annotations. The framework includes an efficient plane fitting module and
a global memory bank for consistent plane labeling.
3.3 Methodology
3.3.1 Preliminaries
NeRF (Neural Radiance Fields) [1] conceptualizes a scene as a continuous function, typically represented
by a multi-layer perceptron (MLP). This function, defined as F(x, v) 7→ (c, σ), maps a 3D point x and a
2D viewing direction v to the corresponding RGB color c and volume density σ. Notably, both color and
density predictions share the same network. For a ray R(t) = o+tv with origin o, the rendered color C(R)
is obtained by integrating points along the ray via volume rendering:
C(R) = Z tf
tn
T(t)σ(R(t))c(R(t), v) dt, (3.1)
where T(t) = exp
−
R t
tn
σ(R(s)) ds
is the accumulated transmittance from the near bound tn to t, and
tf is the far bound. To enhance the learning performance on high-frequency details, NeRF incorporates
positional encoding of 3D coordinates as input. The model is then trained to minimize the discrepancy
between observed and predicted color values.
Recent advancements in NeRF have enhanced rendering and reconstruction by modifying the original framework. Key improvements include using Signed Distance Fields (SDF) for predictions [103], employing
separate neural networks for RGB and geometry with augmented inputs [115], recalculating weights in the
rendering equation based on SDF [106], and adopting hash and one-blob encoding for positional data [22].
21
Additionally, depth rendering is used to improve geometry learning [5]. In PlanarNeRF, we incorporate
these recent modifications, resulting in an updated and optimized color rendering equation:
C(R) = 1
PM
i=1 wi
X
M
i=1
wici(R(t), v), (3.2)
where M is the number of sampled points along the ray, and wi
is the weight computed based on SDF:
wi = σ
si
tr
σ
−
si
tr
. Here, si
is the predicted SDF values along the ray; tr is a predefined truncation
threshold for SDF; and σ(·) is the sigmoid function. Similar to Eq. (3.2), the rendering equation for depth
is:
D(R) = 1
PM
i=1 wi
X
M
i=1
widpi(R(t), v), (3.3)
where dpi
is the depth of sampled points along the ray.
3.3.2 Framework Overview
The overview of PlanarNeRF is depicted in Fig. 3.2. Alongside SDF and color rendering branches, an
additional plane rendering branch (Section 3.3.3) is introduced to map 3D coordinates to 2D plane instances,
utilizing appearance and geometry prior. The plane MLP and color MLP share the same input, which
combines a one-blob encoded 3D coordinate and a learned SDF feature vector. In PlanarNeRF-S, while
consistent 2D plane annotations are requisite, they are often unavailable in real-world scenarios, where
manual labeling for plane instance segmentation is costly. To tackle this challenge, we propose a lightweight
plane fitting module (Section 3.3.4) to estimate plane parameters and a global memory bank (Section 3.3.5)
to track consistent planes and produce plane labels. During the training phase, gradient backpropagation
from the plane branch to the SDF is blocked to prevent potential negative impacts on geometry learning,
with further qualitative analysis provided in Section 3.4.4.
3.3.3 Plane Rendering Learning
Similar to Eq. (3.2) and Eq. (3.3), we propose the rendering equation for planes as:
P(R) = 1
PM
i=1 wi
X
M
i=1
wipi(R(t), v), (3.4)
where pi
is the plane classification probability vector of sampled points along the ray.
Conventionally, instance segmentation learning has been approached using either anchor boxes [95, 8] or
a bipartite matching [14, 55, 11, 116]. Anchor boxes-based methods often involve complex pipelines with
heuristic designs. In contrast, bipartite matching-based methods establish an optimized correspondence
between predictions and ground truths before computing the loss. The instance segmentation loss based on
bipartite matching can be expressed as:
Lins = −
1
Q
X
Q
q=1
X
C
c=1
yc log ˆyc, (3.5)
where Q is the number of pixels; C is the number of classes. yc is the cth element in the ground truth label
y, and yc = 1{c=m(yˆ,y)}
, where m(·) is the matching function, and the assignment cost can be given by the
intersection over union of each instance between the prediction and the ground truth. yˆc is the cth element in
the prediction probability vector yˆ. Using bipartite matching stems from the inherent discrepancies in index
values between instance segmentation predictions and the ground truth labels. We only need to match the
segmented area and distinguish one instance from another.
In contrast to the instance segmentation methods previously discussed, PlanarNeRF employs a distinct approach for plane instance segmentation. We adopt a fixed matching technique, akin to that used in semantic
segmentation, to compute the segmentation loss. This method is chosen because our primary objective is
to learn consistent 3D plane instances. Consequently, it is imperative that the rendered 2D plane instance
segmentation remains consistent across different frames. To uphold this consistency, we ensure that the
indices in the predictions strictly match the values provided in the ground truth during loss computation.
3.3.4 Lightweight Plane Fitting
Before we produce the plane labels, we need to extract plane parameters based on depth images. To achieve
this, we propose a lightweight plane fitting module, which can be treated as a function f that builds a map
from input points to planes:
{{pi}
N
i=1, {P Oi}
N
i=1} = fΘ(P O), (3.6)
where N is the total number of plane instances estimated in one batch; p represents the plane parameters:
pi = [n
x
i
, n
y
i
, nz
i
, di
]
T which form a plane equation n
x
i
· x + n
y
i
· y + n
z
i
· z − di = 0; P O represents the
input points set from depth images and P Oi represent the set of points belonging to the ith plane instance,
and P Oi = {poj}
nj
j=0, where poj = [xj , yj , zj ]
T
; nj is the number of points in the set; Θ is the set of
hyperparameters based on Efficient RANSAC [16] and Θ = {nmin, ϵ, ϵcluster, τn, pˆ}, where:
• nmin: the minimum number of points belonging to one plane instance.
• ϵ: the absolute maximum tolerance Euclidean distance between a point and a plane.
• ϵcluster: controls the connectivity of the points covered by an estimated plane. Large values usually
lead to undersegmentation while small values usually lead to oversegmentation.
• τn: threshold for the normal deviation between the one point’s normal and the estimated plane normal.
• pˆ: probability of missing the largest plane.
To achieve satisfying plane fitting performance for sparse points, hyperparameters Θ have to be carefully
tuned. Details about tuning decisions are shown in Section 3.4.4.
3.3.5 Global Memory Bank
Plane estimations using Eq. (3.6) in different iterations are independent with each other. This lacks consistency as new data constantly comes in. We propose a novel global memory bank to maintain the plane
parameters across different frames (Fig. 3.3). The key part of maintaining the bank is the similarity measure between two planes. According to Eq. (3.6), we are able to obtain the plane vector for each plane
23
Figure 3.3: The global memory bank is updated with each new batch of estimated planes. Notations are
explained in Section 3.3.4.
Table 3.1: Comparisons for 3D geometry, memory and speed on ScanNet. Red for the best and green for the
second best (same for the following).
Method Val. Set Acc. ↓ Comp. ↓ Recall ↑ Prec. ↑ F-score ↑ Mem. (GB) ↓ Time (ms) ↓
NeuralRecon [117] + Seq-RANSAC [98]
[75]
0.144 0.128 0.296 0.306 0.296 4.39 586
Atlas [75] + Seq-RANSAC [98] 0.102 0.190 0.316 0.348 0.331 25.91 848
ESTDepth [118] + PEAC [77] 0.174 0.135 0.289 0.335 0.304 5.44 101
PlanarRecon [13] 0.154 0.105 0.355 0.398 0.372 4.43 40
PlanarNeRF-SS (Ours) 0.059 0.073 0.661 0.651 0.654 4.09∤ 328∤
/ 131∤⋆
PlaneAE [90]
[90]
0.128 0.151 0.330 0.262 0.290 6.29 32
PlanarRecon [13] 0.143 0.098 0.372 0.412 0.389 4.43 40
PlanarNeRF-SS (Ours) 0.063 0.078 0.674 0.657 0.665 4.09∤ 328∤
/ 131∤⋆
∤ Since our method is an online learning method, we report the memory and time used during training. Others are offline-trained, hence inference.
⋆ For PlanarNeRF-S. SS and S share the same geometry learning and GPU memory. The time gap is mainly caused by the self-supervised plane
label generation (on CPU) in SS.
instance. An intuitive way to compare two vectors is to compute the Euclidean distance, ∥p1 − p2∥2
. However, this way fails for planes because each element in the plane parameter vector has a physical meaning.
A reasonable way to compare the distance between two plane parameters is:
dist′
(p1, p2) = 1 −
⟨n1, n2⟩
∥n1∥ ∥n2∥
+ |d1 − d2| , d1, d2 ∈ R≥0, (3.7)
where p1 = [n1, d1]
T
, p2 = [n2, d2]
T
. All offset values must be non-negative because a plane parameter
vector and its negative version describe the same plane spatially, ignoring the normal orientations.
Unfortunately, Eq. (3.7) works well as a similarity measure but it is too sensitive to the estimation noises.
Directly comparing two plane vectors lacks the robustness to the noises. To tackle this issue, we propose
to use a simple yet robust way to compute the similarity measure. We observe in Eq. (3.6) that the results
also inform us of the identity of each point, meaning there are two representations for one plane — the
plane parameters (pi) or the points (P Oi = {poj}
nj
j=0) belonging to the plane instance. The new similarity
24
Table 3.2: 3D plane instance segmentation comparison on ScanNet.
Method VOI ↓ RI ↑ SC ↑
NeuralRecon [117] + Seq-RANSAC [98] 8.087 0.828 0.066
Atlas [75] + Seq-RANSAC [98] 8.485 0.838 0.057
ESTDepth [118] + PEAC [77] 4.470 0.877 0.163
PlanarRecon [13] 3.622 0.897 0.248
PlanarNeRF-SS (Ours) 2.940 0.922 0.237
PlanarNeRF-S (Ours) 2.737 0.937 0.251
PlaneAE [90] 4.103 0.908 0.188
PlanarRecon [13] 3.622 0.898 0.247
PlanarNeRF-SS (Ours) 2.952 0.928 0.235
PlanarNeRF-S (Ours) 2.731 0.940 0.252
measure is based on the distance between points to the plane. Assume we use p1 to represent one plane
and P O2 for another, then we can have:
dist(p1, p2) = 1
nj
Xnj
j=1
|n
x
1
· x2 + n
y
1
· y2 + n
z
1
· z2 − d1|
((n
x
1
)
2 + (n
y
1
)
2 + (n
z
1
)
2)
1
2
. (3.8)
If a new plane is found highly similar to one of the plane vectors inside the bank,
i.elet@tokeneonedotdist(pnew, pbank) < τdist, where τdist is the distance threshold for decisions,
then we return the index of pbank (denoted by k, see Fig. 3.3) in the bank as the index annotation for the
sampled points belonging to the plane instance pnew. Otherwise, we add the pnew into the bank. The plane
label is given by yc = 1{c=k}
(see Eq. (3.5)). If PlanarNeRF-S is used, then yc is assumed to be known.
To further increase the robustness of the global memory bank, we use the Exponential Moving Average
(EMA) to update the plane parameters stored in the bank if the highly similar plane in the bank is found:
pbank = ψpnew + (1 − ψ)pbank, (3.9)
where ψ is the EMA coefficient. Note that before the update using EMA, the offset values must satisfy the
constraint in Eq. (3.7). More details about producing plane labels using global memory bank can be seen in
Algorithm 1 in the supplementary material.
3.4 Experiments
3.4.1 Baselines and Evaluation Metrics
PlanarNeRF has two working modes: PlanarNeRF-S where 2D plane annotations are used; and
PlanarNeRF-SS where no annotations are used. We compare our method with four types of approaches: (1)
Single view plane recovering [90]; (2) Multi-view depth estimation [118] with depth based plane detection
[77]; (3) Volume-based 3D reconstruction [117] with Sequential RANSAC [98]; and (4) Learning-based 3D
planar reconstruction [13].
Following the baseline work [13], we evaluate the performance of our method in terms of both geometry as
well as plane instance segmentation. More specifically, for geometry evaluation, we use five metrics [75]:
25
Table 3.3: Ablation studies for similarity measurement.
Method VOI ↓ RI ↑ SC ↑
Raw plane param.⋄ 3.368 0.821 0.132
Corrected plane param. (Eq. (3.7)) 3.017 0.829 0.200
Points-to-plane dist. (Eq. (3.8)) 2.833 0.857 0.319
⋄ Directly applying Euclidean distance to raw plane parameters.
Completeness; Accuracy; Recall; Precision; and F-score. For plane instance segmentation, we use three
metrics [8]: Rand Index (RI); Variation of Information (VOI); and Segmentation Covering (SC).
3.4.2 Datasets and Implementations
Our experiments are conducted using the ScanNetv2 dataset [17]. This dataset is comprised of RGB-D video
sequences captured with a mobile device across 1,613 different indoor scenes. Due to the lack of ground
truth data in the test set, following the previous work [13], we adopt the approach used by PlaneRCNN [8],
creating 3D plane labels for both training and validation datasets. To be consistent with the previous work,
we also assess our method’s performance on two distinct validation sets, which are differentiated by the
scene splits previously employed in works [90, 75].
Besides ScanNetv2, we test our method on two additional datasets: Replica [119] and Synthetic scenes
from NeuralRGBD [106]. As baselines lack reported results on these datasets, we present only our model’s
outcomes. Detailed information about these datasets is available in the supplementary material. We employ Co-SLAM [114] as our backbone, with further implementation details provided in the supplementary
material.
3.4.3 Qualitative Results
We show qualitative comparisons between our method and all the baselines in Fig. 3.4, where the results
of two scenes in ScanNet are presented. Different color represents different plane instance. Note that the
colors in predictions do not necessarily match the ones in the ground truths. PlaneAE is able to reconstruct
the single-view planes but fails to organize them in 3D space consistently. ESTDepth + PEAC is better than
PlaneAE but still suffers from a lack of consistency. NeuralRecon + Seq-RANSAC can produce good plane
estimations but the geometry is poor and therefore diminishes the performance of instance segmentation.
PlanarRecon can generate consistent and compact 3D planes but the results are oversimplified and many
details of the rooms are missed. We can easily see that the results of our method are significantly superior
to others in terms of both geometry and instance segmentation. PlanarNeRF-S can generate plane instance
segmentation highly close to the ground truth when only 2D plane annotations are used. PlanarNeRF-SS
also shows a high-standard segmentation quality even though no any annotations are used. If we consider a
comparison in the space of plane parameters, i.elet@tokeneonedotplanes sharing highly similar parameters
are classified as one plane instance, our PlanarNeRF-SS gains more credits.
We also present quantitative comparisons for geometry quality (Table 3.1) and instance segmentation (Table 3.2). From Table 3.1, we can see that our method achieves systematic superiority to others in all geometry metrics with very low GPU memory consumption. PlanarNeRF is not as fast as PlanarRecon and
26
Table 3.4: Ablation studies for similarity threshold.
τdist 0.01 0.1 0.2 0.3 0.5 0.7
VOI ↓ 3.219 2.726 2.951 2.753 3.356 3.244
RI ↑ 0.878 0.874 0.875 0.880 0.858 0.856
SC ↑ 0.251 0.338 0.276 0.279 0.200 0.141
Table 3.5: Ablation studies for EMA coefficient.
ψ 0.6 0.7 0.8 0.9 0.99 0.999
VOI ↓ 3.532 3.655 3.587 3.018 3.438 2.812
RI ↑ 0.830 0.814 0.879 0.881 0.890 0.866
SC ↑ 0.204 0.162 0.088 0.146 0.268 0.314
ESTDepth+PEAC because our method is an online-learning method; The training of SDF and color rendering takes around 180ms while self-supervised plane estimation and plane rendering learning takes around
148ms. It is acceptable to be slower than the pure inference speed of the offline-trained models. From
Table 3.2, we can still see the advantages of our method over other baselines in terms of the quality of plane
instance segmentation.
From the above quantitative results, we can observe that PlanarRecon achieves the best performance among
all the baselines. To further validate the advantages of our method, we show more qualitative comparisons between our PlanarNeRF with PlanarRecon in Fig. 3.5. Both of our methods (PlanarNeRF-SS and
PlanarNeRF-S) maintain high-quality performance across very diverse indoor rooms.
3.4.4 Ablation Studies†
Replica and Synthetic. We show qualitative results of our model on Replica and Synthetic datasets
in Fig. 3.7. Our model can generate excellent plane reconstructions without any annotations (pose/2D
planes/3D planes) in an online manner. Note that there is no ground truth and none of the baselines reported results for those datasets. Therefore, we are only able to show the results from PlanarNeRF-SS.
More results by our model on those datasets are listed in the supplementary material.
How many samples are used? The number of samples used in PlanarNeRF is very important because they
are used for all learning modules (pose/SDF/color/plane) in the framework. It is also closely related to the
computational speed. To achieve the best tradeoff, we have conducted thorough experiments. The detailed
comparisons are presented in Fig. 3.6, where we report the geometry quality with F-score; segmentation
quality with VOI; and the speed with Frames Per Second (FPS). In our work, we use the number of 768.
Plane similarity measure. To validate the usefulness of Eq. (3.8) and show the disadvantage of the Eq. (3.7),
we quantitatively compare different plane similarity measures in Table 3.3, from where we can see that using
Eq. (3.8) achieves the best performance.
Tuning the lightweight plane fitting. As we introduced in Section 3.3.4, our plane fitting module has
several key hyperparameters. Satisfying plane fitting results requires a careful selection of the combination
†
For the purpose of ablation, we randomly select 10 scenes from the validation set. The results of all quantitative experiments
through this section are based on the selected scenes.
27
of those different hyperparameters. In PlanarNeRF, we use the values as nmin = 1; ϵ = 0.02; ϵcluster =
1; τn = 0.7; ˆp = 0.3. Detailed quantitative comparisons in terms of plane instance segmentation quality
with RI, VOI, and SC can be seen in Fig. 3.8.
Thershold for similarity measure. After the computation of the similarity measure, we need a threshold
to determine whether the two planes belong to one instance. If the threshold is too small, there will be
too much noise. If the threshold is too large, parallel planes might be treated as one instance. We use the
threshold of 0.1.
Coefficient for EMA. During the maintenance of the global memory bank, we use an EMA to update the
plane parameters in the bank. The selection of the coefficient in EMA can also affect the final performance
a lot. We take the value of ψ as 0.999. Please see a quantitative comparison in Table 3.5.
Gradient Backpropagation. In PlanarNeRF model architecture (Fig. 3.2), we stop backpropagating the
gradients from the plane branch to the SDF branch during training. This is necessary because the gradients
from plane rendering loss can disturb the training of the SDF MLP, weakening the reconstruction quality.
We show the qualitative comparison using error maps in Fig. 3.9.
3.5 Conclusion
In this paper, we propose a novel plane detection model, PlanarNeRF. This framework introduces a unique
methodology that combines plane segmentation rendering, an efficient plane fitting module, and an innovative memory bank for 3D planar detection and global tracking. These contributions enable PlanarNeRF
to learn effectively from monocular RGB and depth sequences. Demonstrated through extensive testing,
its ability to outperform existing methods marks a significant advancement in plane detection techniques.
PlanarNeRF not only challenges existing paradigms but also sets a new standard in the field, highlighting its
potential for diverse real-world applications.
28
PlaneAE
ESTDepth+
PEAC
NeuralRecon+
Seq-RANSAC
PlanarRecon
PlanarNeRF-SS
PlanarNeRF-S
Ground Truth
Scene 0277_00 Scene 0559_00
Figure 3.4: Qualitative comparison of PlanarNeRF and baselines on ScanNet.
29
Ground Truth
Scene 0488_01 Scene 0193_00 Scene 0356_00 Scene 0084_00 Scene 0382_00
PlanarRecon
PlanarNeRF-SS
PlanarNeRF-S
Figure 3.5: Qualitative comparison between the recent SOTA — PlanarRecon [13] and ours on ScanNet.
256
512
768
1024
1536
2048
3072
4096
Number of samples
0.0
0.2
0.4
0.6
0.8
1.0
F-score
F-score ( )
0
1
2
3
4
VOI
VOI ( )
0
2
4
6
8
FPS
FPS ( )
Figure 3.6: Ablation for the number of samples used in PlanarNeRF.
30
Synthetic
office0 morning apartment
Replica
office1 office4 breakfast room grey white room
Figure 3.7: Results by PlanarNeRF for Replica and synthetic scenes from NeuralRGBD.
Figure 3.8: Tuning of different hyperparameters (see definitions of each in Section 3.3.4) in our lightweight plane
fitting module.
(a) (b)
Figure 3.9: Error map with (a) allowing gradients backpropagation and (b) blocking gradients backpropagation. Red
color means a high error and blue color means a low error. Note that the dark red region appears in (a) and (b) because
the ground truth fails to capture the window area.
31
Chapter 4
OrientDream: Streamlining Text-to-3D Generation with Explicit
Orientation Control
a chimpanzee dressed like Henry VIII king of England a small tiger dressed with sash
a majestic giraffe with a long neck a kangaroo wearing boxing gloves
Figure 4.1: 3D objects generated by OrientDream pipeline using text input. Our OrientDream pipeline
successfully generates 3D objects with high-quality textures, effectively free from the view inconsistencies
commonly referred to as the Janus problem. We also present visualizations of geometry and normal maps
alongside each result for further detail and clarity.
In the evolving landscape of text-to-3D technology, Dreamfusion [21] has showcased its proficiency by
utilizing Score Distillation Sampling (SDS) to optimize implicit representations such as NeRF. This process is achieved through the distillation of pretrained large-scale text-to-image diffusion models. However,
Dreamfusion encounters fidelity and efficiency constraints: it faces the multi-head Janus issue and exhibits
a relatively slow optimization process. To circumvent these challenges, we introduce OrientDream, a camera orientation conditioned framework designed for efficient and multi-view consistent 3D generation from
textual prompts. Our strategy emphasizes the implementation of an explicit camera orientation conditioned
feature in the pre-training of a 2D text-to-image diffusion module. This feature effectively utilizes data from
MVImgNet, an extensive external multi-view dataset, to refine and bolster its functionality. Subsequently,
32
we utilize the pre-conditioned 2D images as a basis for optimizing a randomly initialized implicit representation (NeRF). This process is significantly expedited by a decoupled back-propagation technique, allowing
for multiple updates of implicit parameters per optimization cycle. Our experiments reveal that our method
not only produces high-quality NeRF models with consistent multi-view properties but also achieves an
optimization speed significantly greater than existing methods, as quantified by comparative metrics.
4.1 Introduction
The demand for 3D digital content creation has surged recently, driven by advancements in user-end platforms ranging from industrial computing to personal, mobile, and virtual metaverse environments, impacting
sectors like retail and education. Despite this growth, 3D content creation remains challenging and often
requires expert skills. Integrating natural language into 3D content creation could democratize this process.
Recent developments in text-to-2D image creation techniques, especially in diffusion models [120], have
shown significant progress, supported by extensive online 2D image data. However, text-to-3D generation faces challenges, notably the lack of comprehensive 3D training data. Although quality 3D datasets
exist [121, 122], they are limited in scale and diversity. Current research, such as in Lin et al. [123], focusing on specific 3D object categories like human faces, lacks the versatility needed for broader artistic
applications.
Recent advancements, particularly in DreamFusion [124], have showcased that pre-trained 2D diffusion
models, which utilize 2D image data, can effectively serve as a 2D prior in the optimization of 3D Neural
Radiance Fields (NeRF) representations through Score Distillation Sampling (SDS). This approach aligns
NeRF-rendered images with photorealistic images derived from text prompts, facilitating a novel method
for text-to-3D generation that bypasses the need for actual 3D content data. However, this technique faces
notable challenges. The lack of relative pose information in the 2D prior supervision leads to the "multihead Janus issue," as evident in Fig. 4.2, where the 3D objects display multiple heads or duplicated body
parts, detracting from the content’s quality. Additionally, the process is hindered by the synchronization
required between 2D diffusion iteration and NeRF optimization, resulting in a relatively slow optimization process, averaging several hours per scene. While subsequent research, including Magic3D[123] and
Wang et al. [125], has improved aspects like quality and diversity, the Janus issue and the efficiency of the
optimization process remain largely under-explored areas.
In this paper, we introduce a novel text-to-3D method capable of generating multi-view consistent 3D
content within a reasonable time frame. Key to our approach is the fine-tuning of a 2D diffusion model
with explicit orientation control to ensure multi-view aware consistency during Score Distillation Sampling
(SDS) optimization. This is achieved by leveraging MVImgNet [126], a comprehensive multi-view real
object dataset encompassing 6.5 million frames and camera poses. We integrate encoded camera orientation
alongside the text prompt, enabling the orientation conditioned 2D diffusion model to provide multi-view
constrained 2D priors for supervising NeRF reconstruction.
To accelerate the optimization process, we decouple the back-propagation steps between NeRF parameter
updates and the diffusion process. This is implemented using the DDIM solver [127], which directly resolves
33
Figure 4.2: Illustration of the Janus Problem: This figure showcases typical Janus issue manifestations,
where the bunny is depicted with three ears, the pig with two noses, the eagle with a pair of heads, and the
frog anomalously having three back legs but only one front leg.
the diffusion step to XT −step instead of the conventional XT −1. This approach allows for multiple implicit
parameter updates per optimization cycle, enhancing efficiency.
Our experiments demonstrate that our method effectively circumvents the multi-head Janus issue while
maintaining high quality and variety, outperforming other state-of-the-art methods in terms of time efficiency. As shown in Fig. 7.1. In summary, our contributions are threefold:
• In our paper, we introduce a text-to-3D generation method that effectively reduces the multi-head
Janus issue and enhances 3D content generation efficiency while maintaining quality and variety.
• We have developed an advanced 2D diffusion model with explicit orientation control, using
MVImgNet data for integrating camera orientation with text prompts. This ensures consistent multiview accuracy and provides a precise 2D prior for NeRF reconstruction.
• Our novel optimization strategy significantly speeds up the NeRF reconstruction process. By decoupling NeRF parameter updates from the diffusion process and employing the DDIM solver, we
achieve faster and more efficient multiple parameter updates in each optimization cycle.
4.2 Related Works
4.2.1 NeRFs for Image-to-3D
Neural Radiance Fields (NeRFs), as proposed by Mildenhall et al. [1], have revolutionized the representation
of 3D scenes through radiance fields parameterized by neural networks. In this framework, 3D coordinates
are denoted as p = [x, y, z] ∈ P, and the corresponding radiance values are represented as d = [σ, r, g, b] ∈
D.
NeRFs are trained to replicate the process of rendering frames similar to those produced from multi-view
images, complete with camera information. Traditional NeRFs employ a simple mapping of locations p to
radiances d via a function parameterized by a Multilayer Perceptron (MLP). However, recent advancements
in NeRF technology have seen the introduction of spatial grids that store parameters, which are then queried
based on location. This development, as seen in works by Müller et al. [22], Takikawa et al. [128], and Chan
et al. [129], integrates spatial inductive biases into the model.
34
We conceptualize this advancement as a point-encoder function Eθ : P → E, with parameters θ encoding a
location p prior to processing by the final MLP F : E → D. This relationship is mathematically formulated
as:
d = F (Eθ (p)) (4.1)
4.2.2 Text-to-Image Generation
The widespread availability of captioned image datasets has facilitated the creation of potent text-to-image
generative models. Our model aligns with the architecture of recent large-scale methods, as seen in Balaji et
al.[130], Rombach et al.[120], and Saharia et al.[131]. We focus on score-matching, a process where input
images are modified by adding noise, as outlined by Ho et al.[19] and Song et al. [132], and then predicted
by the Denoising Diffusion Model (DDM).
A key feature of these models is their ability to be conditioned on text, enabling the generation of corresponding images through classifier-free guidance, a technique described by Ho et al. [133].
In our approach, we employ pre-trained T5-XXL [134] and CLIP [135] encoders to produce text embeddings. These embeddings are then used by the DDM, which conditions them via cross-attention with latent
image features. Importantly, we repurpose the text token embeddings—denoted as c—for modulating our
Neural Radiance Field (NeRF), integrating textual information directly into the 3D generation process.
4.2.3 Text-to-3D Generation
In the realm of Text-to-3D Generation, recent methodologies primarily rely on per-prompt optimization for
creating 3D scenes. Methods such as those proposed in Dreamfusion [21] and Wang et al. [136] utilize
text-to-image generative models to train Neural Radiance Fields (NeRFs). This process involves rendering
a view, adding noise, and then using a Denoising Diffusion Model (DDM), conditioned on a text prompt,
to approximate the noise. The difference between the estimated noise and actual noise is used to iteratively
update the parameters of the NeRF.
However, these 2D lifting methods, while impressive in text-to-3D tasks, inherently suffer from 3D inconsistencies, leading to the well-known Janus Problem, as depicted in Fig. 4.2. Various attempts have been
made to mitigate this issue. PerpNeg [41] endeavors to enhance view-dependent conditioning by manipulating the text embedding space. Other strategies focus on stabilizing the Score Distillation Sampling (SDS)
process, such as adjusting noise strength [137] and gradient clipping [42]. A popular alternative involves
leveraging 3D datasets, either by training an additional 2D guidance model on 3D data [138] or by training
a diffusion model to produce multi-view sub-images in a single 2D image [139, 140, 141]. However, due
to the smaller size and lower texture quality of 3D datasets, outputs from these methods tend to mirror the
average quality found in these datasets. [139], outputs from these methods lean towards the average quality
in the 3D dataset.
35
Text Prompt:
Encode Corresponding camera orientation
“Photo of a bag”
Image from MVImgNet
~30 frames per object
Reconstruction
Loss
~200K objects
+
Joint Conditioned
Orientation Conditioned
Diffusion UNet
Figure 4.3: Overview of camera orientation conditioned Diffusion Model: This figure illustrates the
core components of our innovative model within the OrientDream pipeline. It highlights how we integrate
encoded camera orientations with text inputs, utilizing quaternion forms and sine encoding for precise view
angle differentiation. The model, fine-tuned on the real-world MVImgDataset, demonstrates our approach
to enhancing 3D consistency and accuracy in NeRF generation, surpassing common limitations found in
models trained solely on synthetic 3D datasets.
4.3 Method
Our proposed method is structured into three critical sections, each essential to our innovative approach in
text-to-3D generation: First, Section 4.3.1 delves into the camera orientation conditioned diffusion model.
Here, we introduce a diffusion model that adeptly integrates camera orientation, significantly enhancing the
precision and consistency of our 3D content generation from various viewpoints. Second, Section 4.3.2
covers the Text-to-3D Generation process. In this part, we detail how our method effectively translates
text descriptions into detailed 3D models. We focus on overcoming specific challenges, such as the multihead Janus issue, and on enhancing the overall efficiency of the generation process. Lastly, Section 4.3.3
presents our Decoupled Back Propagation approach. This section is dedicated to outlining our novel strategy
for optimizing the NeRF model. We emphasize our method’s capacity to improve both the speed and the
diversity of 3D generation by employing an advanced back-propagation technique.
4.3.1 Camera Orientation Conditioned Diffusion Model
Score distillation-based text-to-3D methods, such as those proposed in Poole et al.[21], Lin et al.[123],
and Wang et al. [136], operate under the assumption that maximizing the likelihood of images rendered
from arbitrary viewpoints of a Neural Radiance Field (NeRF) is tantamount to maximizing the NeRF’s
overall likelihood. While this assumption is rational, the 2D diffusion models these methods employ lack
3D awareness, often resulting in inconsistent and distorted geometry in the generated NeRF. To address
this limitation and ensure the reconstructed NeRF maintains 3D consistency, our approach integrates 3D
awareness into the 2D image diffusion generation model.
Prior works, including those by Poole et al.[21] and Wang et al.[136], have attempted to incorporate a sense
of 3D awareness by using text prompts that vaguely describe the camera viewpoint (for example, ’side
36
view’). However, this method is inherently flawed due to its ad-hoc nature. The ambiguity inherent in using
text prompts that can represent a wide range of different pose values makes NeRF generation susceptible
to geometric inconsistencies. To overcome this, we propose an approach that explicitly controls our 2D
diffusion image generation model by inputting encoded camera orientations.
In our proposed method, we utilize a standard Latent Diffusion model [142] with camera orientation control,
which can be mathematically represented as follows:
p(x|θ) = Y
T
t=1
p(x|θ, t, c′
), (4.2)
where p(x|θ) denotes the data distribution parametrized by θ, x represents the data (e.g., an image), and
t signifies the number of diffusion steps. The conditional distribution of the data at step t is given by
p(x|θ, t, c′
), where c
′
is a concatenated feature combining the CLIP feature c from the text input and the
encoded orientation SE(q), both having the same shape. Our method involves adding p to c as a residual,
resulting in c
′ = c + p.
To encode camera orientation, we first convert the camera orientation to a quaternion form q due to the
numerical discontinuity of the rotation matrix. To facilitate smooth interpolation between camera poses, we
apply a sine encoding {SEi(q), i = 1, ..., 7}, where i represents the dimension. The sine encoding process
is as follows:
SEi(q) = sin (q · i) (4.3)
This encoding allows an intermediate quaternion between two poses q1 and q2 to be represented as a simple
linear interpolation, which is straightforward for the network to learn and generalize:
SLerp(q1, q2, t) = sin((1 − t)ω)
sin(ω)
q1 +
sin(tω)
sin(ω)
q2 (4.4)
where ω is the angle between the two quaternions, determined as ω = cos−1
(q1 · q2).
We evaluated our camera encoding scheme against alternative approaches like relative position encoding [2],
rotary embeddings [143], and the plain rotation matrix, concluding that the quaternion form with sine encoding most effectively distinguishes between view angles.
With the above setup, we fine-tune the diffusion model as shown Fig. 4.3, and using the following loss
function:
min
θ
; Ez ∼ E(x), t, ϵ ∼ N (0, 1)||ϵ − ϵθ(xt
, t, c, SE(q))||2
2
. (4.5)
Here, xt
is the noisy image generated from random noise ϵ, t is the timestep, c is the condition, p is the
camera condition, and ϵθ is the view-conditioned diffusion model.
After training the model ϵθ, the inference model f can generate an image through iterative denoising from a
Gaussian noise image, as described in Rombach et al. [120]. This process is conditioned on the parameters
(t, c, p).
37
Fine-tuning pre-trained diffusion models in this manner equips them with a generalized mechanism for controlling camera viewpoints, allowing for extrapolation beyond the objects encountered in the fine-tuning
dataset. Unlike methods that rely on 3D datasets, such as those discussed in Liu et al.[144] and Shi et
al.[139], our model benefits from fine-tuning on the real-world captured MVImgDataset [126]. This approach avoids the common issues of reduced content richness and texture quality associated with models
trained exclusively on 3D datasets.
Camera poses in MVImgNet [126] are estimated using COLMAP [145]. Despite being captured by different
annotators, the normalization term used in COLMAP ensures that the camera poses are canonical across
different captures.
� ∈ � 0,�
�!
MSE Loss
�"#$
Orientation Conditioned
Diffusion UNet
“A rabbit, …,3d model”
Camera Render
Orientation
Encoding
Sine encoded orientation
Clip feature
DDIM update
Update NeRF Multiple Steps
NeRF Representation Joint Conditioned
�! � decay linearly
�"#$
Figure 4.4: Overview of Text-to-3D Generation Methodologies: This figure succinctly illustrates our dual
approach in applying multi-view diffusion models for 3D generation. It highlights the use of multi-viewpoint
images for sparse 3D reconstruction, alongside our focus on employing an orientation-conditioned diffusion
model for Score Distillation Sampling (SDS). We figure showcases our innovative SDS adaptation, where
traditional models are replaced with our orientation conditioned diffusion model, seamlessly integrating
camera parameters and text prompts for enhanced 3D content generation.
4.3.2 Text to 3D generation
We explore two primary methodologies for applying a multi-view diffusion model to 3D generation. The
first method involves using images generated from multiple viewpoints as inputs for sparse 3D reconstruction. This technique, while effective, requires high-quality images and poses, which has inspired our work
described in Section 4.3.3.
The second method leverages the multi-view diffusion model as a prior for Score Distillation Sampling
(SDS). This approach is more robust to imperfections in images and tends to produce superior output quality. In our work, we focus on this latter method as described in Fig.7.2, modifying existing SDS pipelines
by replacing the Stable Diffusion model with our pose-conditioned diffusion model. Instead of utilizing
direction-annotated prompts as in Dreamfusion [21], we employ original prompts for text embedding extraction and incorporate camera parameters as inputs.
38
Diffusion
U-Net
MSE Loss �! = ����(�!)
Differentiable
Rendering
Fast Solver in pixel space by DDIM
Update Parameter ����
Decoupled
Figure 4.5: Decoupled Sampling: We summarizes our approach to enhance NeRF optimization speed,
highlighting the shift from uniform sampling to targeted reduction in T steps and the use of the DDIM solver
for efficient computation, thereby improving both the diversity and texture quality in 3D model generation.
Furthermore, we use our orientation-conditioned diffusion model to initialize the variational model in Variational Score Distillation (VSD), since both models use camera pose as a conditioning element. This integration allows for a more nuanced and accurate 3D generation process, capitalizing on the strengths of our
developed diffusion model.
4.3.3 Decoupled Back Propagation
The current methodologies still grapple with slow optimization speeds. As highlighted in Prolificdreamer [125], the mode-seeking nature of Score Distillation Sampling (SDS) can lead to over-saturation
and reduced diversity. Prolificdreamer suggests incorporating variational constraints to address this, but this
addition further decelerates the learning process.
Recent advancements in Ordinary Differential Equation (ODE) samplers for diffusion models, like
DDIM [127] and DPM-Solver [146], offer rapid evaluation of 2D diffusion models by potentially solving
the ODE in a single step.
In our approach to decoupled back-propagation, we implement a series of strategic steps to enhance the
efficiency of the process. Initially, we move away from uniform sampling, opting instead to strategically
decrease the choice of T. Following this, as shown in Fig. 4.5, we sample a random camera pose and
render a corresponding 2D view of the NeRF. Then, using the DDIM solver, we directly compute XT −step,
diverging from the traditional method of calculating XT −1. This step size is not static; it’s a hyperparameter
that we methodically reduce during the course of training. Finally, once XT −step is determined, we proceed
39
to optimize the NeRF at this specific camera location, employing a Mean Squared Error (MSE) loss for this
purpose.
The DDIM solver formula is expressed as:
xt−1 =
√
αt−1 · x0 +
p
1 − αt−1 · ϵt (4.6)
where x0 is estimated as:
x0 =
xt −
√
1 − αt
· ϵθ(xt
, t)
√
αt
. (4.7)
where xt
is the image at time t, x0 is the estimated initial image, αt
is cumulative noise schedule at timestep
t, defined as αt =
Qt
i=1(1 − βi), βi
is the noise variance added at timestep i during the forward diffusion
process. ϵt
is the noise added during the forward process at timestep t, predicted by the model ϵθ(xt
, t).
This method circumvents the problematic SDS loss and instead employs standard MSE loss to train NeRF,
which results in improved diversity and texture quality. Towards the end of training, as the step size becomes
smaller, the quality aligns closely with that of a 2D diffusion solver.
A significant advantage of this method is the considerable acceleration of the training process. The stable
diffusion 2.1 model having 890 million parameters, are substantial in size compared to the much smaller
NeRF models, which have only about 2.3 million parameters.
Decoupled back propagation requires fewer evaluations of the diffusion model, thus reducing the frequency
of optimizing NeRF from different viewing angles. This efficiency is achievable only when the 2D diffusion
model can precisely generate views from specified locations, making the most of its advanced capabilities.
4.4 Experiments
In this section, we begin by reviewing the training details (Section 4.4.1). We then assess our orientation
conditioned diffusion model in the context of 2D image generation (Section 4.4.2), followed by an evaluation
of its performance in 3D object generation using a 2D to 3D lifting method (Section 4.4.3). Subsequently,
we provide a quantitative evaluation of the generation quality and speed (Section 4.4.4) . Due to space limit,
ablation study of Decoupled Back Propagation is left in Appendix.
4.4.1 Training Details
Our model, an orientation-controlled diffusion variant, builds upon the Stable Diffusion 2.1 framework. It
is fine-tuned using the MVImgNet dataset [126], which comprises 6.5 million frames across 219,188 videos
featuring objects from 238 classes. The fine-tuning process is carried out over 2 epochs. For the 2D to 3D
lifting aspect, we adopt methodologies from Dreamfusion [21, 147]. We have implemented our approach
within the Threestudio [148] framework, which is also employed for replicating baseline models.
40
SD + Text Suffix MVDream Our
An astronaut riding a horse A bald eagle carved out of wood
SD + Text Suffix MVDream Our
A bull dog wearing a black pirate hat A DSLR photo of a ghost eating a hamburger
Figure 4.6: Qualitative Results of Orientation-Controlled Image Generation: We showcases the comparative results of our method against the text suffix-based approach and MVDream [139]. While the text-based
method yields images with inconsistent orientations, both MVDream and our model exhibit precise control
over viewing angles. Notably, our method outperforms MVDream by generating photorealistic images
with richer textures and greater variety, highlighting its effectiveness in producing high-quality, orientationspecific imagery.
4.4.2 Orientation-Controlled Image Generation
Our orientation-controlled diffusion model is specifically trained on an orientation-annotated dataset. To
verify its effectiveness, we initially focus on generating open-domain objects at specified camera orientations.
We utilized the same prompts as those in MVDream [139]. For each prompt, four images are generated,
each with the same elevation but differing in azimuth by 90 degrees. Both MVDream [139] and our method
employ orientation quaternions for this purpose.
41
As a comparative measure, we also incorporate a text-based view conditioning method. This approach
involves adding text suffixes like front view, side view, back view to the text prompt, corresponding to
the camera orientation. This method is a common feature in text-to-3D pipelines, as seen in works like
Threestudio and Stable-Dreamfusion [148, 147].
In Figure 4.6, we display the results of these generative approaches. The text suffix-based method tends to
produce images in various orientations, rather than consistent ones, underscoring its limitations. Conversely,
both MVDream and our model demonstrate reliable control over viewing orientation. However, while MVDream tends to produce 2D images with simplified textures and limited variety, our method achieves photorealistic imagery. This distinction is crucial as it directly impacts the quality of text-to-3D generation.
MVDream generate 4-views simultaneously, while our method generate 4-views independently, it’s not an
apple-to-apple comparison to evaluate 4-view consistency. Considering text-to-3d training require around
10,000 iterations, consistency within a 4-view pair is not critical. On the other hand, our method has clear
advantage on texture quality and content richness.
0
◦ 45◦ 90◦ 135◦ 0
◦ 45◦ 90◦ 135◦
Dreamfusion ProlificDreamer MVDream Our Our-Normal
A colorful toucan with a large beak a bald eagle carved out of wood
Figure 4.7: Qualitative experiment of text to 3D generation results. Each row represents a method and
each column represents a azimuth angle used to render the object.
42
4.4.3 Orientation-Controlled Diffusion for Text to 3D
Following the approach in Dreamfusion [21, 147], we employ the NeRF architecture with Instant-NGP representation [22] for lifting 2D features to 3D. Our comparison includes prominent baselines like Dreamfusion [21], Prolificdreamer [125], and MVDream [139]. In Figure. 4.7, we display our text-to-3D generation
results. Dreamfusion [21] tends to suffer from the Janus problem, as seen in the duplication of heads in both
the toucan and eagle models. Prolificdreamer [125] achieves impressive texture quality in individual views,
but viewed as a coherent 3D object, it exhibits an even more pronounced Janus issue. MVDream [139]
maintains good 3D consistency, yet it falls behind in terms of content richness and texture quality compared
to other methods. Our method stands out by being free from the Janus problem and providing exceptional
detail and texture quality.
4.4.4 Quantitative Evaluation
Evaluating the quality of text-to-3D generation is challenging due to the lack of ground truth 3D scenes
corresponding to text prompts. CLIP R-Precision [21] is one metric, assessing retrieval accuracy through
joint embedding of 2D images and text. However, high image-text matching scores don’t necessarily imply
image consistency.
A-LPIPS [42], calculating average LPIPS [149] between adjacent images, is another metric. But frame-wise
similarity doesn’t guarantee view consistency, and the Janus problem, characterized by repetitive patterns,
could misleadingly improve this metric. An object appearing identical from every angle would score highly
on A-LPIPS, despite lacking view diversity.
To address these limitations, we developed a 3D topology-aware metric, leveraging models like
Zero123[144], MVDream[139], and our OrientDream. We sample k uniformly spaced camera poses from
an upper hemisphere around each scene. For adjacent image pairs, we compute the norm of the gradient predicted by a pre-trained Zero123 [144] model. The average gradient across all frames and scenes is defined
as Zero123grad.
For systematic evaluation, we used the first 64 prompts from Dreamfusion’s [21] online gallery of 397
prompts. The results, shown in Table.4.1, indicate that our model achieves comparable Zero123grad to
MVDream, while outperforming in R-Precision and A-LPIPS. This suggests our model maintains similar
3D awareness but offers superior texture quality.
Method R-Precision ↑ A-LPIPSAlex ↓ Zero123grad ↓
Dreamfusion[21] 58.8% 0.0250 0.0873
Prolificdreamer[125] 63.4% 0.0688 0.1412
MVDream[139] 48.4% 0.0311 0.0092
Ours 55.5% 0.0415 0.0107
Table 4.1: Comparision of R-Precision, A-LPIPS, and Zero123grad. Results are calculated on 64 prompts
dataset.
43
We additionally quantified the occurrence of the Janus problem in our generated scenes. Instances with
anomalies such as extra or missing parts — like an additional leg, a missing leg, or multiple heads — were
identified as manifestations of the Janus problem. The results of this counting are detailed in Table. 4.2.
Method # of Janus Scenes↓ % of Janus Scenes ↓
Dreamfusion[21] 42 65.625%
Prolificdreamer[125] 46 71.875%
MVDream[139] 3 4.688%
Ours 5 7.813%
Table 4.2: Counting of Janus Scenes
Speed Analysis: Generation speed is a critical factor affecting usability. As illustrated in Table. 4.3, we
present the average time required to generate outputs for 64 prompts. Our method demonstrates a relatively
faster generation speed compared to other methods, owing to our efficient decoupled optimization process.
4.4.5 Evaluation of Orientation Conditioned Image Generation.
2D lifting-based text-to-3D methods are heavily dependent on the gradients generated by their 2D priors.
Consequently, in order to understand the disparities among various text-to-3D methods, our initial focus is
on the quantitative comparison and understanding of 2D diffusion models.
We employ three metrics: Fréchet inception distance (FID) [150], Inception Score (IS) [151], and CLIP
score [152]. FID and IS are used to measure image quality and CLIP score is used to measure text-image
similarity. Results are presented in Table. 4.4. Scores are calculated over 397 text prompts from Dreamfusion’s [21] online gallery, and 10 images are generated for each prompt. The original StableDiffusion[142]
achieves the best score across three metrics. MVDream[139] get much worse FID after fine-tuning on Objaverse[153]. This partly explains why it experiences a reduction in texture quality and content diversity in
generated 3D content. For comparison, our method achieves an FID of 16.93, much closer to the level of
the original stable diffusion. This finding partially reveals why our model could produce photorealistic 3D
models, and show the benefits of fine-tuning on MVImgNet[126].
Method Time Per Scene ↓
Dreamfusion [21] 90mins
Stable-Dreamfusion [147] w/ Instant-NGP [22] 11mins
Prolificdreamer[125] 210mins
MVDream[139] 90mins
Our-SDS 14 mins
Our-Decoupled 5mins
Table 4.3: Generation Speed Analysis
44
Model FID↓ IS↑ CLIP↑
SD[142] 12.63 15.75 ± 2.87 35.70 ± 1.90
MVDream[139] 32.06 13.68 ± 0.41 31.31 ± 3.12
OrientDream 16.93 13.28 ± 0.88 34.91 ± 3.91
Table 4.4: Quantitative evaluation on image synthesis quality.
4.5 Conclusion
In summary, our proposed approach makes a modest yet meaningful contribution to the field of text-to-3D
generation. The camera orientation conditioned diffusion model, a key aspect of our research, marks an
advancement in generating 3D content with greater accuracy and consistency. We’ve circumvented significant challenges, such as the multi-head Janus issue, and refined the process of transforming text descriptions
into 3D models. The introduction of the decoupled back propagation method represents our commitment to
improving the efficiency and diversity of 3D model generation. We are hopeful that our work will provide a
valuable foundation for future research in various applications of 3D technology.
45
Chapter 5
EA3D: Edit Anything for 3D Objects
Figure 5.1: We demonstrate the capability of our model to edit 3D objects at various levels of detail across
four rows. Users can just input the prompt (highlighted in orange text) for generating and editing the 3D
object.
A recent breakthrough in 3D modeling is Text-to-3D generation, which excels at crafting high-quality 3D
objects from text descriptions by optimizing neural radiance field representation, guided by 2D diffusion
models. However, how to modify the generated 3D object remains unexplored. Our approach delves into
distilling the 3D consistency of the zero-shot 2D segmentation foundational model, the Segment Anything
Model (SAM), to achieve versatile editing capabilities at various levels of granularity. With flexible in46
put prompts, including text descriptions, point data, and 3D bounding boxes, our method enables precise
localization by modeling the 3D masks for editing purposes. Through 3D-mask-guided distillation from
the original model into two local Deep Marching Tetrahedra models, we gain the ability to edit in regions
of interest, encompassing both geometric and content-related modifications while keeping irreverent parts
intact. This novel approach opens up exciting possibilities for refining and customizing 3D objects.
5.1 Introduction
Recent advances in Text-to-3D generation [24, 154, 155, 156, 157, 21, 155, 136, 125, 158] have been driven
by the success of 2D text-to-image diffusion models [19, 159, 120, 160, 131, 127], These models are capable
of creating 3D objects from any given text description, a breakthrough significant for fields like industrial
design, animation, gaming, and augmented/virtual reality.
These models effectively extract the information at the pixel space from pre-trained 2D models and lift
them into the 3D space. This is achieved by optimizing a Neural Radiance Field (NeRF) [1] with the
supervision by applying the Score Distillation Sampling (SDS) loss [21]. Subsequent research primarily
emphasizes enhancing geometric consistency, generation quality, training stability, and efficiency. Yet, in
practical scenarios, 3D asset artists often face challenges in accurately creating desired 3D objects in a single
attempt, which underscores the need for a system that allows for flexible and iterative editing of generated
3D objects, without manual intervention.
Tackling this issue is hindered by the implicit representation and the limited availability of pertinent 3D
training data. Numerous studies [154, 161, 162], have focused on style editing, which commonly involves
modifying images at the 2D model stage before transferring the changes to the 3D realm. These methods
present two significant challenges: (1) In terms of appearance editing, it is highly challenging to ensure
that the models modify only the designated 3D areas while preserving the rest region; (2) For geometric
editing, the models lack a comprehensive understanding of the 3D spatial positioning of the targeted areas
and struggle to distinguish the intended parts from the original object.
One may draw inspiration from the human visual system [163]. When we human edit or move a specific part
in a 3D space, we first localize the targeted 3D object within the scene. Then, we subconsciously isolate it
from the rendering pipeline, ensuring that the editing does not impact the appearance of the remaining parts.
In light of this, we propose a two-phase approach: 1) the 3D localization of the area of interest as defined by
user input; and 2) the decomposition of the 3D representation, which facilitates independent editing suitable
for real-world applications.
For the localization phase, our goal is to segment any part across various degrees of granularity as specified
by prompts in a zero-shot setting, a task that is notably challenging in the 3D domain due to data limitations.
Segment Anything Model (SAM) [23], which has demonstrated robust zero-shot localization capabilities
across diverse 2D tasks [164], inspires us to explore its potential consistency in localizing 3D objects under
varied conditions. This inquiry focuses on whether SAM can maintain consistent predictions for a single
object, despite changes in views, amidst challenges posed by varying lighting, diverse poses, and more
intricate structures. So our first step is to distill the knowledge from SAM by optimizing occupancy fields
[43, 165] to construct 3D masks of prompt-defined parts from the 3D object. Empirically, this approach
47
effectively resolves the inconsistencies typically caused by multiple 2D segmentation maps. It successfully
achieves 3D localization based on user-provided prompts, whether they are text, points, or 3D bounding
boxes.
For the 3D object decomposition phase, enabling independent editing of local parts, we utilize dual individual deep marching tetrahedra (DMTet) [166] models to distill the original representation of the generated
object. One DMTet is dedicated to representing the object of interest, as defined by the user’s prompt, while
the other DMTet represents the remaining objects. This method guarantees the preservation of parts that do
not need editing, as each DMTet is supervised independently. The guiding principle for this separation is
the pixel values of the original scene, which are masked by the 3D masks defined during the localization
phase. By being guided with pixel supervision masked by the 3D occupancy field’s projection, we ensure
that each DMTet accurately represents its own assigned part. This separation facilitates more flexible and
targeted editing capabilities in subsequent steps.
Our model allows users to iteratively modify generated objects within a designated 3D area until they satisfy
their specifications. This region can be defined using text, a point, or a 3D bounding box. This capability
encompasses both creative content editing and geometric transformations. Based on these, we name our
method EA3D (Edit Anything for 3D Objects), and aspire for our work to further advance the creation of
3D assets. In summary, our key contributions are:
• We model 3D consistency masks by optimizing occupancy fields both locally and globally, under the
guidance of the 2D Segment Anything Model.
• Under the guidance of the 3D masks, we utilize two separate local DMTet models to extract information from the original scene, facilitating flexible editing.
• Our approach enables content and geometric editing within 3D regions defined by prompts. In addition, we show our impressive performance in these areas, both quantitatively and qualitatively.
5.2 Related Work
5.2.1 Text-to-3D Generation
The field of Text-to-3D Generation [24, 154, 155, 156, 157, 21, 138, 125, 167, 158] is focused on creating
3D assets from text descriptions. Recent advancements in data-driven 2D diffusion models [168, 20, 127]
have shown impressive results in generating images from text descriptions. These models can optimize a
NeRF from scratch, utilizing the supervision from multiple viewpoints provided by pre-trained 2D diffusion models, thereby enabling the use of
open-world text prompts while ensuring robust 3D consistency. Subsequent research has further improved
aspects such as geometric consistency, generation fidelity, training stability, and efficiency [169, 170, 171,
172]. However, there is a lack of research focused on the editability of the generated 3D objects due to
the difficulties in acquiring corresponding training data and the complexity of implicit 3D representations.
Some existing methods [26, 154, 161, 162] facilitate content editing based on modified text prompts, but
they struggle to maintain unedited areas and lack support for geometric transformations. Our approach
48
uniquely allows for editing specific 3D components and geometric alterations by utilizing either text, 3D
point, or 3D bounding box prompts.
5.2.2 3D Editing in NeRFs
Effective manipulation and composition are essential for the practical use of 3D representations. Traditional
3D representations like meshes and voxels inherently support editing and composition. However, for neural
network-based implicit representations, such as NeRF, this task is more complex. Certain methods [173,
174] involve extracting meshes from a pre-trained NeRF, utilizing deformation techniques to manage the
rendering. Yet, these approaches are limited to specific scenes and fail to generalize across categories. An
alternate direction employs distinct embeddings to capture shape and texture variations in 3D models, as
discussed in [158, 175]. This strategy enables editing operations including color modifications and removal
of specific shape parts. Nonetheless, it cannot take text descriptions as inputs, requiring specific latent codes
for targeted editing.
Recent diffusion-based methods utilize edited 2D images to update pre-trained NeRF weights in 3D models,
facilitating flexible editing by arbitrary text descriptions. However, they fall short in achieving partial appearance changes, preserving only text-related aspects, and lack support for 3D geometric transformations.
PartNerF can independently edit each part’s appearance and geometry. But it necessitates unique training
data for each scene, and the parts do not carry semantic meanings. Our approach, in contrast, allows users
to specifically edit parts defined through prompts like text, points, or 3D bounding boxes, affecting both
appearance and geometry.
5.2.3 Segment Anything Model
The Segment Anything model is a zero-shot segmentation model, renowned for its high-quality and robust
performance in open-world scenarios. Besides segmentation, it also has significantly enhanced various 2D
domain tasks [176, 25, 177, 178, 179] and has demonstrated consistent results across videos [180] and 3D
objects [181, 182, 183]. These two features inspire us to extend SAM towards the open-world
3D editing area. Based on this, our work employs two 3D occupancy fields to decompose the original 3D
scene based on the user’s prompt, distilling the potential 3D information from SAM. This approach significantly reduces the occurrence of inconsistent predictions typically found in segmentation maps derived from
multiple 2D views, which is highly beneficial for our pipeline.
5.3 Method
We now turn to describe our EA3D model. Initially, we discuss modeling consistent 3D masks within the occupancy field under the guidance of SAM. Following this, we illustrate our method of using dual 3D-maskguided DMTet models to distill the scene from the originally generated objects. Lastly, we demonstrate how
our method facilitates easy and flexible editing.
49
5.3.1 3D Masks Modeling Guided by SAM
In our approach, the first phase involves localizing the targeted 3D area based on the user’s prompt. This is
achieved by distilling information from 2D SAM and optimizing occupancy models which mitigate inconsistencies observed across various viewpoints.
Using the N view-projected images {Ii}
N
i=1 from the generated 3D scene or objects, SAM takes the user’s
prompt p as input and outputs the corresponding binary segmentation maps {Si}
N
i=1. While SAM demonstrates strong performance across different viewpoints, occasional inconsistencies in predictions for the same
object from various views and some artifacts in certain views iss unavoidable.
To address this challenge, we construct consistent 3D masks by iteratively optimizing occupancy fields,
guided by the supervision of segmentation maps {Si}. Occupancy networks determine the likelihood of
a point being within an object, making them ideal for modeling non-color 3D masks through continuous
representation. We employ a pair of occupancy networks to decompose the object: one occupancy network
Om(·) represents the 3D masks delineated by the user’s prompt, while the other network Or(·) captures the
remaining portions of the object. Denote a ray r(t) = o + td that is projected to sample points along its
path in the 3D space, where o is the origin of the ray, d is the direction of the ray and t is the distance from
the origin o to the sample points along the ray. We apply the raymarching algorithm [singh2009realtime]
that selects the maximum occupancy value along each ray to render the 3D mask into 2D:
Bm(r) = max
t∈[tnear ,tfar ]
Om(r(t)) (5.1)
Br(r) = max
t∈[tnear ,tfar ]
Or(r(t)) (5.2)
where tnear and tfar define the range of the sampled points, Bm represents the predicted occupancy value
for the target
object, and Br denotes the occupancy value for the rest of the scene. The object is represented as a Neural
Radiance Field (NeRF) produced by common Text-to-3D pipelines, where the density of the field along a
ray r can be represented as σ(r). We approximate the object’s depth h along this ray based on the density
value σ. By applying a threshold h, we determine the binary mask value for this ray:
h =
P
i
P
Ti (1 − exp (−σiδi))ti
i
Ti (1 − exp (−σiδi)) (5.3)
M(r) =
1 if h > λ
0 otherwise
(5.4)
where ti
is the depth of the i-th sample along the ray, σi
is the density at the i-th sample, δi
is the distance
between the i-th and ( i − 1 )-th samples, Ti
is the accumulated transmittance up to the i-th sample, defined
as Ti = exp
−
Pi−1
j=0 σjδj
. The term (1 − exp (−σiδi)) represents the probability of a ray being absorbed
50
Figure 5.2: Illustration for our overall method. First, we start from an initial NeRF based on the user’s
prompt. We feed the projection from NeRF across different views into SAM to get the binary masks specified
by the user’s prompt. Concurrently, masks for non-target areas are derived by deducting the masks of the
specified areas from the NeRF’s depth map. Subsequently, two occupancy networks are trained using both
global and local shape guidance, utilizing the multi-view masks from the two distinct mask sets. In phase
2 , we employ two DMTet networks that are initialized with the occupancy network and are further refined
under the guidance of the masked projections from NeRF. Finally, this setup empowers users to perform
versatile edits. They have the option to refine any of the DMTet networks using a Text-to-3D approach for
content modification or to implement geometric transformations directly on the output mesh.
at the i-th sample. In this way, we take into account both the absorption probability at each sample and the
accumulated transmittance along the ray to compute a depth value that reflects the density distribution in
the NeRF model. M is a binary mask derived from applying a threshold λ to the depth value h associated
with a given ray r. Here M is a H × W matrix, and h is a scalar. The value of M at the point where ray r
intersects is denoted as M(r). Since we get the binary segmentation map for the targeted part of the object,
we can readily acquire the binary mask Mr for the remaining part of the object by:
Mr =
(
1 if (M = 1) ⊙ (S = 0)
0 otherwise.
. (5.5)
We use the segmentation map and the mask value of the remaining part as supervision for two occupancy
networks respectively. In addition to local supervision, we globally enforce that the combined occupancy
value is geometrically similar to the NeRF’s density. The overall loss function is presented below:
Llocal =
X
r∈R
∥S(r) − Bm(r)∥
2 + ∥Mr(r) − Br(r)∥
2
(5.6)
Lglobal =
X
r∈R
∥(Om(r) + Or(r)) − Onerf (r)∥
2
(5.7)
where Onerf = min
max
1 − e
−∆σ(r)
, 0
, 1
(5.8)
Ltotal = Llocal + Lglobal (5.9)
51
The ∆ represents the differential distance along a ray between sampled points. We use min-max operation
to clamp the NeRF’s occupancy value into [0, 1], which is aligned with the Om and Or. The optimized
occupancy field brings robust 3D consistency and efficiency as it obviates the need for reapplying SAM to
obtain the 2D mask, thereby facilitating the subsequent step of dual DMTet distillation.
5.3.2 3D Mask-Guided Dual DMTet Distillation
Driven by the motivation to enable independent objectlevel editing, we aim to utilize implicit surface representations for modeling individual objects. Previous methods, like those involving marching cubes [184,
185], often yield meshes of sub-optimal quality. However, the recent advancement [169, 186, 187, 188] in
the form of DMTet, with its deformable tetrahedral grid and differentiable marching layer, facilitates the
creation of high-quality surfaces. Therefore,
we leverage Dual DMTet networks to distill the 3D information from the original NeRF, setting the stage for
enhanced object-specific editing capabilities. In short, DMTet Ψ processes vertices as inputs, and optimizes
deformation terms and signed distance values, as formulated by:
Y = Ψ (vi
; (S, D)) (5.10)
where vi
is the vertex in the deformable grid, D predict the position offset for each vertice vi
, S predicts the
Signed Distance Function (SDF) value and Y is the mesh result.
We respectively build DMTet Ψm and DMTet Ψr for modeling the target part and the remaining parts. Given
the 3D masks for both the target part and remaining part of the object represented as occupancy networks
Om and Or, we can employ these as initial vertices for the DMTet models, thereby hastening convergence:
Ym = Ψm (Sm, Vm + Dm) (5.11)
where Vm = {x ∈ X | Om(x) > λ} (5.12)
Yr = Ψr (Sr, Vr + Dr) (5.13)
where Vr = {x ∈ X | Or(x) > λ} (5.14)
Notably, subscripts m and r represent the target part and the remaining part, respectively. X is the space in
the deformable grid and λ is the threshold to determine if a vertice is occupied.
The subsequent phase involves transferring the object information from the original NeRF to the dual DMTet
models. Suppose we have K views’ images {Ii} (i = 1, . . . , K ) from the original NeRF and camera pose
pi
, we can use ray marching to render the two 3D masks into the i-th view’s 2D masks Rrm (Om, pi) and
Rrm (Or, pi), and we can use rasterization rendering to get the predicted image from DMTet, represented
as Rrast (Ψm, pi) and Rrast (Ψr, pi). So the distillation loss function can be expressed as:
52
Ld =
X
K
i=1
∥Ii ∗ Rrm (Om, pi) − Rrast (Ψm, pi)∥
2
(5.15)
+ ∥Ii ∗ Rrm (Or, pi) − Rrast (Ψr, pi)∥
2
(5.16)
Essentially, this loss function compels each DMTet to exclusively learn the appearance of their respective
objects. Additionally, we explore shape loss and geometric regularization, derived from 3D masks, to direct
the geometry modeling. However, these methods hinder the creation of high-quality, detailed models, which
we will discuss more in supplementary materials.
Despite occasional pixel misalignments between the actual view and the 2D projected masks, the DMTet
model exhibits robust 3D consistency and superior geometry quality, facilitating the editing component of
our work.
5.3.3 Versatile Editing Application
SAM, with its adaptable support for various prompts such as text, points, and 3D bounding boxes, enables
users to precisely localize the 3D region Om they wish to edit through prompt input. Subsequently, we
demonstrate three distinct editing capabilities of our method.
Single part geometrical transformation. When rendering multiple meshes, the z-buffer rendering method Z
is commonly employed, along with the use of an MVP (Model, View, Projection) matrix to manipulate the
individual components. Assuming a default MVP ∈ R4×4 matrix is in place, and the transformation matrix
G ∈ R4×4
is applied to the MVP of the target part, our rendered result will be displayed as:
Inew = Z ((Ym), MVP · G),(Yr), MVP)) (5.17)
where Inew is the edited result.
Single part content editing. One potential application is appearance or content editing at the target object’s
position. This can be done by training from scratch or finetune a specific DMTet network given a new text
prompt, optimized by Score Distillation Sampling (SDS) loss [21].
Another application involves replacing the target object with a differently trained DMTet model to alter the
content. Here, we can preserve a similar location by calculating the center and scale of the original target
occupancy network.
Multiple parts editing. We facilitate the editing of scenes involving multiple objects by initializing additional
(≥ 3) occupancy networks for decomposition. This is coupled with the use of more DMTets for distillation.
Additionally, we support sequential and persistent editing processes, which repeatedly utilize occupancy
network decomposition alongside DMTet distillation.
53
Figure 5.3: Qualitative comparison with DreamFusion in terms of content editing.
5.4 Experiments
5.4.1 Implementation Details
Our approach utilizes the stable diffusion 2.1 model for 2D supervision. DMTet is used to fine-tune the initial
NeRF model as our input. The quality of results may be further enhanced by incorporating newer text-to-2D
models. For point and box prompts, we employ the original SAM model. Due to the unavailability of text
encoder weights in SAM, our method is aligned with prior work that integrates grounding DINO with the
SAM pipeline. This technique initially predicts a box from the text prompt, followed by the application
of the box-prompt-based SAM model. All of our experiments are conducted on a single A6000 GPU.
Typically, the training and inference take an hour and less than one minute, respectively. More details of the
experimental setup are included in the supplementary materials.
54
CLIP ↑ UP-CLIP ↑
Dreamfusion w/ NP 0.2241 0.7185
Dreamfusion w/ NP 0.2457 0.8230
Ours 0.3304 0.9285
Table 5.1: Quantitative results for content editing.
5.4.2 Editing Performance
Content editing Fig. 3 shows qualitative results. In our comparative analysis, both our method and the
baseline DreamFusion [21] utilize identical NeRF weights for the initial scene editing. For DreamFusion’s
fine-tuning phase, we incorporate a negative prompt for the object being replaced and introduce a new positive prompt that includes the target object, which strengthens DreamFusion’s editing capability. According
to Fig. 3, We can find that the baseline Text-to-3D model [21] struggles with editing scenes accurately when
the text prompt describes scenarios seldom encountered in real life. For instance, while DreamFusion in the
first row effectively preserves the original appearance of the ladybug, it fails to simultaneously generate the
mouse. Furthermore, in the next three rows, DreamFusion significantly deviates from retaining the original scene information. In contrast, the proposed EA3D’s robustness to localize and decompose 3D objects
ensures that the newly generated object meets two key criteria: 1) it remains unaffected by the distribution
of the previous prompt, and 2) its generation process does not impact the other objects in the scene. Fig. 3
clearly demonstrates this capability.
Additionally, we conduct a quantitative analysis of our results via the CLIP [189] score, which calculates
the cosine similarity between the edited image and the new text prompt. We also introduce the Unmasked
Preservation CLIP Score (UP-CLIP), a new metric designed to assess the similarity of images after masking
the region where editing occurs, as shown below:
UP-CLIP
Iorg , Iedit ,M
= CosSim(
CLIP
Iorg ⊙ (1 − M)
, CLIP (Iedit ⊙ (1 − M))
where Iorg and Iedit represent the original image and edited image, respectively. M represents the binary
mask where the editing occurs. The function CLIP(·) computes the image embedding.
We randomly select 80 prompts for generating our initial NeRF weights and subsequently perform content
editing by combining different nouns. The two scores are derived by averaging the similarity across 8 views,
using both the CLIP image encoder and text encoder. These two metrics quantitatively assess the model’s
proficiency in achieving the intended edits while preserving the integrity of the original data. The results, as
elaborated in Table.5.1, demonstrate our model’s capability to produce semantically coherent outcomes and
maintain the original information.
Geometrically editing We qualitatively present our geometric transformation results in Fig. 4. In the first
column, we showcase our precisely learned 3D mask that effectively decomposes the scene, distinguishing
even the fine details like the pear and its stem. Because we decompose the appearance information from the
Figure 5.4: Illustration for showing our flexible geometrical transformation qualitatively.
Figure 5.5: Ablation study on the importance of occupancy modeling for the mask quality.
original object and render them independently via two local DMTet networks, our geometrical editing approach can be applied freely without impacting the appearance of other objects. This is a distinct advantage
over the NeRF representations since they do not offer such a level of independent object manipulation.
5.4.3 Ablation Study
Occupancy modeling impact A straightforward approach to incorporating SAM information is to bypass the
56
Figure 5.6: Ablation on how segmentation performance affects the distillation. The sub-caption represents
the input and output resolution of SAM. The multi-view images are the distilled results from the initial
NeRF guided by the segmentation maps of SAM.
Figure 5.7: Ablation study on 3D mask initialization for DMTets.
occupancy modeling step and apply SAM directly during the distillation training process. However, this
method may face several challenges. Firstly, there’s a significant efficiency gap: the OccupancyNet is notably lightweight (0.0012 GFLOPs), in contrast to the considerably heavier SAM model ( 14 GFLOPs),
resulting in slower training. Another issue pertains to 3D consistency. As discussed in Sec. 3.1, SAM struggles to accurately predict consistent masks across different views, often resulting in artifacts. In contrast,
occupancy modeling can mitigate some of these artifacts due to its inherent consistency. As shown in Fig. 5,
an example of this can be observed when attempting to model the occupancy for a bunny’s ears. The SAM
model fails to detect the second ear, whereas our occupancy projection approach yields more reasonable
masks.
57
3D mask initialization We find that using a learned 3D mask as the initial step is crucial for convergence
during the distillation stage. As demonstrated in Fig. 7, DMTet already exhibits a well-formed geometric
shape in the early phase of the process, simplifying the optimization goal to focus on refining geometric
details and distilling appearance. In contrast, when 3D masks are not used in the initial step, the model
tends to fit colors in a way that reaches a local minimum and shows no improvement in geometric modeling.
Thus, our observation suggests that having a solid geometric prior is vital for successful distillation. While
the use of
3D masks might not be the only solution, it is certainly an effective approach to address this challenge.
SAM’s performance impact Our 3D localization capability fundamentally stems from the 2D SAM model,
making it worthwhile to investigate how performance variations of SAM may impact our editing outcomes.
We modify the segmentation model’s performance by downsampling the input and then upsampling the
output back to the standard size at various levels. As depicted in Fig. 6, we observe that when the resolution
of the segmentation map is set at 32 × 32, it fails to model the object effectively. However, when the
resolution is 64 or higher, the geometric modeling’s results become reasonable, which indicates that minor
reductions in segmentation quality do not significantly impact the overall modeling process. Additionally,
we observe that higher resolutions contribute to learning finer appearance details. This can be apparently
seen by the color of the bunny’s ear region in Fig. 6, which underscores the necessity of having a high-quality
segmentation map for more accurate and detailed appearance modeling.
5.5 Conclusion
Our work introduces a novel pipeline for future 3D editing that eliminates the need for any 3D data, allowing
users to make edits using purely text-based prompts. We distill robust 2D knowledge from the foundational
model SAM and elevate it into the 3D realm. Additionally, we represent 3D objects using DMTet, which
facilitates flexible geometric transformations. We aspire for our work to inspire further exploration into the
transference of 2D knowledge into the 3D domain, particularly in areas where data is scarce.
There exist some limitations of our method. First, our performance is essentially capped by both the SAM
model and the diffusion models; moreover, modeling highly complex 3D shapes presents a significant challenge. Lastly, our model does not support real-time editing, a critical feature in real-world applications.
58
Chapter 6
Learning Where to Cut from Edited Videos
In this work we propose a new approach for accelerating the video editing process by identifying good moments in time to cut unedited videos. We first validate that there is indeed a consensus among human viewers
about good and bad cut moments with a user study, and then formulate this problem as a classification task.
In order to train for such a task, we propose a self-supervised scheme that only requires pre-existing edited
videos for training, of which there is large and diverse data readily available. We then propose a contrastive
learning framework to train a 3D ResNet model to predict good regions to cut. We validate our method
with a second user study, which indicates that clips generated by our model are preferred over a number of
baselines.
6.1 Introduction
Video editing is a time-consuming and challenging task traditionally performed by highly trained experts.
In the most basic sense, video editing is time selection—selecting a series of clips that tell a story from raw,
unedited footage, and then trimming each video down to its relevant part. As such, editors are performing
two main tasks: the high-level task of deciding which content to show, and the low-level task of precisely
placing cut points in a way that is not distracting to viewers. In this work we address the second task—the
fine-scale placement of cuts (which can be equivalently thought of as clip trimming), assuming that the highlevel direction of which clips to choose has been decided by the editor. We believe this component is better
suited to automation as it is less dependent on high-level context or artistic choices, while being difficult
and tedious to execute in practice, as it requires frame-level precision. Due to the increase in popularity of
social video sharing websites, more and more novice users are creating and sharing edited video content,
often times produced (shot, edited, and distributed) entirely on mobile devices. These users lack the time,
expertise, and equipment to perform frame-level tasks such as cut placement.
In this work, we introduce the concept of “cut suitability”, an instantaneous score for how good a cut would
be if placed at that time. We ignore audio and focus purely on good visual times to cut. In our experience,
audio and language determine clearly bad times to cut (e.g., during human voices, in the middle of sentences,
and during loud noises), but otherwise provide a fairly uniform probability and is easy to determine using
existing methods that can be combined with visual cuts in a late-fusion stage.
59
Before we begin to approach this problem of visual cutting, one might ask whether there is even any agreement among viewers as to what makes a “good” time to cut. We first validate this question by conducting a
user study, which indicated that there is indeed a consensus about good and bad cut points. Generally good
cut points occur at visually non-distracting times, in-between actions, or static moments right before or after
camera motion, etc.
As cut suitability is a complex and hard to define function, we choose a data-driven approach that learns
to associate visual features from real cut points. One challenge is that large scale datasets consisting of
unedited and edited footage are hard to come by. In this work, we instead propose to use a weakly-supervised
approach where we train a model entirely on edited video in the wild. This allows us to collect large-scale
and diverse edited video (video with cuts), and then learn a one-sided function for the start and end placement
of cuts (note that in this data, information across the cut for a single unedited clip is unavailable). We gather
a dataset of 61,486 edited videos from YouTube and Vimeo.
Using this dataset, we propose a multimodal 3D Resnet architecture and train two separate models for
starting and ending cut point prediction via contrastive learning. We design a progressive learning strategy
to enforce the model to differentiate positive samples from negative samples with similar visual appearances.
We use a randomized frame rate conversion method to augment the input videos, which effectively improves
the model’s robustness against video compression.
We then evaluate our models on both our collected dataset and a new set of unedited livestreaming videos.
The experimental results show that our model is able to predict good cut positions close to the ground truth in
the test set. Furthermore, we conduct a user study to evaluate the subjective preferences of human viewers.
Contributions.
We propose a model that predicts dense, continuous scores for cut suitability. Our key contributions include
the following.
1. We introduce a novel task for computational video editing to automate the time-consuming manual
process.
2. We propose a self-supervised contrastive learning framework that is completely data-driven and utilizes abundantly available edited videos without any manual annotation.
3. We define a number of baselines, and evaluate our approach with respect to human preference with a
user study.
6.2 Related Work
Computational Cinematography. Previous papers on computational cinematography have looked towards automating difficult parts of the production process. Some work is on the camera side, such as
stabilizing and centering the video on content as a post process [190]. In addition, prior work has investigated context-specific editing tasks, for example providing a transcript-centered interface for editing
interviews [191], or a tool to leverages additional audio annotation created during filming [192]. Leake et
al. [193] introduce a system to automatically generate edits from dialog scenes, where cuts are determined
based on a set of high-level constraints, such as dialog and shot type. Arev et al. [194] propose a system to
60
produce cuts using multiple social cameras. In this work, we propose a complementary video editing task:
fine-scale placement of cuts for general-purpose footage, based on low-level content and motion cuts.
Edited Video in Computer Vision. Most video content online has at least a few basic edits, including
cuts. This holds in genres ranging from short social clips to elaborate narrative videos. The computer
vision community has leveraged such edited videos to train and benchmark diverse vision tasks including
action recognition [195], speaker recognition[196], event localization[197], scene detection[198], among
others. While edited video has served as a rich source to develop general-purpose video understanding
systems, only a few approaches have leveraged the rich structure encoded within the video edits to learn
video representations [199]. In this paper, we learn a cut suitability function that contrasts the visual features
of moments just before and after a cut with features of all other moments in a video.
Video Shortening. The computer vision community has studied several video shortening problems, such
as temporal action localization [200, 201] and event segmentation [202, 203]. While action boundaries could be places of high cut suitability, existing approaches for action localization have been designed to detect coarse moments in time, such as sport actions [204], high-level activities[205], and
events in movies[197]. These are quite different movements to those who trigger cuts. Recently Shou et
allet@tokeneonedotintroduced the task of generic event boundary detection [203]. Even though their new
study relaxes the shortening task from only actions to taxonomy-free event boundaries, it is unclear whether
these general event boundaries correlate with good places to cut. Another widely-studied task addressing
video shortening is video summarization [206, 207]. These methods process one video stream and cut that
stream into multiple shots, making the video much shorter. Despite the similarities with our task, video
summarization focuses on reducing the video length while maintaining the semantic meaning of the video
unchanged. In short, all previous methods for video shortening offer only a rough time range prediction to
delineate the start and end times of moments of interest in an untrimmed video. This observation makes
them not viable as baselines for predicting frame-level cut suitability.
6.3 Problem Setting
As mentioned earlier, we are interested in the fine-scale localization of cuts. We formulate this task as two
separate classification problems: whether a clip is a good starting clip, and whether a clip is a good ending
clip. In each of these two problems, the task is then to predict a binary classification from the visual features
contained in each clip, evaluated using a sliding window.
First, we verify whether there is in fact human agreement of “cut suitability” by conducting a user study. We
developed a web interface that shows two clips, and ask the users which one is the better starting or ending
clip. The user study shows that human’s preference on which clip has a better cut agrees with the ground
truth is 76% for starting point and 90% for ending point (elaborated in Table 6.2), indicating that there is
agreement on what makes a good or bad cut. In this work, we use 2-second clips, which we found to contain
enough semantic motion to establish context.
61
© 2017 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Data-driven Approach
Learn from one-sided information of cuts in edited videos
1. Run shotcut detection (from Oliver)
2. Generate positive and negative samples (2-sec-clips, 16x112x112x3)
3. Train two binary classifiers
9
positive, start 2 sec negative
clips
video
starting ending
negative positive, end
Figure 6.1: Data sampling for our method used in the start and end prediction tasks. Positive samples are
clips that start (or end) with a cut, and negative samples drawn randomly from the rest of the video.
6.3.1 Learning from Edited Videos
Learning cut points could be done from paired data consisting of videos of raw footage, the trimmed clips
and the timestamps of the edit points. However, acquiring diverse and large scale annotated data in this form
is challenging. Alternatively, we look for edited videos (without original footage) that are widely available
at scale on public video sharing platforms such as YouTube and Vimeo. We run a cut detection algorithm
to identify cut points in edited footage [208] and break up the videos into clips. In this way we can collect
cut points and the video content from one side of the raw footage (Fig. 6.1). Although the trimmed video
content on the other side of each cut is missing, this setting has the advantage of being able to leverage
virtually unlimited public videos for free.
We collect our dataset from public-video sharing websites by downloading random subsets from travel,
narrative and food categories. In addition, we add the vimeo-90k dataset [209] to our collection. A sample
of the collected video along with extracted clips is available online. Our dataset contains 718 YouTube
videos and 60,768 Vimeo videos. The total number of non-overlapping clips is 2.85 million. The average
video duration is 205.86 seconds, and each video contains 46.51 clips on average.
6.4 Method
6.4.1 Architecture
3D CNN architectures have been proven to be useful for video-based tasks, such as action recognition [210].
We hypothesize that motion and object information are important factors in the determination of cut locations, and so we add additional inputs in the form of Mask R-CNN labels and optical flow. As shown in Fig.
6.2, each N-second video clip is therefore represented in 3 ways:
RGB pixel values. Frames are downsampled to 112 × 112, with 3-channel RGB format and 16 frames
(8 fps). The tensor shape is 3 × 16 × 112 × 112.
62
RGB Values
3x16x112x112
Mask R-CNN Label
81x16xWxH
81x16x112x112
Scale, nearest
3x16x112x112
1x1 convolution
Concatenated Input
8x16x112x112
64x16x56x56
64x8x28x28
128x4x14x14
256x2x7x7
512x1x4x4
512
2
Optical Flow Vector
2x16x112x112
Source Video Clip
3xTxWxH
Scaling & FPS conversion
FPS conversion
& Mask R-CNN
Scaling & FPS conversion
& Optical Flow
Figure 6.2: Architecture of proposed network. Please see Section 6.4.1 for details.
Mask R-CNN labels. The pre-trained Mask R-CNN has 81 classes. We use a 1 × 1 convolution layer to
project these features to 3-channels. The tensor shape is 3 × 16 × 112 × 112.
Optical Flow. We use a pretrained optical flow model [211] that computes a 2D motion vector for each
pixel. The tensor shape is 2 × 16 × 112 × 112
These inputs are concatenated and passed through a series of 3D convolution layers. The model is trained
to do binary classification. As described in Section 6.3, We train separate instances of this binary classifier
for different tasks; predicting the start of a cut, and predicting the end of a cut. We also trained a unified
model that performs 3-way classification, but it does not perform as well as the two separate models. We
will elaborate this in Section 6.5.1.
We use the known cut locations as positive labels, and assign negative labels to randomly sampled N-second
clips that are at least N seconds away from cut locations. Note that a random time is not necessarily a bad
cut, so our negative samples are noisy. Considering a good cut point could last a few frames, we add a small
degree of label smoothing here, which we found improved accuracy. For start (and similarly end) tasks,
63
we assign soft label values 0.5, 0.25, 0.125 to 3 frames after (before) the cut point. It is possible to train a
model by minimizing the cross entropy between the model prediction and these smoothed labels, however
we found that the trained model did not generalize well to test videos. (Section 6.5.1)
6.4.2 Contrastive Training
We observed two issues in network performance with the naive cross-entropy loss and simple frame rate
conversion methods for the input video.
Model overfitting. We saw large gaps between the training and test accuracy in models with the crossentropy loss, which might be caused by insufficient positive samples compared to the large network capacity.
(See Baseline in Table 6.1)
Video compression. Most edited videos in our dataset have been compressed for streaming. Some compression algorithms use key-frames and motion estimation, which creates spatial and temporal dependencies
in the processed frames. In particular, the frames near the edit points may be compressed differently, and
the model may be trained to learn from the small structures or artifacts.
In order to address the first issue, we instead train our network using a supervised contrastive loss[212].
Given a mini-batch of 2N samples, Ny˜i
is the total number of samples in the mini-batch that has the same
label, y˜i as the anchor, i. z is the embedding of a sample. The supervised contrastive loss for this batch
L
sup could be written as:
L
sup =
X
2N
i=1
L
sup
i
(6.1)
L
sup
i =
−1
2Ny˜i − 1
X2N
j=1
1i̸=j
· 1y˜i=y˜j
· log exp (zi · zj/τ )
P2N
k=1 1i̸=k · exp (zi · zk/τ )
Several properties of the supervised contrastive loss ideally fit our task.
Contrastive power increases with more negatives [212] Videos are usually recorded at frame rates of
least 24 fps. In a video clip, only a few frames are good cut points (positive), while most frames are not
good places to cut (negative). So our task has inherently imbalanced labels, (each clip containing 1 positive
and 34 negatives on average.) and contrastive loss could take advantage of it.
Non-symmetric loss for positive and negative samples As seen in Eq. 6.4.2, zj and zk are a positive and
negative samples respectively. L
sup is directly correlated with zi
· zj while correlated to the reciprocal of
zi
· zk. The loss term for negative sample will quickly decay as zi
· zk increase, but won’t decay for positive
samples.
64
© 2017 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Sampling Schemes
We use 4 sampling schemes for training with contrastive loss (starting task)
19
Hard
Medium
Easy
Very hard
anchor
positive
negative
Figure 6.3: Sampling Schemes. In each row we show two edited videos (grey bars), with cuts shown as
vertical lines. By changing where positive and negative samples are drawn from relative to an anchor placed
on a cut in the first video, we can make the task easier, or harder.
This is a desired property for our task. Such setup of loss will enforce positive samples to get similar
embeddings with other positive samples, even if they have very different visual appearances. For negative
samples, we design it to enforce them to be far away from positive samples. The intuition behind this is
that there is only a small subset of clips that are indeed good cut points with shared consensus, while the
negative clips are much more diverse.
In addition, we employ a curriculum learning strategy, where we progressively generate samples with increasing levels of difficulty for the network to learn. Fig. 6.3 shows an example of the schemes for the
starting clip task. A positive sample from a clip is used as the anchor. In easy scheme, we sample the
positive from another clip in the same video, and the negative from a different video. In medium scheme,
the negative comes from a different clip in the same video, which is more similar to the anchor. We then
make the scheme harder by moving the negative sample in the same clip as the anchor, and moving the
positive sample to a different video. In this way, the network needs to learn shared features between two
visually distinct positive samples, and learn to distinguish samples of different classes from the same scene.
It becomes even more challenging by moving the negative sample very close to the anchor and forcing the
network to differentiate two samples that have very similar visual appearances, as shown in the very-hard
scheme.
During training, we start from the easy sampling scheme for better convergence, and gradually shift towards
more difficult sampling schemes to improve classification accuracy on challenging videos.
The contrastive loss and the sampling schemes encourages the network to cluster inter-class samples and
distinguish intra-class samples regardless of their visual similarity. As a result, we can train a much more
robust model with high accuracy (Section 6.5.1).
6.4.3 Temporal Augmentation
To address the second issue, we propose a new temporal augmentation method.
65
Figure 6.4: Randomized Frame Rate Conversion. We augment our training samples by jittering sampling
locations during rate conversion.
We convert the input videos to a fixed FPS of 8 such that every input video sample contains 16 frames.
FFmpeg [213] supports several frame conversion methods, including frame rounding, frame blending, and
optical flow based blending, but these methods are designed for good viewing experiences rather than effective data sampling methods. Frame rounding cannot utilize available data effectively as it rounds to nearest
frames to the desired timestamps. Frame blending and optical flow methods both produce different frames
than the source thus the model might learn from data in different domains.
Inspired by [214, 215], we propose a randomized frame sampling method that selects the first and last frame
of the source video, and then randomly selects the remaining 14 frames in ascending order (Fig. 6.4).
This randomized frame sampling method augments the training data with more variants. It prevents the
network from learning any temporal patterns, structures or artifacts from video compression, and as a result
we found that models trained with this augmentation generalized better to unseen videos.
Our model was trained with 0.1 learning rate using SGD optimizer with momentum, with an early stopping
strategy that stops when the validation loss did not decrease for 3 epochs. We then recovered the lowest
validation loss weight. The model was trained for 78 epochs. Training took about 42.7 hours on a single
Nvidia V100.
6.5 Experiments
Baselines Recent works related to video cut point prediction include highlight prediction, video segmentation and action recognition. To the best of our knowledge, there are no models specifically designed to
predict dense, continuous scores of cut feasibility at frame level. Therefore, we introduce a number of
naive baselines by ablating the various components of our method, and also experiment with using an action
recognition model [216] as our baseline, with the hypothesis that cut suitability is often a function of when
actions have been completed. We use this baseline by fixing the pretrained action recognition weights, and
fine-tuning the last layer on our collected dataset using a standard cross-entropy loss.
66
Figure 6.5: Difference between the shotcut detector and our model. The shotcut detector fires when the
window includes a cut. Our model sees continuous, unedited clips and the window does not contain any
cuts.
6.5.1 Classification Evaluation
As ground truth labels in unedited footage are not widely available, we use our collected YouTube and Vimeo
videos for self supervision and quantitative evaluation. Please note that although we use edited videos for
evaluation, our model only sees continuous (unedited) video clips, whereas the shotcut detection model sees
edits. (Fig. 6.5). Our model needs to learn from the video content from a single clip and infer whether it is
a good start or end.
We evaluate on a 80/20 split of training and testing, and report training and test accuracy on both start and
end tasks in Table 6.1. From these results, we conclude
1. The model generalizes well to a very large set of diverse, unseen videos (87.51% test accuracy for
start task and 79.84% for end task). This suggests that our model is able to learn the common features
at the start or end of human-edited clips.
2. Cut point prediction is a high-level task that benefits high-level features (semantic labels, motion
vectors) derived from training related tasks. The ablation study results show that our proposed model
improves the classification accuracy of the baseline model by 28.54% for start task and 21.99% for
end task.
3. The start task and end task have different behavior and presumably are learning different features. The
3-way classification model has lower accuracy than separate models in start and end tasks. Based on
this experiment result, we use two models instead of a unified model.
4. The contrastive loss and temporal augmentation significantly increase the model’s prediction accuracy.
In particularly, the gaps between training and test are decreased, which suggests better robustness and
generalization.
6.5.2 Distribution of model prediction scores
As shown in Fig. 6.6, we compute the average scores of all test videos on 16 frames after a true starting
point. These 16 frames are all the temporally downsampled frames within the 2-second clip. We can see
67
Input Model Start Task End Task
Train Test Train Test
RGB XE (Baseline) 91.71 58.97 95.69 58.85
RGB 3 way XE 54.79 55.05 58.24 55.37
RGB CT w/ TA 77.12 68.33 70.07 61.14
RGB CT w/o TA 85.23 58.44 82.45 57.14
Optical Flow [211] XE 75.23 62.89 69.33 57.14
Mask RCNN [217] XE 62.15 56.85 61.24 53.73
RGB+Optical Flow XE 89.28 75.91 88.93 68.29
RGB+Mask RCNN XE 87.88 78.00 86.78 78.42
RGB+Optical Flow+Mask RCNN XE 97.45 68.29 94.53 65.32
⋆ RGB+Optical Flow+Mask RCNN CT w/ TA 93.42 87.51 89.76 79.84
Table 6.1: Quantitative comparison of accuracy values for different baselines and ablations. In this table
XE stands for cross entropy loss, CT stands for contrastive loss, and TA stands for temporal augmentation.
The last row prefixed by ⋆ is our proposed model. We can see that adding the higher-level features improves
accuracy, and our proposed contrastive learning training scheme improves generalization to our test set.
Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Start 0.697244 0.605774 0.395019 0.244583 0.144647 0.087716 0.052884 0.044287 0.042807 0.037197 0.020765 0.020053 0.015439 0.014567 0.028392 0.048625
End 0.16454 0.138346 0.111088 0.10229 0.117132 0.145182 0.173386 0.194659 0.231164 0.28739 0.32467 0.394657 0.495658 0.606189 0.726655 0.786993
start std 0.24631 0.221457 0.166721 0.128304 0.104268 0.091384 0.084016 0.08299 0.084821 0.086826 0.08481 0.088858 0.09424 0.100765 0.111503 0.120267
0.98524 0.88583 0.666883 0.513216 0.417071 0.365536 0.336063 0.331961 0.339285 0.347305 0.339242 0.355433 0.376958 0.403059 0.44601 0.481068
0.246861 0.24547 0.244092 0.243061 0.243141 0.241524 0.239832 0.239009 0.238687 0.23826 0.237309 0.236716 0.237605 0.236945 0.235955 0.235695
‐15 ‐14 ‐13 ‐12 ‐11 ‐10 ‐9 ‐8 ‐7 ‐6 ‐5 ‐4 ‐3 ‐2 ‐1 0
end std 0.143788 0.131583 0.111834 0.100268 0.097568 0.101057 0.105874 0.110858 0.119991 0.133615 0.142136 0.159682 0.184707 0.212408 0.243353 0.259715
0.575151 0.526331 0.447337 0.401073 0.390271 0.40423 0.423497 0.443434 0.479965 0.534461 0.568545 0.638729 0.738829 0.849633 0.973413 1.03886
0.2363 0.236542 0.237494 0.237637 0.236977 0.237119 0.236889 0.237703 0.238099 0.237772 0.238684 0.239059 0.239312 0.239802 0.239661 0.239711
‐0.2
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Starting Task
‐0.2
0
0.2
0.4
0.6
0.8
1
‐15 ‐14 ‐13 ‐12 ‐11 ‐10 ‐9 ‐8 ‐7 ‐6 ‐5 ‐4 ‐3 ‐2 ‐1 0
Ending Task
2 Sec Clip
Offset
2 Sec Clip
Offset
Average Predicted Scores
Offset (frames) Offset (frames)
Figure 6.6: Average predicted scores on frames near the clip boundary. The x-axis is the offset of the input
clip as it shifts away from the boundary. The model’s prediction is highly concentrated on the true positive
at clip boundaries.
that the predicted score is the highest on the first frame and quickly drops to small values on frames away
from the start. This distribution confirms that the model’s prediction approximates the ground truth. We
observe a similar distribution for the ending task model where the high values are concentrated towards the
last frame.
6.5.3 User Study Evaluation
We conduct a user study to analyze the users’ subjective evaluations on the predicted scores of cut feasibility.
The participants are 15 individuals hired from the video production community on upwork.com who worked
on video projects ranging from consumer to professional levels. We present to the participants a web page
that shows a pair of short clips, and ask them to review and click on the one with better starting or ending,
respectively; left-right order is randomized. We generate a total of 3,000 pairs of clips that are sampled
68
Dataset
Task Start End Clip
GT 90.00 76.66 81.66
Test Set 80.45 69.44 71.66
Livestream 69.16 63.33 66.66
Table 6.2: User study results. Average evaluation scores (in percentage) for different tasks and datasets. A
higher number indicates that human viewers agree more with the model’s prediction or the ground truth.
from the test set and a new, out of domain set of unedited livestreaming videos (all natural videos) from
twitch.com. To add redundancy, we design the user study so that each web page is viewed and clicked by
five different users, which allows us to analyze the consensus of each selection among different individuals.
There are 3 types of tasks when users select the clip:
1. Start. Choose the clip with better starting from two N-second-clips.
2. End. Choose the clip with better ending from two N-second-clips.
3. Clip. Choose the clip with better starting and ending. The clips are generally longer (between 2
seconds to 15 seconds) than those in the other two tasks, and they are closer approximation to real
world video editing tasks.
For each video pair, we generate the positive one by sampling local peak values from the continuous curve
predicted by the model. A peak is defined as a local maximum inside the clip with predicted value above 0.9
threshold. The peak value indicates that the video sample is likely to have a good starting/ending cut point.
Similarly, we generate the negative one by sampling local valley values below 0.1. If the users are able
to select the positive clip, it suggests that the model is making a correct classification that matches human
viewers’ subjective decision for what makes a “good cut location”.
To find an upper bound of human agreement, and to validate our ground truth, we also ask the users to perform the same set of tasks on clips generated with ground truth labels, by sampling the true clip boundaries
as positive, and random samples elsewhere as negative.
We compute the subjective score of each selection by majority voting, which means we say that one clip
“wins” if there are more than half of the users making that decision. The average accuracy for three tasks
are reported in Table 6.2. The results show that an expected upper bound for human agreement should be
around 77%-90%, as demonstrated on the GT dataset where cuts have been hand chosen by editors. This
indicates that there is a statistically significant consensus among users that the positive and negative samples
can be separated by humans. Further, we see that on our held out test set, humans agree with our model
69%-80% of the time. We also evaluate 0-shot dataset transfer by testing on the aforementioned out-ofdomain livestream dataset (unedited livesteaming video), and see that accuracy scores are lower 63%-69%,
although still are significantly over chance (50%). This is likely due to the different domain than the training
data.
69
Figure 6.7: Grad-CAM visualization of an input video clip. We can see that the network pays attention to
regions with motion, e.g., the waiving hands.
6.5.4 Grad-CAM Visualization
To understand what our model is looking at, we utilize Grad-CAM [218] to visualize the activation heat
map. A few examples are shown in Fig. 6.7, 6.8. Our model detects semantically meaningful regions in
the scenes and pays attention to moving objects, landmarks and human faces, etc. This analysis gives us
insights into the model’s decision on evaluating a good cut position.
6.6 Applications
We use the models to predict the cut suitability on every frame. The local maximums (or values above
a threshold) can be used as candidate cut points for starting or ending tasks. A starting cut point can be
combined with an ending cut point to generate a clip. We envision possible applications to assist users with
video editing tasks such as cutting, trimming and selection. See Fig. 6.9.
Cut point suggestion. The model recommends the most likely cut positions that can be combined to
produce clips.
Refine clip boundaries of a rough selection. The user selects a clip and then the clip boundaries will
snap to the closest candidate cut points predicted by the model.
Avoid bad cuts. Cutting is disabled in regions with very low prediction scores.
70
Figure 6.8: Grad-CAM visualization of an input video clip. We can see that the network pays particular
attention to salient features such as the horizon
Figure 6.9: An example of automatic cut point prediction using our model. Given an input video, the model
predicts two curves. The starting cuts (orange) and ending cuts (red) are detected at peak values of the
curves. A candidate clip (blue segment) can be generated by combining the cut points in either automatic or
interactive fashion.
71
6.7 Conclusion and Future Work
In conclusion, we take a first step towards evaluating and learning from the cutting points, introduce a
new problem of cut suitability prediction from video data using weakly-supervised edited videos collected
in-the-wild, and demonstrate potential applications of the proposed method. Our paper provides practical
examples of how AI could be used for understanding and accelerating video editing.
Using the proposed method, we are able to simplify the video editing process by “snapping” cuts to frames
that are preferred by viewers, as validated by a user study. Our method uses contrastive learning with a
sampling curriculum designed for this task that improves results over a number of baselines. In addition, we
have showed that adding features derived from other high-level tasks such as motion estimation and object
segmentation improve the overall accuracy of our approach.
We hope that as edited videos become a more common form of communication and self-expression, that this
and other similar computational cinematography technologies will enable new users to participate in this
medium. In the future, we would like to investigate the decision-making process of humans when editing
videos, which will help us understand the gap between human preference and the model’s prediction. We
plan to add the audio cut points in a late-fusion stage for regions of videos containing speech or music.
While we report the average accuracy on a very large dataset (2.85 million clips), we can augment the study
by classifying the videos into different categories using metadata or scene classifiers, and report the numbers
for each video type such as sports, touring, speech, etc. Training specialized models for each video domain
will help increase the accuracy.
72
Chapter 7
Future Work
7.1 OminiPlane: Reconstructing Planes from Single View, Sparse View, and
Monocular Video
Single View Sparse View Monocular Video
Baselines
Proposed
PlaneRecTR PlaneFormers PlanarRecon
Figure 7.1: Comparison between OminiPlane and baselines in three setups. Left: Single View Plane
Reconstruction setup. The baseline is PlaneRecTR[219]. Predicted plane masks are overlaid on the input
image. Our method detects more fine-grained planes with varying orientations. Middle: Sparse View setup.
The baseline is PlaneFormers [220]. Our method achieves better coverage and generates meshes without
holes. Right: Monocular video setup. The baseline is [13]. Our model is able to accurately detect more
planes improving both recall and precision.
We present OminiPlane, a unified method capable of tackling plane detection and reconstruction for various
types of inputs: Single View, Sparse View, and Monocular Video. Our method leverages recent large-scale
pretrained geometric prior models such as DUSt3R [221] and Metric3D [222] to perform monocular depth
estimation and camera pose estimation. Using these, we aggregate multi-frame information into a point
cloud representation of the scene. From the point cloud, we propose a detection-guided RANSAC method
to extract planes. Specifically, we learn a set of per-plane detection queries that perform plane instance
73
Pretrained
Geometric
Prior Model
(DUSt3R)
Location
Normal
Semantic Label
Per
Vertex
…
N detection queries
⊗
Dot
Binary Mask #2
Binary Mask #N
Binary Mask #1
…
RANSAC Loop
As
Inlier
Plane Coefficients
Plane Masks
Single View
Sparse View
Monocular Video
Refined Camera Pose
Detection Guided RANSAC for Plane Recovery
Figure 7.2: Overall Architecture Proposed method supports various inputs, including single view, sparse
view, and monocular video. Pretrained geometric priors, such as DUSt3R, provide vertex-level properties
(e.g., location, normal, semantic label), which are processed by detection queries to generate binary masks
for plane instance segmentation. A detection-guided RANSAC loop refines inliers to estimate plane coefficients and masks, enabling accurate and compact plane recovery.
segmentation on the point cloud. The detected instances are then used as inliers for RANSAC. Compared to
traditional point cloud representations, our learned surfaces are more accurate and compact.
7.1.1 Methods
7.1.1.1 Pretrained Geometric Prior Models
OminiPlane leverages large-scale pretrained geometric prior models, including DUSt3R [221] and Metric3D [222], to extract critical geometric insights. These models provide vertex-level features:
fv = {pv, nv, sv}, (7.1)
where pv ∈ R
3
is the 3D spatial location, nv ∈ R
3
is the surface normal, and sv is the semantic label for
each vertex v. These features enable monocular depth estimation and camera pose refinement, forming the
basis for multi-frame processing.
7.1.1.2 Multi-Frame Aggregation and Point Cloud Representation
For sparse-view and monocular video inputs, OminiPlane aggregates multi-frame information into a unified
point cloud P = {pi}
N
i=1, where pi ∈ R
3
represents the i-th point. The aggregation process incorporates
camera pose transformations Tj ∈ SE(3) across multiple frames:
pi = Tj · p
′
i
, (7.2)
where p
′
i
is the corresponding point in the j-th frame. This step integrates geometric and semantic information, producing a rich representation of the scene.
74
7.1.1.3 Detection Queries for Plane Instance Segmentation
To detect planes, OminiPlane uses N learnable detection queries Q = {qk}
N
k=1, where each query predicts
a binary mask mk ∈ {0, 1}
|P| over the point cloud. These masks represent individual plane instances:
mk = σ(Wk · F), (7.3)
where F ∈ R
|P|×d
is the feature matrix of the point cloud, Wk ∈ R
d
are the learnable weights for query qk,
and σ(·) is the sigmoid function.
7.1.1.4 Detection-Guided RANSAC
For each binary mask mk, a RANSAC loop is applied to estimate plane coefficients ak = (a, b, c, d),
representing the plane equation:
ax + by + cz + d = 0. (7.4)
The RANSAC algorithm selects a subset of inliers Ik from mk that minimizes the distance of inliers to the
plane:
Ik = {i ∈ P : |axi + byi + czi + d| < ϵ}, (7.5)
where ϵ is the inlier threshold. The plane coefficients ak are then optimized by solving:
ak = argmin
a
X
i∈Ik
(axi + byi + czi + d)
2
. (7.6)
This process iterates until a termination criterion is met, such as the maximum number of iterations or
convergence of inlier support.
7.1.1.5 Output: Accurate and Compact Planes
The final output consists of refined plane coefficients {ak}
N
k=1 and their associated binary masks {mk}
N
k=1.
These results represent a compact and accurate representation of planar surfaces, suitable for various downstream applications.
7.1.2 Experiments
7.1.2.1 Single View Plane Reconstruction
In the single view setup, OminiPlane demonstrates superior performance compared to PlaneRecTR [219].
As shown in Figure 7.3, the proposed method successfully identifies more fine-grained planes with varying
orientations, even in challenging scenarios with occlusions and complex geometries. The generated masks
closely align with the ground truth, highlighting the accuracy of OminiPlane’s plane detection mechanism.
Additionally, the reconstructed 3D models illustrate the method’s ability to preserve geometric consistency
and surface details, setting a new benchmark for single-view plane recovery tasks.
75
Input GT Mask PlaneRecTR Mask Proposed Mask PlaneRecTR 3D Models Proposed 3D Models
Figure 7.3: Single View plane recovery result with comparison GT and PlaneRecTR on the ScanNet dataset.
Input images are processed to generate ground truth (GT) plane masks, which are compared with predictions
from PlaneRecTR and the proposed OminiPlane method. The visualizations include 2D plane masks and
reconstructed 3D models, highlighting the superior accuracy and detail achieved by the proposed approach.
76
Input Views PlaneFormers
Proposed
Ground Truth
Figure 7.4: 2-View Result. Reconstruction comparison on 2 views setup.
Input Views PlaneFormers Proposed Ground Truth
Figure 7.5: Multiview Test Results. Multiview reconstruction results on room-scale scenes.
7.1.2.2 Sparse View Plane Reconstruction
For the sparse view setup, OminiPlane achieves improved coverage and structural integrity compared to
PlaneFormers [220]. As depicted in Figure 7.4, the method generates seamless meshes without significant holes, effectively capturing planar surfaces even with limited viewpoints. This improvement can be
attributed to the multi-frame aggregation step, which leverages geometric priors and camera pose estimations to integrate sparse observations into a coherent point cloud representation. These results demonstrate
OminiPlane’s robustness and adaptability in sparse-view scenarios.
77
Table 7.1: Quantitative comparison of OminiPlane and baseline methods across different setups.
Setup Metric PlaneRecTR [219] PlaneFormers [220] Proposed (OminiPlane)
Single View IoU 0.65 N/A 0.82
Sparse View Precision N/A 0.78 0.88
Recall N/A 0.76 0.84
Monocular Video Chamfer Distance N/A 0.34 0.21
7.1.2.3 Monocular Video Plane Reconstruction
In the monocular video setup, OminiPlane outperforms the baseline method [13] in terms of both recall
and precision. Figure 7.5 showcases the method’s ability to reconstruct large, room-scale scenes with high
fidelity. The detection-guided RANSAC approach effectively extracts planes from the aggregated point
cloud, minimizing artifacts and inconsistencies. Furthermore, the temporal coherence introduced by the
multi-frame aggregation process ensures that planes are consistently reconstructed across frames, making
OminiPlane particularly suitable for video-based applications.
7.1.2.4 Quantitative Evaluation
To provide a comprehensive evaluation, we measure the performance of OminiPlane and baseline methods
using standard metrics, including Intersection over Union (IoU), Precision, Recall, and Chamfer Distance.
As summarized in Table 7.1, OminiPlane consistently outperforms the baselines across all setups. The
improved IoU scores indicate better alignment with ground truth, while higher precision and recall values
highlight the robustness of the plane detection mechanism. Lower Chamfer Distance further validates the
accuracy of the reconstructed 3D geometry.
7.1.2.5 Qualitative Analysis
The visual results in Figures 7.3, 7.4, and 7.5 highlight the qualitative improvements achieved by OminiPlane. The proposed method demonstrates consistent accuracy across various setups, with better delineation
of planar surfaces and fewer artifacts. These visualizations corroborate the quantitative findings, underscoring the effectiveness of OminiPlane’s unified approach.
7.1.3 Summary of Results
In summary, the experiments validate OminiPlane as a versatile and robust solution for plane detection and
reconstruction across diverse input types. Its ability to leverage pretrained geometric priors, perform multiframe aggregation, and incorporate detection-guided RANSAC establishes it as a state-of-the-art method in
3D plane recovery tasks.
78
7.2 Aligning automated metrics with human experts for evaluation of
Structured Reconstruction
GT WF1 WF2 WF3 WF4
Metric/wireframe GT WF1 WF2 WF3 WF4
Vertex F1 ↑ 1.00 0.69 0.60 0.91 0.18
Edge F1 ↑ 1.00 0.31 0.56 0.71 0.00
WED ↓ 0.00 2.77 0.84 0.52 1.46
Edge chamfer distance ↓ 0.00 13.99 4.22 25.89 176.48
Graph Spectral distance L2 ↓ 0.00 0.33 0.25 0.57 1.57
Sampled edge point IoU ↑ 1.00 1.00 0.81 0.66 0.06
Figure 7.6: Motivational example for this work. While the humans tend to sort the wireframes from the
best to worst in the presented order, the popular metrics sort them in different ways, sometimes complete
opposite. Top row, left to right: ground truth wireframe, wireframe with edges split into several, maintaining geometrical and topological accuracy, wireframe with removed parts of the edges, but same vertices,
wireframe with missing vertex and edges, wireframe with only one correct vertex. Bottom row: distances
between GT and respective wireframe, numbers, which change sorting are in red.
Evaluating structured 3D reconstruction is a critical yet challenging task, as existing automated metrics often
fail to align with expert human judgments. This misalignment is especially problematic in professional
applications, where subtle geometric and topological details significantly influence the perceived quality of
reconstructions. Motivational examples illustrate how metrics frequently produce rankings that contradict
human intuition, such as favoring reconstructions with fragmented edges over those with minor topological
errors, or overlooking missing key vertices.
In this paper, we systematically analyze the limitations of popular structured reconstruction metrics through
controlled comparisons with expert evaluations. Using a dataset of diverse reconstruction scenarios, we
reveal key inconsistencies in how these metrics assess structural fidelity and alignment with ground truth
models. Our results demonstrate the importance of considering both geometric accuracy and topological
coherence in evaluation metrics.
79
To bridge this gap, we propose practical recommendations for selecting or combining metrics tailored to
specific use cases, ensuring a closer alignment with human evaluations. By addressing these discrepancies, this work provides valuable insights for improving the reliability and interpretability of structured 3D
reconstruction assessments, advancing both academic research and professional practice.
7.2.1 Introduction
Benchmarks have been key drivers of progress in computer vision, the canonical example is certainly ImageNet [223], but prominent examples abound beyond image classification: object tracking [224, 225], image
retrieval [226, 227], image matching [228], 6D pose estimation [229], optical flow [4], etclet@tokeneonedot.
Benchmarks have two main components – the dataset and the metrics. The data is the single most important
component, but progress is near impossible without being able to answer the question, "progress on what?"
Good metrics are the quantitative answer to this question.
We consider an area of structured and semi-structured reconstruction and recognition, which gained some
popularity recently. Given a set or sequence of sensory data, such as ground [230] or satellite images [231],
or aerial LiDAR [232], the goal is to produce a wireframe or a CAD model of the building or other structures.
Naturally, the modeling output is presented as a graph with vertices (such as house apex point) and edges
(roof line). Several datasets have been recently proposed [232, 230] for the task. The issue, however, is that
hardly a pair of papers in area use the same metric to evaluate quantitative results. Some of the papers use
recognition metrics such as precision or recall on vertices and edges [233] or a more enhanced version such
as Structured Average Precision [234, 235]. Others go for the graph-based metrics, such as the Wireframe
Edit Distance (WED) [236, 237, 233]. Finally, some methods treat the problem similarly to the point clouds
registration problem and report Chamfer distance [238, 239, 240].
This shows the difficulty of comparing wireframe or CAD reconstructions to each other. Other related fields,
like structure from motion, use downstream metrics, such as image generation quality [241] or camera pose
accuracy [228]. The difficulty here is related to the fact that structured reconstructions often have different
goals. For instance, one purpose of the reconstructed wireframe is to represent building floorplan and answer
the questions such as "what is the area of the bedroom?" Within this formulation, a blackbox model, visualand-language model, which takes an image as an input and outputs a correct estimate, would be a perfect
match. On the other hand, a floorplan or a blueprint have a value on its own, which cannot be replaced with
a black box model.
Finally, many existing metrics, while being useful often fail to deliver value in practice when comparing two
imperfect estimations. It is also hard to design a metric that compares well "very good" solutions and "very
bad" solutions at the same time. Moreover, there is an evidence [230] that such metrics can be "hacked"
or exploited in such a way, that obviously bad solutions have better scores than flawed but ultimately quite
reasonable solutions. Such examples include a number of corner cases, which can cause existing metrics
to become useless in practical scenarios. For instance, imagine solutions where the graph connectivity is
broken, but the solution still visually resembles the ground truth. In such case, most commonly used metrics,
as WED, fail completely, whereas our proposed rasterization-based metric measures a high score for this
solution. This paper presents the following contributions:
80
(1) We show advantages and drawbacks of existing metrics for structured reconstruction and provide a
recommendations when and how one should use them.
(2) We measure perceived quality of of reconstructions by human domain experts, which gives us a ranking
of all structured reconstructions. This ranking is then compared to the ranking given by all metrics.
(3) We list different metric properties that might be required or desired depending on the application.
(4) We propose a rasterization-based metric, which is fast to compute, has only a single hyper-parameter to
tune, and which allows for reasonable comparison both of low and high-quality predictions, and also is
aligned with human perception of usability of the wireframe or CAD models.
We show the benefit of a proposed metric on the 3D (S
23DR, Building3D) wireframes, as well as low-poly
mesh reconstructions
7.2.2 What do we actually need from the wireframe comparison metrics?
While producing a "one metric to rule them all" remains illusive (largely due to application dependence), we
consider semi-automated modeling as a target task with enough generality to drive progress, can smoothly
scale to fully automated-modeling, but is still tractable and reliable for commercial applications today.
We consider the semi-automatic 3D modeling task as our use-case:
For this task, the wireframe representation is estimated from some "raw" input like images or a LiDAR
point cloud, and then transferred to the human experts to either approve it, correct it and then approve, or
reject and create the model from scratch manually. We argue that such task formulation creates an implicit
usefulness ranking over reconstructions (and thereby over reconstruciton methods). Otherwise, without
considering human involvement, everything becomes binary – either the model is perfect and can be used
for 3D printing, measurement extraction, etc. or it is not.
For this reason, we consider the following experimental setup to benchmark the wireframe comparison
metrics. A pool of professional 3D modelers, whose everyday job is to create CAD-like models from raw
data such as images is asked to rank triplets of wireframes from the best to worst, and also to indicate if
the top result in the triplet is useful for 3D modeling or not. An example of the ranking setup is shown in
Fig. 7.7. All wireframes are superimposed with ground truth models, and the labelers can translate, scale,
and rotate the wireframes in 3D.
The wireframes are drawn from several pools, as described below. We then evaluate how the existing metrics
agree with the judgment of professional human 3D modeling experts.
Pool1– S
23DR. We acquire a representative set of S
23DR challenge entries [230] and PC2WF [236]
baseline. These wireframes were algorithmically reconstructed from multiview inputs with the goal of
minimizing a variant of WED. We include entries including from top-10. The ground truth models where
created by human experts and have undergone significant validation. The input data were captured by users
on mobile phones in North America.
Pool2 – Building3D. Similarly, we acquire a representative set of Building3D [232] challenge entries and
the PC2WF baseline. These wireframes were algorithmically reconstructed from the LiDAR scans of the
roofs captured from an aerial platform in Tallin, Estonia. We include entries from the Top 10.
81
Figure 7.7: Wireframe ranking interface for human labelers.
Pool3 — Corrupted ground truth. We apply one of the following operations on the ground truth wireframes from Pools 1 and 2 (examples are shown in Fig. 7.8):
• Randomly delete one of the vertices and all the edges connected to it.
• Split the edge into several edges and perturb the vertices positions, without breaking the topology.
• Split the edge into several edges and randomly delete some of the smaller edges.
• Randomly remove fraction of correct edges and vertices.
• Randomly add wrong edges to the model.
Specific parameters of the algorithmic corruptions are listed in the supplementary.
Unit-tests. In addition to being aligned with human judgement, we also propose a set of "unit-tests" for the
metrics, which we believe are reasonable and check if the metrics satisfy these requirements. For example,
if wrong edges E1 and E2 are added to the GT model, resulting wireframe should be scored lower by the
metric than if either E1 or E2 are added separately.
Desired properties of dissimilarity scores. The following are the formal properties of mathematical metrics, as well as additional properties relevant to evaluating dissimilarity in structured reconstruction tasks.
• Identity of Indiscernibles: This property ensures that identical inputs receive a dissimilarity score
of zero, indicating perfect similarity. For any reconstruction x, a metric d satisfies this property if
82
Figure 7.8: Examples of corrupted ground truth wireframes, used for wireframe ranking. Left to right:
ground truth, deformed edges, vertex duplication and random movement, edge addition, edge deletion.
d(x, x) = 0. The test_identity_of_indiscernibles function is used to check for this
behavior.
• Symmetry: A symmetric metric produces the same dissimilarity score regardless of the order of
the inputs. For reconstructions x and y, the metric satisfies symmetry if d(x, y) = d(y, x). This is
evaluated through the test_symmetry and test_near_symmetry functions. While perfect
symmetry is ideal, near-symmetry may occur in the presence of noise, transformations, or numerical
differences.
• Triangle Inequality: The triangle inequality ensures that for any three reconstructions x, y, and
z, the dissimilarity between x and z is less than or equal to the sum of dissimilarities between x
and y, and y and z. This relationship is expressed as d(x, z) ≤ d(x, y) + d(y, z). The test_-
triangle_inequality_randother, test_triangle_inequality_add_noise, and
test_triangle_inequality_del1_del2 functions are used to verify this property.
• Monotonicity: This property describes how the dissimilarity score behaves when components (such
as vertices or edges) are removed from a reconstruction. A metric satisfies monotonicity if the dissimilarity score either increases or remains constant as vertices or edges are deleted. Similarly, we check
if the score either decreases or remains constant as wrong vertices or wrong edges are added. The
test_monotonic_when_deleting_verts and test_monotonic_when_deleting_-
edges functions are employed to assess this property.
• Smoothness: This property holds when the metric changes smoothly over time (either increases or
decreases). This is evaluated by moving random vertices with small increments and checking the
variance of the differences in the score.
83
7.2.3 Metrics
The following metrics are considered:
WED – Wireframe Edit Distance was first proposed by Liu [236] as an extension of the Graph Edit Distance (GED) [242]. GED quantifies the distance between two graphs as the minimum number of elementary
operations (inserting and deleting edges and vertices) required to transform one graph into another. WED
extends this to wireframes (graphs with node positions and edge lengths) and proposes a cheap approximation to the NP-Hard problem of computing the optimal sequence of edits. Concretely, an assignment is first
computed between the predicted and ground truth vertices, and a cost is paid proportional to the distance
between matched vertices. Next, unmatched vertices are deleted and missing vertices inserted (paying a
cost proportional to the number of inserted/deleted vertices). Finally, given the vertex assignments, missing
edges are inserted and extra edges deleted paying a cost proportional to their length. In order to use WED,
one needs to decide on the cost of insertion and deletion of the vertices, as well as order of operations.
ECD – Edge Chamfer Distance. We consider a family of chamfer-like metrics between two sets A and B.
The general form is:
d(A, B) := inf
πAB:A→B
Ea∈A [f(a, πAB(a))] , (7.7)
where πAB represents an assignment from elements in A to elements in B, and f is typically an ℓp norm.
Different constraints on πAB yield different metrics:
• The classical chamfer distance corresponds to choosing πAB(a) = argminb∈B f(a, b),
i.elet@tokeneonedotnearest neighbor matching.
• A stricter variant requires mutual nearest neighbors, where (a, b) are matched only if a is the nearest
neighbor of b and vice versa.
• The most constrained version requires πAB to be a bijective matching, which can be computed via the
Hungarian algorithm and is equivalent to the Earth Mover’s Distance when f is the ℓp norm.
Length Weighed Spectral Graph Distance - SD incorporate both topological and geometric information
by framing graph (wireframe) distance in terms of distances between the spectra of weighted graph Laplacians. We measure the spectral distance using the 2-Wasserstein metric between the eigenvalue distributions:
SD(G1, G2) := W2(λ(L1), λ(L2)) (7.8)
where λ(L) denotes the spectrum of the Laplacian L. For a graph G = (V, E), the weighted graph Laplacian
is defined:
L := D − A (7.9)
where D is the weighted degree matrix (|V | × |V | diagonal matrix with each diagonal entry containing the
sum of the lengths of edges incident to that vertex), and A is the weighted adjacency matrix (|V | × |V | with
Aij = ∥Vi − Vj∥2 iff (i, j) ∈ E and 0 otherwise).
84
Corner and Edge Metrics. We also compute precision, recall, and F1 scores for both corners and edges.
For corners, we consider a prediction correct if it lies within a distance threshold of a ground truth corner. For
edges, we use the Hausdorff distance between line segments to determine matches. These metrics provide
an intuitive measure of the topological accuracy of the predicted wireframes.
Hausdorff Distance measures the maximum of minimal distances between two sets of points. For wireframes, we sample points along the edges and compute the Hausdorff distance between these point sets,
providing a measure of geometric similarity that considers both corner positions and edge geometry.
Structured Average Pecision, proposed in [234], is the area under the precision-recall curve of the line
segment detection, similarly to the object detection. The line segment is considered correct, when both edge
vertices are within user defined distance from the ground truth. In [234], threshold of 10 and 15 pixels are
used within the image size of 128 pixels.
Intersection over Union is a popular metric in a wide range of fields, segmentation, tracking, object detection. However, it is never used to assess the quality of wireframe reconstructions. First, we extend the
definition of the wireframe as a set of cylinders with a fixed radius (the only hyperparameter of this metric). Then, we define this metric as an intersection over union between two sets of cylinders, given by two
wireframe reconstructions that need to be compared. Ideally, a closed-form solution needs to be derived.
However, this remains an open research question, whether if such a solution is even feasible. In this paper,
we propose several approximations of this metric.
Mesh approximation. The most precise one is based on mesh approximation of every cylinder with the fixed
number of edges. Then, mesh intersection is computed to derive the final score.
Voxel approximation. Another approximation is to discretize the 3D space into voxel grid and voxelize the
cylinders. Then, intersection over union can be easily computed in a similar way as in the 2D domain. This
solution however has high storage requirements and is impractical for longer edges and larger scenes.
Point sampling approximation. We sample random point from both set of cylinders, and compute the average
number of time when the point falls inside of both sets of cylinders. This approximation is faster to compute,
where the runtime could be changed based on the number of sampled points.
7.2.4 Experiments
7.2.4.1 Human wireframe ranking and its consistency
To determine which of the metrics under consideration are most appropriate, we employ a group of human
3D modeling experts to provide pseudo-ground-truth rankings of the solutions. Given a pool of samples
generated from various automated reconstruction approaches (human 3D modeling experts) were shown
triples of solutions, sampled from the possible solutions. Then, they were asked to rank them from worst
to best. We additionally generated synthetically ranked triples by adding increasing levels of corruption
to the ground truth (see 7.8). We explore several approaches to map these triples to a single "quality"
factor. Our primary approach is to model the quality of each solution using a Bradley-Terry (BT) preference
model on the expressed preferences of the rankers (using the BT-Abilities as the scores). We also explore
85
a sparse matrix factorization approach based on the Singular Value Decomposition (SVD). We further use
both methods to estimate the agreement between raters and the structure of their preferences.
The Bradley-Terry model defines
P(i > j) = θi
θi + θj
(7.10)
as the probability of solution i beating solution j. Given the win rate matrix W, where Wij is the number
of times i has beaten j if they played, and 0 otherwise. We iteratively update θ until convergence using
regularized gradient ascent. The regularization helps ensure numerical stability and prevents degenerate
solutions. Specifically, at each iteration we randomly sample a subset of solutions with replacement and for
each solution i, we compute:
∂L
∂θi
=
X
j̸=i
wij − (wij + wji)
θi
θi + θj
− 2λθi
, (7.11)
where wij represents the number of times solution i was preferred over solution j, and λ is the regularization
parameter.
To quantify inter-rater agreement, we fit separate Bradley-Terry models per rater and compute both the pairwise rank-correlations between raters and the rank correlation between each of the raters and the consensus.
etween most raters we see low but positive pairwise rank-correlation, with much higher correlations between
some raters and the consensus.
We also use SVD to analyze the ratings. We form the ratings matrix R with one row per solution and one
column per rater, and normalize it between −1 and 1. We then apply SVD to factorize it as
R = UΣV
T
(7.12)
where R ∈ [−1, 1]n×m, U ∈ R
n×n
, Σ ∈ R
n×m, and V ∈ R
m×m, with n being the number of solutions and
m being the number of raters. If (and only if) the raters are in perfect agreement, the rank of R will be 1,
R will have exactly one non-zero singular value (S00). In practice, raters do not display perfect agreement,
however, if we see approximate rank1 structure we can conclude their is a single dominant linear factor
which describes the variance in the data well. Concretely, using the formula
vi =
σ
2
P
i
j
σ
2
j
(7.13)
we conclude ≈ 90% of the variance is explained by the largest factor while only σ
2
2
/
P
j
σ
2
j % is explained
by the seccond largest. We can then, read the ratings directly off the first left singular vector of R. Additionally, we observe a ratio of Z = σ
2
1
/σ2
2
between the first and second singular values of R indicating a
single dominant factor. We find a Spearman coefficient Ω between the rankings implied by SVD and those
implied by BT lending additional evidence to the hypothesis that there is a true "quality" factor driving the
rater’s views.
86
Chapter 8
Conclusion
This dissertation has explored the critical role of semantic structures in advancing the field of 3D vision
and generation. Moving beyond traditional non-semantic representations like point clouds, we have demonstrated the transformative advantages of incorporating semantic information into the processes of understanding, generating, and modifying 3D environments. By bridging the gap between implicit and explicit
representations, this work offers scalable, interpretable, and application-ready solutions that address key
challenges in 3D modeling.
Through novel methods such as UniPlane (Ch.2) and PlanarNeRF (Ch.3), we introduced plane-aware techniques for scene reconstruction, enabling implicit representations to produce explicit plane structures. This
approach enhances structural interpretability while retaining the high fidelity and scalability of implicit models. In OrientDream (Ch.4), we demonstrated how orientation-conditioned diffusion models can provide
explicit control during text-to-3D generation, addressing issues like the Janus problem and enhancing user
interaction. Similarly, EA3D (Ch.5) introduced text-guided semantic editing for implicit representations,
enabling precise and intuitive modifications of 3D models, overcoming longstanding usability challenges.
These contributions collectively highlight the value of integrating semantic understanding into 3D workflows. The results consistently show that leveraging semantic structures leads to more accurate, interpretable, and scalable systems, as evidenced by improved performance in complex scenes and enhanced
user control in generative tasks.
While this dissertation represents a step forward, the journey toward richer and more interactive 3D experiences is ongoing. Future research directions include:
1. Expanding Semantic Representations: As discussed in Sec. 7.1, where we explored reconstructing
planes from diverse inputs like single views and sparse views, future work involves moving beyond
planar structures to incorporate a broader range of semantic primitives, such as curves, objects, and
their relationships.
2. Defining the Right Metrics: As discussed in Sec. 7.2, establishing metrics that align with human
perception and domain-specific requirements is crucial for evaluating and guiding the development of
structured 3D models. Automated metrics must go beyond pixel-level accuracy to reflect the semantic
quality and usability of outputs, bridging the gap between human intuition and machine evaluation.
87
3. Multimodal Integration: Fusing semantic information from diverse sources, including text, images,
and audio, to create immersive and comprehensive 3D environments.
4. Real-time Interaction: Developing methods for real-time semantic understanding and manipulation
of 3D scenes, unlocking applications in augmented reality and human-robot collaboration.
5. Generalization and Robustness: Enhancing the generalization capabilities of 3D models across
diverse datasets and real-world scenarios.
By continuing to push the boundaries of semantic understanding and focusing on meaningful evaluation,
this work lays the foundation for future innovations that will bridge the gap between the digital and physical
worlds. Ultimately, these advancements will unlock new possibilities for creative expression, scientific
discovery, and technological progress, bringing us closer to the vision of fully interpretable, scalable, and
interactive 3D workflows.
88
Bibliography
[1] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and
Ren Ng. “Nerf: Representing scenes as neural radiance fields for view synthesis”. In:
Communications of the ACM (2021).
[2] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu,
Harry Yang, Oron Ashual, Oran Gafni, et al. “Make-a-video: Text-to-video generation without
text-video data”. In: arXiv:2209.14792 (2022).
[3] Thomas Schöps, Torsten Sattler, Christian Häne, and Marc Pollefeys. “3D Modeling on the Go:
Interactive 3D Reconstruction of Large-Scale Scenes on Mobile Devices”. In: 2015 International
Conference on 3D Vision. 2015, pp. 291–299. DOI: 10.1109/3DV.2015.40.
[4] Moritz Menze and Andreas Geiger. “Object Scene Flow for Autonomous Vehicles”. In: Conference
on Computer Vision and Pattern Recognition (CVPR). 2015.
[5] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. “Monosdf:
Exploring monocular geometric cues for neural implicit surface reconstruction”. In: Advances in
neural information processing systems 35 (2022), pp. 25018–25032.
[6] Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. “Deep Hough Voting for 3D Object
Detection in Point Clouds”. In: Proceedings of the IEEE International Conference on Computer
Vision. 2019.
[7] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and
Justin M. Solomon. “Dynamic Graph CNN for Learning on Point Clouds”. In: ACM Transactions
on Graphics (TOG) (2019).
[8] Chen Liu, Kihwan Kim, Jinwei Gu, Yasutaka Furukawa, and Jan Kautz. “Planercnn: 3d plane
detection and reconstruction from a single image”. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. 2019, pp. 4450–4459.
[9] Ji Hou, Angela Dai, and Matthias Nießner. “3d-sis: 3d semantic instance segmentation of rgb-d
scans”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
2019, pp. 4421–4430.
[10] Fengting Yang and Zihan Zhou. “Recovering 3d planes from a single image via convolutional
neural networks”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018,
pp. 85–100.
89
[11] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar.
“Masked-attention mask transformer for universal image segmentation”. In: Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition. 2022, pp. 1290–1299.
[12] Rich Sutton. “The Bitter Lesson”. In: (2019). URL:
http://www.incompleteideas.net/IncIdeas/BitterLesson.html.
[13] Yiming Xie, Matheus Gadelha, Fengting Yang, Xiaowei Zhou, and Huaizu Jiang. “PlanarRecon:
Real-Time 3D Plane Detection and Reconstruction from Posed Monocular Videos”. In: CVPR.
2022.
[14] Bin Tan, Nan Xue, Song Bai, Tianfu Wu, and Gui-Song Xia. “Planetr: Structure-guided
transformers for 3d plane recovery”. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2021, pp. 4186–4195.
[15] Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and Hujun Bao. “NeuralRecon:
Real-Time Coherent 3D Reconstruction from Monocular Video”. In: CVPR (2021).
[16] Ruwen Schnabel, Roland Wahl, and Reinhard Klein. “Efficient RANSAC for point-cloud shape
detection”. In: Computer graphics forum. Vol. 26. 2. Wiley Online Library. 2007, pp. 214–226.
[17] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and
Matthias Nießner. “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes”. In: Proc.
Computer Vision and Pattern Recognition (CVPR), IEEE. 2017.
[18] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li,
Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu.
ShapeNet: An Information-Rich 3D Model Repository. Tech. rep. arXiv:1512.03012 [cs.GR].
Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
[19] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models”. In:
Advances in Neural Information Processing Systems (2020).
[20] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
“High-resolution image synthesis with latent diffusion models”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2022, pp. 10684–10695.
[21] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. “Dreamfusion: Text-to-3d using 2d
diffusion”. In: arXiv:2209.14988 (2022).
[22] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. “Instant neural graphics
primitives with a multiresolution hash encoding”. In: arXiv:2201.05989 (2022).
[23] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson,
Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. “Segment Anything”. In:
arXiv preprint arXiv:2304.02643 (2023).
90
[24] Congyue Deng, Chiyu Jiang, Charles R. Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, and
Dragomir Anguelov. “NeRDi: Single-view NeRF synthesis with language-guided diffusion as
general image priors”. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2023.
[25] Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang,
and Jianfeng Gao. “Semantic-SAM: Segment and recognize anything at any granularity”. In: arXiv
preprint arXiv:2307.04767 (2023).
[26] Xinhua Cheng, Tianyu Yang, Jianan Wang, Yu Li, Lei Zhang, Jian Zhang, and Li Yuan.
“Progressive3D: Progressively local editing for text-to-3D content creation with complex semantic
prompts”. In: arXiv preprint arXiv:2310.11784 (2023).
[27] Xin Yao, Wei Liu, Jun Wu, and Xingxing Xie. “CAD-based vision tasks: A survey”. In: Computer
Graphics Forum 39.2 (2020), pp. 1–16.
[28] Mario Botsch, Leif Kobbelt, Mark Pauly, Pierre Alliez, and Bruno Lévy. Polygon Mesh
Processing. AK Peters/CRC Press, 2010.
[29] Eitan Gal. “Procedural Modeling for Architects: Software as a Design Partner”. In: Architectural
Design 89.2 (2019), pp. 40–45.
[30] Radu Bogdan Rusu and Steve Cousins. “3D is here: Point cloud library (PCL)”. In: Proceedings of
the IEEE International Conference on Robotics and Automation (ICRA) (2011), pp. 1–4.
[31] Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan, Simon Stent, and Roberto Cipolla.
“SceneNet: Understanding Real World Indoor Scenes with Synthetic Data”. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2014, pp. 567–576.
[32] Matthias Niessner, Michael Zollöfer, Shahram Izadi, and Marc Stamminger. “Real-time 3D
reconstruction at scale using voxel hashing”. In: ACM Transactions on Graphics (TOG). Vol. 32. 6.
ACM, 2013, pp. 1–11.
[33] Fan Zhou et al. “Sparse Neural Representations for Efficient 3D Shape Analysis”. In: Advances in
Neural Information Processing Systems (NeurIPS). 2021.
[34] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. “PointNet: Deep Learning on Point
Sets for 3D Classification and Segmentation”. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). 2017, pp. 652–660.
[35] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove.
“DeepSDF: Learning continuous signed distance functions for shape representation”. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2019,
pp. 165–174.
[36] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and
Ren Ng. “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”. In:
Proceedings of the European Conference on Computer Vision (ECCV). 2020, pp. 405–421.
91
[37] Bernhard Kerbl et al. “3D Gaussian Splatting for Real-Time Radiance Field Rendering”. In: ACM
Transactions on Graphics (TOG). Vol. 42. 4. ACM, 2023, pp. 1–12.
[38] Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello,
Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and
Gordon Wetzstein. “Efficient Geometry-aware 3D Generative Adversarial Networks”. In: arXiv.
2021.
[39] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman,
Steven M Seitz, and Ricardo Martin-Brualla. “NERFies: Deformable Neural Radiance Fields”. In:
Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2021,
pp. 5865–5874.
[40] Matthew Tancik et al. “Block-NeRF: Scalable Large Scene Neural Radiance Fields”. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2022,
pp. 305–315.
[41] Mohammadreza Armandpour, Huangjie Zheng, Ali Sadeghian, Amir Sadeghian, and
Mingyuan Zhou. “Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D,
alleviate Janus problem and Beyond”. In: arXiv preprint arXiv:2304.04968 (2023).
[42] Susung Hong, Donghoon Ahn, and Seungryong Kim. Debiasing Scores and Prompts of 2D
Diffusion for View-consistent Text-to-3D Generation. 2023. arXiv: 2303.15413 [cs.CV].
[43] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger.
“Occupancy networks: Learning 3D reconstruction in function space”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 4460–4470.
[44] Ricardo Martin-Brualla, Pratul P Srinivasan, Connor Zhang, Jonathan T Barron, Dan B Goldman,
and Steven M Seitz. “Editable Neural Radiance Fields for 3D Scene Editing”. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 5884–5893.
[45] Mingyu Zhang, Ruihan Wang, Tinghui Zhou, and Alexei A Efros. “Editing Implicit
Representations through Local Geometric Constraints”. In: arXiv preprint arXiv:2302.03589
(2023).
[46] Ji Hou, Angela Dai, and Matthias Nießner. “Revealnet: Seeing behind objects in rgb-d scans”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020,
pp. 2098–2107.
[47] Yinyu Nie, Ji Hou, Xiaoguang Han, and Matthias Nießner. “Rfd-net: Point scene understanding by
semantic instance reconstruction”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2021, pp. 4608–4618.
[48] Zhenfei Yang, Fei Gao, and Shaojie Shen. “Real-time monocular dense mapping on aerial robots
using visual-inertial fusion”. In: 2017 IEEE International Conference on Robotics and Automation
(ICRA). 2017, pp. 4552–4559. DOI: 10.1109/ICRA.2017.7989529.
92
[49] Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim,
Andrew J. Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon.
“KinectFusion: Real-time dense surface mapping and tracking”. In: 2011 10th IEEE International
Symposium on Mixed and Augmented Reality. 2011, pp. 127–136. DOI:
10.1109/ISMAR.2011.6092378.
[50] K. Wang and S. Shen. “MVDepthNet: real-time multiview depth estimation neural network”. In:
International Conference on 3D Vision (3DV). 2018.
[51] Chao Liu, Jinwei Gu, Kihwan Kim, Srinivasa G. Narasimhan, and Jan Kautz. “Neural RGB®D
Sensing: Depth and Uncertainty From a Video Camera”. In: 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR). IEEE, 2019. DOI: 10.1109/cvpr.2019.01124.
[52] Yuxin Hou, Juho Kannala, and Arno Solin. “Multi-View Stereo by Temporal Nonparametric
Fusion”. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019,
pp. 2651–2660. DOI: 10.1109/ICCV.2019.00274.
[53] Julien Valentin, Adarsh Kowdle, Jonathan T. Barron, Neal Wadhwa, Max Dzitsiuk,
Michael Schoenberg, Vivek Verma, Ambrus Csaszar, Eric Turner, Ivan Dryanovski, Joao Afonso,
Jose Pascoal, Konstantine Tsotsos, Mira Leung, Mirko Schmidt, Onur Guleryuz, Sameh Khamis,
Vladimir Tankovitch, Sean Fanello, Shahram Izadi, and Christoph Rhemann. “Depth from Motion
for Smartphone AR”. In: ACM Trans. Graph. 37.6 (2018). ISSN: 0730-0301. DOI:
10.1145/3272127.3275041.
[54] Manuel Dahnert, Ji Hou, Matthias Nießner, and Angela Dai. “Panoptic 3d scene reconstruction
from a single rgb image”. In: Advances in Neural Information Processing Systems 34 (2021),
pp. 8282–8293.
[55] Bowen Cheng, Alex Schwing, and Alexander Kirillov. “Per-pixel classification is not all you need
for semantic segmentation”. In: Advances in Neural Information Processing Systems 34 (2021),
pp. 17864–17875.
[56] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. “Indoor segmentation and
support inference from rgbd images.” In: ECCV (5) 7576 (2012), pp. 746–760.
[57] Zhuo Deng, Sinisa Todorovic, and Longin Jan Latecki. “Unsupervised object region proposals for
RGB-D indoor scenes”. In: Computer Vision and Image Understanding 154 (2017), pp. 127–136.
[58] Zehao Yu, Jia Zheng, Dongze Lian, Zihan Zhou, and Shenghua Gao. “Single-Image Piece-wise
Planar 3D Reconstruction via Associative Embedding”. In: CVPR. 2019, pp. 1029–1037.
[59] Chen Liu, Jimei Yang, Duygu Ceylan, Ersin Yumer, and Yasutaka Furukawa. “Planenet: Piece-wise
planar reconstruction from a single rgb image”. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2018, pp. 2579–2588.
[60] Bin Tan, Nan Xue, Song Bai, Tianfu Wu, and Gui-Song Xia. “PlaneTR: Structure-Guided
Transformers for 3D Plane Recovery”. In: International Conference on Computer Vision. 2021.
93
[61] Yiming Qian and Yasutaka Furukawa. “Learning pairwise inter-plane relations for piecewise planar
reconstruction”. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK,
August 23–28, 2020, Proceedings, Part VII 16. Springer. 2020, pp. 330–345.
[62] Cheng Sun, Chi-Wei Hsiao, Ning-Hsu Wang, Min Sun, and Hwann-Tzong Chen. “Indoor
panorama planar 3d reconstruction via divide and conquer”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2021, pp. 11338–11347.
[63] Adrien Bartoli. “A random sampling strategy for piecewise planar scene segmentation”. In:
Computer Vision and Image Understanding 105.1 (2007), pp. 42–59.
[64] Caroline Baillard and Andrew Zisserman. “Automatic reconstruction of piecewise planar models
from multiple views”. In: Proceedings. 1999 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (Cat. No PR00149). Vol. 2. IEEE. 1999, pp. 559–565.
[65] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. “Manhattan-world stereo”. In: 2009 IEEE
Conference on Computer Vision and Pattern Recognition(CVPR). Vol. 00. 2018, pp. 1422–1429.
[66] Sudipta Sinha, Drew Steedly, and Rick Szeliski. “Piecewise planar stereo for image-based
rendering”. In: Proceedings of the IEEE International Conference on Computer Vision. 2009.
[67] David Gallup, Jan-Michael Frahm, and Marc Pollefeys. “Piecewise planar and non-planar stereo
for urban scene reconstruction”. In: (2010).
[68] Linyi Jin, Shengyi Qian, Andrew Owens, and David F. Fouhey. “Planar Surface Reconstruction
From Sparse Views”. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision (ICCV). 2021, pp. 12991–13000.
[69] Jiayi Ma, Xingyu Jiang, Aoxiang Fan, Junjun Jiang, and Junchi Yan. “Image matching from
handcrafted to deep features: A survey”. In: International Journal of Computer Vision 129 (2021),
pp. 23–79.
[70] Shengcai Liao and Ling Shao. “Transmatcher: Deep image matching through transformers for
generalizable person re-identification”. In: Advances in Neural Information Processing Systems 34
(2021), pp. 1992–2003.
[71] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. “SuperGlue:
Learning Feature Matching with Graph Neural Networks”. In: CVPR. 2020.
[72] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. “Trackformer:
Multi-object tracking with transformers”. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 2022, pp. 8844–8854.
[73] Dan Wang, Xinrui Cui, Xun Chen, Zhengxia Zou, Tianyang Shi, Septimiu Salcudean,
Z Jane Wang, and Rabab Ward. “Multi-view 3d reconstruction with transformers”. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 5722–5731.
94
[74] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and
Quoc V. Le. “MnasNet: Platform-Aware Neural Architecture Search for Mobile”. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019.
[75] Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, and
Andrew Rabinovich. “Atlas: End-to-End 3D Scene Reconstruction from Posed Images”. In: ECCV.
2020. URL: https://arxiv.org/abs/2003.10432.
[76] Xiaoxiao Long, Lingjie Liu, Wei Li, Christian Theobalt, and Wenping Wang. Multi-view Depth
Estimation using Epipolar Spatio-Temporal Networks. 2020. DOI: 10.48550/ARXIV.2011.13118.
[77] Chen Feng, Yuichi Taguchi, and Vineet R. Kamat. “Fast plane extraction in organized point clouds
using agglomerative hierarchical clustering”. In: 2014 IEEE International Conference on Robotics
and Automation (ICRA). 2014, pp. 6218–6225. DOI: 10.1109/ICRA.2014.6907776.
[78] Haotian Tang, Zhijian Liu, Xiuyu Li, Yujun Lin, and Song Han. “TorchSparse: Efficient Point
Cloud Inference Engine”. In: Conference on Machine Learning and Systems (MLSys). 2022.
[79] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale
hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern
recognition. Ieee. 2009, pp. 248–255.
[80] Fengting Yang and Zihan Zhou. “Recovering 3D Planes from a Single Image via Convolutional
Neural Networks”. In: European Conference on Computer Vision. 2018.
[81] Martin A. Fischler and Robert C. Bolles. “Random Sample Consensus: A Paradigm for Model
Fitting with Applications to Image Analysis and Automated Cartography”. In: Commun. ACM 24.6
(1981), 381–395. ISSN: 0001-0782. DOI: 10.1145/358669.358692.
[82] Les Piegl and Wayne Tiller. The NURBS Book. second. New York, NY, USA: Springer-Verlag,
1996.
[83] Zhongzheng Ren, Ishan Misra, Alexander G Schwing, and Rohit Girdhar. “3d spatial recognition
without spatially labeled 3d”. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2021, pp. 13204–13213.
[84] Muxingzi Li and Florent Lafarge. “Planar Shape Based Registration for Multi-modal Geometry”.
In: BMVC 2021-The British Machine Vision Conference. 2021.
[85] Chen Zhu, Zihan Zhou, Ziran Xing, Yanbing Dong, Yi Ma, and Jingyi Yu. “Robust plane-based
calibration of multiple non-overlapping cameras”. In: 2016 Fourth International Conference on 3D
Vision (3DV). IEEE. 2016, pp. 658–666.
[86] Chuchu Chen, Patrick Geneva, Yuxiang Peng, Woosik Lee, and Guoquan Huang. “Monocular
visual-inertial odometry with planar regularities”. In: 2023 IEEE International Conference on
Robotics and Automation (ICRA). IEEE. 2023, pp. 6224–6231.
[87] Lipu Zhou. “Efficient second-order plane adjustment”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2023, pp. 13113–13121.
95
[88] Michael Kaess. “Simultaneous localization and mapping with infinite planes”. In: 2015 IEEE
International Conference on Robotics and Automation (ICRA). IEEE. 2015, pp. 4605–4611.
[89] Pyojin Kim, Brian Coltin, and H Jin Kim. “Linear RGB-D SLAM for planar environments”. In:
Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 333–348.
[90] Zehao Yu, Jia Zheng, Dongze Lian, Zihan Zhou, and Shenghua Gao. “Single-image piece-wise
planar 3d reconstruction via associative embedding”. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. 2019, pp. 1029–1037.
[91] Linyi Jin, Shengyi Qian, Andrew Owens, and David F Fouhey. “Planar surface reconstruction from
sparse views”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision.
2021, pp. 12991–13000.
[92] Samir Agarwala, Linyi Jin, Chris Rockwell, and David F Fouhey. “Planeformers: From sparse view
planes to 3d reconstruction”. In: European Conference on Computer Vision. Springer. 2022,
pp. 192–209.
[93] Bin Tan, Nan Xue, Tianfu Wu, and Gui-Song Xia. “NOPE-SAC: Neural One-Plane RANSAC for
Sparse-View Planar 3D Reconstruction”. In: IEEE Transactions on Pattern Analysis and Machine
Intelligence (2023).
[94] Tahir Rabbani, Frank Van Den Heuvel, and George Vosselmann. “Segmentation of point clouds
using smoothness constraint”. In: International archives of photogrammetry, remote sensing and
spatial information sciences 36.5 (2006), pp. 248–253.
[95] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. “Mask r-cnn”. In: Proceedings of
the IEEE international conference on computer vision. 2017, pp. 2961–2969.
[96] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and
Sergey Zagoruyko. “End-to-end object detection with transformers”. In: European conference on
computer vision. Springer. 2020, pp. 213–229.
[97] Jiachen Liu, Pan Ji, Nitin Bansal, Changjiang Cai, Qingan Yan, Xiaolei Huang, and Yi Xu.
“Planemvs: 3d plane reconstruction from multi-view stereo”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2022, pp. 8665–8675.
[98] Martin A Fischler and Robert C Bolles. “Random sample consensus: a paradigm for model fitting
with applications to image analysis and automated cartography”. In: Communications of the ACM
24.6 (1981), pp. 381–395.
[99] Mulin Yu and Florent Lafarge. “Finding good configurations of planar primitives in unorganized
point clouds”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2022, pp. 6367–6376.
[100] Lingxiao Li, Minhyuk Sung, Anastasia Dubrovina, Li Yi, and Leonidas J Guibas. “Supervised
fitting of geometric primitives to 3d point clouds”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2019, pp. 2652–2660.
96
[101] Gopal Sharma, Difan Liu, Subhransu Maji, Evangelos Kalogerakis, Siddhartha Chaudhuri, and
Radomír Mech. “Parsenet: A parametric surface fitting network for 3d point clouds”. In: ˇ Computer
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Part VII 16. Springer. 2020, pp. 261–276.
[102] Siming Yan, Zhenpei Yang, Chongyang Ma, Haibin Huang, Etienne Vouga, and Qixing Huang.
“Hpnet: Deep primitive segmentation using hybrid representations”. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision. 2021, pp. 2753–2762.
[103] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. “Neus:
Learning neural implicit surfaces by volume rendering for multi-view reconstruction”. In: arXiv
preprint arXiv:2106.10689 (2021).
[104] Yiming Wang, Qin Han, Marc Habermann, Kostas Daniilidis, Christian Theobalt, and Lingjie Liu.
“Neus2: Fast learning of neural implicit surfaces for multi-view reconstruction”. In: Proceedings of
the IEEE/CVF International Conference on Computer Vision. 2023, pp. 3295–3306.
[105] Yusen Wang, Zongcheng Li, Yu Jiang, Kaixuan Zhou, Tuo Cao, Yanping Fu, and Chunxia Xiao.
“Neuralroom: Geometry-constrained neural implicit surfaces for indoor scene reconstruction”. In:
arXiv preprint arXiv:2210.06853 (2022).
[106] Dejan Azinovic, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. ´
“Neural rgb-d surface reconstruction”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2022, pp. 6290–6301.
[107] Tong Wu, Jiaqi Wang, Xingang Pan, Xudong Xu, Christian Theobalt, Ziwei Liu, and Dahua Lin.
“Voxurf: Voxel-based efficient and accurate neural surface reconstruction”. In: arXiv preprint
arXiv:2208.12697 (2022).
[108] Noah Stier, Anurag Ranjan, Alex Colburn, Yajie Yan, Liang Yang, Fangchang Ma, and
Baptiste Angles. “FineRecon: Depth-aware Feed-forward Network for Detailed 3D
Reconstruction”. In: arXiv preprint arXiv:2304.01480 (2023).
[109] Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H Taylor, Mathias Unberath, Ming-Yu Liu, and
Chen-Hsuan Lin. “Neuralangelo: High-Fidelity Neural Surface Reconstruction”. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 8456–8465.
[110] Botao Ye, Sifei Liu, Xueting Li, and Ming-Hsuan Yang. “Self-Supervised Super-Plane for Neural
3D Reconstruction”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2023, pp. 21415–21424.
[111] Yiming Gao, Yan-Pei Cao, and Ying Shan. “SurfelNeRF: Neural Surfel Radiance Fields for Online
Photorealistic Reconstruction of Indoor Scenes”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2023, pp. 108–118.
[112] Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui,
Martin R Oswald, and Marc Pollefeys. “Nice-slam: Neural implicit scalable encoding for slam”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022,
pp. 12786–12796.
97
[113] Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R Oswald, Andreas Geiger, and
Marc Pollefeys. “Nicer-slam: Neural implicit scene encoding for rgb slam”. In: arXiv preprint
arXiv:2302.03594 (2023).
[114] Hengyi Wang, Jingwen Wang, and Lourdes Agapito. “Co-SLAM: Joint Coordinate and Sparse
Parametric Encodings for Neural Real-Time SLAM”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2023, pp. 13293–13302.
[115] Cheng Sun, Min Sun, and Hwann-Tzong Chen. “Direct Voxel Grid Optimization: Super-fast
Convergence for Radiance Fields Reconstruction”. In: CVPR. 2022.
[116] Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulò, Norman Müller, Matthias Nießner, Angela Dai,
and Peter Kontschieder. “Panoptic lifting for 3d scene understanding with neural fields”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023,
pp. 9043–9052.
[117] Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and Hujun Bao. “NeuralRecon: Real-time
coherent 3D reconstruction from monocular video”. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. 2021, pp. 15598–15607.
[118] Xiaoxiao Long, Lingjie Liu, Wei Li, Christian Theobalt, and Wenping Wang. “Multi-view depth
estimation using epipolar spatio-temporal networks”. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. 2021, pp. 8258–8267.
[119] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green,
Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. “The Replica dataset: A digital
replica of indoor spaces”. In: arXiv preprint arXiv:1906.05797 (2019).
[120] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
“High-resolution image synthesis with latent diffusion models”. In: IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2022.
[121] Haoran Bai, Di Kang, Haoxian Zhang, Jinshan Pan, and Linchao Bao. “FFHQ-UV: Normalized
Facial UV-Texture Dataset for 3D Face Reconstruction”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2023, pp. 362–371.
[122] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. “Stargan v2: Diverse image synthesis
for multiple domains”. In: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition. 2020, pp. 8188–8197.
[123] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang,
Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. “Magic3D: High-Resolution
Text-to-3D Content Creation”. In: arXiv:2211.10440 (2022).
[124] Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, and Shenghua Gao.
“Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion
models”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
98
[125] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu.
“ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score
Distillation”. In: arXiv:2305.16213 (2023).
[126] Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan,
Tianyou Liang, Guanying Chen, Shuguang Cui, and Xiaoguang Han. “MVImgNet: A Large-scale
Dataset of Multi-view Images”. In: CVPR. 2023.
[127] Jiaming Song, Chenlin Meng, and Stefano Ermon. “Denoising Diffusion Implicit Models”. In:
arXiv:2010.02502 (2020). URL: https://arxiv.org/abs/2010.02502.
[128] Towaki Takikawa, Alex Evans, Jonathan Tremblay, Thomas Müller, Morgan McGuire,
Alec Jacobson, and Sanja Fidler. “Variable bitrate neural fields”. In: ACM SIGGRAPH 2022
Conference Proceedings. 2022.
[129] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello,
Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. “Efficient
geometry-aware 3D generative adversarial networks”. In: IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2022.
[130] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis,
Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. “eDiffi: Text-to-Image Diffusion
Models with an Ensemble of Expert Denoisers”. In: arXiv:2211.01324 (2022).
[131] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed
Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al.
“Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”. In:
arXiv:2205.11487 (2022).
[132] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. “Score-based generative modeling through stochastic differential equations”. In:
arXiv:2011.13456 (2020).
[133] Jonathan Ho and Tim Salimans. “Classifier-free diffusion guidance”. In: arXiv:2207.12598 (2022).
[134] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer”. In: JMLR (2020).
[135] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.
“Learning Transferable Visual Models From Natural Language Supervision”. In: 2021.
[136] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. “Score
Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation”. In:
arXiv:2212.00774 (2022).
99
[137] Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-Jun Zha, and Lei Zhang. DreamTime:
An Improved Optimization Strategy for Text-to-3D Content Creation. 2023. arXiv: 2306.12422
[cs.CV].
[138] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li,
Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, and Bernard Ghanem.
Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion
Priors. 2023. arXiv: 2306.17843 [cs.CV].
[139] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. MVDream: Multi-view
Diffusion for 3D Generation. 2023. arXiv: 2308.16512 [cs.CV].
[140] Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang.
SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. 2023. arXiv:
2309.03453 [cs.CV].
[141] Yichen Ouyang, Wenhao Chai, Jiayi Ye, Dapeng Tao, Yibing Zhan, and Gaoang Wang. Chasing
Consistency in Text-to-3D Generation from a Single Image. 2023. arXiv: 2309.03599 [cs.CV].
[142] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. “Latent-nerf for
shape-guided generation of 3d shapes and textures”. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2023.
[143] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. “Roformer:
Enhanced transformer with rotary position embedding”. In: arXiv:2104.09864 (2021).
[144] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and
Carl Vondrick. “Zero-1-to-3: Zero-shot one image to 3d object”. In: arXiv:2303.11328 (2023).
[145] Johannes Lutz Schönberger and Jan-Michael Frahm. “Structure-from-Motion Revisited”. In:
Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
[146] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. “DPM-Solver: A Fast
ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps”. In: Advances in
Neural Information Processing Systems. Ed. by Alice H. Oh, Alekh Agarwal, Danielle Belgrave,
and Kyunghyun Cho. 2022. URL: https://openreview.net/forum?id=2uAaGwlP_V.
[147] Jiaxiang Tang. Stable-dreamfusion: Text-to-3D with Stable-diffusion.
github.com/ashawkey/stable-dreamfusion. 2022.
[148] Threestudio project. https://github.com/threestudio-project/threestudio. Accessed:
2023-11-17.
[149] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. “The Unreasonable
Effectiveness of Deep Features as a Perceptual Metric”. In: 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2018, pp. 586–595. DOI: 10.1109/CVPR.2018.00068.
100
[150] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. 2018.
arXiv: 1706.08500 [cs.LG].
[151] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
“Improved Techniques for Training GANs”. In: Proceedings of the 30th International Conference
on Neural Information Processing Systems. NIPS’16. Barcelona, Spain: Curran Associates Inc.,
2016, 2234–2242. ISBN: 9781510838819.
[152] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A
Reference-free Evaluation Metric for Image Captioning. 2022. arXiv: 2104.08718 [cs.CV].
[153] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt,
Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. “Objaverse: A Universe of
Annotated 3D Objects”. In: arXiv preprint arXiv:2212.08051 (2022).
[154] Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa.
“InstructNeRF2NeRF: Editing 3D scenes with instructions”. In: arXiv preprint arXiv:2303.12789
(2023).
[155] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang,
Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. “Magic3D: High-resolution
text-to-3D content creation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2023, pp. 300–309.
[156] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. “Latent-NeRF for
shape-guided generation of 3D shapes and textures”. In: arXiv preprint arXiv:2211.07600 (2022).
[157] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. “Point-E: A
system for generating 3D point clouds from complex prompts”. In: arXiv preprint
arXiv:2212.08751 (2022).
[158] Zhizhuo Zhou and Shubham Tulsiani. “SparseFusion: Distilling view-conditioned diffusion for 3D
reconstruction”. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2023.
[159] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew,
Ilya Sutskever, and Mark Chen. “GLIDE: Towards photorealistic image generation and editing
with text-guided diffusion models”. In: arXiv preprint arXiv:2112.10741 (2021).
[160] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman.
“DreamBooth: Fine-tuning text-to-image diffusion models for subject-driven generation”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023,
pp. 22500–22510.
[161] Junha Hyung, Sungwon Hwang, Daejin Kim, Hyunji Lee, and Jaegul Choo. “Local 3D editing via
3D distillation of CLIP knowledge”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2023, pp. 12674–12684.
101
[162] Xingchen Zhou, Ying He, F. Richard Yu, Jianqiang Li, and You Li. “RePaint-NeRF: NeRF editing
via semantic masks and diffusion models”. In: arXiv preprint arXiv:2306.05668 (2023).
[163] Andrew E. Welchman, Arne Deubelius, Verena Conrad, Heinrich H. Bülthoff, and Zoe Kourtzi.
“3D shape perception from combined depth cues in human visual cortex”. In: Nature Neuroscience
8.6 (2005), pp. 820–827.
[164] Junde Wu, Rao Fu, Huihui Fang, Yuanpei Liu, Zhaowei Wang, Yanwu Xu, Yueming Jin, and
Tal Arbel. “Medical SAM Adapter: Adapting segment anything model for medical image
segmentation”. In: arXiv preprint arXiv:2304.12620 (2023).
[165] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger.
“Convolutional occupancy networks”. In: Computer Vision–ECCV 2020: 16th European
Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III. Springer, 2020, pp. 523–540.
[166] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. “Deep Marching
Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis”. In: Advances in
Neural Information Processing Systems (NeurIPS). 2021.
[167] Jamie Wynn and Daniyar Turmukhambetov. “DiffusioNeRF: Regularizing neural radiance fields
with denoising diffusion models”. In: Conference on Computer Vision and Pattern Recognition
(CVPR). 2023.
[168] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models”. In:
Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 6840–6851.
[169] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. “Fantasia3d: Disentangling geometry and
appearance for high-quality text-to-3d content creation”. In: arXiv:2303.13873 (2023).
[170] Weiyu Li, Rui Chen, Xuelin Chen, and Ping Tan. “SweetDreamer: Aligning geometric priors in 2D
diffusion for consistent text-to-3D”. In: arXiv preprint arXiv:2310.02596 (2023).
[171] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and
Carl Vondrick. “Zero-1-to-3: Zero-shot one image to 3D object”. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2023, pp. 9298–9309.
[172] Han Yi, Zhedong Zheng, Xiangyu Xu, and Tat Seng Chua. “DreamGaussian: Generative Gaussian
splatting for efficient 3D content creation”. In: arXiv preprint arXiv:2309.16653 (2023).
[173] Marie-Julie Rakotosaona, Fabian Manhardt, Diego Martin Arroyo, Michael Niemeyer,
Abhijit Kundu, and Federico Tombari. “NeRFMeshing: Distilling neural radiance fields into
geometrically-accurate 3D meshes”. In: arXiv preprint arXiv:2303.09431 (2023).
[174] Jiaxiang Tang, Hang Zhou, Xiaokang Chen, Tianshu Hu, Errui Ding, Jingdong Wang, and
Gang Zeng. “Delicate textured mesh recovery from NeRF via adaptive surface refinement”. In:
arXiv preprint arXiv:2303.02091 (2023).
102
[175] Xiaoshuai Zhang, Abhijit Kundu, Thomas Funkhouser, Leonidas Guibas, Hao Su, and
Kyle Genova. “NeRFlets: Local radiance fields for efficient structure-aware 3D scene
representation from 2D supervision”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2023, pp. 8274–8284.
[176] Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su,
Ziyan Huang, Jilong Chen, Lei Jiang, et al. “SAM-MED2D”. In: arXiv preprint arXiv:2308.16184
(2023).
[177] Xinpeng Li, Ting Jiang, Haoqiang Fan, and Shuaicheng Liu. “SAM-IQA: Can segment anything
boost image quality assessment?” In: arXiv preprint arXiv:2307.04455 (2023).
[178] Zhihe Lu, Zeyu Xiao, Jiawang Bai, Zhiwei Xiong, and Xinchao Wang. “Can SAM boost video
super-resolution?” In: arXiv preprint arXiv:2305.06524 (2023).
[179] Jingfeng Yao, Xinggang Wang, Lang Ye, and Wenyu Liu. “Matte anything: Interactive natural
image matting with segment anything models”. In: arXiv preprint arXiv:2306.04121 (2023).
[180] Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang.
“Segment and track anything”. In: arXiv preprint arXiv:2305.06558 (2023).
[181] Jiazhong Cen, Zanwei Zhou, Jiemin Fang, Wei Shen, Lingxi Xie, Xiaopeng Zhang, and Qi Tian.
“Segment anything in 3D with NeRFs”. In: arXiv preprint arXiv:2304.12308 (2023).
[182] Qiuhong Shen, Xingyi Yang, and Xinchao Wang. “Anything3D: Towards single-view anything
reconstruction in the wild”. In: arXiv preprint arXiv:2304.10261 (2023).
[183] Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao, and Xihui Liu. “SAM3D: Segment
anything in 3D scenes”. In: arXiv preprint arXiv:2306.03908 (2023).
[184] Yiyi Liao, Simon Donne, and Andreas Geiger. “Deep marching cubes: Learning explicit surface
representations”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2018, pp. 2916–2925.
[185] Gregory M. Nielson. “On marching cubes”. In: IEEE Transactions on Visualization and Computer
Graphics 9.3 (2003), pp. 283–297.
[186] William Gao, April Wang, Gal Metzer, Raymond A Yeh, and Rana Hanocka. “TetGAN: A
convolutional neural network for tetrahedral mesh generation”. In: arXiv preprint
arXiv:2210.05735 (2022).
[187] Jacob Munkberg, Jon Hasselgren, Tianchang Shen, Jun Gao, Wenzheng Chen, Alex Evans,
Thomas Müller, and Sanja Fidler. “Extracting triangular 3D models, materials, and lighting from
images”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2022, pp. 8280–8290.
[188] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. “Deep marching
tetrahedra: A hybrid representation for high-resolution 3D shape synthesis”. In: Advances in
Neural Information Processing Systems. Vol. 34. 2021, pp. 6087–6101.
103
[189] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. “Learning transferable visual
models from natural language supervision”. In: International Conference on Machine Learning.
PMLR, 2021, pp. 8748–8763.
[190] Michael L Gleicher and Feng Liu. “Re-cinematography: Improving the camerawork of casual
video”. In: ACM transactions on multimedia computing, communications, and applications
(TOMM) 5.1 (2008), pp. 1–28.
[191] Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. “Tools for placing cuts and transitions in
interview video”. In: ACM Transactions on Graphics (TOG) 31.4 (2012), pp. 1–8.
[192] Anh Truong, Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. “QuickCut: An Interactive
Tool for Editing Narrated Video”. In: Proceedings of the 29th Annual Symposium on User Interface
Software and Technology. UIST ’16. , Japan: Association for Computing Machinery, 2016,
497–507. ISBN: 9781450341899. DOI: 10.1145/2984511.2984569.
[193] Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. “Computational video editing
for dialogue-driven scenes.” In: ACM Trans. Graph. 36.4 (2017), pp. 130–1.
[194] Ido Arev, Hyun Soo Park, Yaser Sheikh, Jessica Hodgins, and Ariel Shamir. “Automatic Editing of
Footage from Multiple Social Cameras”. In: ACM Trans. Graph. 33.4 (July 2014). ISSN:
0730-0301. DOI: 10.1145/2601097.2601198.
[195] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier,
Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. “The
kinetics human action video dataset”. In: arXiv preprint arXiv:1705.06950 (2017).
[196] Mark Everingham, Josef Sivic, and Andrew Zisserman. “Hello! My name is... Buffy”–Automatic
Naming of Characters in TV Video.” In: BMVC. Vol. 2. 4. 2006, p. 6.
[197] Xiaolong Liu, Yao Hu, Song Bai, Fei Ding, Xiang Bai, and Philip HS Torr. “Multi-shot Temporal
Event Localization: a Benchmark”. In: arXiv preprint arXiv:2012.09434 (2020).
[198] Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. “A
Local-to-Global Approach to Multi-modal Movie Scene Segmentation”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 10146–10155.
[199] Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. “Human Mesh Recovery from Multiple
Shots”. In: arXiv preprint arXiv:2012.xxxxxx (2020).
[200] Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. “BSN: Boundary Sensitive
Network for Temporal Action Proposal Generation”. In: European Conference on Computer
Vision. 2018.
[201] Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. “Temporal localization of actions with
actoms”. In: IEEE transactions on pattern analysis and machine intelligence 35.11 (2013),
pp. 2782–2795.
104
[202] H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino. “Superframes, A Temporal Video
Segmentation”. In: 2018 24th International Conference on Pattern Recognition (ICPR). 2018,
pp. 566–571. DOI: 10.1109/ICPR.2018.8545723.
[203] Mike Zheng Shou, Deepti Ghadiyaram, Weiyao Wang, and Matt Feiszli. Generic Event Boundary
Detection: A Benchmark for Event Segmentation. 2021. arXiv: 2101.10511 [cs.CV].
[204] Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and
Mubarak Shah. “The THUMOS challenge on action recognition for videos “in the wild””. In:
Computer Vision and Image Understanding 155 (2017), pp. 1–23.
[205] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. “Activitynet: A
large-scale video benchmark for human activity understanding”. In: Proceedings of the ieee
conference on computer vision and pattern recognition. 2015, pp. 961–970.
[206] Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. “Tvsum: Summarizing web
videos using titles”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2015, pp. 5179–5187.
[207] Boqing Gong, Wei-Lun Chao, Kristen Grauman, and Fei Sha. “Diverse sequential subset selection
for supervised video summarization”. In: Advances in neural information processing systems.
2014, pp. 2069–2077.
[208] Michael Gygli. “Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural
Networks”. In: CoRR abs/1705.08214 (2017). arXiv: 1705.08214. URL:
http://arxiv.org/abs/1705.08214.
[209] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. “Video Enhancement
with Task-Oriented Flow”. In: International Journal of Computer Vision (IJCV) 127.8 (2019),
pp. 1106–1125.
[210] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. “Can Spatiotemporal 3D CNNs Retrace the
History of 2D CNNs and ImageNet?” In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR). 2018, pp. 6546–6555.
[211] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. “PWC-Net: CNNs for Optical Flow
Using Pyramid, Warping, and Cost Volume”. In: CVPR. 2018.
[212] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola,
Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised Contrastive Learning. 2020. arXiv:
2004.11362 [cs.LG].
[213] Suramya Tomar. “Converting video formats with FFmpeg”. In: Linux Journal 2006.146 (2006),
p. 10.
[214] Juhyun Lee. “Non-deterministic video frame sampling to thwart frame insertion attacks”. In: 2017.
105
[215] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Val Gool.
“Temporal Segment Networks: Towards Good Practices for Deep Action Recognition”. In: ECCV.
2016.
[216] Hirokatsu Kataoka, Tenga Wakamiya, Kensho Hara, and Yutaka Satoh. Would Mega-scale Datasets
Further Enhance Spatiotemporal 3D CNNs? 2020. arXiv: 2004.04968 [cs.CV].
[217] K. He, G. Gkioxari, P. Dollár, and R. Girshick. “Mask R-CNN”. In: 2017 IEEE International
Conference on Computer Vision (ICCV). 2017, pp. 2980–2988. DOI: 10.1109/ICCV.2017.322.
[218] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. “Grad-CAM: Visual
Explanations from Deep Networks via Gradient-Based Localization”. In: 2017 IEEE International
Conference on Computer Vision (ICCV). 2017, pp. 618–626. DOI: 10.1109/ICCV.2017.74.
[219] Jingjia Shi, Shuaifeng Zhi, and Kai Xu. “PlaneRecTR: Unified Query Learning for 3D Plane
Recovery from a Single View”. In: ICCV. 2023.
[220] Samir Agarwala, Linyi Jin, Chris Rockwell, and David F. Fouhey. “PlaneFormers: From Sparse
View Planes to 3D Reconstruction”. In: Computer Vision – ECCV 2022: 17th European
Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III. Tel Aviv, Israel:
Springer-Verlag, 2022, 192–209. ISBN: 978-3-031-20061-8. DOI:
10.1007/978-3-031-20062-5_12.
[221] Wang, Shuzhe and Leroy, Vincent and Cabon, Yohann and Chidlovskii, Boris and Revaud Jerome.
DUSt3R: Geometric 3D Vision Made Easy. 2023.
[222] Hu, Mu and Yin, Wei, and Zhang, Chi and Cai, Zhipeng and Long, Xiaoxiao and Chen, Hao, and
Wang, Kaixuan and Yu, Gang and Shen, Chunhua and Shen, Shaojie. A Versatile Monocular
Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation. 2024.
[223] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
Li Fei-Fei. “ImageNet Large Scale Visual Recognition Challenge”. In: International Journal of
Computer Vision (IJCV) 115.3 (2015), pp. 211–252. DOI: 10.1007/s11263-015-0816-y.
[224] Matej Kristan, Jiri Matas, Aleš Leonardis, Tomas Vojir, Roman Pflugfelder, Gustavo Fernandez,
Georg Nebehay, Fatih Porikli, and Luka Cehovin. “A Novel Performance Evaluation Methodology ˇ
for Single-Target Trackers”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence
38.11 (2016), pp. 2137–2155. ISSN: 0162-8828. DOI: 10.1109/TPAMI.2016.2516982.
[225] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler. “MOTChallenge 2015: Towards a
Benchmark for Multi-Target Tracking”. In: arXiv:1504.01942 [cs] (Apr. 2015). arXiv: 1504.01942.
URL: http://arxiv.org/abs/1504.01942.
[226] Andre Araujo, Bingyi Cao, Cam Askew, Jack Sim, Maggie, Tobias Weyand, and Will Cukierski.
Google Landmark Retrieval 2021.
https://kaggle.com/competitions/landmark-retrieval-2021. Kaggle. 2021.
106
[227] Filip Radenovic, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ´ ˇrej Chum. “Revisiting
Oxford and Paris: Large-Scale Image Retrieval Benchmarking”. In: CVPR. 2018.
[228] Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and
Eduard Trulls. “Image Matching across Wide Baselines: From Paper to Practice”. In: International
Journal of Computer Vision (2020).
[229] Tomáš Hodan, Martin Sundermeyer, Yann Labbé, Van Nguyen Nguyen, Gu Wang, ˇ
Eric Brachmann, Bertram Drost, Vincent Lepetit, Carsten Rother, and Jiˇrí Matas. “BOP Challenge
2023 on Detection, Segmentation and Pose Estimation of Seen and Unseen Rigid Objects”. In:
Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2024).
[230] Jack Langerman, Caner Korkmaz, Hanzhi Chen, Daoyi Gao, Ilke Demir, Dmytro Mishkin, and
Tolga Birdal. S23DR Competition at 1st Workshop on Urban Scene Modeling @ CVPR 2024.
https://huggingface.co/usm3d. 2024. URL: usm3d.github.io.
[231] Jiacheng Chen, Yiming Qian, and Yasutaka Furukawa. “HEAT: Holistic Edge Attention
Transformer for Structured Reconstruction”. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). 2022.
[232] Ruisheng Wang, Shangfeng Huang, and Hongxin Yang. “ Building3D: An Urban-Scale Dataset and
Benchmarks for Learning Roof Structures from Point Clouds ”. In: ICCV. Los Alamitos, CA, USA:
IEEE Computer Society, Oct. 2023, pp. 20019–20029. DOI: 10.1109/ICCV51070.2023.01837.
[233] Yicheng Luo, Jing Ren, Xuefei Zhe, Di Kang, Yajing Xu, Peter Wonka, and Linchao Bao.
“LC2WF:Learning to Construct 3D Building Wireframes from 3D Line Clouds”. In: Proceedings
of the British Machine Vision Conference (BMVC). 2022.
[234] Yichao Zhou, Haozhi Qi, and Yi Ma. “End-to-End Wireframe Parsing”. In: ICCV 2019. 2019.
[235] Wenchao Ma, Bin Tan, Nan Xue, Tianfu Wu, Xianwei Zheng, and Gui-Song Xia. “HoW-3D:
Holistic 3D Wireframe Perception from a Single Image”. In: International Conference on 3D
Vision. 2022.
[236] Yujia Liu, Stefano D’Aronco, Konrad Schindler, and Jan Dirk Wegner. “{PC}2WF: 3D Wireframe
Reconstruction from Raw Point Clouds”. In: International Conference on Learning
Representations. 2021. URL: https://openreview.net/forum?id=8X2eaSZxTP.
[237] Li Cao, Yike Xu, Jianwei Guo, and Xiaoping Liu. “WireframeNet: A novel method for wireframe
generation from point cloud”. In: Computers and Graphics 115 (2023), pp. 226–235. ISSN:
0097-8493. DOI: https://doi.org/10.1016/j.cag.2023.07.015.
[238] Shangfeng Huang, Ruisheng Wang, Bo Guo, and Hongxin Yang. “PBWR:
Parametric-Building-Wireframe Reconstruction from Aerial LiDAR Point Clouds”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
2024, pp. 27778–27787.
107
[239] Kseniya Cherenkova, Elona Dupont, Anis Kacem, Ilya Arzhannikov, Gleb Gusev, and
Djamila Aouada. “SepicNet: Sharp Edges Recovery by Parametric Inference of Curves in 3D
Shapes”. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW). 2023, pp. 2727–2735. DOI: 10.1109/CVPRW59228.2023.00273.
[240] Perpetual Hope Akwensi, Akshay Bharadwaj, and Ruisheng Wang. “APC2Mesh: Bridging the gap
from occluded building façades to full 3D models”. In: ISPRS Journal of Photogrammetry and
Remote Sensing 211 (2024), pp. 438–451. ISSN: 0924-2716. DOI:
https://doi.org/10.1016/j.isprsjprs.2024.04.009.
[241] Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, Áron Monszpart,
Daniyar Turmukhambetov, and Victor Adrian Prisacariu. “Scene Coordinate Reconstruction:
Posing of Image Collections via Incremental Learning of a Relocalizer”. In: ECCV. 2024.
[242] Alberto Sanfeliu and King-Sun Fu. “A distance measure between attributed relational graphs for
pattern recognition”. In: IEEE Transactions on Systems, Man, and Cybernetics SMC-13.3 (1983),
pp. 353–362. DOI: 10.1109/TSMC.1983.6313167.
108
Abstract (if available)
Abstract
The ability to understand, generate, and modify 3D environments is foundational for applications such as virtual reality, autonomous driving, and generative AI tools. However, existing methods often rely on non-semantic point clouds as their representation, which capture only geometric information without semantic context. This limitation creates a significant gap in both interpretability and performance when compared to methods that leverage semantic information. Additionally, non-semantic approaches often struggle to scale effectively with increasing complexity, underscoring the need for semantic structures to enhance scalability and adaptability.
This dissertation addresses these limitations by introducing methods that emphasize controllable semantic structures in 3D understanding, generation, and editing. First, to improve 3D scene understanding, we propose plane-aware techniques, such as planar priors and plane-splatting volume rendering, which provide explicit geometric and semantic representations. These methods enable more accurate and interpretable reconstructions compared to traditional point-cloud-based approaches. Second, for 3D content generation, we develop an orientation-conditioned diffusion model, enabling precise control over the alignment and orientation of generated objects, enhancing flexibility and user interaction. Third, to facilitate intuitive editing of 3D environments, we introduce a method for projecting text-guided 2D segmentation maps onto 3D models, bridging the gap between semantic understanding and user-driven modification.
These contributions collectively address the semantic and performance gaps in 3D reconstruction and generation, demonstrating that integrating semantic information not only improves interpretability and precision but also enables models to scale more effectively for complex applications. By combining controllable semantic structures with geometric understanding, this dissertation advances the state-of-the-art in 3D vision and generation, paving the way for more scalable, interpretable, and interactive 3D workflows.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D deep learning for perception and modeling
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Green learning for 3D point cloud data processing
PDF
Point-based representations for 3D perception and reconstruction
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Object detection and recognition from 3D point clouds
PDF
Reconstructing 3D reconstruction: a graphical taxonomy of current techniques
PDF
Data-driven 3D hair digitization
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
Accurate 3D model acquisition from imagery data
PDF
Understanding the 3D genome organization in topological domain level
PDF
Green knowledge graph completion and scalable generative content delivery
PDF
3D modeling of eukaryotic genomes
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Green image generation and label transfer techniques
PDF
Unsupervised learning of holistic 3D scene understanding
PDF
Tractable information decompositions
PDF
Complete human digitization for sparse inputs
PDF
3D object detection in industrial site point clouds
PDF
Machine learning methods for 2D/3D shape retrieval and classification
Asset Metadata
Creator
Huang, Yuzhong
(author)
Core Title
Semantic structure in understanding and generation of the 3D world
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-12
Publication Date
12/06/2024
Defense Date
12/03/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D reconstruction,diffusion model,generative model,Janus problem,NeRF,OAI-PMH Harvest,planar reconstruction,text to 3d
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Morstatter, Fred (
committee chair
), Nakano, Aiichiro (
committee member
), Ortega, Antonio (
committee member
), Wang, Yue (
committee member
)
Creator Email
yuzhongh@usc.edu,yuzhonghuangcs@gmail.com
Unique identifier
UC11399EHFC
Identifier
etd-HuangYuzho-13673.pdf (filename)
Legacy Identifier
etd-HuangYuzho-13673
Document Type
Dissertation
Format
theses (aat)
Rights
Huang, Yuzhong
Internet Media Type
application/pdf
Type
texts
Source
20241210-usctheses-batch-1227
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
3D reconstruction
diffusion model
generative model
Janus problem
NeRF
planar reconstruction
text to 3d