Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
3D deep learning for perception and modeling
(USC Thesis Other)
3D deep learning for perception and modeling
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
3D DEEP LEARNING FOR PERCEPTION AND MODELING
by
Weiyue Wang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2019
Copyright 2019 Weiyue Wang
Acknowledgments
First and foremost, I am deeply grateful to my Ph.D. advisor, Prof. Ulrich Neumann, for
his care and support throughout the years. His insightful discussions, helpful guidance
and general openness have led and encouraged me to the area of study. It was great
honor to be his Ph.D. advisee and it was a great journey to study in CGIT lab.
I would like to thank my co-authors, Duygu Ceylan, Qiangui Huang, Radomir Mech,
Naiyan Wang, Xiaomin Wu, Qiangeng Xu, Chao Yang, Suya You, and Ronald Yu,
whom I appreciate and enjoyed working with. They also contributed to the work found
in this thesis. I would also like to thank my colleagues and labmates at USC, partic-
ularly, Jing Huang, Qiangui Huang, Yijing Li, Guan Pang, Rongqi Qiu, Bohan Wang,
Cho-Ying Wu, Qiangeng Xu, Chao Yang, Danyong Zhao, Mianlun Zheng, Yiqi Zhong,
and Yi Zhou, for a wonderful working experience, their friendship and help.
I also want to thank Tusimple Inc. and Adobe Inc. for providing me internship
experiences. My research has been greatly inspired by these industry experiences during
the internships.
I would like to thank Prof. Weiyao Lin, who has introduced me to the realm of
computer vision during my undergrad years. I feel extremely fortunate for being part of
this community and getting to solve interesting and challenging problems.
I would also sincerely thank my family for their encouragement and love. I greatly
appreciate my parents who always give me support, understanding and open space.
ii
Last but not least, I would like to thank Prof. Ram Nevatia, Prof. C.-C. Jay Kuo,
Prof. Hao Li, Prof. Joseph Lim for being my qualifying and dissertation committee
members. I appreciate their insightful suggestions and helpful feedbacks.
iii
Contents
Acknowledgments ii
List of Tables vii
List of Figures ix
Abstract xiv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 From 2D to 3D . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Advances and Challenges in 3D Deep Learning . . . . . . . . . 2
1.2 Overview of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 3D Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 3D Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . 7
2 2.5D Perception: RGB-D Segmentation 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 RGB-D Semantic Segmentation . . . . . . . . . . . . . . . . . 11
2.2.2 Spatial Transformations in CNN . . . . . . . . . . . . . . . . . 12
2.3 Depth-aware CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Depth-aware Convolution . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Depth-aware Average Pooling . . . . . . . . . . . . . . . . . . 14
2.3.3 Understanding Depth-aware CNN . . . . . . . . . . . . . . . . 15
2.3.4 Depth-aware CNN for RGB-D Semantic Segmentation . . . . . 17
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.3 Model Complexity and Runtime Analysis . . . . . . . . . . . . 26
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iv
3 3D Perception: Point Cloud Instance Segmentation 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Object Detection and Instance Segmentation . . . . . . . . . . 31
3.2.2 3D Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 Similarity Metric Learning . . . . . . . . . . . . . . . . . . . . 32
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Similarity Group Proposal Network . . . . . . . . . . . . . . . 33
3.3.2 Group Proposal Merging . . . . . . . . . . . . . . . . . . . . . 38
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 39
3.4.2 S3DIS Instance Segmentation and 3D Object Detection . . . . . 41
3.4.3 NYUV2 Object Detection and Instance Segmentation Evaluation 45
3.4.4 ShapeNet Part Instance Segmentation . . . . . . . . . . . . . . 50
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 3D Modeling: Shape Inpainting 52
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 Generative models . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 3D Completion . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 56
4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.1 3D Encoder-Decoder Generative Adversarial Network (3D-ED-
GAN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.2 Long-term Recurrent Convolutional Network (LRCN) Model . 60
4.3.3 Training the hybrid network . . . . . . . . . . . . . . . . . . . 63
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.1 3D Objects Inpainting . . . . . . . . . . . . . . . . . . . . . . 65
4.4.2 Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 73
5 3D Modeling: 3D Deformation Network 75
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.1 3D Mesh Deformation . . . . . . . . . . . . . . . . . . . . . . 77
5.2.2 Single View 3D Reconstruction . . . . . . . . . . . . . . . . . 78
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.1 Shape Deformation Network (3DN) . . . . . . . . . . . . . . . 80
5.3.2 Learning Shape Deformations . . . . . . . . . . . . . . . . . . 81
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 89
v
5.4.2 Shape Reconstruction from Point Cloud . . . . . . . . . . . . . 90
5.4.3 Single-view Reconstruction . . . . . . . . . . . . . . . . . . . 92
5.4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6 3D Modeling: 3D Implicit Surface Generation 99
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3.1 DISN: Deep Implicit Surface Network . . . . . . . . . . . . . . 104
6.3.2 Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . 108
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4.1 Single-view Reconstruction Comparison With State-of-the-art
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4.2 Camera Pose Estimation . . . . . . . . . . . . . . . . . . . . . 114
6.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7 Conclusion and Future Work 121
7.1 Summary of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Bibliography 124
vi
List of Tables
2.1 Mean depth variance of different categories on NYUv2 dataset. “All”
denotes the mean variance of all categories. For every image, pixel-wise
variance of depth for each category is calculated. Averaged variance is
then computed over all images. For “All”, all pixels in a image are con-
sidered to calculate the depth variance. Mean variance over all images
is further computed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Network architecture. DeepLab is our baseline with a modified ver-
sion of VGG-16 as the encoder. The convolution layer parameters are
denoted as “C[kernel size]-[number of channels]-[dilation]”. “DC” and
“Davgpool” represent depth-aware convolution and depth-aware aver-
age pooling respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Comparison with baseline CNNs on NYUv2 test set. Networks are
trained from scratch. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Comparison with the state-of-the-arts on NYUv2 test set. Networks are
trained from pre-trained models. . . . . . . . . . . . . . . . . . . . . . 20
2.5 Comparison with baseline CNNs on SUN-RGBD test set. Networks are
trained from scratch. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Comparison with the state-of-the-arts on SUN-RGBD test set. Networks
are trained from pre-trained models. . . . . . . . . . . . . . . . . . . . 22
2.7 Comparison with baseline CNNs on SID Area 5. Networks are trained
from scratch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Results of using depth-aware operations in different layers. Experiments
are conducted on NYUv2 test set. Networks are trained from scratch. . 24
2.9 Results of using depth-aware operations in ResNet-50. “VGG-D-CNN”
denotes the same network and result as in Table 2.4. Networks are
trained from pre-trained models. . . . . . . . . . . . . . . . . . . . . . 25
2.10 Results of using different and F
D
. Experiments are conducted on
NYUv2 test set. Networks are trained from scratch. . . . . . . . . . . . 26
2.11 Model complexity and runtime comparison. Runtime is tested on Nvidia
1080Ti, with input image size4255603. . . . . . . . . . . . . . . 27
vii
3.1 Results on instance segmentation in S3DIS scenes. The metric is AP(%)
with IoU threshold 0:5. To the best of our knowledge, there are no
existing instance segmentation methods on point clouds for arbitrary
object categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Comparison results on instance segmentation with different IoU thresh-
olds in S3DIS scenes. Metric is mean AP(%) over13 categories. . . . . 43
3.3 Comparison results on 3D detection in S3DIS scenes. SGPN uses Point-
Net as baseline. The metric is AP with IoU threshold0:5. . . . . . . . . 43
3.4 Results on semantic segmentation in S3DIS scenes. SGPN uses Point-
Net as baseline. Metric is mean IoU(%) over13 classes (including clutter). 44
3.5 Results on instance segmentation in NYUV2. The metric is AP with
IoU0:5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Comparison results on 3D detection (AP with IoU 0.5) in NYUV2.
Please note we use point groups as inference while [105, 23] use large
bounding box with invisible regions as ground truth. Our prediction is
the tight bounding box on points which makes the IoU much smaller
than [105, 23]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7 Semantic segmentation results on ShapeNet part dataset. Metric is mean
IoU(%) on points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 Quantitative shape completion results on ShapeNet with simulated 3D
scanner noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Classification Results on ModelNet Dataset. . . . . . . . . . . . . . . . 72
5.1 Point cloud reconstruction results on ShapeNet core dataset. Metrics are
mean Chamfer distance (0:001, CD) on points, Earth Mover’s distance
(100, EMD) on points and Intersection over Union (%, IoU) on solid
voxelized grids. For both CD and EMD, the lower the better. For IoU,
the higher the better. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Quantitative comparison on ShapeNet rendered images. Metrics are CD
(0:001), EMD (100) and IoU (%). . . . . . . . . . . . . . . . . . . 91
5.3 Quantitative comparison on ShapeNet rendered images. ’-x’ denotes
without x loss. Metrics are CD (1000), EMD (0:01) and IoU (%). . . 96
6.1 Average inference time (seconds) for a single mesh. . . . . . . . . . . . 112
6.2 Quantitative results on ShapeNet Core for various methods. Metrics are
CD (0:001, the smaller the better), EMD (100, the smaller the better)
and IoU (%, the larger the better). CD and EMD are computed on2048
points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3 Camera pose estimation comparison. The unit ofd
2D
is pixel. . . . . . . 115
6.4 Quantitative results on the category “chair”. CD (0:001), EMD (100)
and IoU (%). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
viii
List of Figures
2.1 Illustration of Depth-aware CNN. A and C are labeled as table and B is
labeled as chair. They all have similar visual features in the RGB image,
while they are separable in depth. Depth-aware CNN incorporate the
geometric relations of pixels in both convolution and pooling. When A
is the center of the receptive field, C then has more contribution to the
output unit than B. Figures in the rightmost column shows the RGB-D
semantic segmentation result of Depth-aware CNN. . . . . . . . . . . . 9
2.2 Illustration of information propagation in Depth-aware CNN. Without
loss of generality, we only show one filter window with kernel size3
3. In depth similarity shown in figure, darker color indicates higher
similarity while lighter color represents that two pixels are less similar
in depth. In (a), the output activation of depth-aware convolution is the
multiplication of depth similarity window and the convolved window on
input feature map. Similarly in (b), the output of depth-aware average
pooling is the average value of the input window weighted by the depth
similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Illustration of effective receptive field of Depth-aware CNN. (a) is the
input RGB images. (b), (c) and (d) are depth images. For (b), (c) and
(d), we show the sampling locations (red dots) in three levels of 33
depth-aware convolutions for the activation unit (green dot). . . . . . . 17
2.4 Segmentation results on NYUv2 test dataset. “GT” denotes ground
truth. The white regions in “GT” are the ignoring category. Networks
are trained from pre-trained models. . . . . . . . . . . . . . . . . . . . 21
2.5 Segmentation results on SUN-RGBD test dataset. “GT” denotes ground
truth. The white regions in “GT” are the ignoring category. Networks
are trained from pre-trained models. . . . . . . . . . . . . . . . . . . . 23
2.6 Performance Analysis. (a) Per-class IoU improvement of D-CNN over
baseline on NYUv2 test dataset. (b) Evolution of training loss on NYUv2
train dataset. Networks are trained from scratch. . . . . . . . . . . . . . 26
ix
3.1 Instance segmentation for point clouds using SGPN. Different colors
represent different instances. (a) Instance segmentation on complete
real scenes. (b) Single object part instance segmentation. (c) Instance
segmentation on point clouds obtained from partial scans. . . . . . . . . 29
3.2 Pipeline of our system for point cloud instance segmentation. . . . . . . 33
3.3 (a) Similarity (euclidean distance in feature space) between a given point
(indicated by red arrow) and the rest of points. A darker color represents
lower distance in feature space thus higher similarity. (b) Confidence
map. A darker color represents higher confidence. . . . . . . . . . . . . 34
3.4 Comparison results on S3DIS. (a) Ground Truth for instance segmenta-
tion. Different colors represents different instances. (b) SGPN instance
segmentation results. (c) Seg-Cluster instance segmentation results. (d)
Ground Truth for semantic segmentation. (e) Semantic Segmentation
and 3D detection results of SGPN. The color of the detected bounding
box for each object category is the same as the semantic labels. . . . . . 45
3.5 SGPN instance segmentation results on S3DIS. The first row is the pre-
diction results. The second row is groud truths. Different colors rep-
resent different instances. The third row is the predicted semantic seg-
mentation results. The fourth row is the ground truths for semantic seg-
mentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Incorporating CNN features in SGPN. . . . . . . . . . . . . . . . . . . 47
3.7 SGPN instance segmentation results on NYUV2. (a) Input point clouds.
(b) Ground truths for instance segmentation. (c) Instance segmentation
results with SGPN. (d) Instance segmentation results with SGPN-CNN. 47
3.8 Qualitative results on ShapeNet Part Dataset. (a) Generated ground truth
for instance segmentation. (b) SGPN instance segmentation results. (c)
Semantic segmentation results of PointNet++. (d) Semantic segmenta-
tion results of SGPN. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 Our method completes a corrupted 3D scan using a convolutional Encoder-
Decoder generative adversarial network in low resolution. The outputs
are then sliced into a sequence of 2D images and a recurrent convolu-
tional network is further introduced to produce high-resolution comple-
tion prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Network architecture of our 3D-ED-GAN. . . . . . . . . . . . . . . . . 58
4.3 Framework for LRCN. The 3D input volumes are aligned by PCA and
sliced along the first principle component into 2D images. LRCN pro-
cessesc (c = 5) consecutive images with a 3D CNN, whose outputs are
fed into LSTM. The outputs of LSTM further go through a 2D CNN and
produce a sequence of high-resolution 2D images. The concatenations
of these 2D images are the high-resolution 3D completion results. . . . 62
x
4.4 3D completion results on real-world scans. Inputs are the voxelized
scans. 3D-ED-GAN represents the low-resolution completion result
without going through LRCN. Hybrid represents the high-resolution
completion result of the combination of 3D-ED-GAN and LRCN. . . . 66
4.5 3D inpainting results with50% injected noise on ShapeNet test dataset.
For this noise type, detailed information is missing while the global
structure is preserved. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6 We vary the amount of random noise injected to test data and quantita-
tively compare the reconstruction error. . . . . . . . . . . . . . . . . . 69
4.7 Shape completion examples on ShapeNet testing points with simulated
3D scanner noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.8 Shape interpolation results. . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1 3DN deforms a given a source mesh to a new mesh based on a reference
target. The target can be a 2D image or a 3D point cloud. . . . . . . . . 76
5.2 3DN extracts global features from both the source and target. ‘MLP’
denotes the ‘1 1’ conv as in PointNet [84]. These features are then
input to an offset decoder which predicts per-vertex offsets to deform
the source. We utilize loss functions to preserve geometric details in the
source (L
Lap
;L
LPI
;L
Sym
) and to ensure deformation output is similar
to the target (L
CD
;L
EMD
). . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Differentiable mesh sampling operator (best viewed in color). Given a
face e = (v
1
;v
2
;v
3
), p is sampled on e in the network forward pass
using barycentric coordinatesw
1
;w
2
;w
3
. Sampled points are used dur-
ing loss computation. When performing back propagation, gradient of
p is passed back to(v
1
;v
2
;v
3
) with the stored weightsw
1
;w
2
;w
3
. This
process is differentiable. . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Self intersection. The red arrow is the deformation handle. (a) Original
Mesh. (b) Deformation with self-intersection. (c) Plausible deformation. 85
5.5 Given a source (a) and a target (b) model from the ShapeNet dataset, we
show the deformed meshes obtained by our method (g). We also show
Poisson surface reconstruction (d) from a set of points sampled on the
target (c). We also show comparisons to previous methods of Jack et al.
(e) and AtlasNet (f). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6 Offset decoder architecture. Each 1 1Conv is followed by a ReLU
layer except for the last one. . . . . . . . . . . . . . . . . . . . . . . . 89
5.7 PointNet autoencoder architecture. Each 1 1Conv is followed by a
ReLU layer except for the last one. “Embedding” is the feature vector
used to query template source mesh. . . . . . . . . . . . . . . . . . . . 90
5.8 Offset decoder architecture for mid-layer fusion experiment. Each 1
1Conv is followed by a ReLU layer except for the last one. . . . . . . . 90
xi
5.9 Given a target image and a source, we show deformation results of FFD,
AtlasNet, Pixel2Mesh (P2M), and 3DN. We also show the ground truth
target model (GT). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.10 For a given target image and source model, we show ground truth model
and results of Pixel2Mesh (P2M), AtlasNet, and our method (3DN) ren-
dered in wire-frame mode to better judge the quality of the meshes.
Please zoom into the PDF for details. . . . . . . . . . . . . . . . . . . . 94
5.11 Qualitative results on online product images. The first row shows the
images scrapped online. Second and third row are results of AtlasNet
and Pixel2Mesh respectively. Last row is our results. . . . . . . . . . . 95
5.12 Deformation with different source-target pairs. ‘S’ and ‘T’ denote source
meshes and target meshes respectively. . . . . . . . . . . . . . . . . . . 96
5.13 Shape interpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.14 Shape inpainting with real point cloud scan as input. Src means source
mesh and ’out’ is the corresponding deformed mesh. . . . . . . . . . . 97
6.1 Single-view reconstruction results using OccNet [74], a state-of-the-art
method, and DISN on synthetic and real images. . . . . . . . . . . . . . 100
6.2 Illustration of SDF. (a) Rendered 3D surface with s = 0. (b) Cross-
section of the SDF. A point is outside the surface if s > 0, inside if
s< 0, and on the surface ifs = 0. . . . . . . . . . . . . . . . . . . . . 103
6.3 Given an image and a pointp, we estimate the camera pose and project
p onto the image plane. DISN uses the local features at the projected
location, the global features, and the point features to predict the SDF
ofp. ‘MLPs’ denotes multi-layer perceptrons. . . . . . . . . . . . . . . 104
6.4 Local feature extraction. Given a 3D point p, we use the estimated
camera parameters to projectp onto the image plane. Then we identify
the projected location on each feature map layer of the encoder. We
concatenate features at each layer to get the local features of pointp. . . 104
6.5 Camera Pose Estimation Network. ‘PC’ denotes point cloud. ‘GT Cam’
and ‘Pred Cam’ denote the ground truth and predicted cameras. . . . . . 105
6.6 Shape reconstruction results (a) without and (b) with local feature extrac-
tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.7 Single-view reconstruction results of various methods. ‘GT’ denotes
ground truth shapes. Best viewed on screen with zooming in. . . . . . . 109
6.8 Camera Pose Prediction Network. . . . . . . . . . . . . . . . . . . . . 112
6.9 DISN Two-stream Network. . . . . . . . . . . . . . . . . . . . . . . . 112
6.10 DISN One-stream Network. . . . . . . . . . . . . . . . . . . . . . . . 113
6.11 Qualitative results of our method using different settings. ‘GT’ denotes
ground truth shapes, and ‘
cam
’ denotes models with estimated camera
parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.12 Shape interpolation result. . . . . . . . . . . . . . . . . . . . . . . . . 118
xii
6.13 Test our model on online product images. . . . . . . . . . . . . . . . . 118
6.14 Multi-view reconstruction results. (a) Single-view input. (b) Recon-
struction result from (a). (c)&(d) Two other views. (e) Multi-view
reconstruction result from (a), (c) and (d). . . . . . . . . . . . . . . . . 119
xiii
Abstract
Recently, the availability of 3D sensors and large-scale 3D data have opened doors to
3D deep learning for both perception and modeling. This thesis develops a series of 3D
deep learning techniques. These include: 1) Semantic scene understanding on various
3D representations, i.e. 2.5D images and 3D point cloud, and 2) 3D object modeling
including shape inpainting, deformation and single-view reconstruction.
Widely available sensors such as the Kinect produce depth images with rich 3D
geometry as well as RGB imagery. RGB-D segmentation is our first step to explore
semantic scene perception in the domain of 3D deep learning. State-of-the-art meth-
ods process depth as additional images using Convolutional Neural Networks (CNNs).
However, CNNs on 2D images are limited in their ability to model 3D geometric vari-
ance due to the fixed grid computation structure. To address this issue, we present
Depth-aware CNN to modify CNN kernel with geometry-aware operations for RGB-D
segmentation. These operations are able to help our network to achieve similar perfor-
mance as state-of-the-art methods while reduce the number of parameters by half.
In addition to depth data, a point cloud is an intuitive, memory-efficient 3D represen-
tation well-suited for representing detailed, large scenes.PointNet/Pointnet++ recently
introduce deep neural networks on 3D point clouds, producing successful results for
tasks such as object classification and part and semantic scene segmentation. Based on
these networks, we introduce the Similarity Group Proposal Network (SGPN), a simple
xiv
and intuitive deep learning framework for 3D point cloud instance segmentation, a fun-
damental yet challenging perception task. To the best of our knowledge, we are the first
work using 3D deep learning for point cloud instance segmentation.
Immersive applications in virtual reality (VR) and augmented reality (AR) create
a demand for rapid creation and easy access to large sets of 3D models. The sec-
ond research topic in this thesis is 3D modeling, specifically, 3D shape inpainting, 3D
shape deformation and Signed Distance Function generation for reconstruction. Due
to GPU memory limitations, traditional shape inpainting methods can only produce
low-resolution outputs. To inpaint 3D models with semantic plausibility and contex-
tual details, we introduce a hybrid framework that combines a 3D Encoder-Decoder
Generative Adversarial Network and a Longterm Recurrent Convolutional Network. By
handling the 3D model as a sequence of 2D slices, our method is able to generate a more
complete and higher resolution volume with limited GPU memory.
In computer graphics, 3D objects are usually represented as polygonal meshes. Due
to their irregular format, current deep learning methods cannot be directly applied on
meshes. In this thesis, we introduce a novel differentiable operation, mesh sampling
operator, which can be seamlessly integrated into the network to handle meshes with
different topologies. We also present 3DN, an end-to-end network that deforms a source
3D mesh to resemble a target (a 2D image or another 3D model). Our method generates
shapes with higher quality compared to the state-of-the art learning-based methods for
3D shape generation.
To further extend our work on generating topology-varying 3D shapes, we introduce
Deep Implicit Surface Network (DISN) to generate Signed Distance Functions given a
3D point and a target image. By taking advantage of both global and local image feature,
DISN can generate high-quality 3D surfaces. To the best of our knowledge, this method
xv
is the first deep learning based method that is capable of producing fine-grained details
in 3D shapes.
xvi
Chapter 1
Introduction
1.1 Motivation
1.1.1 From 2D to 3D
Two basic tasks in computer vision are perception and generation. Given data captured
by sensors, machine perception is to interpret the data in a similar manner as humans
explain the world, and the tasks include classification, semantic segmentation, object
detection, and instance segmentation etc. Perception methods are widely used in robotic
applications such as autonomous driving. While perception focuses on understanding a
given input, the goal of generation is to produce new content, which can be a new image
or 3D model. Machine generation techniques are crucial for computer graphics applica-
tions like computer games and AR/VR. These techniques include image generation, 3D
reconstruction, image/shape inpainting etc.
With the help of large-scale image dataset [22] and high-performance GPU comput-
ing, deep learning methods especially Convolutional Neural Networks (CNNs) achieve
promising results on both tasks. For example, over last few years, 2D perception net-
works for tasks such as semantics segmentation [97, 10], object detection [32, 92],
instance segmentation [42, 75] are widely used in robotics area. Recently, adversarial
generative models and their variant [34, 89, 80] have been highly successful to generate
high-quality images.
1
Despite these encouraging progresses in 2D domain, their 3D counterparts lag
behind. The development of 3D sensors (Kinect, LiDAR, radar etc.) has made 3D
data (depth images, point cloud, triangle meshes etc.) widely available. Moreover,
the amount of accessible 3D models has increased largely over the past decade. These
growths shred light to the direction of visual computing with big 3D data, i.e., 3D deep
learning.
1.1.2 Advances and Challenges in 3D Deep Learning
Traditional 3D methods usually require manually designed shape features or heavy
parameter tuning efforts. For example, modeling a predefined shape from a point
cloud [88] involves performing part segmentation with more effective features auto-
matically and efficiently from raw input data.
There are several representations of 3D data: depth images, volumetric grids, point
clouds and polygonal meshes. Similar to 2D images, depth images are usually treated as
an additional channel to RGB images. Even though adding new depth branches [37, 87]
improves the performance of RGBD perception, the geometric relations between pixels
have not been fully utilized by CNNs.
V olumetric grids are the simplest way to represent the geometry in 3D space and
can be directly applied on volumetric CNNs. Over the past few years, methods on
volumetric data are introduced for 3D recognition [73], detection [105] and genera-
tion [127]. Since the amount of data in 3D volumetric grids is greatly increased com-
pared to 2D images, GPU memory limitation prevents these methods to process high-
resolution shapes.
Unlike structured grid representations, other 3D formats, such as point clouds and
meshes, are nontrivial to process in CNNs. A point cloud stores 3D information in a
sparse and efficient way. Compared to volumetric grids, point clouds cost less space
2
and processing time, however, CNNs are not suited to apply on this irregular data for-
mat. Recently, deep learning frameworks, such as PointNet/Pointnet++ [84, 86] open
up more efficient and flexible ways to handle 3D point clouds. PointNet/Pointnet++
have shown good performance on point cloud recognition and segmentation, however,
instance segmentation on point clouds is still a challenging yet important problem since
the number of output instances cannot be fixed.
Polygonal meshes, especially triangular meshes, are widely used in computer graph-
ics due to their efficiency to in representing surfaces. A polygonal mesh is a collection
of vertices and faces. The vertices are an irregular point set while the faces define the
connectivities among the vertices. The connectivities are discrete and hard to use in
CNNs. The availability of large-scale 3D mesh dataset [9] motivates us to delve into 3D
deep learning for meshes.
1.2 Overview of Contributions
This thesis focuses on solving the challenges mentioned in Section 1.1.2 and introduces
a series of 3D deep learning techniques on perception and modeling.
1.2.1 3D Perception
2.5D Semantic Segmentation
Depth images are raw data captured by depth sensors which contain rich 3D geometric.
CNNs are able to be directly applied on depth images due to their grid structure. We start
from handling depth images for 3D perception. In literature, processing depth images
using 2D methods is called 2.5D perception. Our task is 2.5D segmentation (RGBD
Segmentation), which is to assign a semantic label for each pixel given an RGB image
along with the depth image.
3
CNNs are limited by the lack of capability to handle 3D geometry due to the fixed
grid kernel structure. State-of-the-art RGBD segmentation methods either use depth as
additional images or process spatial information on 3D volumes or point clouds. These
methods suffer from high computation and memory cost. To address these issues, we
present Depth-aware CNN by introducing two intuitive, flexible and effective opera-
tions: depth-aware convolution and depth-aware pooling. By leveraging depth similar-
ity between pixels, geometric relations between two pixels are seamlessly incorporated
into CNN. Without introducing any additional parameters, both operators can be easily
integrated into existing CNNs. Depth-aware CNNs are able to produce superior perfor-
mance on challenging RGB-D semantic segmentation benchmarks to baseline methods
with sufficiently high computation speed.
Point Cloud Instance Segmentation
Instance segmentation is the problem of distinguish and detect each single object of
interest in a scene. Unlike classification and semantic segmentation, the number of
distinct objects can be varying for different scenes, consequently, the output is not
able to be represented as a fixed-size probability hot vector. This problem has been
addressed by using proposal network in 2D object detection approaches [33, 32, 92],
however, instance segmentation on 3D point cloud using deep learning is yet studied.
We observed that the sparsity of point cloud provides an intuitive formulation of object
proposals. Based on this observation, we introduce the Similarity Group Proposal Net-
work (SGPN), a simple and effective deep learning framework for 3D object instance
segmentation on point clouds.
SGPN uses a single network to predict point grouping proposals and a corresponding
semantic class for each proposal, from which we can directly extract instance segmen-
tation results. Important to the effectiveness of SGPN is its novel representation of 3D
4
instance segmentation results in the form of a similarity matrix that indicates the similar-
ity between each pair of points in embedded feature space, thus producing an accurate
grouping proposal for each point. To the best of our knowledge, SGPN is the first frame-
work to learn 3D instance-aware semantic segmentation on point clouds. Experimental
results on various 3D scenes show the effectiveness of our method on 3D instance seg-
mentation, and we also evaluate the capability of SGPN to improve 3D object detection
and semantic segmentation results.
1.2.2 3D Modeling
3D Shape Inpainting
Data captured by 3D sensors usually contains artifacts and missing parts due to occlu-
sion and noise. Given a 3D model with artifacts, the goal of shape inpainting is to
complete the missing parts and reduce noise. Recent advances in CNNs have shown
promising results in 3D shape completion. But due to GPU memory limitations, these
methods can only produce low-resolution outputs.
To inpaint 3D models with semantic plausibility and contextual details, we introduce
a hybrid framework that combines a 3D Encoder-Decoder Generative Adversarial Net-
work (3D-ED-GAN) and a Long-term Recurrent Convolutional Network (LRCN). The
3D-ED-GAN is a 3D convolutional neural network trained with a generative adversarial
paradigm to fill missing 3D data in low-resolution. LRCN adopts a recurrent neural
network architecture to minimize GPU memory usage and incorporates an Encoder-
Decoder pair into a Long Short-term Memory Network. By handling the 3D model as
a sequence of 2D slices, LRCN transforms a coarse 3D shape into a more complete
and higher resolution volume. While 3D-ED-GAN captures global contextual struc-
ture of the 3D shape, LRCN localizes the fine-grained details. Experimental results on
5
both real-world and synthetic data show reconstructions from corrupted models result
in complete and high-resolution 3D objects.
3D Deformation
As is mentioned in Section 1.1.2, handling polygonal meshes with deep neural networks
is a challenging problem and has many potential applications. For example, in AR/VR,
there is a demand for rapid creation and easy access to large sets of 3D models. An
effective way to address this demand is to edit or deform existing 3D models based on a
reference, e.g., a 2D image which is very easy to acquire.
Given such a source 3D model and a target which can be a 2D image, 3D model,
or a point cloud acquired as a depth scan, we introduce 3DN, an end-to-end network
that deforms the source model to resemble the target. Our method infers per-vertex
offset displacements while keeping the mesh connectivity of the source model fixed. We
present a training strategy which uses a novel differentiable operation, mesh sampling
operator, to generalize our method across source and target models with varying mesh
densities. Mesh sampling operator can be seamlessly integrated into the network to
handle meshes with different topologies. Qualitative and quantitative results show that
our method generates higher quality results compared to the state-of-the art learning-
based methods for 3D shape generation.
3D Implicit Surface Generation
Reconstruction a 3D shape by deformation has the limitation that the deformed mesh and
the source mesh must have the same topology and connectivity. To address this limita-
tion, we further study an implicit 3D surface representation, Signed Distance Functions
(SDF), and introduce DISN, a Deep Implicit Surface Network that generates a high-
quality 3D shape given an input image by predicting the underlying SDF. In addition to
6
utilizing global image features, DISN also predicts the local image patch each 3D point
sample projects onto and extracts local features from such patches. Combining global
and local features significantly improves the accuracy of the predicted signed distance
field. To the best of our knowledge, DISN is the first method that constantly captures
details such as holes and thin structures present in 3D shapes from single images. We
demonstrate DISN achieves state-of-the-art single-view reconstruction performance on
a variety of shape categories reconstructed from both synthetic and real images.
1.2.3 Organization of the Dissertation
We discuss 3D perception and modeling with deep learning separately in the rest of the
thesis.
We firstly introduce our efforts on 3D perception. Chapter 2 describes the depth-
aware CNN and its application on RGBD segmentation. In Chapter 3, we develop a
novel and effective method for point cloud instance segmentation.
Then we discuss our works on 3D modeling. Chapter 4 presents our shape inpainting
framework. In Chapter 5, we describe our 3D deformation network to handle polygonal
mesh. Chapter 3 introduce our work on 3D implicit surface generation for single-view
reconstruction.
Finally, we summarize the thesis and discuss future research directions in Chapter 7.
7
Chapter 2
2.5D Perception: RGB-D Segmentation
2.1 Introduction
Recent advances [97, 138, 10] in CNN have achieved significant success in scene under-
standing. With the help of range sensors (such as Kinect, LiDAR etc.), depth images
are applicable along with RGB images. Taking advantages of the two complemen-
tary modalities with CNN is able to improve the performance of scene understanding.
However, CNN is limited to model geometric variance due to the fixed grid computa-
tion structure. Incorporating the geometric information from depth images into CNN is
important yet challenging.
Extensive studies [87, 13, 58, 98, 93, 16, 123] have been carried out on this task.
FCN [97] and its successors treat depth as another input image and construct two CNNs
to process RGB and depth separately. This doubles the number of network parameters
and computation cost. In addition, the two-stream network architecture still suffers from
the fixed geometric structures of CNN. Even if the geometric relations of two pixels are
given, this relation cannot be used in information propagation of CNN. An alternative
is to leverage 3D networks [87, 106, 122] to handle geometry. Nevertheless, both volu-
metric CNNs [106] and 3D point cloud graph networks [87] are computationally more
expensive than 2D CNN. Despite the encouraging results of these progresses, we need
This chapter is based on the paper Depth-aware CNN for RGB-D Segmentation by Weiyue Wang and
Ulrich Neumann, ECCV , 2018. Code is available atgithub.com/laughtervv/DepthAwareCNN.
8
A B C RGB Depth-aware CNN Depth-Aware Convolution Ground Truth Depth A B C Figure 2.1: Illustration of Depth-aware CNN. A and C are labeled as table and B is
labeled as chair. They all have similar visual features in the RGB image, while they are
separable in depth. Depth-aware CNN incorporate the geometric relations of pixels in
both convolution and pooling. When A is the center of the receptive field, C then has
more contribution to the output unit than B. Figures in the rightmost column shows the
RGB-D semantic segmentation result of Depth-aware CNN.
to seek a more flexible and efficient way to exploit 3D geometric information in 2D
CNN.
To address the aforementioned problems, in this paper, we present an end-to-end
network, Depth-aware CNN (D-CNN), for RGB-D segmentation. Two new operators
are introduced: depth-aware convolution and depth-aware average pooling. Depth-
aware convolution augments the standard convolution with a depth similarity term. We
force pixels with similar depths with the center of the kernel to have more contribution
to the output than others. This simple depth similarity term efficiently incorporates
geometry in a convolution kernel and helps build a depth-aware receptive field, where
convolution is not constrained to the fixed grid geometric structure.
The second introduced operator is depth-ware average pooling. Similarly, when a
filter is applied on a local region of the feature map, the pairwise relations in depth
9
between neighboring pixels are considered in computing mean of the local region.
Visual features are able to propagate along with the geometric structure given in depth
images. Such geometry-aware operation enables the localization of object boundaries
with depth images.
Both operators are based on the intuition that pixels with the same semantic label
and similar depths should have more impact on each other. We observe that two pixels
with the same semantic labels have similar depths. As illustrated in Figure 2.1, pixel A
and pixel C should be more correlated with each other than pixel A and pixel B. This
correlation difference is obvious in depth image while it is not captured in RGB image.
By encoding the depth correlation in CNN, pixel C has more contribution to the output
unit than pixel B in the process of information propagation.
The main advantages of depth-aware CNN are summarized as follows:
By exploiting the nature of CNN kernel handling spatial information, geometry in
depth image is able to be integrated into CNN seamlessly.
Depth-aware CNN does not introduce any parameters and computation complex-
ity to the conventional CNN.
Both depth-aware convolution and depth-ware average pooling can replace their
standard counterparts in conventional CNNs with minimal cost.
Depth-aware CNN is a general framework that bonds 2D CNN and 3D geometry.
Comparison with the state-of-the-art methods and extensive ablation studies on RGB-
D semantic segmentation illustrate the flexibility, efficiency and effectiveness of our
approach.
10
2.2 Related Works
2.2.1 RGB-D Semantic Segmentation
With the help of CNNs, semantic segmentation on 2D images have achieved promis-
ing results [97, 138, 10, 48]. These advances in 2D CNN and the availability of
depth sensors enables progresses in RGB-D segmentation. Compared to the RGB set-
tings, RGB-D segmentation is able to integrate geometry into scene understanding. In
[26, 72, 41, 117], depth is simply treated as additional channels and directly fed into
CNN. Several works [97, 41, 37, 66, 78] encode depth to HHA image, which has three
channels: horizontal disparity, height above ground, and norm angle. RGB image and
HHA image are fed into two separate networks, and the two predictions are summed up
in the last layer. The two-stream network doubles the number of parameters and forward
time compared to the conventional 2D network. Moreover, CNNs per se are limited in
their ability to model geometric transformations due to their fixed grid computation.
Cheng et al. [13] propose a locality-sensitive deconvolution network with gated fusion.
They build a feature affinity matrix to perform weighted average pooling and unpooling.
Lin et al. [67] discretize depth and build different branches for different discrete depth
value. He et al. [45] use spatio-temporal correspondences across frames to aggregate
information over space and time. This requires heavy pre and post-processing such as
optical flow and superpixel computation.
Alternatively, many works [106, 105] attempt to solve the problem with 3D CNNs.
However, the volumetric representation prevents scaling up due to high memory and
computation cost. Recently, deep learning frameworks [87, 84, 86, 124, 47] on point
cloud are introduced to address the limitations of 3D volume. Qi et al. [87] built a 3D
k-nearest neighbor (kNN) graph neural network on a point cloud with extracted features
from a CNN and achieved the state-of-the-art on RGB-D segmentation. Although their
11
method is more efficient than 3D CNNs, the kNN operation suffers from high computa-
tion complexity and lack of flexibility. Instead of using 3D representations, we use the
raw depth input and integrate 3D geometry into 2D CNN in a more efficient and flexible
fashion.
2.2.2 Spatial Transformations in CNN
Standard CNNs are limited to model geometric transformations due to the fixed struc-
ture of convolution kernels. Recently, many works are focused on dealing with this
issue. Dilated convolutions [138, 10] increases the receptive field size with keeping the
same complexity in parameters. This operator achieves better performance on vision
tasks such as semantic segmentation. Spatial transform networks [52] warps feature
maps with a learned global spatial transformation. Deformable CNN [21] learns kernel
offsets to augment the spatial sampling locations. These methods have shown geometric
transformations enable performance boost on different vision tasks.
With the advances in 3D sensors, depth is applicable at low cost. The geometric
information that resides in depth is highly correlated with the spatial transformation in
CNN. Our method integrates the geometric relation of pixels into basic operations of
CNN, i.e. convolution and pooling, where we use a weighted kernel and force every
neuron to have different contributions to the output. This weighted kernel is defined
by depth and is able to incorporate geometric relationships without introducing any
parameter.
2.3 Depth-aware CNN
In this section, we introduce two depth-aware operations: depth-aware convolution and
depth-aware average pooling. They are both simple and intuitive. Both operations
12
* Conv Kernel Depth Input Feature Depth
Similarity Depth Input Feature Depth
Similarity (a) Depth-aware Convolution (b) Depth-aware Average Pooling
Figure 2.2: Illustration of information propagation in Depth-aware CNN. Without loss
of generality, we only show one filter window with kernel size33. In depth similarity
shown in figure, darker color indicates higher similarity while lighter color represents
that two pixels are less similar in depth. In (a), the output activation of depth-aware
convolution is the multiplication of depth similarity window and the convolved window
on input feature map. Similarly in (b), the output of depth-aware average pooling is the
average value of the input window weighted by the depth similarity.
require two inputs: the input feature mapx2R
c
i
hw
and the depth imageD2R
hw
,
wherec
i
is the number of input feature channels,h is the height andw is the width. The
output feature map is denoted asy2R
cohw
, wherec
o
is the number of output feature
channels. Although x and y are both 3D tensors, the operations are explained in 2D
spatial domain for notation clarity and they remain the same across different channels.
2.3.1 Depth-aware Convolution
A standard 2D convolution operation is the weighted sum of a local grid. For each pixel
locationp
0
ony, the output of standard 2D convolution is
y(p
0
) =
X
pn2R
w(p
n
)x(p
0
+p
n
); (2.1)
whereR is the local grid aroundp
0
inx andw is the convolution kernel.R can be a
regular grid defined by kernel size and dilation [138], and it can also be a non-regular
grid [21].
13
As is shown in Figure 2.1, pixel A and pixel B have different semantic labels and
different depths while they are not separable in RGB space. On the other hand, pixel
A and pixel C have the same labels and similar depths. To exploit the depth correlation
between pixels, depth-aware convolution simply adds a depth similarity term, resulting
in two sets of weights in convolution: 1) the learnable convolution kernelw; 2) depth
similarityF
D
between two pixels. Consequently, Equ. 2.1 becomes
y(p
0
) =
X
pn2R
w(p
n
)F
D
(p
0
;p
0
+p
n
)x(p
0
+p
n
): (2.2)
AndF
D
(p
i
;p
j
) is defined as
F
D
(p
i
;p
j
) = exp(jD(p
i
)D(p
j
)j); (2.3)
where is a constant. The selection of F
D
is based on the intuition that pixels with
similar depths should have more impact on each other. We will study the effect of
different and differentF
D
in Section 5.4.4.
The gradients forx andw are simply multiplied byF
D
. Note that theF
D
part does
not require gradient during back-propagation, therefore, Equ. 2.2 does not integrate any
parameters by the depth similarity term.
Figure 2.2(a) illustrates this process. Pixels which have similar depths with the con-
volving center will have more impact on the output during convolution.
2.3.2 Depth-aware Average Pooling
The conventional average pooling computes the mean of a gridR overx. It is defined
as
14
y(p
0
) =
1
jRj
X
pn2R
x(p
0
+p
n
): (2.4)
It treats every pixel equally and will make the object boundary blurry. Geometric
information is useful to address this issue.
Similar to as in depth-aware convolution, we take advantage of the depth similar-
ity F
D
to force pixels with more consistent geometry to make more contribution on
the corresponding output. For each pixel locationp
0
, the depth-aware average pooling
operation then becomes
y(p
0
) =
1
P
pn2R
F
D
(p
0
;p
0
+p
n
)
X
pn2R
F
D
(p
0
;p
0
+p
n
)x(p
0
+p
n
): (2.5)
The gradient should be multiplied by
F
D
P
pn2R
F
D
(p
0
;p
0
+pn)
during back propagation.
As illustrated in Figure 2.2(b), this operation prevent suffering from the fixed geometric
structure of standard pooling.
2.3.3 Understanding Depth-aware CNN
A major advantage of CNN is its capability of using GPU to perform parallel computing
and accelerate the computation. This acceleration mainly stems from unrolling convo-
lution operation with the grid computation structure. However, this limits the ability
of CNN to model geometric variations. Researchers in 3D deep learning have focused
on modeling geometry in deep neural networks in the last few years. As the volumetric
representation [106, 105] is of high memory and computation cost, point clouds are con-
sidered as a more proper representation. However, deep learning frameworks [86, 87]
on point cloud are based on building kNN. This not only suffers from high computation
15
Wall Floor Bed Chair Table All
Variance 0.57 0.65 0.12 0.23 0.34 1.20
Table 2.1: Mean depth variance of different categories on NYUv2 dataset. “All” denotes
the mean variance of all categories. For every image, pixel-wise variance of depth for
each category is calculated. Averaged variance is then computed over all images. For
“All”, all pixels in a image are considered to calculate the depth variance. Mean variance
over all images is further computed.
complexity, but also breaks the pixel-wise correspondence between RGB and depth,
which makes the framework is not able to leverage the efficiency of CNN’s grid com-
putation structure. Instead of operating on 3D data, we exploit the raw depth input.
By augmenting the convolution kernel with a depth similarity term, depth-aware CNN
captures geometry with transformable receptive field.
Many works have studied spatial transformable receptive field of CNN. Dilated con-
volution [10, 138] has demonstrated that increasing receptive field boost the perfor-
mance of networks. In deformable CNN [21], Dai et al. demonstrate that learning
receptive field adaptively can help CNN achieve better results. They also show that pix-
els within the same object in a receptive field contribute more to the output unit than
pixels with different labels. We observe that semantic labels and depths have high cor-
relations. Table 2.1 reports the statistics of pixel depth variance within the same class
and across different classes on NYUv2 [98] dataset. Even the pixel depth variances of
large objects such as wall and floor are much smaller than the variance of a whole scene.
This indicates that pixels with the same semantic labels tend to have similar depths. This
pattern is integrated in Equ. 2.2 and Equ. 2.5 withF
D
. Without introducing any param-
eter, depth-aware convolution and depth-aware average pooling are able to enhance the
localization ability of CNN. We evaluate the impact on performance of different depth
similarity functionsF
D
in Section 5.4.4.
16
(a) (b) (c) (d)
Figure 2.3: Illustration of effective receptive field of Depth-aware CNN. (a) is the input
RGB images. (b), (c) and (d) are depth images. For (b), (c) and (d), we show the
sampling locations (red dots) in three levels of 33 depth-aware convolutions for the
activation unit (green dot).
To get a better understanding of how depth-aware CNN captures geometry with
depth, Figure 2.3 shows the effective receptive field of the given input neuron. In con-
ventional CNN, the receptive fields and sampling locations are fixed across feature map.
With the depth-aware term incorporated, they are adjusted by the geometric variance.
For example, in the second row of Figure 2.3(d), the green point is labeled as chair and
the effective receptive field of the green point are essentially chair points. This indicates
that the effective receptive field mostly have the same semantic label as the center. This
pattern increases CNN’s performance on semantic segmentation.
2.3.4 Depth-aware CNN for RGB-D Semantic Segmentation
In this paper, we focus on RGB-D semantic segmentation with depth-aware CNN. Given
an RGB image along with depth, our goal is to produce a semantic mask indicating the
label of each pixel. Both depth-aware convolution and average pooling easily replace
their counterpart in standard CNN.
DeepLab[10] is a state-of-the-art method for semantic segmentation. We adopt
DeepLab as our baseline for semantic segmentation and a modified VGG-16 network is
17
layer name conv1 x conv2 x conv3 x conv4 x conv5 x conv6 & conv7
Baseline
DeepLab
C3-64-1 C3-128-1 C3-256-1 C3-512-1 C3-512-2 C3-1024-12
C3-64-1 C3-128-1 C3-256-1 C3-512-1 C3-512-2 C1-1024-0
maxpool maxpool C3-256-1 C3-512-1 C3-512-2 globalpool+concat
maxpool maxpool avgpool
D-CNN
DC3-64-1 DC3-128-1 DC3-256-1 DC3-512-1 DC3-512-2 DC3-1024-12
C3-64-1 C3-128-1 C3-256-1 C3-512-1 C3-512-2 C1-1024-0
maxpool maxpool C3-256-1 C3-512-1 C3-512-2 globalpool+concat
maxpool maxpool Davgpool
Table 2.2: Network architecture. DeepLab is our baseline with a modified version of VGG-16 as
the encoder. The convolution layer parameters are denoted as “C[kernel size]-[number of channels]-
[dilation]”. “DC” and “Davgpool” represent depth-aware convolution and depth-aware average pooling
respectively.
used as the encoder. We replace layers in this network with depth-aware operations. The
network configurations of the baseline and depth-aware CNN are outlined in Table 2.2.
Supposeconv7 hasC channels. Following [87], global pooling is used to compute aC-
dim vector fromconv7. This vector is then appended to all spatial positions and results
in a 2C-channel feature map. This feature map is followed by a 11 conv layer and
produce the segmentation probability map.
2.4 Experiments
Evaluation is performed on three popular RGB-D datasets:
NYUv2 [98]: NYUv2 contains of 1;449 RGB-D images with pixel-wise labels.
We follow the 40-class settings and the standard split with 795 training images
and654 testing images.
SUN-RGBD [103, 53]: This dataset have37 categories of objects and consists of
10;335 RGB-D images, with5;285 as training and5050 as testing.
18
Stanford Indoor Dataset (SID) [2]: SID contains 70;496 RGB-D images with 13
object categories. We use Area1;2;3;4 and6 as training, and Area5 as testing.
Four common metrics are used for evaluation: pixel accuracy (Acc), mean pixel
accuracy of different categories (mAcc), mean Intersection-over-Union of different cat-
egories (mIoU), and frequency-weighted IoU (fwIoU). Suppose n
ij
is the number of
pixels with ground truth class i and predicted as class j, n
C
is the number of classes
ands
i
is the number of pixels with ground truth classi, the total number of all pixels is
s =
P
i
s
i
. The four metrics are defined as follows: Acc =
P
i
n
ii
s
, mAcc =
1
n
C
P
i
n
ii
s
i
,
mIoU =
1
n
C
P
i
n
ii
s
i
+
P
j
n
ji
n
ii
, fwIoU =
1
s
P
i
s
i
n
ii
s
i
+
P
j
n
ji
n
ii
.
Implementation Details For most experiments, DeepLab with a modified VGG-16
encoder (c.f. Table 2.2) is the baseline. Depth-aware CNN based on DeepLab out-
lined in Table 2.2 is evaluated to validate the effectiveness of our approach and this is
referred as “D-CNN” in the paper. We also conduct experiments with combining HHA
encoding [37]. Following [97, 87, 26], two baseline networks consume RGB and HHA
images separately and the predictions of both networks are summed up in the last layer.
This two-stream network is dubbed as “HHA”. To make fair comparison, we also build
depth-aware CNN with this two-stream fashion and denote this as “D-CNN+HHA”. In
ablation study, we further replace VGG-16 with ResNet-50 [44] as the encoder to have
a better understanding of the functionality of depth-aware operations.
We use SGD optimizer with initial learning rate0:001, momentum0:9 and batch size
1. The learning rate is multiplied by(1
iter
max iter
)
0:9
for every10 iterarions. is set to
8:3. (The impact of is studied in Section 2.4.2.) The dataset is augmented by randomly
scaling, cropping, and color jittering. We use PyTorch deep learning framework. Both
depth-aware convolution and depth-aware average pooling operators are implemented
with CUDA acceleration. Code will be released.
19
Baseline HHA D-CNN D-CNN+HHA
Acc (%) 50.1 59.1 60.3 61.4
mAcc (%) 23.9 30.8 39.3 35.6
mIoU (%) 15.9 21.9 27.8 26.2
fwIoU (%) 34.2 43.0 44.9 45.7
Table 2.3: Comparison with baseline CNNs on NYUv2 test set. Networks are trained
from scratch.
[97] [26] [45] [87] HHA D-CNN D-CNN+HHA
mAcc (%) 46.1 45.1 53.8 55.2 51.1 53.6 56.3
mIoU (%) 34.0 34.1 40.1 42.0 40.4 41.0 43.9
Table 2.4: Comparison with the state-of-the-arts on NYUv2 test set. Networks are
trained from pre-trained models.
2.4.1 Main Results
Depth-aware CNN is compared with both its baseline and the state-of-the-art methods
on NYUv2 and SUN-RGBD dataset. It is also compared with the baseline on SID
dataset.
NYUv2 Table 2.3 shows quantitative comparison results between D-CNNs and base-
line models. Since D-CNN and its baseline are in different function space, all networks
are trained from scratch to make fair comparison in this experiment. Without introduc-
ing any parameters, D-CNN outperforms the baseline by incorporating geometric infor-
mation in convolution operation. Moreover, the performance of D-CNN also exceeds
“HHA” network by using only half of its parameters. This effectively validate the supe-
rior capability of D-CNN on handling geometry over “HHA”.
We also compare our results with the state-of-the-art methods. Table 2.4 illustrates
the good performance of D-CNN. In this experiment, the networks are initialized with
the pre-trained parameters in [10]. Long et al. [97] and Eigen et al. [26] both use the
two-stream network with HHA/depth encoding. Yang et al. [45] compute optical flows
20
RGB Depth GT Baseline HHA D-CNN DCNN+HHA
Figure 2.4: Segmentation results on NYUv2 test dataset. “GT” denotes ground truth. The white regions
in “GT” are the ignoring category. Networks are trained from pre-trained models.
and superpixels to augment the performance with spatial-temporal information. D-CNN
with only one VGG network is superior to their methods. Qi et al. [87] built a 3D graph
on the top of VGG encoder and use RNN to update the graph, which introduces more
network parameters and higher computation complexity. As is shown in Table 2.4, D-
CNN is already comparable with these state-of-the-art methods. By incorporating HHA
encoding, our method achieves the state-of-the-art on this dataset. Figure 2.4 visualizes
qualitative comparison results on NYUv2 test set.
21
Baseline HHA D-CNN D-CNN+HHA
Acc (%) 66.6 72.6 72.4 72.9
mAcc (%) 31.5 37.9 38.6 41.2
mIoU (%) 22.8 28.8 29.7 31.3
fwIoU (%) 51.4 58.5 58.2 59.3
Table 2.5: Comparison with baseline CNNs on SUN-RGBD test set. Networks are
trained from scratch.
[66] [87] HHA D-CNN D-CNN+HHA
mAcc (%) 48.1 55.2 50.5 51.2 53.5
mIoU (%) - 42.0 40.2 41.5 42.0
Table 2.6: Comparison with the state-of-the-arts on SUN-RGBD test set. Networks are
trained from pre-trained models.
SUN-RGBD The comparison results between D-CNN and its baseline are listed in
Table 2.5. The networks in this table are trained from scratch. D-CNN outperforms
baseline by a large margin. Substituting the baseline with the two-stream “HHA” net-
work is able to further improve the performance.
By comparing with the state-of-the-art methods in Table 2.6, we can further see
the effectiveness of D-CNN. Similarly as in NYUv2, the networks are initialized with
pre-trained model in this experiment. Figure 2.5 illustrates the qualitative comparison
results on SUN-RGBD test set. Our network achieves comparable performance with
the state-of-the-art method [87], while their method is more time-consuming. We will
further compare the runtime and numbers of model parameters in Section 2.4.3.
SID The comparison results on SID between D-CNN with its baseline are reported
in Table 2.7. Networks are trained from scratch. Using depth images, D-CNN is able
to achieve 4% IoU over CNN while preserving the same number of parameters and
computation complexity.
22
RGB Depth GT Baseline HHA D-CNN DCNN+HHA
Figure 2.5: Segmentation results on SUN-RGBD test dataset. “GT” denotes ground truth. The white
regions in “GT” are the ignoring category. Networks are trained from pre-trained models.
2.4.2 Ablation Study
In this section, we conduct ablation studies on NYUv2 dataset to validate efficiency and
efficacy of our approach. Testing results on NYUv2 test set are reported.
Depth-aware CNN To verify the functionality of both depth-aware convolution and
depth-aware average pooling, the following experiments are conducted.
23
Baseline D-CNN
Acc (%) 64.3 65.4
mAcc (%) 46.7 55.5
mIoU (%) 35.5 39.5
fwIoU (%) 48.5 49.9
Table 2.7: Comparison with baseline CNNs on SID Area5. Networks are trained from
scratch.
Baseline HHA VGG-1 VGG-2 VGG-3
Acc (%) 50.1 59.1 60.3 56.0 59.3
mAcc (%) 23.9 30.8 39.3 32.2 39.2
mIoU (%) 15.9 21.9 27.8 22.4 27.4
fwIoU (%) 34.2 43.0 44.9 40.2 44.0
Table 2.8: Results of using depth-aware operations in different layers. Experiments are
conducted on NYUv2 test set. Networks are trained from scratch.
VGG-1: Conv1 1, Conv2 1, Conv3 1, Conv4 1, Conv5 1 and Conv6 in
VGG-16 are replaced with depth-aware convolution. This is the same as in
Table 2.2.
VGG-2: Conv4 1, Conv5 1 and Conv6 in VGG-16 are replaced with depth-
aware convolution. Other layers remain the same as in Table 2.2.
VGG-3: The depth-aware average pooling layer listed in Table 2.2 is replaced
with regular pooling. Other layers remain the same as in Table 2.2.
Results are shown in Table 2.8. Compared to VGG-2, VGG-1 adds depth-aware con-
volution in bottom layers. This helps the network propagate more fine-grained features
with geometric relationships and increase segmentation performance by6% in IoU. The
depth-aware average pooling operation is able to further promote the accuracy.
We also replace VGG-16 to ResNet-50 as the encoder. To build depth-aware
ResNet, the Conv3 1, Conv4 1, and Conv5 1 in ResNet-50 are replaced with
depth-aware convolution. The networks are initialized with parameters pre-trained on
24
ResNet ResNet-D-CNN VGG-D-CNN
Acc (%) 68.9 69.6 69.4
mAcc (%) 50.2 53.3 53.6
mIoU (%) 38.8 41.5 41.0
fwIoU (%) 54.4 54.4 54.5
Table 2.9: Results of using depth-aware operations in ResNet-50. “VGG-D-CNN”
denotes the same network and result as in Table 2.4. Networks are trained from pre-
trained models.
ADE20K [141]. Detailed architecture and training details for ResNet can be found in
Supplementary Materials. Results are listed in Table 2.9.
Depth Similarity Function We modify and F
D
to further validate the effect of
different choices of depth similarity function on performance. We conduct the following
experiments:
8:3
: is set to8:3. The network architecture is the same as Table 2.2.
20
: is set to20. The network architecture is the same as Table 2.2.
2:5
: is set to2:5. The network architecture is the same as Table 2.2.
clipF
D
: The network architecture is the same as Table 2.2.F
D
is defined as
F
D
(p
i
;p
j
) =
8
>
>
<
>
>
:
0; jD(p
i
)D(p
j
)j 1
1; otherwise
(2.6)
Table 2.10 reports the test performances with different depth similarity functions.
Though the performance varies with different , they are all superior to baseline and
even “HHA”. The result of clipF
D
is also comparable with “HHA”. This validate that
the effectiveness of using a depth-sensitive term to weight the contributions of neurons.
25
Baseline HHA
8:3
20
2:5
clipF
D
Acc (%) 50.1 59.1 60.3 58.5 58.5 53.0
mAcc (%) 23.9 30.8 39.3 35.2 35.9 29.8
mIoU (%) 15.9 21.9 27.8 24.9 25.3 20.1
fwIoU (%) 34.2 43.0 44.9 42.6 42.9 37.5
Table 2.10: Results of using different andF
D
. Experiments are conducted on NYUv2
test set. Networks are trained from scratch.
(a) (b)
Figure 2.6: Performance Analysis. (a) Per-class IoU improvement of D-CNN over base-
line on NYUv2 test dataset. (b) Evolution of training loss on NYUv2 train dataset.
Networks are trained from scratch.
Performance Analysis To have a better understanding of how depth-aware CNN out-
performs the baseline, we visualize the improvement of IoU for each semantic class
in Figure 2.6(a). The statics shows that D-CNN outperform baseline on most object
categories, especially these large objects such as ceilings and curtain. Moreover, we
observe depth-aware CNN has a faster convergence than baseline, especially trained
from scratch. Figure 2.6(b) shows the training loss evolution with respect to training
steps. Our network gains lower loss values than baseline.
2.4.3 Model Complexity and Runtime Analysis
Table 2.11 reports the model complexity and runtime of D-CNN and the state-of-the-
art method [87]. In their method, kNN takes O(kN) runtime at least, where N is the
number of pixels. We leverage the grid structure of raw depth input. Without increasing
26
Baseline HHA [87] D-CNN
net. forward (ms) 32.5 64.2 214 39.3
# of params 47.0M 92.0M 47.25M 47.0M
Table 2.11: Model complexity and runtime comparison. Runtime is tested on Nvidia
1080Ti, with input image size4255603.
any model parameters, D-CNN is able to incorporate geometric information in CNN
efficiently.
2.5 Conclusion
We present a novel depth-aware CNN by introducing two operations: depth-aware con-
volution and depth-aware average pooling. Depth-aware CNN augments conventional
CNN with a depth similarity term and encode geometric variance into basic convolu-
tion and pooling operations. By adapting effective receptive field, these depth-aware
operations are able to incorporate geometry into CNN while preserving CNN’s effi-
ciency. Without introducing any parameters and computational complexity, this method
is able to improve the performance on RGB-D segmentation over baseline by a large
margin. Moreover, depth-aware CNN is flexible and easily replaces its plain counterpart
in standard CNNs. Comparison with the state-of-the-art methods and extensive ablation
studies on RGB-D semantic segmentation demonstrate the effectiveness and efficiency
of depth-aware CNN.
Depth-aware CNN provides a general framework for vision tasks with RGB-D input.
Moreover, depth-aware CNN takes the raw depth image as input and bridges the gap
between 2D CNN and 3D geometry. In future works, we will apply depth-aware CNN
on various tasks such as 3D detection, instance segmentation and we will perform depth-
aware CNN on more challenging dataset. Apart from depth input, we will exploit vari-
ous 3D data such as LiDAR point cloud.
27
Chapter 3
3D Perception: Point Cloud Instance
Segmentation
3.1 Introduction
Instance segmentation on 2D images have achieved promising results recently [42, 20,
81, 65]. With the rise of autonomous driving and robotics applications, the demand for
3D scene understanding and the availability of 3D scene data has rapidly increased in
recently. Unfortunately, the literature for 3D instance segmentation and object detec-
tion lags far behind its 2D counterpart; scene understanding with Convolutional Neural
Networks (CNNs) [104, 105, 23] on 3D volumetric data is limited by high memory and
computation cost. Recently, deep learning frameworks PointNet/Pointnet++ [84, 86] on
point clouds open up more efficient and flexible ways to handle 3D data.
Following the pioneering works in 2D scene understanding, our goal is to develop
a novel deep learning framework trained end-to-end for 3D instance-aware semantic
segmentation on point clouds that, like established baseline systems for 2D scene under-
standing tasks, is intuitive, simple, flexible, and effective.
This chapter is based on the paper SGPN: Similarity Group Proposal Network for 3D Point Cloud
Instance Segmentation by Weiyue Wang, Ronald Yu, Qiangui Huang and Ulrich Neumann, CVPR, 2018.
Code is available atgithub.com/laughtervv/SGPN.
28
(a)
(b) (c)
Figure 3.1: Instance segmentation for point clouds using SGPN. Different colors repre-
sent different instances. (a) Instance segmentation on complete real scenes. (b) Single
object part instance segmentation. (c) Instance segmentation on point clouds obtained
from partial scans.
An important consideration for instance segmentation on a point cloud is how to
represent output results. Inspired by the trend of predicting proposals for tasks with a
variable number of outputs, we introduce a Similarity Group Proposal Network (SGPN),
which formulates group proposals of object instances by learning a novel 3D instance
segmentation representation in the form of a similarity matrix .
Our pipeline first uses PointNet/PointNet++ to extract a descriptive feature vector
for each point in the point cloud. As a form of similarity metric learning, we enforce the
idea that points belonging to the same object instance should have very similar features;
hence we measure the distance between the features of each pair of points in order to
29
form a similarity matrix that indicates whether any given pair of points belong to the
same object instance.
The rows in our similarity matrix can be viewed as instance candidates, which we
combine with learned confidence scores in order to generate plausible group proposals.
We also learn a semantic segmentation map in order to classify each object instance
obtained from our group proposals. We are also able to directly derive tight 3D bounding
boxes for object detection.
By simply measuring the distance between overdetermined feature representations
of each pair of points, our similarity matrix simplifies our pipeline in that we remain
in the natural point cloud representation of defining our objects by the relationships
between points.
In summary, SGPN has three output branches for instance segmentation on point
clouds: a similarity matrix yielding point-wise group proposals, a confidence map for
pruning these proposals, and a semantic segmentation map to give the class label for
each group. SGPN is to the best of our knowledge the first deep learning framework to
learn 3D instance segmentation on point clouds.
We evaluate our framework on both 3D shapes (ShapeNet [9]) and real 3D scenes
(Stanford Indoor Semantic Dataset [3] and NYUV2 [98]) and demonstrate that SGPN
achieves state-of-the-art results on 3D instance segmentation. We also conduct compre-
hensive experiments to show the capability of SGPN on achieving high performance on
3D semantic segmentation and 3D object detection on point clouds. Although a min-
imalistic framework with no bells and whistles already gives visually pleasing results
(Figure 3.1), we also demonstrate the flexibility of SGPN as we boost performance even
more by seamlessly integrating CNN features from RGBD images.
30
3.2 Related Works
3.2.1 Object Detection and Instance Segmentation
Recent advances in object detection [92, 32, 68, 90, 91, 70, 29, 69] and instance segmen-
tation [65, 20, 19, 82, 81] on 2D images have achieved promising results. R-CNN [33]
for 2D object detection established a baseline system by introducing region proposals
as candidate object regions. Faster R-CNN [92] leveraged a CNN learning scheme and
proposed Region Proposal Networks(RPN). YOLO [90] divided the image into grids and
each grid cell produced an object proposal. Many 2D instance segmentation approaches
are based on segment proposals. DeepMask [81] learns to generate segment propos-
als each with a corresponding object score. Dai et al. [20] predict segment candidates
from bounding box proposals. Mask R-CNN [42] extended Faster R-CNN by adding a
branch on top of RPN to produce object masks for instance segmentation.
Following these pioneering 2D works, 3D bounding box detection frameworks have
emerged [94, 104, 105, 23, 11]. Song and Xiao [105] use a volumetric CNN to create
3D RPN on a voxelized 3D scene and then use both the color and depth data of the image
in a joint 3D and 2D object recognition network on each proposal. Deng and Latecki
[23] regress class-wise 3D bounding box models based on RGBD image appearance
features only. Armeni et al [3] use a sliding shape method with CRF to perform 3D
object detection on point cloud. To the best of our knowledge, no previous work exists
that learns 3D instance segmentation.
3.2.2 3D Deep Learning
Convolutional neural networks generalize well to 3D by performing convolution on vox-
els for certain tasks such as object classification [85, 119, 73, 129, 95], shape recon-
struction [122, 39, 18] of simple objects, and 3D object detection as mentioned in
31
Section 3.2.1. However, volumetric representation carry a high memory and computa-
tional cost and have strong limitations dealing with 3D scenes [17, 3, 107]. Octree-based
CNNs [95, 113, 119] have been introduced recently, but they are less flexible than vol-
umetric CNNs and still suffer from memory efficiency problems.
A point cloud is an intuitive, memory-efficient 3D representation well-suited for rep-
resenting detailed, large scenes for 3D instance segmentation using deep learning. Point-
Net/Pointnet++ [84, 86] recently introduce deep neural networks on 3D point clouds,
learning successful results for tasks such as object classification and part and seman-
tic scene segmentation. We base our network architecture off of PointNet/PointNet++,
achieving a novel method that learns 3D instance segmentation on point clouds.
3.2.3 Similarity Metric Learning
Our work is also closely related to similarity metric learning, which has been widely
used in deep learning on various tasks such as person re-identification [136], match-
ing [38], image retrival [28, 125] and face recognition [14]. Siamese CNNs [14, 99, 6]
are used on tasks such as tracking [64] and one-shot learning [61] by measuring the
similarity of two input images. Alejandro et. al [75] introduced an associative embed-
ding method to group similar pixels for multi-person pose estimation and 2D instance
segmentation by enforcing that pixels in the same group should have similar values in
their embedding space without actually enforcing what those exact values should be.
Our method exploits metric learning in a different way in that we regress the likelihood
of two points belonging to the same group and formulate the similarity matrix as group
proposals to handle variable number of instances.
32
PointNet / PointNet++ Features Pairwise
Distance Np X Np Similarity Matrix Confidence Map Np x 1 Pruning Group Proposals Group class SGPN Group Merging Instance Segmentation Input Points: Np x Nf Semantic Prediction Np x Nc Figure 3.2: Pipeline of our system for point cloud instance segmentation.
3.3 Method
The goal of this paper is to take a 3D point cloud as input and produce an object instance
label for each point and a class label for each instance. Utilizing recent developments
in deep learning on point clouds [84, 86], we introduce a Similarity Group Proposal
Network (SGPN), which consumes a 3D point cloud and outputs a set of instance pro-
posals that each contain the group of points inside the instance as well as its class label.
Section 3.3.1 introduces the design and properties of SGPN. Section 3.3.2 proposes an
algorithm to merge similar groups and give each point an instance label. Section 3.4.1
gives implementation details. Figure 3.2 depicts the overview of our system.
3.3.1 Similarity Group Proposal Network
SGPN is a very simple and intuitive framework. As shown in Figure 3.2, it first passes
a point cloudP of sizeN
p
through a feed-forward feature extraction network inspired
by PointNets [84, 86], learning both global and local features in the point cloud. This
feature extraction network produces a matrixF . SGPN then diverges into three branches
that each passF through a single PointNet layer to obtain sizedN
p
N
f
feature matrices
F
SIM
;F
CF
;F
SEM
, which we respectively use to obtain a similarity matrix, a confidence
map and a semantic segmentation map. Theith row in aN
p
N
f
feature matrix is a
N
f
-dimensional vector that represents pointP
i
in an embedded feature space. Our loss
L is given by the sum of the losses from each of these three branches: L = L
SIM
+
L
CF
+L
SEM
. Our network architecture can be found in the supplemental.
33
(a) (b)
Figure 3.3: (a) Similarity (euclidean distance in feature space) between a given point
(indicated by red arrow) and the rest of points. A darker color represents lower distance
in feature space thus higher similarity. (b) Confidence map. A darker color represents
higher confidence.
Similarity Matrix We propose a novel similarity matrixS from which we can formu-
late group proposals to directly recover accurate instance segmentation results. S is of
dimensionsN
p
N
p
, and elementS
ij
classifies whether or not pointsP
i
andP
j
belong
to the same object instance. Each row of S can be viewed as a proposed grouping of
points that form a candidate object instance.
We leverage that points belonging to the same object instance should have similar
features and lie very close together in feature space. We obtainS by, for each pair of
pointsfP
i
;P
j
g, simply subtracting their corresponding feature vectorsfF
SIM
i
,F
SIM
j
g
and taking theL
2
norm such thatS
ij
=jjF
SIM
i
F
SIM
j
jj
2
. This reduces the problem of
instance segmentation to learning an embedding space where points in the same instance
are close together and those in different object instances are far apart.
For a better understanding of how SGPN captures correlation between points, in
Figure 3.3(a) we visualize the similarity (euclidean distance in feature space) between a
given point and the rest of the points in the point cloud. Points in different instances have
greater euclidean distances in feature space and thus smaller similarities even though
34
they have the same semantic labels. For example, in the bottom-right image of Fig-
ure 3.3(a), although the given table leg point has greater similarity with the other table
leg points than the table top, it is still distinguishable from the other table leg.
We believe that a similarity matrix is a more natural and simple representation for
3D instance segmentation on a point cloud compared to traditional 2D instance seg-
mentation representations. Most state-of-the-art 2D deep learning methods for instance
segmentation first localize the image into patches, which are then passed through a neu-
ral network and segment a binary mask of the object.
While learning a binary mask in a bounding box is a more natural representation for
space-centric structures such as images or volumetric grids where features are largely
defined by which positions in a grid have strong signals, point clouds can be viewed as
shape-centric structures where information is encoded by the relationship between the
points in the cloud, so we would prefer to also define instance segmentation output by
the relationship between points without working too much in grid space.
Hence we expect that a deep neural network could better learn our similarity matrix,
which compared to traditional representations is a more natural and straightforward
representation for instance segmentation in a point cloud.
Double-Hinge Loss for Similarity Matrix As is the case in [75], in our similarity
matrix we do not need to precisely regress the exact values of our features; we only
optimize the simpler objective that similar points should be close together in feature
space. We define three potential similarity classes for each pair of pointsfP
i
;P
j
g:
1. P
i
andP
j
belong to the same object instance.
2. P
i
and P
j
share the same semantic class but do not belong to the same object
instance.
3. P
i
andP
j
do not share the same semantic class.
35
Pairs of points should lie progressively further away from each other in feature space as
their similarity class increases. We define out loss as:
L
SIM
=
Np
X
i
Np
X
j
l(i;j)
l(i;j) =
8
>
>
>
>
>
<
>
>
>
>
>
:
jjF
SIM
i
F
SIM
j
jj
2
C
ij
= 1
max(0;K
1
jjF
SIM
i
F
SIM
j
jj
2
) C
ij
= 2
max(0;K
2
jjF
SIM
i
F
SIM
j
jj
2
) C
ij
= 3
whereC
ij
indicates which of the similarity classes defined above does the pair of points
(fP
i
;P
j
)g belong to and;K
1
;K
2
are constants such that> 1,K
2
>K
1
.
Although the second and third similarity class are treated equivalently for the pur-
poses of instance segmentation, distinguishing between them inL
SIM
using our double-
hinge loss allows our similarity matrix output branch and our semantic segmentation
output branch to mutually assist each other for increased accuracy and convergence
speed. Since the semantic segmentation network is actually wrongly trying to bring
pairs of points in our second similarity class closer together in feature space, we also
add an > 1 term to increase the weight of our loss to dominate the gradient from the
semantic segmentation output branch.
At test time ifS
ij
< Th
S
whereTh
S
< K
1
, then points pairP
i
andP
j
are in the
same instance group.
Similarity Confidence Network SGPN also feedsF
CF
through an additional Point-
Net layer to predict aN
p
1 confidence mapCM reflecting how confidently the model
believes that each grouping candidate is indeed a correct object instance. Figure 3.3(b)
provides a visualization of the confidence map; points located in the boundary area
between parts have lower confidence.
36
We regress confidence scores based on ground truth groupsG represented as aN
p
N
p
matrix identical in form to our similarity matrix. If P
i
is a background point that
does not belong to any object in the ground truth then the rowG
i
will be all zeros. For
each row inS
i
, we expect the ground-truth value in the confidence mapCM
i
to be the
intersection over union (IoU) between the set of points in the predicted groupS
i
and the
ground truth groupG
i
. Our lossL
CF
is the L2 loss between the inferred and expected
CM.
Although the lossL
CF
depends on the similarity matrix output branch during train-
ing, at test time we run the branches in parallel and only groups with confidence greater
than a thresholdTh
C
are considered valid group proposals.
Semantic Segmentation Map The semantic segmentation map acts as a point-wise
classifier. SGPN passesF
SEM
through an additional PointNet layer whose architecture
depends on the number of possible semantic classes, yielding the final outputM
SEM
,
which is aN
p
N
C
sized matrix whereN
C
is the number of possible object categories.
M
SEM
ij
corresponds to the probability that pointP
i
belongs to classC
j
.
The loss L
SEM
is a weighted sum of the cross entropy softmax loss for each row
in the matrix. We use median frequency balancing [4] and the weight assigned to a
category is ac = medianfreq=freq(c), where freq(c) is the total number of points
of class c divided by the total number of points in samples where c is present, and
medianfreq is the median of thesefreq(c).
At test time, the class label for a group instance is assigned by calculating the mode
of the semantic labels of the points in that group.
37
3.3.2 Group Proposal Merging
The similarity matrixS producesN
p
group proposals, many of which are noisy or rep-
resent the same object. We first discard proposals with predicted confidence less than
Th
C
or cardinality less than Th
M2
. We further prune our proposals into clean, non-
overlapping object instances by applying Non-Maximum Suppression; groups with IoU
greater thanTh
M1
are merged together by selecting the group with the maximum cardi-
nality.
Each point is then assigned to the group proposal that contains it. In the rare case
(2%) that after the merging stage a point belongs to more than one final group proposal,
this usually means that the point is at the boundary between two object instances, which
means that the effectiveness of our network would be roughly the same regardless of
which group proposal the point is assigned to. Hence, with minimal loss in accuracy we
randomly assign the point to any one of the group proposals that contains it. We refer to
this process as GroupMerging throughout the rest of the paper.
3.4 Experiments
We evaluate SGPN on 3D instance segmentation on the following datasets:
Stanford 3D Indoor Semantics Dataset (S3DIS) [3]: This dataset contains 3D
scans in 6 areas including 271 rooms. The input is a complete point cloud gen-
erated from scans fused together from multiple views. Each point has semantic
labels and instance annotations.
NYUV2 [98]: Partial point clouds are generated from single view RGBD images.
The dataset is annotated with 3D bounding boxes and 2D semantic segmentation
masks. We use the improved annotation in [23]. Since both 3D bounding boxes
38
and 2D segmentation masks annotations are given, ground truth 3D instance seg-
mentation labels for point clouds can be easily generated We follow the standard
split with 795 training images and 654 testing images.
ShapeNet [9, 137] Part Segmentation: ShapeNet contains 16;881 shapes anno-
tated with 50 types of parts in total. Most object categories are labeled with two
to five parts. We use the official split of 795 training samples and 654 testinn
percentageg samples in our experiments.
We also show the capability of SGPN to improve semantic segmentation and 3D object
detection. To validate the flexibility of SGPN, we also seamlessly incorporate 2D CNN
features into our network to boos performance on the NYUV2 dataset.
3.4.1 Implementation Details
We use an ADAM [59] optimizer with initial learning rate 0:0005, momentum 0:9 and
batch size 4. The learning rate is divided by 2 every 20 epochs. The network is trained
with only theL
SIM
loss for the first 5 epochs. In our experiment, is set to 2 initially
and is increased by 2 every 5 epochs. This design makes the network more focused on
separating features of points that belong to different object instances but have the same
semantic labels. K
1
;K
2
are set to1 and2, respectively. We use per-category histogram
thresholding to get the threshold pointTh
s
for each testing sample. Th
M1
is set to 0:6
andTh
M2
is set to200.Th
C
is set to0:1. Our network is implemented with Tensorflow
and a single Nvidia GTX1080 Ti GPU. It takes 16-17 hours to converge. At test time,
SGPN takes 40ms on an input point cloud with size 40969 with PointNet++ as our
baseline architecture. Further runtime analysis can be found in Section 3.4.3.
39
Network Architecture In our experiments, we use both PointNet and PointNet++
as our baseline architectures. For the S3DIS dataset, we use PointNet as our base-
line for fair comparison with the 3D object detection system described in the PointNet
paper [84]. The network architecture is the same as the semantic segmentation network
as stated in PointNet except for the last two layers. OurF is the last 11conv layer
with BatchNorm and ReLU in PointNet with 256 output channels. F
SIM
;F
CF
;F
SEM
are11conv layers with output channels(128;128;128), respectively.
For the NYUV2 dataset, we use PointNet++ as our baseline. We use the same nota-
tions as PointNet++ to describe our architecture:
SA(K;r;[l
1
;:::;l
d
]) is a set abstraction (SA) level withK local regions of ball radius
r using a PointNet architecture of d 1 1 conv layers with output channels l
i
(i =
1;:::;d).FP(l
1
;:::;l
d
) is a feature propagation (FP ) level withd11conv layers. Our
network architecture is:
SA(1024;0:1;[32;32;64]);
SA(256;0:2;[64;64;128]);
SA(128;0:4;[128;128;256]);
SA(64;0:8;[256;256;256]);
SA(16;1:2;[256;256;512]);
FP(512;256);
FP(256;256);
FP(256;256);
FP(256;128);
FP(128;128;128;128):
40
F
SIM
;F
CF
;F
SEM
are 1 1 conv layers with output channels (128;128;128) respec-
tively.
For our experiments on the ShapeNet part dataset, PointNet++ is used as our
baseline. We use the same network architecture as in the PointNet++ paper [86].
F
SIM
;F
CF
;F
SEM
are11conv layers with output channels(64;64;64), respectively.
Block Merging We divide each scene into1m1m blocks with overlapping sliding
windows in a snake pattern of stride 0:5m. The entire scene is also divided into a
400 400 400 grid V . V
k
is used to indicate the instance label of cell k where
k2 [0;400400400). GivenV and point instance labels for each blockPL where
PL
ij
represents the instance label of jth point in block i, a BlockMerging algorithm
(refer to Algorithm 1) is derived to merge object instances from different blocks.
3.4.2 S3DIS Instance Segmentation and 3D Object Detection
We perform experiments on Stanford 3D Indoor Semantic Dataset to evaluate our per-
formance on large real scene scans. Following experimental settings in PointNet [84],
points are uniformly sampled into blocks of area 1m 1m. Each point is labeled as
one of13 categories (chair, table, floor, wall, clutter etc.) and represented by a9D vec-
tor (XYZ, RGB, and normalized location as to the room). At train time we uniformly
sample4096 points in each block, and at test time we use all points in the block as input.
SGPN uses PointNet as its baseline architecture for this experiment.
1
Figure 3.5
shows instance segmentation results on S3DIS with SGPN. Different colors represent
different instances. Point colors of the same group are not necessarily the same as
1
PointNet [84] proposed a 3D detection system while PointNet++ [86] does not. To make fair com-
parison, we use PointNet as our baseline architecture for this experiment while using PointNet++ in
Sections 3.4.3 and 3.4.4.
41
Algorithm 1: BlockMeriging
Input :V ,PL
Output: Point instance labels for the whole sceneL
1 InitializeV with all elements1;
2 GroupCount 0;
3 for every blocki do
4 if i is the1st block then
5 for every pointP
j
in blocki do
6 Definek whereP
j
is located in thekth cell ofV ;
7 V
k
PL
1j
;
8 end
9 else
10 for every instanceI
j
in blocki do
11 DefineV
I
j
points inI
j
are located in cellsV
I
j
;
12 V
t
the cells inV
I
j
that do not have value1;
13 if the frequency of the mode inV
t
< 30 then
14 V
I
j
GroupCount;
15 GroupCount GroupCount+1;
16 else
17 V
I
j
the mode ofV
t
;
18 end
19 end
20 end
21 end
22 for every pointP
j
in the whole scene do
23 Definek whereP
j
is located in thekth cell ofV ;
24 L
j
V
k
;
25 end
their counterparts in the ground truth since object instances are unordered. To visual-
ize instance classes, we also add semantic segmentation results. SGPN achieves good
performance on various room types.
We also compare instance segmentation performance with the following method
(which we call Seg-Cluster): Perform semantic segmentation using our network and
then select all points as seeds. Starting from a seed point, BFS is used to search neigh-
boring points with the same label. If a cluster with more than200 points has been found,
42
Mean ceiling floor wall beam col-
umn
windowdoor table chair sofa book-
case
board
Seg-Cluster 17.40 70.01 80.12 10.64 15.30 0.00 28.97 32.32 22.16 27.76 0.00 0.06 21.52
SGPN 36.30 58.42 83.67 42.24 25.64 7.15 42.73 45.23 38.25 47.05 0.00 13.57 31.68
Table 3.1: Results on instance segmentation in S3DIS scenes. The metric is AP(%)
with IoU threshold 0:5. To the best of our knowledge, there are no existing instance
segmentation methods on point clouds for arbitrary object categories.
AP
0:25
AP
0:5
AP
0:75
Seg-Cluster 34.8 17.4 11.2
SGPN 52.6 36.3 18.8
Table 3.2: Comparison results on instance segmentation with different IoU thresholds
in S3DIS scenes. Metric is mean AP(%) over13 categories.
it is viewed as a valid group. Our GroupMerging algorithm is then used to merge these
valid groups.
We calculate the IoU on points between each predicted and ground truth group.
A detected instance is considered as true positive if the IoU score is greater than a
threshold. The average precision (AP) is further calculated for instance segmentation
performance evaluation. Table 3.1 shows the AP for every category with IoU threshold
0:5. To the best of our knowledge, there are no existing instance segmentation method
on point clouds for arbitrary object categories, so we further demonstrate the capability
of SGPN to handle various objects by adding the 3D detection results of Armeni et al.
[3] on S3DIS to Table 3.1. The difference in evaluation metrics between our method
Mean table chair sofa board
PointNet [84] 24.24 46.67 33.80 4.76 11.72
Seg-Cluster 18.72 33.44 22.8 5.38 13.07
SGPN 30.20 49.90 40.87 6.96 13.28
Table 3.3: Comparison results on 3D detection in S3DIS scenes. SGPN uses PointNet
as baseline. The metric is AP with IoU threshold0:5.
43
Mean IoU Accuracy
PointNet [84] 49.76 79.66
SGPN 50.37 80.78
Table 3.4: Results on semantic segmentation in S3DIS scenes. SGPN uses PointNet as
baseline. Metric is mean IoU(%) over13 classes (including clutter).
and [3] is that the IoU threshold of [3] is 0:5 on a 3D bounding box and the IoU
calculation of our method is on points. Despite this difference in metrics, we can still
see our superior performance on both large and small objects.
We see that a naive method like Seg-Cluster tends to properly separate regions far
away for large objects like the ceiling and floor. However for small object, Seg-Cluster
fails to segment instances with the same label if they are close to each other. Mean APs
with different IoU thresholds (0:25,0:5,0:75) are also evaluated in Table 3.2. Figure 3.4
shows qualitative comparison results.
Once we have instance segmentation results, we can compute the bounding box
for every instance and thus produce 3D object detection predictions. In Table 3.3, we
compare out method with the 3D object detection system introduced in PointNet [84],
which to the best of our knowledge is the state-of-the-art method for 3D detection on
S3DIS. Detection performance is evaluated over4 categories AP with IoU threshold0:5.
The method introduced in PointNet clusters points given semantic segmentation
results and uses a binary classification network for each category to separate close
objects with same categories. Our method outperforms it by a large margin, and unlike
PointNet does not require an additional network, which unnecessarily introduces addi-
tional complexity during both train and test time and local minima during train time.
SGPN can effectively separate the difficult cases of objects of the same semantic class
but different instances (c.f. Figure 3.4) since points in different instances are far apart
in feature space even though they have the same semantic label. We further compare
44
(a) (b) (c) (d) (e)
Figure 3.4: Comparison results on S3DIS. (a) Ground Truth for instance segmentation.
Different colors represents different instances. (b) SGPN instance segmentation results.
(c) Seg-Cluster instance segmentation results. (d) Ground Truth for semantic segmen-
tation. (e) Semantic Segmentation and 3D detection results of SGPN. The color of the
detected bounding box for each object category is the same as the semantic labels.
Mean
Seg-Cluster 23.2 31.070.127.1 1.3 25.820.313.911.124.3 4.4 16.3 3.6 32.025.536.350.912.9 2.2 23.5
SGPN 26.5 55.953.327.8 0.0 27.459.628.9 6.1 33.9 2.0 19.7 2.0 29.430.739.143.617.6 1.2 25.9
SGPN-CNN 30.5 56.455.435.2 0.0 42.650.623.121.131.8 7.5 22.7 6.4 39.933.542.454.821.3 3.8 32.1
Table 3.5: Results on instance segmentation in NYUV2. The metric is AP with IoU0:5.
our semantic segmentation results with PointNet in Table 3.4. SGPN outperforms its
baseline with the help of its similarity matrix.
3.4.3 NYUV2 Object Detection and Instance Segmentation Evalua-
tion
We evaluate the effectiveness of our approach on partial 3D scans on the NYUV2
dataset. In this dataset, 3D point clouds are lifted from a single RGBD image. An
45
Figure 3.5: SGPN instance segmentation results on S3DIS. The first row is the pre-
diction results. The second row is groud truths. Different colors represent different
instances. The third row is the predicted semantic segmentation results. The fourth row
is the ground truths for semantic segmentation.
image of sizeHW can produceHW points. We subsample this point cloud by
resizing the image to
H
4
W
4
and get the corresponding points using a nearest neighbor
search. Both our training and testing experiments are conducted on such a point cloud.
PointNet++ is used as our baseline.
In [87], 2D CNN features are combined 3D point cloud for RGBD semantic seg-
mentation. By leveraging the flexibility of SGPN, we also seamlessly integrate 2D CNN
features from RGB images to boost performance. A 2D CNN consumes an RGBD map
and extracts feature maps F
2
with size
H
4
W
4
N
F2
. Since there are
H
4
W
4
sub-
sampled points for every image, a feature vector of sizeN
f2
can be extracted fromF
2
at
each pixel location. Every feature vector is concatenated toF (aN
p
N
F
feature matrix
46
Input Points: Nx6 (N=HxW/16) Input Image: HxW Sub-sample with stride 4 PointNet / PointNet++ CNN H/4 W/4 C1 C2 C1 CNN Feature Map 2DCNN SGPN Concat N N Similarity Matrix Confidence
Map Semantic
Prediction Figure 3.6: Incorporating CNN features in SGPN.
(a) (b) (c) (d)
Figure 3.7: SGPN instance segmentation results on NYUV2. (a) Input point clouds. (b)
Ground truths for instance segmentation. (c) Instance segmentation results with SGPN.
(d) Instance segmentation results with SGPN-CNN.
produced by PointNet/PointNet++ as mentioned in Section 3.3.1) for each correspond-
ing point, yielding a feature map of sizeN
P
(N
F
+N
F2
), which we then feed to our
output branches. Figure 3.6 illustrates this procedure; we call this pipeline SGPN-CNN.
In our experiments, we use a pre-trained AlexNet model [62] (with the first layer stride
1) and extractF
2
from theconv5 layer. We useHW = 316415 andN
p
= 8137.
The 2D CNN and SGPN are trained jointly.
47
Mean
Deep Sliding Shapes [105] 37.55 58.2 36.1 27.2 28.7
Deng and Latecki [23] 35.55 46.4 33.1 33.3 29.4
SGPN 36.25 44.4 30.4 46.1 24.4
SGPN-CNN 41.30 50.8 34.8 49.4 30.2
Table 3.6: Comparison results on 3D detection (AP with IoU 0.5) in NYUV2. Please
note we use point groups as inference while [105, 23] use large bounding box with
invisible regions as ground truth. Our prediction is the tight bounding box on points
which makes the IoU much smaller than [105, 23].
Mean
air-
plane
bag cap car chair
head
phone
guitar knife lamp laptopmotor mug pistol rocket
skate
board
table
[86] 84.6 80.4 80.9 60.0 76.8 88.1 83.7 90.2 82.6 76.9 94.7 68.0 91.2 82.1 59.9 78.2 87.5
SGPN 85.8 80.4 78.6 78.8 71.5 88.6 78.0 90.9 83.0 78.8 95.8 77.8 93.8 87.4 60.1 92.3 89.4
Table 3.7: Semantic segmentation results on ShapeNet part dataset. Metric is mean
IoU(%) on points.
Evaluation is performed on 19 object categories. Figure 3.7 shows qualitative results
on instance segmentation of SGPN. Table 3.5 shows comparisons between Seg-Cluster
and our SGPN and CNN-SGPN frameworks on instance segmentation. The evaluation
metric is average precision (AP) with IoU threshold0:5.
The margin of improvement for SGPN compared to Seg-Cluster is not as high as it
is on S3DIS, because in this dataset objects with the same semantic label are usually far
apart in Euclidean space. Additionally, naive methods like Seg-Cluster benefit since it
is easy to separate a single instance into parts since the points are not connected due to
occlusion in partial scanning. Table 3.5 also illustrates that SGPN can effectively utilize
CNN features. Instead of concatenating fully-connected layer of 2D and 3D networks as
in [105], we combine 2D and 3D features by considering their geometric relationships.
We further calculate bounding boxes with instance segmentation results. Table 3.6
compares our work with the state-of-the-art works [105, 23] on NYUV2 3D object
48
(a)
(b)
(c)
(d)
Figure 3.8: Qualitative results on ShapeNet Part Dataset. (a) Generated ground truth for
instance segmentation. (b) SGPN instance segmentation results. (c) Semantic segmen-
tation results of PointNet++. (d) Semantic segmentation results of SGPN.
detection. Following the evaluation metric in [104], AP is calculated with IoU threshold
0:25 on 3D bounding boxes. The NYUV2 dataset provides ground truth 3D bounding
boxes that encapsulate the whole object including the part that is invisible in the depth
image. Both [105] and [23] use these large ground truth bounding boxes for inference.
In our method, we infer point groupings, which lack information of the invisible part of
the object. Our output is the derived tight bounding box around the grouped points in
the partial scan, which makes our IoUs much smaller than [105, 23]. However, we can
still see the effectiveness of SGPN on the task of 3D object detection on partial scans as
our method achieves better performance on small objects.
49
Computation Speed To benchmark the testing time with [105, 23] and make fair
comparison, we run our framework on an Nvidia K40 GPU. SGPN takes 170ms and
around400M GPU memory per sample. CNN-SGPN takes300ms and1:4G GPU mem-
ory per sample. GroupMerging takes180ms on an Intel i7 CPU. However, the detec-
tion net in [23] takes739ms on an Nvidia Titan X GPU. In [105], RPN takes 5.62s and
ORN takes 13.93s per image on an Nvidia K40 GPU. Our model improves the efficiency
and reduces GPU memory usage by a large margin.
3.4.4 ShapeNet Part Instance Segmentation
Following the settings in [86], point clouds are generated by uniformly sampling shapes
from Shapenet [9]. In our experiments we sample each shape into2048 points. The XYZ
of points are fed into network as input with size 2048 3. To generate ground truth
labels for part instance segmentation from semantic segmentation results, we perform
DBSCAN clustering on each part category of an object to group points into instances.
This experiment is conducted as a toy example to demonstrate the effectiveness of our
approach on instance segmentation for pointclouds.
We use Pointnet++ as our baseline. Figure 3.8(b) illustrates the instance segmen-
tation results. For instance results, we again use different colors to represent different
instances, and point colors of the same group are not necessarily the same as the ground
truth. Since the generated ground truths are not “real” ground truths, only qualitative
results are provided. SGPN achieves good results even under challenging conditions.
As we can see from the Figure 3.8, SGPN is able to group the chair legs into four
instances even though even in the ground truth DBSCAN can not separate the chair legs
apart.
The similarity matrix can also help the semantic segmentation branch training. We
compare SGPN to PointNet++ (i.e. our framework with solely a semantic segmentation
50
branch) on semantic segmentation in Table 3.7. The inputs of both networks are point
clouds of size 2048. Evaluation metric is mIoU on points of each shape category. Our
model performs better than PointNet++ due to the similarity matrix. Qualitative results
are shown in Figure 3.8. Some false segmentation prediction is refined with the help of
SGPN.
3.5 Conclusion
We present SGPN, an intuitive, simple, and flexible framework for 3D instance seg-
mentation on point clouds. With the introduction of the similarity matrix as our output
representation, group proposals with class predictions can be easily generated from a
single network. Experiments show that our algorithm can achieve good performance
on instance segmentation for various 3D scenes and facilitate the tasks of 3D object
detection and semantic segmentation.
Future Work While a similarity matrix provides an intuitive representation and an
easily defined loss function, one limitation of SGPN is that the size of the similarity
matrix scales quadratically as N
p
increases. Thus, although much more memory effi-
cient than volumetric methods, SGPN cannot process extremely large scenes on the
order 10
5
or more points. Future research directions can consider generating groups
using seeds that are selected based on SGPN to reduce the size of the similarity matrix.
Currently trained in a fully supervised manner using a hinge loss, SGPN can also be
extended in future works to learn in a more unsupervised setting or to learn more differ-
ent kinds of data representations beyond instance segementation.
51
Chapter 4
3D Modeling: Shape Inpainting
4.1 Introduction
Data collected by 3D sensors (e.g. LiDAR, Kinect) are often impacted by occlusion,
sensor noise, and illumination, leading to incomplete and noisy 3D models. For exam-
ple, a building scan occluded by a tree leads to a hole or gap in the 3D building model.
However, a human can comprehend and describe the geometry of the complete building
based on the corrupted 3D model. Our 3D inpainting method attempts to mimic this
ability to reconstruct complete 3D models from incomplete data.
Convolutional Neural Network (CNN) based methods [34, 89, 80, 133] yield impres-
sive results for 2D image generation and image inpainting. Generating and inpainting
3D models is a new and more challenging problem due to its higher dimensionality. The
availability of large 3D CAD datasets [9, 129] and CNNs for voxel (spatial occupancy)
models [127, 96, 18] enabled progress in learning 3D representation, shape generation
and completion. Despite their encouraging results, artifacts still persists in their gener-
ated shapes. Moreover, their methods are all based on 3D CNN, which impedes their
ability to handle higher resolution data due to limited GPU memory.
This chapter is based on the paper Shape Inpainting using 3D Generative Adversarial Network and
Recurrent Convolutional Networks by Weiyue Wang, Qiangui Huang, Suya You, Chao Yang and Ulrich
Neumann, ICCV , 2017.
52
3D Encoder-Decoder Generative Adversarial Network 32x32x32 32x32x32 Encoder Decoder Discriminator CNN LSTM Full-CNN CNN LSTM Full-CNN CNN LSTM Full-CNN ... 128x 128x128 } } ... } Long-term Recurrent Convolutional Network Slice ... Concat Figure 4.1: Our method completes a corrupted 3D scan using a convolutional Encoder-
Decoder generative adversarial network in low resolution. The outputs are then sliced
into a sequence of 2D images and a recurrent convolutional network is further introduced
to produce high-resolution completion prediction.
In this paper, a new system for 3D object inpainting is introduced to overcome the
aforementioned limitations. Given a 3D object with holes, we aim to (1) fill the missing
or damaged portions and reconstruct a complete 3D structure, and (2) further predict
high-resolution shapes with fine-grained details. We propose a hybrid network structure
based on 3D CNN that leverages the generalization power of a Generative Adversarial
model and the memory efficiency of Recurrent Neural Network (RNN) to handle 3D
data sequentially. The framework is illustrated in Figure 4.1.
More specifically, a 3D Encoder-Decoder Generative Adversarial Network (3D-ED-
GAN) is firstly proposed to generalize geometric structures and map corrupted scans
to complete shapes in low resolution. Like a variational autoencoder (V AE) [60, 96],
3D-ED-GAN utilizes an encoder to map voxelized 3D objects into a probabilistic latent
space, and a Generative Adversarial Network (GAN) to help the decoder predict the
complete volumetric objects from the latent feature representation. We train this net-
work by minimizing both contextual loss and an adversarial loss. Using GAN, we can
not only preserve contextual consistency of the input data, but also inherit information
from data distribution.
Secondly, a Long-term Recurrent Convolutional Network (LRCN) is further intro-
duced to obtain local geometric details and produce much higher resolution volumes.
3D CNN requires much more GPU memory than 2D CNN, which impedes volumetric
network analysis of high-resolution 3D data. To overcome this limitation, we model
53
the 3D objects as sequences of 2D slices. By utilizing the long-range learning capabil-
ity from a series of conditional distributions of RNN, our LRCN is a Long Short-term
Memory Network (LSTM) where each cell has a CNN encoder and a fully-convolutional
decoder. The outputs of 3D-ED-GAN are sliced into 2D images, which are then fed into
the LRCN, which gives us a sequence of high-resolution images.
Our hybrid network is an end-to-end trainable network which takes corrupted low
resolution 3D structures and outputs complete and high-resolution volumes. We evaluate
the proposed method qualitatively and quantitatively on both synthesized and real 3D
scans in challenging scenarios. To further evaluate the ability of our model to capture
shape features during 3D inpainting, we test our network for 3D object classification
tasks and further explore the encoded latent vector to demonstrate that this embedded
representation contains abundant semantic shape information.
The main contributions of this paper are:
1. a 3D Encoder-Decoder Generative Adversarial Convolutional Neural Network
that inpaints holes in 3D models, which can further help 3D shape feature learning
and help object recognition.
2. a Long-term Recurrent Convolutional Network that produces high resolution 3D
volumes with fine-grained details by modeling volumetric data as sequences of
2D images to overcome GPU memory limitation.
3. an end-to-end network that combines the above two ideas and completes corrupted
3D models, while also producing high resolution volumes.
54
4.2 Related Work
4.2.1 Generative models
Generative Adversarial Network (GAN) [34] generates images by jointly training a gen-
erator and a discriminator. Following this pioneering work, a series of GAN mod-
els [89, 24] were developed for image generation tasks. Pathak et al. [80] developed
a context encoder in an unsupervised learning algorithm for image inpainting. Genera-
tive adversarial loss in their autoencoder-like network architecture achieves impressive
performance for image inpainting.
With the introduction of 3D CAD model datasets [129, 9], recent developments in
3D generative models use data-driven methods to synthesize new objects. CNN is used
to learn embedded object representations. Bansal et al. [5] introduced a skip-network
model to retrieve 3D models for objects depicted in 2D images of CAD data. Choy et
al. [15] used a recurrent network with multi-view images for 3D model reconstruction.
Girdhar [31] proposed a TL-embedding network to learn an embedding space that can
be generative in 3D and predicative from 2D rendered images. Wu et al. [127] showed
that the learned latent vector by 3D GAN can generate high-quality 3D objects and
improve object recognition accuracy as a shape descriptor. They also added an image
encoder to 3D GAN to generate 3D model from 2D images. Yan et al. [132] formulated
an encoder-decoder network with a loss by perspective transformation for predicting 3D
models from a single-view 2D image.
4.2.2 3D Completion
Recent advances in deep learning have shown promising results in 3D completion. Wu
et al. [129] built a generative model with Convolutional Deep Belief Network by learn-
ing a probabilistic distribution from 3D volumes for shape completion from 2.5D depth
55
maps. Sharma [96] introduced a fully convolutional autoencoder that learns volumet-
ric representation from noisy data by estimating voxel occupancy grids. This is the
state of the art for 3D volumetric occupancy grid inpainting to the best of our knowl-
edge. An important benefit of our 3D-ED-GAN over theirs is that we introduce GAN to
inherit information from the data distribution. Dai et al. [18] introduced a 3D-Encoder-
Predictor Network to predict and fill missing data for 3D distance field and proposed
a 3D synthesis procedure to obtain high-resolution objects. This is the state-of-the-art
method for high-resolution object completion. However, instead of an end-to-end net-
work, their shape synthesis procedure requires iterating every sample from the dataset.
Since we are using occupancy grids to represent 3D shapes, we do not compare with
them in our experiment. Song et al. [107] synthesized a 3D scenes dataset and proposed
a semantic scene completion network to produce complete 3D volumes and semantic
labels for a scene from single-view depth map. Despite the encouraging results of the
works mentioned above, these methods are mostly based on 3D CNN, which requires
much more GPU memory than 2D convolution and impedes handling high-resolution
data.
4.2.3 Recurrent Neural Networks
RNNs have been shown to excel at hard sequence problems ranging from natural lan-
guage translation [54], to video analysis [25]. By implicitly conditioning on all previous
variables and preserving long-range contextual dependencies, RNNs are also suitable
for dense prediction tasks such as semantic segmentation [116, 8], and image comple-
tion [115]. Donahue et al. [25] applied 2D CNN and LSTM on 3D data (video) and
developed a recurrent convolutional architecture for video recognition. Oord et al. [115]
presented a deep network that sequentially predicts the pixels in an image along two
spatial dimensions. Choy et al. [15] used a recurrent network and a CNN to reconstruct
56
3D models from a sequence of multi-view images. Followed by these pioneer works,
we apply RNN on 3D object data and predict dense volume as sequences of 2D pixels.
4.3 Methods
The goal of this paper is to take a corrupted 3D object in low resolution as input and
produce a complete high-resolution model as output. The 3D model is represented as
volumetric occupancy grids. To fill the missing data requires an approach that can make
conceivable predictions from data distributions as well as preserve structural context of
the imperfect input.
We introduce an 3D Encoder-Decoder CNN by extending a 3D Generative Adversar-
ial Network [127], namely 3D Encoder-Decoder Generative Adversarial Network (3D-
ED-GAN), to accomplish the 3D inpainting task. Since 3D CNN is memory consuming
and applying 3D-ED-GAN on a high-resolution volume is improbable, we only use 3D-
ED-GAN to operate low-resolution voxels (say 32
3
). Then we treat 3D volume output
of 3D-ED-GAN as a sequence of 2D images and reconstruct the object slice by slice. A
Long-term Recurrent Convolutional Network (LRCN) based on LSTM is proposed to
recover fine-grained details and produce high-resolution results. LRCN functions as an
upsampling network while completing details by learning from the dataset.
We now describe our network structure of 3D-ED-GAN and LRCN respectively and
the details of the training procedure.
4.3.1 3D Encoder-Decoder Generative Adversarial Network (3D-
ED-GAN)
The Generative Adversarial Network (GAN) consists of a generatorG that maps a noise
distribution Z to the data space X, and a discriminator D that classifies whether the
57
64x16x16x16 128x8x8x8 256x4x4x4 reshape 64x16x16x16 256x4x4x4 128x8x8x8 64x16x16x16 128x8x8x8 256x4x4x4 2 5 fc Real Or Fake Cross-Entropy
Loss 3D Encoder 3D Decoder Discriminator Stride Kernel size represents convolution
operation in Encoders
and Discriminator and
deconvolution in Decoder 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 32x32x32 32x32x32 reshape z x’ F ed (x’) x 32x32x32 Figure 4.2: Network architecture of our 3D-ED-GAN.
generated sample is real or fake. G and D are both deep networks that are learned
jointly. D distinguishes real samples from synthetic data. G tries to generate ”real”
samples to confuse D. Concretely, the objective of GAN is to achieve the following
optimization:
min
G
max
D
(E
xp
data
(x)
[logD(x)]+
E
zpz(z)
[log(1D(G(z)))]); (4.1)
wherep
data
is data distribution andp
z
is noise distribution.
Network structure 3D-ED-GAN extends the general GAN framework by modeling
the generatorG as a fully-convolutional Encoder-Decoder network, where the encoder
maps input data into a latent vector z. Then the decoder maps z to a cube. The 3D-
ED-GAN consists of three components: an encoder, a decoder and a discriminator.
Figure 4.2 depicts the algorithmic architecture of 3D-ED-GAN.
The encoder takes a corrupted 3D volumex
0
of sized
l
3
(sayd
l
= 32) as input. It
consists of three 3D convolutional layers with kernel size 5 and stride 2, connected via
batch normalization (BN) [50] and ReLU [43] layers. The last convolutional layer is
58
reshaped into a vector z, which is the latent feature representation. There is no fully-
connection (fc) layers. The noise vector in GAN is replaced with z. Therefore, the
3D-ED-GAN network conditions z using the 3D encoder. We show that this latent
vector carries informative features for supervised tasks in Section 4.4.2.
The decoder has the same architecture asG in GAN, which maps the latent vectorz
to a 3D voxel of sized
l
3
. It has three volumetric full-convolution (also known as decon-
volution) layers of kernel size 5 and strides 2 respectively, with BN and ReLU layers
added in between. A tanh activation layer is added after the last layer. The Encoder-
Decoder network is a fully-convolutional neural network without linear or pooling lay-
ers.
The discriminator has the same architecture as the encoder with an fc layer and a
sigmoid layer at the end.
Loss function The generatorG in 3D-ED-GAN is modeled by the Encoder-Decoder
network. This can be viewed as a conditional GAN, in which the latent distribution
is conditioned on given context data. Therefore, the loss function can been derived by
reformulating the objective function in Equation 4.1
L
GAN
=E
xp
data
(x)
[logD(x)+
log(1D(F
ed
(x
0
)))]; (4.2)
whereF
ed
() :X!X is the Encoder-Decoder network, andx
0
is the corrupted model
of complete volumex.
59
Similar to [80], we add an object reconstruction Cross-Entropy loss,L
recon
, defined
by
L
recon
=
1
N
N
X
i=1
[x
i
logF
ed
(x
0
)
i
+
(1x
i
)log(1F
ed
(x
0
)
i
)]; (4.3)
whereN =d
l
3
,x
i
represents for theith voxel of the complete volumex andF
ed
(x
0
)
i
is
theith voxel of the generated volume. In this way, the output of the Encoder-Decoder
networkF
ed
(x
0
) is the probability of a voxel being filled.
The overall loss function for 3D-ED-GAN is
L
3DEDGAN
=
1
L
GAN
+
2
L
recon
; (4.4)
where
1
and
2
are weight parameters.
The loss function can effectively infer the structures of missing regions to produce
conceivable reconstructions from the data distribution. Inpainting requires maintaining
coherence of given context and producing plausible information according to the data
distribution. 3D-ED-GAN has the capability of capturing the correlation between a
latent space and the data distribution, thus producing appropriate plausible hypothesis.
4.3.2 Long-term Recurrent Convolutional Network (LRCN) Model
3D CNN consumes much more GPU memory than 2D CNN. Extending 3D-ED-GAN
by adding 3D convolution layers to produce high resolution output is improbable due
to memory limitation. We take advantage of the capability of RNN to handle long-term
sequential dependencies and treat the 3D object volume as slices of 2D images. The net-
work is required to map a volume with dimensiond
l
3
to a volume with dimensiond
h
3
60
(we haved
l
= 32;d
h
= 128). For a sequence-to-sequence problem with different input
and output dimensions, we integrate an encoder-decoder pair to the LSTM cell inspired
by the video processing work [25]. Our LRCN model combines an LSTM, a 3D CNN,
and 2D deep fully-convolutional network. It works by passing each 2D slice with its
neighboring slices through a 3D CNN to produce a fixed-length vector representation as
input to LSTM. The output vector of LSTM is passed through a 2D fully-convolutional
decoder network and mapped to a high-resolution image. A sequence of high-resolution
2D images formulate the output 3D object volume. Figure 4.3 depicts our LRCN archi-
tecture.
Formulation of Sequential Input In order to obtain the maximal amount of con-
textual data from each 3D object volume, we would like to maximize the number of
nonempty slices for the volume. So given a 3D object volume of dimension d
l
3
, we
firstly use principle component analysis (PCA) to align the 3D object and denote the
aligned volume asI and its first principle component as direction
!
l
1
. ThenI is treated
as a sequence of d
l
d
l
2D images along
!
l , denoted byfI
1
;I
2
;:::;I
d
l
g. Since the
output of LRCN is a sequence with lengthd
h
, the input sequence length should also be
d
h
. As illustrated in Figure 4.3, for each step, a slice with its 4 neighboring slices (so 5
slices total) is formed into a thin volume and fed into the network, say for stept. And
slices with negative indices, or indices beyond d
l
, are 0-padded. The input of the 3D
CNN is thenv
0
t
=fI t
d
h
=d
l
2
;I t
d
h
=d
l
1
;I t
d
h
=d
l
;I t
d
h
=d
l
+1
;I t
d
h
=d
l
+2
g.
Network structure As illustrated in Figure 4.3, the 3D CNN encoder takes ad
l
d
l
c
volume as input, wherec represents number of slices (we havec = 5). At stept, the 3D
CNN transformsc slices of 2D imagesv
0
t
into a200D vectorv
t
. The 3D CNN encoder
1
In our experiment implementation, we use PCA to align the corrupted objects instead of the output
of 3D-ED-GAN.
61
32x32x32 128x 128x128 Long-term Recurrent Convolutional Network PCA Concat … l → Slices … } } } 0-padding 0-padding 3D CNN LSTM 2D
Full-CNN 3D CNN LSTM 2D
Full-CNN 3D CNN LSTM 2D
Full-CNN 3D CNN LSTM 2D
Full-CNN ... ... Figure 4.3: Framework for LRCN. The 3D input volumes are aligned by PCA and
sliced along the first principle component into 2D images. LRCN processesc (c = 5)
consecutive images with a 3D CNN, whose outputs are fed into LSTM. The outputs
of LSTM further go through a 2D CNN and produce a sequence of high-resolution 2D
images. The concatenations of these 2D images are the high-resolution 3D completion
results.
has the same structure with the 3D encoder in 3D-ED-GAN with an fc layer at the end.
After the 3D CNN, the recurrent model LSTM takes over. We use the LSTM cell as
described in [140]: Given inputv
t
, the LSTM updates at timestept are:
i
t
=(W
vi
v
t
+W
hi
h
t1
+W
ci
c
t1
+b
i
)
f
t
=(W
vf
v
t
+W
hf
h
t1
+W
cf
c
t1
+b
f
)
c
t
=f
t
c
t1
+i
t
tanh(W
vc
v
t
+W
hc
h
t1
+b
c
) (4.5)
o
t
=(W
vo
v
t
+W
ho
h
t1
+W
co
c
t
+b
o
)
h
t
=o
t
tanh(c
t
)
where is the logistic sigmoid function, i;f;o;c are respectively
inputgate;forgetgate;outputgate;cellgate,W
vi;vf;vc;vo;hi;hf;hc;ho;ci;co;cf
andb
i;f;c;o
are
parameters.
The output vector of LSTM o
t
is further going through a 2D fully-convolutional
neural network to generate a d
h
d
h
image. It has two fully-convolutional layers of
62
kernel size 5 and stride 2, with BN and ReLU in between followed by a tanh layer at
the end.
Loss function We experimented with bothl
1
andl
2
losses and found thel
1
loss obtains
higher-quality results. In this way, thel
1
loss is adopted to train our LRCN, denoted by
L
LRCN
.
The overall loss to jointly train the hybrid network (combination of 3D-ED-GAN
and LRCN) is
L =
3
L
3DEDGAN
+
4
L
LRCN
; (4.6)
where
3
and
4
are weight parameters.
Although the LRCN contains a 3D CNN encoder, the thin input slices makes the net-
work sufficiently small compared to a regular volumetric CNN. By taking advantage of
RNN’s ability to manipulate sequential data and long-range dependencies, our memory
efficient network is able to produce high-resolution completion result.
4.3.3 Training the hybrid network
Training our 3D-ED-GAN and LRCN both jointly and from scratch is a challenging
task. Therefore, we propose a three-phase training procedure.
In the first stage, 3D-ED-GAN is trained independently with corrupted 3D input and
complete output references in low resolution. Since the discriminator learns much faster
than the generator, we first train the Encoder-Decoder network independently without
discriminator (with only reconstruction loss). The learning rate is fixed to10
5
, and20
epochs are trained. Then we jointly train the discriminator and the Encoder-Decoder as
in [89] for 100 epochs. We set the learning rate of the Encoder-Decoder to 10
4
, and
63
D to 10
6
. Then
1
and
2
are set to 0.001 and 0.999 respectively. For each batch, we
only update the discriminator if its accuracy in the last batch is not higher than 80% as
in [127]. ADAM [59] optimization is employed with = 0:5 and a batch size of4.
In the second stage, LRCN is trained independently with perfect 3D input in low
resolution and high-resolution output references for100 epochs. We use a learning rate
of 10
4
and a batch size of 4. In this stage, LRCN works as an upsampling network
capable of predicting fine-grained details from trained data distributions.
In the final training phase, we jointly finetune the hybrid network on the pre-trained
networks in the first and second stages with loss defined as in Equation 4.6. The learning
rate of the discriminator is10
7
and the learning rate of the remaining network is set to
be10
6
with batch size1. Then
3
and
4
are both set to 0.5. We observe that most of
the parameter updates happen in LRCN. The input of LRCN in this stage is imperfect
and the output reference is still complete high-resolution model, which indicates that
LRCN works as a denoising network while maintaining its power of upsampling and
preserving details.
For convenience, we use the aforementioned PCA method to align all models before
training instead of aligning the predictions of 3D-ED-GAN.
4.4 Experiments
Our network architecture is implemented using the deep learning library Tensorflow [1].
We extensively test and evaluate our method using various datasets canonical to 3D
inpainting and feature learning.
We split each category in the ShapeNet dataset [9] to mutually-excluded 80 training
points and 20 testing points. Our network is trained on the training points as stated
in Section 4.3.3. We train separate networks for seven major categories (chairs, sofas,
64
tables, boats, airplanes, lamps, dressers, and cars) without fine-tuning on any existing
models. 3D meshes are voxelized into32
3
grids for low-resolution input and128
3
grids
for high-resolution output reference. The input 3D volumes are synthetically corrupted
to simulate the imperfections of a real-world 3D scanner.
The following experiments are conducted with the trained model: We firstly evaluate
the inpainting performance of 3D-ED-GAN on both real-world 3D range scans data and
the ShapeNet 20-point testing set with various injected noise. Ablation experiments
are conducted to assess the capability of producing high-resolution completion results
from the combination of 3D-ED-GAN and LRCN. We also compare with the state-of-
the-art method. Then, we evaluate the capability of 3D-ED-GAN as a feature learning
framework.
4.4.1 3D Objects Inpainting
Our hybrid network has 26:3M parameters and requires 7:43GB GPU memory. If we
add two more full-convolution layers in the decoder and two more convolution layers in
the discriminator of 3D-ED-GAN to produce high-resolution, the network has116:4M
parameters and won’t fit into the GPU memory. For comparison between low-resolution
and high-resolution results, we simply upsample the prediction of 3D-ED-GAN and do
numerical comparisons.
Real-World Scans
We test 3D-ED-GAN and LRCN on both real-world and synthetic data. The real-world
scans are from the work of [85]. They reconstructed 3D mesh from RGB-D data and we
voxelized these 3D meshes into 32
3
grids for test. Our network is trained on ShapeNet
dataset as in Section 4.3.3. Before testing, all shapes are aligned using PCA as stated
in Section 4.3.2. Figure 4.4 shows shape completion examples on real-world scans for
65
Input 3D-ED-GAN Hybrid Input 3D-ED-GAN Hybrid
Figure 4.4: 3D completion results on real-world scans. Inputs are the voxelized scans.
3D-ED-GAN represents the low-resolution completion result without going through
LRCN. Hybrid represents the high-resolution completion result of the combination of
3D-ED-GAN and LRCN.
both low-resolution and high-resolution outputs. We use 3D-ED-GAN to represent the
low-resolution output of 3D-ED-GAN and Hybrid to denote the high-resolution output
of the combination of 3D-ED-GAN and LRCN. As we can see, our network is able to
produce plausible completion results even with large missing area. The 3D-ED-GAN
itself can result conceivable outputs while LRCN further improves fine-grained details.
Random Noise
We then assess our model with the splitted testing data of ShapeNet. This is applicable
in cases where capturing the geometry of objects with 3D scanners results in holes and
incomplete shapes. Since it is hard to obtain ground truth for real-world objects, we
66
rely on the ShapeNet dataset where complete object geometry of diversified categories
is available, and we test on data with simulated noises.
Because it is hard to predict the exact noise from 3D scanning, we test different
noise characteristics and show the robustness of our trained model. We do the following
ablation experiments:
1. 3D-ED-GAN: 3D-ED-GAN is trained the first training stage of Section 4.3.3.
2. LRCN: After the LRCN is pre-trained as the second training stage in Sec-
tion 4.3.3, we directly feed the partially scanned 3D volume into the LRCN as
input and train LRCN independently for 100 epochs with a learning rate of 10
5
and a batchsize 4. We test the shape completion ability of this single network.
3. Hybrid: Our overall network, i.e. the combination of 3D-ED-GAN and LRCN, is
trained with the aforementioned procedure.
To have a better understanding of the effectiveness of our generative adversarial
model, we also compare qualitatively and quantitatively with VConv-DAE [96]. They
adopted a full convolutional volumetric autoencoder network architecture to estimate
voxel occupancy grids from noisy data. The major difference between 3D-ED-GAN
and VConv-DAE is introduction of GAN. In our implementation of VConv-DAE, we
simply remove the discriminator from 3D-ED-GAN and compare the two networks with
the same parameters.
We first evaluate our model on test data with random noise. As stated above, we
adopted simulated scanning noise in our training procedure. With random noise, vol-
umes have to be recovered from limited given information, where the testing set and the
training set have different patterns. Figure 4.5 shows the results of different methods for
shape completion with50% noise injected.
67
Input Ground Truth VConv-DAE 3D-ED-GAN LCRN Hybrid
Figure 4.5: 3D inpainting results with50% injected noise on ShapeNet test dataset. For
this noise type, detailed information is missing while the global structure is preserved.
We also vary the amount of noise injected to the data. For numerical comparison,
the number n of generated voxels (at 128
3
resolution) which differ from ground truth
(object volume before corruption) is counted for each sample. The reconstruction error
is n divided by total number of grids 128
3
. For 3D-ED-GAN and VConv-DAE, their
predictions are computed by upsampling the low resolution output. We use mean error
for different object categories as our evaluation metric. The results are reported in Fig-
ure 4.6. It can be seen from Figure 4.5 that different methods produce similar results.
Even though50% noise is injected, the corrupted input still maintains the global seman-
tic structure of the original 3D shape. In this way, this experiment measures the denois-
ing ability of these models. As illustrated in Figure 4.6, these models introduce noise
when 0% noise injected. LRCN performs better than the other three when the noise
percentage is low. When the input gets more corrupted, 3D-ED-GAN tends to perform
better than others.
Simulated 3D scanner
We then evaluate our network on completing shapes for simulated scanned objects. 3D
scanners such as Kinect can only capture object geometry from a single view at one
68
Figure 4.6: We vary the amount of random noise injected to test data and quantitatively
compare the reconstruction error.
time. In this experiment, we simulate these 3D scanners by scanning objects in the
ShapeNet dataset from a single view and evaluate the reconstruction performance of
our method from these scanned incomplete data. This is a challenging task since the
recovered region must contain semantically correct content. Completion results can be
found in Figure 4.7. Quantitative comparison results are shown in Table 4.1.
As illustrated in Figure 4.7 and Table 4.1, our model performs better than 3D-ED-
GAN, VConv-DAE and LRCN. For VConv-DAE, small or thin components of objects,
such as the pole of a lamp tend to be filtered out even though these parts exist in the input
volume. With the help of the generative adversarial model, our model is able to produce
reasonable predictions for the large missing areas that are consistent with the data dis-
tribution. The superior performance of 3D-ED-GAN over VConv-DAE demonstrates
our model benefits from the generative adversarial structure. Moreover, by comparing
the results of 3D-ED-GAN and the hybrid network, we can see the capability of LRCN
to recover local geometry. LRCN alone has difficulty capturing global context structure
69
Methods Reconstruction Error
VConv-DAE [96] 7.48%
3D-ED-GAN 6.55%
LRCN 7.08%
Hybrid 4.74%
Table 4.1: Quantitative shape completion results on ShapeNet with simulated 3D scan-
ner noise.
of 3D shapes. By combining 3D-ED-GAN and LRCN, our hybrid network is able to
predict global structure as well as local fine-grained details.
Overall, our hybrid network performs best by leveraging 3D-ED-GAN’s ability to
produce plausible predictions and LRCN’s power to recover local geometry.
4.4.2 Feature Learning
3D object classification
We now evaluate the transferability of unsupervised learned features obtained from
inpainting to object classification. We use the popular benchmark ModelNet10 and
ModelNet40, which are both subsets of the ModelNet dataset [129]. Both ModelNet10
and ModelNet40 are split into mutually exclusive training and testing sets. We conduct
three experiments.
1. Our-FT: We train 3D-ED-GAN as the first training stage stated in Section 4.3.3 on
all samples of ShapeNet dataset as pre-training and treat the encoder component
(with a softmax layer added on top of z as a loss layer) as our classifier. We
fine-tune this CNN classifier on ModelNet10 and ModelNet40.
2. RandomInit: We directly train the classifier mentioned in Our-FT with random
initialization on ModelNet10 and ModelNet40.
70
Input Ground Truth VConv-DAE 3D-ED-GAN LCRN Hybrid
Figure 4.7: Shape completion examples on ShapeNet testing points with simulated 3D
scanner noise.
71
Methods ModelNet40 ModelNet10
RandomInit 86.1% 90.5%
Ours-FT 87.3% 92.6%
3DGAN [127] 83.3% 91.0%
TL-network [31] 74.4% -
VConv-DAE-US [96] 75.5% 80.5%
Ours-SVM 84.3% 89.2%
3DShapeNet [9] 77.0% 83.5%
VConv-DAE [96] 79.8% 84.1%
VRN [7] 91.3% 93.6%
MVCNN [109] 90.1% -
MVCNN-Multi[85] 91.4% -
Table 4.2: Classification Results on ModelNet Dataset.
3. Our-SVM: We generate z (of dimension 16384) with the trained 3D-ED-GAN
in Section 4.3.3 for samples on ModelNet10 and ModelNet40 and train a linear
SVM classifier withz as the feature vector.
We also compare our algorithm with the state-of-the-art methods [127, 96, 31, 7, 109,
85, 9]. VRN [7], MVCNN [109], MVCNN-Multi [85] are designed for object clas-
sification. 3DGAN [127], TL-network [31], and VConv-DAE-US [96] learned a fea-
ture representation for 3D objects, and trained a linear SVM as classifier for this task.
VConv-DAE [96] and VRN [7] adopted a V AE architecture with pre-training. We report
the testing accuracy in Table 4.2.
Although our framework is not designed for object recognition, our results with 3D-
ED-GAN pre-training is competitive with existing methods including models designed
for recognition [7, 109]. By comparing RandomInit and Ours-FT, we can see unsuper-
vised 3D-ED-GAN pre-training is able to guide the CNN classifier to capture the rough
geometric structure of 3D objects. The superior performance of Our-SVM training over
other vector representation methods [31, 127, 96] demonstrate the effectiveness of our
method as a feature learning architecture.
72
Figure 4.8: Shape interpolation results.
Shape Arithmetic
Previous works in embedding representation learning [127, 31] have shown the phe-
nomena of the capability of shape transformation by performing arithmetic on the latent
vectors. Our 3D-ED-GAN also learns a latent vectorz. To this end, we randomly chose
two different instances and fed it into the encoder to produce two encoded vectorsz
0
and
z
00
and feed the interpolated vectorz
000
=
z
0
+(1
)z
00
(0<
< 1) to the decoder to
produce volumes. The results for the interpolation are shown in Figure 4.8. We observe
smooth transitions in the generated object domain with gradually increasing
.
4.5 Conclusion and Future Work
In this paper, we present a convolutional encoder-decoder generative adversarial net-
work to inpaint corrupted 3D objects. A long-term recurrent convolutional network is
further introduced, where the 3D volume is treated as a sequence of 2D images, to save
GPU memory and complete high-resolution 3D volumetric data. Experimental results
on both real-world and synthetic scans show the effectiveness of our method.
73
Since our model is easy to fit into GPU memory compared with other 3D CNN
methods [96, 107]. A potential direction is to complete more complex 3D structures,
such as indoor scenes [107, 17], with much higher resolutions. Another interesting
future avenue is to utilize our model on other 3D representations like 3D mesh, distance
field etc.
74
Chapter 5
3D Modeling: 3D Deformation
Network
5.1 Introduction
Applications in virtual and augmented reality and robotics require rapid creation and
access to a large number of 3D models. Even with the increasing availability of large
3D model databases [9], the size and growth of such databases pale when compared
to the vast size of 2D image databases. As a result, the idea of editing or deforming
existing 3D models based on a reference image or another source of input such as an
RGBD scan is pursued by the research community.
Traditional approaches for editing 3D models to match a reference target rely on
optimization-based pipelines which either require user interaction [131] or rely on the
existence of a database of segmented 3D model components [46]. The development of
3D deep learning methods [84, 15, 129, 124, 47] inspire more efficient alternative ways
to handle 3D data. In fact, a multitude of approaches have been presented over the past
few years for 3D shape generation using deep learning. Many of these, however, utilize
voxel [132, 31, 143, 126, 114, 128, 134, 122] or point based representations [27] since
This chapter is based on the paper 3DN: 3D Deformation Network by Weiyue Wang, Duygu Ceylan,
Radomir Mech and Ulrich Neumann, CVPR, 2019. Code is available atgithub.com/laughtervv/
3DN.
75
Target Image Source Mesh Deformed Mesh with 3DN Target Point
Cloud Source Mesh Deformed Mesh with 3DN Figure 5.1: 3DN deforms a given a source mesh to a new mesh based on a reference
target. The target can be a 2D image or a 3D point cloud.
the representation of meshes and mesh connectivity in a neural network is still an open
problem. The few recent methods which do use mesh representations make assumptions
about fixed topology [36, 118] which limits the flexibility of their approach.
This paper describes 3DN, a 3D deformation network that deforms a source 3D mesh
based on a target 2D image, 3D mesh, or a 3D point cloud (e.g., acquired with a depth
sensor). Unlike previous work which assume a fixed topology mesh for all examples,
we utilize the mesh structure of the source model. This means we can use any existing
high-quality mesh model to generate new models. Specifically, given any source mesh
and a target, our network estimates vertex displacement vectors (3D offsets) to deform
the source model while maintaining its mesh connectivity. In addition, the global geo-
metric constraints exhibited by many man-made objects are explicitly preserved during
deformation to enhance the plausibility of the output model.
Our network first extracts global features from both the source and target inputs.
These are input to an offset decoder to estimate per-vertex offsets. Since acquiring
ground truth correspondences between the source and target is very challenging, we use
unsupervised loss functions (e.g., Chamfer and Earth Mover’s distances) to compute the
similarity of the deformed source model and the target. A difficulty in measuring sim-
ilarity between meshes is the varying mesh densities across different models. Imagine
76
a planar surface represented by just 4 vertices and 2 triangles as opposed to a dense set
of planar triangles. Even though these meshes represent the same shape, vertex-based
similarity computation may yield large errors. To overcome this problem, we adopt a
point cloud intermediate representation. Specifically, we sample a set of points on both
the deformed source mesh and the target model and measure the loss between the result-
ing point sets. This measure introduces a differentiable mesh sampling operator which
propagates features, e.g., offsets, from vertices to points in a differentiable manner.
We evaluate our approach for various targets including 3D shape datasets as well as
real images and partial points scans. Qualitative and quantitative comparisons demon-
strate that our network learns to perform higher quality mesh deformation compared
to previous learning based methods. We also show several applications, such as shape
interpolation. In conclusion, our contributions are as follows:
We propose an end-to-end network to predict 3D deformation. By keeping the
mesh topology of the source fixed and preserving properties such as symmetries,
we are able to generate plausible deformed meshes.
We propose a differentiable mesh sampling operator in order to make our network
architecture resilient to varying mesh densities in the source and target models.
5.2 Related Work
5.2.1 3D Mesh Deformation
3D mesh editing and deformation has received a lot of attention from the graphics com-
munity where a multitude of interactive editing systems based on preserving local Lapla-
cian properties [108] or more global features [30] have been presented. With easy access
to growing 2D image repositories and RGBD scans, editing approaches that utilize a
77
reference target have been introduced. Given source and target pairs, such methods
use interactive [131] or heavy processing pipelines [46] to establish correspondences
to drive the deformation. The recent success of deep learning has inspired alternative
methods for handling 3D data. Yumer and Mitra[139] propose a volumetric CNN that
generates a deformation field based on a high level editing intent. This method relies
on the existence of model editing results based on semantic controllers. Kurenkov et
al. present DeformNet [63] which employs a free-form deformation (FFD) module as
a differentiable layer in their network. This network, however, outputs a set of points
rather than a deformed mesh.Furthermore, the deformation space lacks smoothness and
points move randomly. Groueix et al. [35] present an approach to compute correspon-
dences across deformable models such as humans. However, they use an intermediate
common template representation which is hard to acquire for man-made objects. Pontes
et al. [83] and Jack et al. [51] introduce methods to learn FFD. Yang et al. propose
Foldingnet [135] which deforms a 2D grid into a 3D point cloud while preserving local-
ity information. Compared to these existing methods, our approach is able to generate
higher quality deformed meshes by handling source meshes with different topology and
preserving details in the original mesh.
5.2.2 Single View 3D Reconstruction
Our work is also related to single-view 3D reconstruction methods which have received
a lot of attention from the deep learning community recently. These approaches have
used various 3D representations including voxels [132, 15, 31, 143, 126, 114, 128, 134],
point clouds [27], octrees [113, 40, 120], and primitives [144, 76]. Sun et al. [110]
present a dataset for 3D modeling from single-images. However, pose ambiguity and
artifacts widely occur in this dataset. More recently, Sinha et al. [102] propose a method
to generate the surface of an object using a representation based on geometry images.
78
CNN / PointNet Target Global Feature Target Encoder Sample PointNet Source Global Feature Source Encoder MLP Source Vertices Offsets / Sampled Point Cloud / Offset Decoder Vertex/Point Locations Shared
Weight Deformed Vertices Deformed Point Cloud Mesh Sampling Operator Llap LCD, LEMD, LLPI, Lsym LCD, LEMD Deformed Point Cloud Source Vertices Deformed Vertices Losses / Target Model Concat Target Point Cloud Figure 5.2: 3DN extracts global features from both the source and target. ‘MLP’ denotes
the ‘11’ conv as in PointNet [84]. These features are then input to an offset decoder
which predicts per-vertex offsets to deform the source. We utilize loss functions to
preserve geometric details in the source (L
Lap
;L
LPI
;L
Sym
) and to ensure deformation
output is similar to the target (L
CD
;L
EMD
).
In a similar approach, Groueix et al. [36] present a method to generate surfaces of 3D
shapes using a set of parametric surface elements. The more recent method of Hiroharo
et al. [57] and Kanazawa et al. [55] also uses differentiable renderer and per-vertex
displacements as a deformation method to generate meshes from image sets. Wang et
al. [118] introduce a graph-based network to reconstruct 3D manifold shapes from input
images. These recent methods, however, are limited to generating manifolds and require
3D output to be topology invariant for all examples.
5.3 Method
Given a source 3D mesh and a target model (represented as a 2D image or a 3D model),
our goal is to deform the source mesh such that it resembles the target model as close as
possible. Our deformation model keeps the triangle topology of the source mesh fixed
and only updates the vertex positions. We introduce an end-to-end 3D deformation
network (3DN) to predict such per-vertex displacements of the source mesh.
We represent the source mesh as S = (V;E), where V 2 R
N
V
3
is the (x;y;z)
positions of vertices and E2 Z
N
E
3
is the set of triangles and encodes each triangle
79
with the indices of vertices. N
V
and N
E
denote the number of vertices and triangles
respectively. The target modelT is either aHW3 image or a 3D model. In caseT
is a 3D model, we represent it as a set of 3D pointsT2R
N
T
3
, whereN
T
denotes the
number of points inT .
As shown in Figure 5.2, 3DN takes S and T as input and outputs per-vertex dis-
placements, i.e., offsets,O2R
N
V
3
. The final deformed mesh isS
0
= (V
0
;E), where
V
0
=V +O. Moreover, 3DN can be extended to produce per-point displacements when
we replace the input source vertices with a sampled point cloud on the source. 3DN is
composed of a target and a source encoder which extract global features from the source
and target models respectively, and an offset decoder which utilizes such features to
estimate the shape deformation. We next describe each of these components in detail.
5.3.1 Shape Deformation Network (3DN)
Source and Target Encoders. Given the source modelS, we first uniformly sample
a set of points on S and use the PointNet [84] architecture to encode S into a source
global feature vector. Similar to the source encoder, the target encoder extracts a target
global feature vector from the target model. In case the target model is a 2D image,
we use VGG [100] to extract features. If the target is a 3D model, we sample points on
T and use PointNet. We concatenate the source and target global feature vectors into a
single global shape feature vector and feed into the offset decoder.
Offset Decoder. Given the global shape feature vector extracted by the source and
target encoders, the offset decoder learns a functionF() which predicts per-vertex dis-
placements, for S. In other words, given a vertex v = (x
v
;y
v
;z
v
) in S, the offset
decoder predictsF(v) = o
v
= (x
ov
;y
ov
;z
ov
) updating the deformed vertex inS
0
to be
v
0
=v+o
v
.
80
Offset decoder is easily extended to perform point cloud deformations. When we
replace the input vertex locations to point locations, say given a pointp = (x
p
;y
p
;z
p
) in
the point cloud sampled formS, the offset decoder predicts a displacementF(p) =o
p
,
and similarly, the deformed point isp
0
=p+o
p
.
The offset decoder has an architecture similar to the PointNet segmentation net-
work [84]. However, unlike the original PointNet architecture which concatenates the
global shape feature vector with per-point features, we concatenate the original point
positions to the global shape feature. We find this enables to better capture the ver-
tex and point locations distribution in the source, and results in effective deformation
results. We study the importance of this architecture in Section 5.4.4. Finally we note
that, our network is flexible to handle source and target models with varying number of
vertices.
5.3.2 Learning Shape Deformations
Given a deformed mesh S
0
produced by 3DN and the 3D mesh corresponding to the
target model T = (V
T
;E
T
), where V
T
2 R
N
V
T
3
(N
V
T
6= N
V
) and E
T
6= E, the
remaining task is to design a loss function that measures the similarity betweenS
0
and
T . Since it is not trivial to establish ground truth correspondences betweenS
0
andT ,
our method instead utilizes the Chamfer and Earth Mover’s losses introduced by Fan
et al. [27]. In order to make these losses robust to different meshing densities across
source and target models, we operate on set of points uniformly sampled onS
0
andT by
introducing the differentiable mesh sampling operator (DMSO). DMSO is seamlessly
integrated in 3DN and bridges the gap between handling meshes and loss computation
with point sets.
81
Compute Losses Deformed Mesh: Generated by Network Sample P Point Cloud Gradient P V1 V2 V3 w1 w2 w3 Figure 5.3: Differentiable mesh sampling operator (best viewed in color). Given a face
e = (v
1
;v
2
;v
3
),p is sampled one in the network forward pass using barycentric coor-
dinatesw
1
;w
2
;w
3
. Sampled points are used during loss computation. When performing
back propagation, gradient ofp is passed back to (v
1
;v
2
;v
3
) with the stored weights
w
1
;w
2
;w
3
. This process is differentiable.
Differentiable Mesh Sampling Operator. As is illustrated in Figure 5.3, DMSO is
used to sample a uniform set of points from a 3D mesh. Suppose a pointp is sampled
on the facee = (v
1
;v
2
;v
3
) enclosed by the verticesv
1
;v
2
;v
3
. The position ofp is
then
p =w
1
v
1
+w
2
v
2
+w
3
v
3
;
wherew
1
+w
2
+w
3
= 1 are the barycentric coordinates ofp. Given any typical feature
for the original vertices, the per-vertex offsets in our case,o
v
1
;o
v
2
;o
v
3
, the offset ofp
is
o
p
=w
1
o
v
1
+w
2
o
v
2
+w
3
o
v
3
:
To perform back-propogation, the gradient for each original per-vertex offsets o
v
i
is
calculated simply byg
ov
i
=w
i
g
ovp
, whereg denotes the gradient.
We train 3DN using a combination of different losses as we discuss next.
82
Shape Loss. Given a target model, T , inspired by [27], we use Chamfer and Earth
Mover’s distances to measure the similarity between the deformed source and the tar-
get. Specifically, given the point cloudPC sampled on the deformed output andPC
T
sampled on the target model, Chamfer loss is defined as
L
Mesh
CD
(PC;PC
T
) =
X
p
1
2PC
min
p
2
2PC
T
kp
1
p
2
k
2
2
+
X
p
2
2PC
T
min
p
1
2PC
kp
1
p
2
k
2
2
; (5.1)
and Earth Mover’s loss is defined as
L
Mesh
EMD
(PC;PC
T
) = min
:PC!PC
T
X
p2PC
kp(p)k
2
; (5.2)
where :PC!PC
T
is a bijection.
We compute these distances between point sets sampled both on the source (using
the DMSO) and target models. Moreover, computing the above losses on point sets
sampled on source and target models further helps for robustness to different mesh den-
sities. In practice, for each (S;T) source-target model pair, we also pass a point cloud
sampled onS together withT through the decoder offset in a second pass to help the
network cope with sparse meshes. Specifically, given a point set sampled onS, we pre-
dict per-point offsets and compute the above Chamfer and Earth Mover’s losses between
the resulting deformed point cloud andT . We denote these two losses asL
Points
CD
and
L
Points
EMD
. During testing, this second pass is not necessary and we only predict per-vertex
offsets forS.
We note that we train our model with synthetic data where we always have access to
3D models. Thus, even if the target is a 2D image, we use the corresponding 3D model
to compute the point cloud shape loss. During testing, however, we do not need access
83
to any 3D target models, since the global shape features required for offset prediction
are extracted from the 2D image only.
Symmetry Loss. Many man-made models exhibit global reflection symmetry and our
goal is to preserve this during deformation. However, the mesh topology itself does
not always guarantee to be symmetric, i.e., a symmetric chair does not always have
symmetric vertices. Therefore, we propose to preserve shape symmetry by sampling
a point cloud,M(PC), on the mirrored deformed output and measure the point cloud
shape loss with this mirrored point cloud as
L
sym
(PC;PC
T
) =L
CD
(M(PC);PC
T
)
+L
EMD
(M(PC);PC
T
): (5.3)
We note that we assume the reflection symmetry plane of a source model to be known.
In our experiments, we use 3D models from ShapeNet [9] which are already aligned
such that the reflection plane coincides with thexz plane.
Mesh Laplacian Loss. To preserve the local geometric details in the source mesh and
enforce smooth deformation across the mesh surface, we desire the Laplacian coordi-
nates of the deformed mesh to be the same as the original source mesh. We define this
loss as
L
lap
=
X
i
jjLap(S)Lap(S
0
)jj
2
: (5.4)
where Lap is the mesh Laplacian operator, S and S
0
are the original and deformed
meshes respectively.
84
(a) (b) (c)
Figure 5.4: Self intersection. The red arrow is the deformation handle. (a) Original
Mesh. (b) Deformation with self-intersection. (c) Plausible deformation.
Local Permutation Invariant Loss. Most traditional deformation methods (such as
FFD) are prone to suffer from possible self-intersections that can occur during defor-
mation (see Figure 5.4). To prevent such self-intersections, we present a novel local
permutation invariant loss. Specifically, given a point p and a neighboring point at a
distance top, we would like to preserve the distance between these two neighboring
points after deformation as well. Thus, we define
L
LPI
=min(F(V +)F(V);0): (5.5)
where is a vector with a small magnitude and 0 = (0;0;0). In our experiments we
define2f(;0;0);(0;;0);(0;0;)g where = 0:05. The intuition behind of this is to
preserve the local ordering of points in the source. We observe that the local permutation
invariant loss helps to achieve smooth deformation across 3D space. Given all the losses
defined above, we train 3DN with a combined loss of
85
(a) Source Template (b) Target Mesh (c) Target Point Cloud (d)Poisson (e)FFD (f)AtlasNet (g) Ours
Figure 5.5: Given a source (a) and a target (b) model from the ShapeNet dataset, we
show the deformed meshes obtained by our method (g). We also show Poisson sur-
face reconstruction (d) from a set of points sampled on the target (c). We also show
comparisons to previous methods of Jack et al. (e) and AtlasNet (f).
L =!
L
1
L
Mesh
CD
+!
L
2
L
Mesh
EMD
+!
L
3
L
Points
CD
+!
L
4
L
Points
EMD
+
!
L
5
L
sym
+!
L
6
L
lap
+!
L
7
L
LPI
; (5.6)
where!
L
1
;!
L
2
;!
L
3
;!
L
4
;!
L
5
;!
L
6
;!
L
7
denote the relative weighting of the losses.
5.4 Experiments
In this section, we perform qualitative and quantitative comparisons on shape recon-
struction from 3D target models (Section 5.4.2) as well as single-view reconstruc-
tion (Section 5.4.3). We also conduct ablation studies of our method to demonstrate
the effectiveness of the offset decoder architecture and the different loss functions
employed. Finally, we provide several applications to demonstrate the flexibility of
86
our method. More qualitative results and implementation details can be found in sup-
plementary material.
Dataset. In our experiments, we use the ShapeNet Core dataset [9] which includes 13
shape categories and an official traning/testing split. We use the same template set of
models as in [51] for potential source meshes. There are 30 shapes for each category in
this template set. When training the 2D image-based target model, we use the rendered
views provided by Choy et al. [15]. We note that we train a single network across all
categories.
Template Selection. In order to sample source and target model pairs for 3DN, we
train a PointNet based auto-encoder to learn an embedding of the 3D shapes. Specifi-
cally, we represent each 3D shape as a uniformly sampled set of points. The encoder
encodes the points as a feature vector and the decoder predicts the point positions from
this feature vector (please refer to the supplementary material for details). Given the
embedding composed of the features extracted by the encoder, for each target model can-
didate, we choose the nearest neighbor in this embedding as the source model. Source
models are chosen from the aforementioned template set. No class label information is
required during this procedure, however, the nearest neighbors are queried within the
same category. When given a target 2D image for testing, if no desired source model is
given, we use the point set generation network, PSGN [27], to generate an initial point
cloud, and use its nearest neighbor in our embedding as the source model.
Evaluation Metrics. Given a source and target model pair (S;T), we utilize three
metrics in our quantitative evaluations to compare the deformation output S
0
and the
targetT : 1) Chamfer Distance (CD) between the point clouds sampled onS
0
andT , 2)
87
plane bench box car chair displaylamp speakerrifle sofa table phone boat Mean
EMD
AtlasNet 3.46 3.18 4.20 2.84 3.47 3.97 3.79 3.83 2.44 3.19 3.76 3.87 2.99 3.46
FFD 1.88 2.02 2.50 2.11 2.13 2.69 2.42 3.06 1.55 2.44 2.44 1.88 2.00 2.24
Ours 0.79 1.98 3.57 1.24 1.12 3.08 3.44 3.40 1.79 2.06 1.34 3.27 2.27 2.26
CD
AtlasNet 2.16 2.91 6.62 3.97 3.65 3.65 4.48 6.29 0.98 4.34 6.01 2.44 2.73 3.86
FFD 3.22 4.53 6.94 4.45 4.99 5.98 8.72 11.97 1.97 6.29 6.89 3.61 4.41 5.69
Ours 0.38 2.40 5.26 0.90 0.82 5.59 8.74 9.27 1.52 2.55 0.97 2.66 2.77 3.37
IoU
AtlasNet 56.9 53.3 31.3 44.0 47.9 48.0 41.6 33.2 63.4 44.7 43.8 58.7 50.9 46.7
FFD 29.0 42.3 28.4 21.1 42.2 27.9 38.9 52.5 31.9 34.7 43.3 22.9 47.7 35.6
Ours 71.0 40.7 43.6 75.8 66.3 40.4 25.1 49.2 40.0 60.6 57.9 50.1 42.6 51.1
Table 5.1: Point cloud reconstruction results on ShapeNet core dataset. Metrics are mean Chamfer distance
(0:001, CD) on points, Earth Mover’s distance (100, EMD) on points and Intersection over Union (%,
IoU) on solid voxelized grids. For both CD and EMD, the lower the better. For IoU, the higher the better.
Earth Mover’s Distance (EMD) between the point clouds sampled onS
0
andT , 3) Inter-
section over Union (IoU) between the solid voxelizations of S
0
and T . We normalize
the outputs of our method and previous work into a unit cube before computing these
metrics. We also evaluate the visual plausibility of our results by providing a large set
of qualitative examples.
Comparison We compare our approach with state-of-the-art reconstruction methods.
Specifically, we compare to three categories of methods: 1) learning-based surface gen-
eration, 2) learning-based deformation prediction, and 3) traditional surface reconstruc-
tion methods. We would like to note that we are solving a fundamentally different
problem than surface generation methods. Even though, having a source mesh to start
with might seem advantageous, our problem at hand is not easier since our goal is not
only to generate a mesh similar to the target but also preserve certain properties of the
source. Furthermore, our source meshes are obtained from a fixed set of templates which
contain only 30 models per category.
88
Input
Vertices/Points Nx3 Global Feature 1x1024 Tile Concat 1x1Conv 1x1Conv 1x1Conv Nx512 Nx256 Nx3 Nx1024 Nx1024 Concat 1x1Conv 1x1Conv 1x1Conv Nx512 Nx256 Nx3 Concat Nx1024 1x1Conv 1x1Conv 1x1Conv Nx512 Nx256 Nx3 Output Offset Figure 5.6: Offset decoder architecture. Each 11Conv is followed by a ReLU layer
except for the last one.
5.4.1 Implementation Details
We use an ADAM optimizer with initial learning rate0:0005, momentum0:9 and batch
size4. Our network is implemented with Tensorflow and trained on an Nvidia GTX1080
Ti GPU. When the input is a point cloud, we have used a point cloud of size 20483.
No Batch Normalization layer is used. The weights for the losses are!
1
= 1000;!
2
=
1;!
3
= 10000;!
4
= 1;!
5
= 1000;!
6
= 0:01;!
7
= 1000.
The mesh sampling operator is implemented with CuDA acceleration and Tensor-
flow. Since number of vertices are varying for different meshes and we need to train
the network with batch size> 1, we set a maximum number of vertices and triangles to
make the operation trainable with batch size> 1, and we also input the number of ver-
tices and triangles for each sample. In the CuDA implementation, we sample the same
number of points for meshes with different number of vertices. The inputs to the mesh
sampling operator areV 2R
BN
Vmax
3
,T2Z
BN
Tmax
3
,N
V
2Z
B1
,N
T
2Z
B1
,
w
V1
2 (0;1)
BN
Tmax
3
,w
V2
2 (0;1)
BN
Tmax
3
,w
V3
2 (0;1)
BN
Tmax
3
, whereB is
batch size,N
Vmax
= 10000,N
Tmax
= 10000,N
V
andN
T
indicate the number of ver-
tices and triangles for each sample in the mini batch,w
V1
,w
V2
andw
V3
are the random
barycentric coordinate weights.
Network Architecture Figure 5.6 illustrates the network architecture of our offset
decoder. Figure 5.7 shows the network architecture of the PointNet autoencoder for
template selection. “Embedding” denotes the feature vector we use to query the template
89
Input Points 2048x3 2048x1x64 2048x1x128 1x1024 2048x1x1024 1x1Conv 1x1Conv 1x1Conv 1x512 FC 1x1Conv MaxPool Embedding 1x512 1024x3 FC FC 1x512 2x2x512 FC 2x2Deconv Stride 1 2048x1x64 4x4x256 10x10x256 32x32x128 2048x3 3x3Deconv Stride 1 4x4Deconv Stride 2 5x5Deconv Stride 3 1x1Deconv Reshape Output Points Figure 5.7: PointNet autoencoder architecture. Each11Conv is followed by a ReLU
layer except for the last one. “Embedding” is the feature vector used to query template
source mesh.
1x1Conv Input
Vertices/Points Nx3 Global Feature 1x1024 Tile Concat 1x1Conv 1x1Conv Nx512 Nx256 Nx1024 1x1Conv 1x1Conv Nx512 Nx256 Nx3 Output Offset Figure 5.8: Offset decoder architecture for mid-layer fusion experiment. Each 1
1Conv is followed by a ReLU layer except for the last one.
source mesh. Figure 5.8 illustrates the network architecture used in mid-layer fusion
experiment.
5.4.2 Shape Reconstruction from Point Cloud
For this experiment, we define each 3D model in the testing split as target and iden-
tify a source model in the testing split based on the autoencoder embedding described
above. 3DN computes per-vertex displacements to deform the source and keeps the
source mesh topology fixed. We evaluate the quality of this mesh with alternative mesh-
ing techniques. Specifically, given a set of points sampled on the desired target model,
we reconstruct a 3D mesh using Poisson surface reconstruction. As shown in Figure 5.5,
this comparison demonstrates that even with a ground truth set of points, generating a
mesh that preserves sharp features is not trivial. Instead, our method utilizes the source
mesh connectivity to output a plausible mesh. Furthermore, we apply the learning-based
surface generation technique of AtlasNet [36] on the uniformly sampled points on the
90
plane bench box car chair displaylamp speakerrifle sofa table phone boat Mean
EMD
AtlasNet 3.39 3.22 3.36 3.72 3.86 3.12 5.29 3.75 3.35 3.14 3.98 3.19 4.39 3.67
Pxel2mesh 2.98 2.58 3.44 3.43 3.52 2.92 5.15 3.56 3.04 2.70 3.52 2.66 3.94 3.34
FFD 2.63 3.96 4.87 2.98 3.38 4.88 7.19 5.04 3.58 3.70 3.56 4.11 3.86 4.13
Ours 3.30 2.98 3.21 3.28 4.45 3.91 3.99 4.47 2.78 3.31 3.94 2.70 3.92 3.56
CD
AtlasNet 5.98 6.98 13.76 17.04 13.21 7.18 38.21 15.96 4.59 8.29 18.08 6.35 15.85 13.19
Pixel2mesh 6.10 6.20 12.11 13.45 11.13 6.39 31.41 14.52 4.51 6.54 15.61 6.04 12.66 11.28
FFD 3.41 13.73 29.23 5.35 7.75 24.03 45.86 27.57 6.45 11.89 13.74 16.93 11.31 16.71
Ours 6.75 7.96 8.34 7.09 17.53 8.35 12.79 17.28 3.26 8.27 14.05 5.18 10.20 9.77
IoU
AtlasNet 39.2 34.2 20.7 22.0 25.7 36.4 21.3 23.2 45.3 27.9 23.3 42.5 28.1 30.0
Pixel2mesh 51.5 40.7 43.4 50.1 40.2 55.9 29.1 52.3 50.9 60.0 31.2 69.4 40.1 47.3
FFD 30.3 44.8 30.1 22.1 38.7 31.6 35.0 52.5 29.9 34.7 45.3 22.0 50.8 36.7
Ours 54.34 39.82 49.44 59.45 34.46 47.20 35.48 45.34 57.62 60.74 31.33 71.49 46.47 48.7
Table 5.2: Quantitative comparison on ShapeNet rendered images. Metrics are CD (0:001), EMD (100)
and IoU (%).
target model. Thus, we expect AtlasNet only to perform surface generation without any
deformation. We also compare to the method of Jack et al. [51] (FFD) which introduces
a learning based method to apply free form deformation to a given template model to
match an input image. This network consists of a module which predicts FFD parame-
ters based on the features extracted from the input image. We retrain this module such
that it uses the features extracted from the points sampled on the 3D target model. As
shown in Figure 5.5, the deformed meshes generated by our method are higher quality
than the previous methods. We also report quantitative numbers in Table 5.1. While
AtlastNet achieves lower error based on Chamfer Distance, we observe certain artifacts
such as holes and disconnected surfaces in their results. We also observe that our defor-
mation results are smoother than FFD.
91
5.4.3 Single-view Reconstruction
We also compare our method to recent state-of-the-art single view image based recon-
struction methods including Pixel2Mesh [118], AtlasNet [36] and FFD [51]. Specifi-
cally, we choose a target rendered image from the testing split and input to the previous
methods. For our method, in addition to this target image, we also provide a source
model selected from the template set. We note that the scope of our work is not single-
view reconstruction, thus the comparison with Pixel2Mesh and AtlasNet is not entirely
fair. However, both quantitative (see Table 5.2) and qualitative (Figure 5.9) results still
provide useful insights. Though the rendered output of AtlasNet and Pixel2Mesh in Fig-
ure 5.9 are visually plausible, self-intersections and disconnected surfaces often exist in
their results. Figure 5.10 illustrates this by rendering the output meshes in wireframe
mode. Furthermore, as shown in Figure 5.10, while surface generation methods strug-
gle to capture shape details such as chair handles and car wheels, our method preserves
these details that reside in the source mesh.
Evaluation on real images. We further evaluate our method on real product images
that can be found online. For each input image, we select a source model as described
before and provide the deformation result. Even though our method has been trained
only on synthetic images, we observe that it generalizes to real images as seen in Fig-
ure 5.11. AtlasNet and Pixel2Mesh fail in most cases, while our method is able to
generate plausible results by taking advantages of source meshes.
5.4.4 Ablation Study
We study the importance of different losses and the offset decoder architecture on
ShapeNet chair category. We compare our final model to variants including 1) 3DN
without the symmetry loss, 2) 3DN without the mesh Laplacian loss, 3) 3DN without
92
Target Source GT FFD AtlasNet P2M 3DN
Figure 5.9: Given a target image and a source, we show deformation results of FFD, AtlasNet,
Pixel2Mesh (P2M), and 3DN. We also show the ground truth target model (GT).
93
Target Source GT P2M AtlasNet 3DN
Figure 5.10: For a given target image and source model, we show ground truth model and results of
Pixel2Mesh (P2M), AtlasNet, and our method (3DN) rendered in wire-frame mode to better judge the
quality of the meshes. Please zoom into the PDF for details.
the local permutation invariance loss, and 4) fusing global features with midlayer fea-
tures instead of the original point positions (see the supplemental material for details).
We provide quantitative results in Table 5.3. Symmetry loss helps the deformation
to produce plausible symmetric shapes. Local permutation and Laplacian losses help to
obtain smoothness in the deformation field across 3D space and along the mesh surface.
However, midlayer fusion makes the network hard to converge to a valid deformation
space.
94
Figure 5.11: Qualitative results on online product images. The first row shows the images scrapped
online. Second and third row are results of AtlasNet and Pixel2Mesh respectively. Last row is our
results.
5.4.5 Applications
Random Pair Deformation. In Figure 5.12 we show deformation results for ran-
domly selected source and target model pairs. While the first column of each row is
the source mesh, the first row of each column is the target. Each grid cell shows defor-
mation results for the corresponding source-target pair.
Shape Interpolation. Figure 5.13 shows shape interpolation results. Each row shows
interpolated shapes generated from the two targets and the source mesh. Each interme-
diate shape is generated using a weighted sum of the global feature representations of
95
CD EMD IoU
3DN 4.50 2.06 41.0
-Symmetry 4.78 2.73 36.7
-Mesh Laplacian 4.55 2.08 39.8
-Local Permutation 5.31 2.96 35.4
Midlayer Fusion 6.63 3.03 30.9
Table 5.3: Quantitative comparison on ShapeNet rendered images. ’-x’ denotes without
x loss. Metrics are CD (1000), EMD (0:01) and IoU (%).
S
T
Figure 5.12: Deformation with different source-target pairs. ‘S’ and ‘T’ denote source
meshes and target meshes respectively.
the target shapes. Notice how the interpolated shapes gradually deform from the first to
the second target.
Shape Inpainting. We test our model trained in Section 5.4.2 on targets in the form
of partial scans produced by RGBD data [111]. We provide results in Figure 5.14 with
different selection of source models. We note that AtlastNet fails on such partial scan
input.
96
Source Target1 Interpolation Target2
Figure 5.13: Shape interpolation.
Scan Src1 Out1 Src2 Out2 Src3 Out3 AtlasNet
Figure 5.14: Shape inpainting with real point cloud scan as input. Src means source
mesh and ’out’ is the corresponding deformed mesh.
5.5 Conclusion
We have presented 3DN, an end-to-end network architecture for mesh deformation.
Given a source mesh and a target which can be in the form of a 2D image, 3D mesh, or
3D point clouds, 3DN deforms the source by inferring per-vertex displacements while
keeping the source mesh connectivity fixed. We compare our method with recent learn-
ing based surface generation and deformation networks and show superior results. Our
method is not without limitations, however. Certain deformations indeed require to
change the source mesh topology, e.g., when deforming a chair without handles to a
chair with handles. If large holes exist either in the source or target models, Chamfer
97
and Earth Mover’s distances are challenging to compute since it is possible to generate
many wrong point correspondences.
In addition to addressing the above limitations, our future work include extending
our method to predict mesh texture by taking advantages of differentiable renderer [57].
98
Chapter 6
3D Modeling: 3D Implicit Surface
Generation
6.1 Introduction
Over the recent years, a multitude of single-view 3D reconstruction methods have
been proposed where deep learning based methods have specifically achieved promis-
ing results. To represent 3D shapes, many of these methods utilize either vox-
els [132, 31, 143, 126, 114, 128, 134, 122] or point clouds [27] due to ease of encoding
them in a neural network. However, such representations are often limited in terms of
resolution. A few recent methods have explored utilizing explicit surface representa-
tions, i.e., polygonal meshes, in a neural network but make the assumption of a fixed
topology [36, 118, 121] limiting the flexibility of the approaches. Moreover, point- and
mesh-based methods use Chamfer Distance (CD) and Earth-mover Distance (EMD) as
training losses. However, these distances provide only approximated metrics for mea-
suring shape similarity.
This chapter is based on the paper DISN: Deep Implicit Surface Network for High-quality Single-
view 3D Reconstruction by Weiyue Wang*, Qiangeng Xu*, Duygu Ceylan, Radomir Mech and Ulrich
Neumann, and was submitted to NeurIPS, 2018. (* indicates equal contributions.) Code is available at
github.com/laughtervv/DISN.
99
Rendered
Image
OccNet DISN
Real Image OccNet DISN
Figure 6.1: Single-view reconstruction results using OccNet [74], a state-of-the-art
method, and DISN on synthetic and real images.
To address the aforementioned limitations in voxels, point clouds and meshes, in
this paper, we study an alternative implicit 3D surface representation, Signed Distance
Functions (SDF). We present an efficient, flexible, and effective Deep Implicit Surface
Network (DISN) for predicting SDFs from single-view images (Figure 6.1). SDF simply
encodes the signed distance of each point sample in 3D from the boundary of the under-
lying shape. Thus, given a set of signed distance values, the shape can be extracted by
identifying the iso-surface using methods such as Marching Cubes [71]. As illustrated
in Figure 6.3, given a convolutional neural network (CNN) that encodes the input image
into a feature vector, DISN predicts the SDF value of a given 3D point using this feature
vector. By sampling different 3D point locations, DISN is able to generate an implicit
field of the underlying surface with infinite resolution. Moreover, without the need of
a fixed topology assumption, the regressing target for DISN is an accurate ground truth
instead of an approximated metric.
100
While many single-view 3D reconstruction methods [132, 27, 12, 74] that learn a
shape embedding from a 2D image are able to capture the global shape properties, they
have a tendency to ignore details such as holes or thin structures, resulting in visually
unsatisfactory reconstruction. To address this problem, we further introduce a local fea-
ture extraction module. Specifically, we estimate the viewpoint parameters of the input
image. We utilize this information to project each query point onto the input image to
identify a corresponding local patch. We extract local features from such patches and use
them in conjunction with global image features to predict the SDF value of the points.
This module enables the network to learn the relations between projected pixels and 3D
space, and significantly improves the reconstruction quality of fine-grained details in
the resulting 3D shape. As shown in Figure 6.1, DISN is able to generate shape details,
such as the patterns on the bench back and holes on the rifle handle, which previous
state-of-the-art methods fail to produce. To the best our knowledge, DISN is the first
deep learning model that is able to capture such high-quality details from single-view
images.
We evaluate our approach on various shape categories using both synthetic data gen-
erated from 3D shape datasets as well as online product images. Qualitative and quan-
titative comparisons demonstrate that our network outperforms state-of-the-art methods
and generates plausible shapes with high-quality details. Furthermore, we also extend
DISN to multi-view reconstruction and other applications such as shape interpolation.
6.2 Related Work
Over the last few years, there have been extensive studies on learning based single-
view 3D reconstruction using various 3D representations including voxels [132, 31,
143, 126, 114, 128, 134], octrees [40, 113, 120], points [27], and primitives [144, 76].
101
More recently, Sinha et al. [102] propose a method to generate the surface of an object
using geometry images. Tang et al. [112] use shape skeletons for surface reconstruction,
however, their method requires additional shape primitives dataset. Groueix et al. [36]
present AtlasNet to generate surfaces of 3D shapes using a set of parametric surface
elements. Wang et al. [118] introduce a graph-based network Pix2Mesh to reconstruct
3D manifold shapes from input images whereas Wang et al. [121] present 3DN to recon-
struct a 3D shape by deforming a given source mesh.
Most of the aforementioned methods use explicit 3D representations and often suf-
fer from problems such as limited resolution and fixed mesh topology. Implicit repre-
sentations provide an alternative representation to overcome these limitations. In our
work, we adopt the Signed Distance Functions (SDF) which are among the most pop-
ular implicit surface representations. Several deep learning approaches have utilized
SDFs recently. Dai et al. [18] use a voxel-based SDF representation for shape inpaint-
ing. Nevertheless, the 3D CNNs are known to suffer from high memory usage and
computation cost. Park et al. [77] introduce DeepSDF for shape completion using an
auto-decoder structure. However, their network is not feed-forward and requires opti-
mizing the embedding vector during test time which limits the efficiency and capability
of the approach. In concurrent work, Chen and Zhang [12] use SDFs in deep networks
for the task of shape generation. While their method achieves promising results for
the generation task, it fails to recover fine-grained details of 3D objects for single-view
reconstruction. Finally, Mescheder et al. [74] learn an implicit representation by pre-
dicting the probability of each cell in a volumetric grid being occupied or not, i.e., being
inside or outside of a 3D model. By iteratively subdividing each active cell (i.e., cells
surrounded by occupied and empty cells) into sub-cells and repeating the prediction for
each sub-cell, they alleviate the problem of limited resolution of volumetric grids. In
102
contrast, our method not only predicts the sign (i.e., being inside or outside) of sam-
pled points but also the distance which is continuous. Therefore, an iterative prediction
procedure is not necessary. We compare our method with these recent approaches in
Section 6.4.1 and demonstrate state-of-the-art results.
6.3 Method
Given an image of an object, our goal is to reconstruct a 3D shape that captures both
the overall structure and fine-grained details of the object. We consider modeling a
3D shape as a signed distance function (SDF). As illustrated in Figure 6.2, SDF is a
continuous function that maps a given spatial pointp = (x;y;z)2 R
3
to a real value
s2R: s = SDF(p): The absolute value ofs indicates the distance of the point to the
surface, while the sign ofs represents if the point is inside or outside the surface. An
iso-surfaceS
0
=fpjSDF(p) = 0g implicitly represents the underlying 3D shape.
S<0 S>0 (a) (b) Figure 6.2: Illustration of SDF. (a)
Rendered 3D surface with s = 0.
(b) Cross-section of the SDF. A
point is outside the surface if s >
0, inside if s < 0, and on the sur-
face ifs = 0.
In this paper, we use a feed-forward deep
neural network, Deep Implicit Surface Network
(DISN), to predict the SDF from an input image.
DISN takes a single image as input and predicts
the SDF value for any given point. Unlike the 3D
CNN methods [18] which generate a volumetric
grid with fixed resolution, DISN produces a con-
tinuous field with arbitrary resolution. Moreover,
we introduce a local feature extraction method to
improve recovery of shape details.
103
6.3.1 DISN: Deep Implicit Surface Network
The overview of our method is illustrated in Figure 6.3. Given an image, DISN consists
of two parts: camera pose estimation and SDF prediction. DISN first estimates the
camera parameters that map an object in world coordinates to the image plane. Given
the predicted camera parameters, we project each 3D query point onto the image plane
and collect multi-scale CNN features for the corresponding image patch. DISN then
decodes the given spatial point to an SDF value using both the multi-scale local image
features and the global image features.
Encoder Global Features SDF Local Features p(x,y,z) Feature Maps Estimated Camera Pose MLPs Point Features Decoder Decoder + Point Features Figure 6.3: Given an image and a point p, we
estimate the camera pose and projectp onto the
image plane. DISN uses the local features at the
projected location, the global features, and the
point features to predict the SDF ofp. ‘MLPs’
denotes multi-layer perceptrons.
Local Features Global Features p(x,y,z) Concat Figure 6.4: Local feature extraction. Given
a 3D point p, we use the estimated cam-
era parameters to project p onto the image
plane. Then we identify the projected
location on each feature map layer of the
encoder. We concatenate features at each
layer to get the local features of pointp.
Camera Pose Estimation
Given an input image, our first goal is to estimate the corresponding viewpoint. We train
our network on the ShapeNet Core dataset [9] where all the models are aligned. There-
fore we use this aligned model space as the world space where our camera parameters
are with respect to, and we assume a fixed set of intrinsic parameters. Regressing cam-
era parameters from an input image directly using a CNN often fails to produce accurate
104
CNN Translation Rotation PC in World Space Apply
Transformation PC in Pred Cam Space PC in GT Cam Space MSE Figure 6.5: Camera Pose Estimation Network. ‘PC’ denotes point cloud. ‘GT Cam’ and
‘Pred Cam’ denote the ground truth and predicted cameras.
poses as discussed in [49]. To overcome this issue, Insafutdinov and Dosovitskiy [49]
introduce a distilled ensemble approach to regress camera pose by combining several
pose candidates. However, this method requires a large number of network parameters
and a complex training procedure. We present a more efficient and effective network
illustrated in Figure 6.5. In a recent work, Zhou et al. [142] show that a6D rotation rep-
resentation is continuous and easier for a neural network to regress compared to more
commonly used representations such as quaternions and Euler angles. Thus, we employ
the 6D rotation representationb = (b
x
;b
y
), whereb2R
6
,b
x
2R
3
,b
y
2R
3
. Given
b, the rotation matrixR = (R
x
;R
y
;R
z
)
T
2R
33
is obtained by
R
x
=N(b
x
);R
z
=N(R
x
b
y
);R
y
=R
z
R
x
;
(6.1)
whereR
x
;R
y
;R
z
2R
3
,N() is the normalization function, ‘’ indicates cross prod-
uct. Translationt2 R
3
from world space to camera space is directly predicted by the
network.
105
Instead of calculating losses on camera parameters directly as in [49], we use the
predicted camera pose to transform a given point cloud from the world space to the
camera coordinate space. We compute the lossL
cam
by calculating the mean squared
error between the transformed point cloud and the ground truth point cloud in the camera
space:
L
cam
=
P
pw2PCw
jjp
G
(Rp
w
+t))jj
2
2
P
pw2PCw
1
; (6.2)
where PC
w
2 R
N3
is the point cloud in the world space, N is number of points in
PC
w
. For eachp
w
2PC
w
,p
G
represents the corresponding ground truth point location
in the camera space andjjjj
2
2
is the squaredL
2
distance.
SDF Prediction with Deep Neural Network
Given an image I, we denote the groudtruth SDF by SDF
I
(), and the goal of our
networkf() is to estimateSDF
I
(). Unlike the common used CD and EMD losses in
previous reconstruction methods [27, 36], our guidance is a true ground truth instead of
approximated metrics.
Park et al [77] recently propose DeepSDF, a direct approach to regress SDF with a
neural network. DeepSDF concatenates the location of a query 3D point and the shape
embedding extracted from a depth image or a point cloud and uses an auto-decoder to
obtain the corresponding SDF value. The auto-decoder structure requires optimizing
the shape embedding for each object. In our initial experiments, when we applied a
similar network architecture in a feed-forward manner, we observed convergence issues.
Alternatively, Chen and Zhang [12] propose to concatenate the global features of an
input image and the location of a query point to every layer of a decoder. While this
approach works better in practice, it also results in a significant increase in the number
of network parameters. Our solution is to use a multi-layer perceptron to map the given
106
Input (a) (b)
Figure 6.6: Shape reconstruction results (a) without and (b) with local feature extraction.
point location to a higher-dimensional feature space. This high dimensional feature is
then concatenated with global and local image features respectively and used to regress
the SDF value.
Local Feature Extraction As shown in Figure 6.6(a), our initial experiments showed
that it is hard to capture shape details such as holes and thin structures when only global
image features are used. Thus, we introduce a local feature extraction method to focus
on reconstructing fine-grained details, such as the back poles of a chair (Figure 6.6). As
illustrated in Figure 6.4, a 3D point p2 R
3
is projected to a 2D location q2 R
2
on
the image plane with the estimated camera parameters. We retrieve features on each
feature map corresponding to location q and concatenate them to get the local image
features. Since the feature maps in the later layers are smaller in dimension than the
107
original image, we resize them to the original size with bilinear interpolation and extract
the resized features at locationq.
Two decoders then take the global and local image features respectively as input
with the point features and make an SDF prediction. The final SDF is the sum of these
two predictions. Figure 6.6 compares the results of our approach with and without local
feature extraction. With only global features, the network is able to predict the overall
shape but fails to produce details. Local feature extraction helps to recover these missing
details by predicting the residual SDF.
Loss Functions We regress continuous SDF values instead of formulating a binary
classification problem (e.g., inside or outside of a shape) as in [12]. This strategy enables
us to extract surfaces that correspond to different iso-values. To ensure that the network
concentrates on recovering the details near and inside the iso-surfaceS
0
, we propose a
weighted loss function. Our loss is defined by
L
SDF
=
X
p
mjf(I;p)SDF
I
(p)j;
m =
8
>
>
<
>
>
:
m
1
; ifSDF
I
(p)<;
m
2
; otherwise;
(6.3)
wherejj is theL
1
-norm. m
1
, m
2
are different weights, and for points whose signed
distance is below a certain threshold, we use a higher weight ofm
1
.
6.3.2 Surface Reconstruction
To generate a mesh surface, we firstly define a dense 3D grid and predict SDF values for
each grid point. Once we compute the SDF values for each point in the dense grid, we
108
use Marching Cubes [71] to obtain the 3D mesh that corresponds to the iso-surfaceS
0
.
6.4 Experiments
We perform quantitative and qualitative comparisons on single-view 3D reconstruction
with state-of-the-art methods [36, 118, 121, 12, 74] in Section 6.4.1. We also compare
the performance of our method on camera pose estimation with [49] in Section 6.4.2.
We further conduct ablation studies in Section 6.4.3 and showcase several applications
in Section 6.4.4.
Input 3DN AtlasNet Pix2Mesh 3DCNN IMNET OccNet Ours
cam
Ours GT
Figure 6.7: Single-view reconstruction results of various methods. ‘GT’ denotes ground
truth shapes. Best viewed on screen with zooming in.
109
Dataset For both camera prediction and SDF prediction, we follow the settings of
[36, 118, 121, 74], and use the ShapeNet Core dataset [9], which includes 13 object
categories, and an official training/testing split to train and test our method. For 2D
images, we use the rendered views provided by Choy et al [15]. We train a single
network on all categories and report the test results generated by this network.
Data Preparation and Implementation Details For each 3D mesh in ShapeNet Core,
we first generate an SDF grid with resolution256
3
using [130, 101]. Models in ShapeNet
Core are aligned and we choose this aligned model space as our world space where each
render view in [15] represents a transformation to a different camera space.
We train our camera pose estimation network and SDF prediction network sepa-
rately. For both networks, we use VGG-16 [100] as the image encoder. When train-
ing the SDF prediction network, we extract the local features using the ground truth
camera parameters. As mentioned in Section 6.3.1, DISN is able to generate a signed
distance field with arbitrary resolution by continuously sampling points and regressing
their SDF values. However, in practice, we are interested in points near the iso-surface
S
0
. Therefore, we use Monte Carlo sampling to choose2048 grid points under Gaussian
distributionN(0;0:1) during training. We choosem
1
= 4,m
2
= 1, and = 0:01 as the
parameters of Equation 6.3.
Our network is implemented with TensorFlow. We use the Adam optimizer with a
learning rate of110
4
and a batch size of16.
For testing, we first use the camera pose prediction network to estimate the camera
parameters for the input image and feed the estimated parameters as input to SDF pre-
diction. We follow the aforementioned surface reconstruction procedure (Section 6.3.2)
to generate the output mesh.
110
Evaluation Metrics For quantitative evaluations, we apply three commonly used met-
rics to compute the difference between a reconstructed mesh object and its ground truth
mesh: (1) Chamfer Distance (CD) and (2) Earth Mover’s Distance (EMD) between
uniformly sampled point clouds, and (3) Intersection over Union (IoU) on voxelized
meshes. SupposePC andPC
T
are the sampled point clouds from predicted mesh and
ground truth mesh respectively, CD is defined by
CD(PC;PC
T
) =
X
p
1
2PC
min
p
2
2PC
T
kp
1
p
2
k
2
2
+
X
p
2
2PC
T
min
p
1
2PC
kp
1
p
2
k
2
2
; (6.4)
and EMD is defined by
EMD(PC;PC
T
) = min
:PC!PC
T
X
p2PC
kp(p)k
2
; (6.5)
where :PC!PC
T
is a bijection (one-to-one matching).
Network Structure The detail network architecture for camera prediction is illus-
trated in Figure 6.8. The detail network architecture for the two-stream network is
illustrated in Figure 6.9. The detail network architecture for the one-stream network
is illustrated in Figure 6.10.
Computation Efficiency It is possible to predict SDF values only for a small set of
uniformly sampled points and apply trilinear interpolation to recover a dense grid of
signed distance values. However, we observe that with the computation power of GPUs,
predicting SDF at densely sampled points does not incur additional computation cost.
Suppose the target model can be represented by a voxelized grid of sizeNNN,
the complexity of our algorithm isO(N
3
) since we predict the SDF value for each grid
111
1 12x1 12x128 56x56x256 28x28x512 14x14x512 512 224x224x3 224x224x64 1024 FC ReLU 256 FC Pred Rotation 6 128 FC ReLU 64 FC Pred T ranslation 3 Figure 6.8: Camera Pose Prediction Network.
1x1Conv p(x,y,z) Concat 112x112x128 56x56x256 28x28x512 14x14x512 1536 224x224x3 224x224x64 conv1_2:64 conv2_2:128 conv3_3:256 Concat Global Stream Local Features 1472 64 256 1x1Conv ReLU 1x1Conv ReLU 1x1Conv ReLU 512 1x1Conv ReLU 256 1 Point Features 512 1x1Conv 1984 1x1Conv ReLU 512 1x1Conv ReLU 256 1 Concat Pred SDF + Global Features 1024 Local Stream conv4_3:512 conv5_3:512 Figure 6.9: DISN Two-stream Network.
cell. We test our model on an Intel i7 machine with Geforce 1080Ti GPU. Network
convergence takes50 epochs. The average inference time (SDF generation and March-
ing Cube) for a single mesh of DISN Two-stream and OccNet is reported in Table 6.1.
DISN is much faster than the OccNet in terms of computing speed.
Resolution OccNet DISN
64
3
6.00 0.27
256
3
148 27
Table 6.1: Average inference time (seconds) for a single mesh.
112
1x1Conv Embedding p(x,y ,z) Concat 1 12x1 12x128 56x56x256 28x28x512 14x14x512 1984 224x224x3 224x224x64 conv1_2:64 conv2_2:128 conv3_3:256 Concat Global Features 1024 Local Features 448 64 256 1x1Conv ReLU 1x1Conv ReLU 1x1Conv ReLU 512 1x1Conv ReLU 256 Pred SDF Point Features 512 Figure 6.10: DISN One-stream Network.
6.4.1 Single-view Reconstruction Comparison With State-of-the-
art Methods
In this section, we compare our approach on single-view reconstruction with state-
of-the-art methods: AtlasNet [36], Pixel2Mesh [118], 3DN [121], OccNet [74] and
IMNET [12]. AtlasNet [36] and Pixel2Mesh [118] generate a fixed-topology mesh from
a 2D image. 3DN [121] deforms a given source mesh to reconstruct the target model.
When comparing to this method, we choose a source mesh from a given set of tem-
plates by querying a template embedding as proposed in the original work. IMNET [12]
and OccNet [74] both predict the sign of SDF to reconstruct 3D shapes. Since IMNET
trains an individual model for each category, we implement their model following the
original paper, and train a single model on all 13 categories. Due to mismatch between
the scales of shapes reconstructed by our method and OccNet, we only report their IoU,
which is scale-invariant. In addition, we train a 3D CNN model, denoted by ‘3DCNN’,
where the encoder is the same as DISN and a decoder is a volumetric 3D CNN structure
with an output dimension of 64
3
. The ground truth for 3DCNN is the SDF values on
113
all 64
3
grid locations. For both IMNET and 3DCNN, we use the same surface recon-
struction method as ours to output reconstructed meshes. We also report the results of
DISN using estimated camera poses and ground truth poses, denoted by ‘Ours
cam
’ and
‘Ours’ respectively. AtlasNet, Pixel2Mesh and 3DN use explicit surface generation,
while 3DCNN, IMNET, OccNet and our methods reconstruct implicit surfaces.
As shown in Table 6.2, DISN outperforms all other models in EMD and IoU. Only
3DN performs better than our model on CD, however 3DN requires more information
than ours in the form of a source mesh as input. Figure 6.7 shows qualitative results.
As illustrated in both quantitative and qualitative results, implicit surface representation
provides a flexible method of generating topology-variant 3D meshes. Comparisons to
3D CNN show that predicting SDF values for given points produces smoother surfaces
than generating a fixed 3D volume using an image embedding. We speculate that this is
due to SDF being a continuous function with respect to point locations. It is harder for
a deep network to approximate an overall SDF volume with global image features only.
Moreover, our method outperforms IMNET and OccNet in terms of recovering shape
details. For example, in Figure 6.7, local feature extraction enables our method to gener-
ate different patterns of the chair backs in the first three rows, while other methods fail to
capture such details. We further validate the effectiveness of our local feature extraction
module in Section 6.4.3. Although using ground truth camera poses (i.e., ’Ours’) outper-
forms using predicted camera poses (i.e., ’Ours
cam
’) in quantitative results, respective
qualitative results demonstrate no significant difference.
6.4.2 Camera Pose Estimation
We compare our camera pose estimation with [49]. Given a point cloudPC
w
in world
coordinates for an input image, we transformPC
w
using the predicted camera pose and
compute the mean distance d
3D
between the transformed point cloud and the ground
114
plane bench box car chair display lamp speaker rifle sofa table phone boat Mean
EMD
AtlasNet 3.39 3.22 3.36 3.72 3.86 3.12 5.29 3.75 3.35 3.14 3.98 3.19 4.39 3.67
Pixel2mesh 2.98 2.58 3.44 3.43 3.52 2.92 5.15 3.56 3.04 2.70 3.52 2.66 3.94 3.34
3DN 3.30 2.98 3.21 3.28 4.45 3.91 3.99 4.47 2.78 3.31 3.94 2.70 3.92 3.56
IMNET 2.90 2.80 3.14 2.73 3.01 2.81 5.85 3.80 2.65 2.71 3.39 2.14 2.75 3.13
3D CNN 3.36 2.90 3.06 2.52 3.01 2.85 4.73 3.35 2.71 2.60 3.09 2.10 2.67 3.00
Ours
cam
2.67 2.48 3.04 2.67 2.67 2.73 4.38 3.47 2.30 2.62 3.11 2.06 2.77 2.84
Ours 2.45 2.41 2.99 2.52 2.62 2.63 4.11 3.37 1.93 2.55 3.07 2.00 2.55 2.71
CD
AtlasNet 5.98 6.98 13.76 17.04 13.21 7.18 38.21 15.96 4.59 8.29 18.08 6.35 15.85 13.19
Pixel2mesh 6.10 6.20 12.11 13.45 11.13 6.39 31.41 14.52 4.51 6.54 15.61 6.04 12.66 11.28
3DN 6.75 7.96 8.34 7.09 17.53 8.35 12.79 17.28 3.26 8.27 14.05 5.18 10.20 9.77
IMNET 12.65 15.10 11.39 8.86 11.27 13.77 63.84 21.83 8.73 10.30 17.82 7.06 13.25 16.61
3D CNN 10.47 10.94 10.40 5.26 11.15 11.78 35.97 17.97 6.80 9.76 13.35 6.30 9.80 12.30
Ours
cam
9.96 8.98 10.19 5.39 7.71 10.23 25.76 17.90 5.58 9.16 13.59 6.40 11.91 10.98
Ours 9.01 8.32 9.98 4.92 7.54 9.58 22.73 16.70 4.36 8.71 13.29 6.21 10.87 10.17
IoU
AtlasNet 39.2 34.2 20.7 22.0 25.7 36.4 21.3 23.2 45.3 27.9 23.3 42.5 28.1 30.0
Pixel2mesh 51.5 40.7 43.4 50.1 40.2 55.9 29.1 52.3 50.9 60.0 31.2 69.4 40.1 47.3
3DN 54.3 39.8 49.4 59.4 34.4 47.2 35.4 45.3 57.6 60.7 31.3 71.4 46.4 48.7
IMNET 55.4 49.5 51.5 74.5 52.2 56.2 29.6 52.6 52.3 64.1 45.0 70.9 56.6 54.6
3D CNN 50.6 44.3 52.3 76.9 52.6 51.5 36.2 58.0 50.5 67.2 50.3 70.9 57.4 55.3
OccNet 54.7 45.2 73.2 73.1 50.2 47.9 37.0 65.3 45.8 67.1 50.6 70.9 52.1 56.4
Ours
cam
57.5 52.9 52.3 74.3 54.3 56.4 34.7 54.9 59.2 65.9 47.9 72.9 55.9 56.9
Ours 61.7 54.2 53.1 77.0 54.9 57.7 39.7 55.9 68.0 67.1 48.9 73.6 60.2 59.4
Table 6.2: Quantitative results on ShapeNet Core for various methods. Metrics are CD
(0:001, the smaller the better), EMD (100, the smaller the better) and IoU (%, the
larger the better). CD and EMD are computed on2048 points.
[49] Ours
d
3D
0.073 0.047
d
2D
4.86 2.95
Table 6.3: Camera pose estimation comparison. The unit ofd
2D
is pixel.
truth point cloud in camera space. We also compute the 2D reprojection errord
2D
of the
transformed point cloud after we project it onto the input image. Table 6.3 reportsd
3D
andd
2D
of [49] and our method. With the help of the 6D rotation representation, our
method outperforms [49] by2 pixels in terms of 2D reprojection error.
115
Input Binary
cam
Binary Global One-
stream
cam
One-
stream
Two-
stream
cam
Two-
stream
GT
Figure 6.11: Qualitative results of our method using different settings. ‘GT’ denotes
ground truth shapes, and ‘
cam
’ denotes models with estimated camera parameters.
6.4.3 Ablation Studies
To show the impact of the camera pose estimation, local feature extraction, and different
network architectures, we conduct ablation studies on the ShapeNet “chair” category,
since it has the greatest variety. Table 6.4 reports the quantitative results and Figure 6.11
shows the qualitative results.
Camera Pose Estimation As is shown in Section 6.4.2, camera pose estimation
potentially introduces uncertainty to the local feature extraction process with an average
reprojection error of 2:95 pixels. Although the quantitative reconstruction results with
ground truth camera parameters are constantly superior to the results with estimated
parameters in Table 6.4, Figure 6.11 demonstrates that a small difference in the image
projection does not affect the reconstruction quality significantly.
Binary Classification Previous studies [74, 12] formulate SDF prediction as a binary
classification problem by predicting the probability of a point being inside or outside the
surfaceS
0
. Even though Section 6.4.1 illustrates our superior performance over [74, 12],
we further validate the effectiveness of our regression supervision by comparing with
classification supervision using our own network structure. Instead of producing a SDF
value, we train our network with classification supervision and output the probability of
116
a point being inside the mesh surface. We use a softmax cross entropy loss to optimize
this network. We report the result of this classification network as ‘Binary’.
Local Feature Extraction Local image features of each point provide access to the
corresponding local information that capture shape details. To validate the effectiveness
of this information, we remove the ‘local features extraction’ module from DISN and
denote this setting by ‘Global’. This model predicts the SDF value solely based on
the global image features. By comparing ‘Global’ with other methods in Table 6.4 and
Figure 6.11, we conclude that local feature extraction helps the model capture shape
details and improve the reconstruction quality by a large margin.
Network Structures To further assess the impact of different network architectures, in
addition to our original architecture with two decoders (which we call ’Two-stream’), we
also introduce a ‘One-stream’ architecture where the global features, the local features
and the point features are concatenated and fed into a single decoder which predicts the
SDF value. As illustrated in Table 6.4 and Figure 6.11, the original Two-stream setting
is slightly superior to One-stream, which shows that DISN is robust to different network
architectures.
Binary Global One-stream Two-stream
Camera Pose ground truthj estimated n/a ground truthj estimated ground truthj estimated
EMD 2.88j 2.99 2.75j n/a 2.74j 2.71 2.62j 2.65
CD 8.27j 8.80 7.64j n/a 8.30j 7.86 7.55j 7.63
IoU 54.9j 53.5 54.8j n/a 53.5j 53.6 55.3j 53.9
Table 6.4: Quantitative results on the category “chair”. CD (0:001), EMD (100) and
IoU (%).
117
Figure 6.12: Shape interpolation result.
Figure 6.13: Test our model on online product images.
6.4.4 Applications
Shape interpolation Figure 6.12 shows shape interpolation results where we interpo-
late both global and local image features going from the leftmost sample to the right-
most. We see that the generated shape is gradually transformed.
Test with online product images Figure 6.13 illustrates 3D reconstruction results by
DISN on online product images. Note that our model is trained on rendered images, this
experiment validates the domain transferability of DISN.
Multi-view reconstruction Our model can also take multiple 2D views of the same
object as input. After extracting the global and the local image features for each view,
118
(a) (b) (c) (d) (e)
Figure 6.14: Multi-view reconstruction results. (a) Single-view input. (b) Reconstruc-
tion result from (a). (c)&(d) Two other views. (e) Multi-view reconstruction result from
(a), (c) and (d).
we apply max pooling and use the resulting features as input to each decoder. We
have retrained our network for 3 input views and visualize some results in Figure 6.14.
Combining multi-view features helps DISN to further address shape details.
6.5 Conclusion
In this paper, we present DISN, a deep implicit surface network for single-view recon-
struction. Given a 3D point and an input image, DISN predicts the SDF value for the
point. We introduce a local feature extraction module by projecting the 3D point onto
the image plane with an estimated camera pose. With the help of such local features,
DISN is able to capture fine-grained details and generate high-quality 3D models. Qual-
itative and quantitative experiments validate the superior performance of DISN over
state-of-the-art methods and the flexibility of our model.
119
Though we achieve state-of-the-art performance in single-view reconstruction, our
method is only able to handle objects with clear background since it’s trained with ren-
dered images. To address this limitation, our future work includes extending SDF gen-
eration with texture prediction using a differentiable renderer [57].
120
Chapter 7
Conclusion and Future Work
7.1 Summary of Research
This thesis addresses two fundamental problems in 3D vision: perception and modeling.
We introduce our algorithms on different 3D data representations, i.e., depth images,
point clouds, volumetric grids, and meshes.
First, a depth-aware CNN is introduced to increase CNN’s capability to handle geo-
metric data. The convolution and pooling kernels are boosted by depth images. These
light weight operations can be easily integrated into conventional CNN. Experimental
results on RGBD segmentation validate the effectiveness and efficiency of these novel
operations. Then an instance segmentation method on point cloud has been introduced.
Based on the sparsity of point cloud, a similarity group proposal network is formed by
computing pairwise similarity in feature space. These perception methods on both depth
images and point clouds achieve state-of-the-art performance and show a great potential
for 3D understanding.
For 3D modeling, we focus on solving shape inpainting and reconstruction prob-
lems. 3D CNN methods on volumetric grids are hard to scale up due to the GPU mem-
ory limitation. To address this limitation, we use a recurrent neural network to process
3D volume as a sequence of 2D images, and this will reduce GPU usage by a large
portion. Moreover, we introduce a 3D generative adversarial loss during training to
increase the quality of inpainting results. For 3D reconstruction, we present 3DN, a 3D
deformation method which deforms an existing source mesh to reconstruct a target. We
121
also introduce a mesh sampling operation, which is flexible and effective to handle 3D
meshes with varying number of vertices and different connectivities. Experiments on
large-scale 3D CAD dataset have shown the superior performance of our 3D modeling
approaches. Even though 3DN achieves state-of-the-art results in 3D reconstruction,
this method cannot change topology and connectivity of source mesh, consequently, fail
to generate fine-grained details in shape such as holes and small poles. To address this
issue, we use an alternative surface representation, Signed Distance Functions, and fur-
ther introduce a deep implicit surface network (DISN). With the help of a local feature
extraction module, DISN is able to capture both overall shape and fine-grained details.
To the best of our knowledge, DISN is the first learning-based method that is able to
generate high-quality shape details. Experiments validate the effectiveness of DISN on
generating shapes with complex patterns.
7.2 Future Work
There are several potential research directions in the field of 3D vision. We name a few
for both perception and modeling. Future works for 3D perception include
1. Large Scale 3D Scene Understanding. Compared to 2D, the size of 3D data
increases exponentially. Even with sparse representation, the deep networks
(PointNet/PointNet++) cannot process large-scale point cloud (more than200;000
points) due to the GPU limitation. However, a Velodyne’s LiDAR VLP-16 is able
to capture 300,000 points per second. Processing such large-scale point cloud in
real-time is a potential future work in 3D deep learning.
2. Sensor fusion. Robotic systems such as autonomous driving usually rely on mul-
tiple sensors. Even though 3D point cloud contains more geometric information
than 2D images, cameras are still essential in such applications. For example,
122
LiDAR rely on light reflection to sense object distance, so it will produce noise
during rainy weather since the rain drops affect light reflection. Therefore, we
need to seek more reliable data on such conditions. Fusing different types of data
will make the system more reliable and robust.
For 3D modeling, the potential directions are
1. High quality reconstruction. GANs [56, 79] have achieved good performance
on 2D image generation. The resolution of their generated images can be as high
as10241024. However, the 3D shape generation methods are still in low resolu-
tion. By taking advantages of the 2D generation advances, generating fine-grained
details in 3D model is a promising direction to go.
2. Mesh generation. Previous deep learning methods on 3D reconstruction use a
unified 3D mesh (an ellipsoid or several atlases) to reproduce every target shape.
These methods failed to deal with shapes with different topologies, for instance, a
chair with holes. Although our 3D deformation network is able to handle meshes
with different topologies, it still requires a source mesh and the connectivity is
not changed during deformation. Moreover, in real life, we cannot always find a
mesh for every object to reconstruct. Generating mesh with topology variance is
important yet challenging.
123
Bibliography
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on
heterogeneous systems, 2015. Software available from tensorflow. org, 1, 2015.
[2] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese. Joint 2D-3D-Semantic Data for
Indoor Scene Understanding. ArXiv e-prints, 2017.
[3] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and
S. Savarese. 3d semantic parsing of large-scale indoor spaces. In CVPR, 2016.
[4] V . Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. TPAMI, 2017.
[5] A. Bansal, B. Russell, and A. Gupta. Marr Revisited: 2D-3D model alignment
via surface normal prediction. In CVPR, 2016.
[6] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Fully-
convolutional siamese networks for object tracking. In ECCV Workshops, 2016.
[7] A. Brock, T. Lim, J. Ritchie, and N. Weston. Generative and discriminative voxel
modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236,
2016.
[8] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene labeling with lstm
recurrent neural networks. In CVPR, June 2015.
[9] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li,
S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet:
An information-rich 3d model repository. arxiv, 2015.
[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic
image segmentation with deep convolutional nets and fully connected crfs. In
ICLR, 2015.
124
[11] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection
network for autonomous driving. In CVPR, 2017.
[12] Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling.
arXiv preprint arXiv:1812.02822, 2018.
[13] Y . Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang. Locality-sensitive deconvolution
networks with gated fusion for rgb-d indoor semantic segmentation. In CVPR,
2017.
[14] S. Chopra, R. Hadsell, and Y . LeCun. Learning a similarity metric discrimina-
tively, with application to face verification. In CVPR, 2005.
[15] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified
approach for single and multi-view 3d object reconstruction. In ECCV, 2016.
[16] C. Couprie, C. Farabet, L. Najman, and Y . Lecun. Indoor semantic segmentation
using depth information. In ICLR, 2013.
[17] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner.
Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
[18] A. Dai, C. R. Qi, and M. Nießner. Shape completion using 3d-encoder-predictor
cnns and shape synthesis. In CVPR, 2017.
[19] J. Dai, K. He, Y . Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional
networks. In ECCV, 2016.
[20] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task
network cascades. In CVPR, 2016.
[21] J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convo-
lutional networks. In ICCV, 2017.
[22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-
Scale Hierarchical Image Database. In CVPR, 2009.
[23] Z. Deng and L. J. Latecki. Amodal detection of 3d objects: Inferring 3d bounding
boxes from 2d ones in rgb-depth images. In CVPR, 2017.
[24] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using
a laplacian pyramid of adversarial networks. In NIPS, 2015.
[25] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,
K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual
recognition and description. In CVPR, 2015.
125
[26] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels
with a common multi-scale convolutional architecture. In ICCV, 2015.
[27] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object
reconstruction from a single image. In CVPR, 2017.
[28] A. Frome, Y . Singer, F. Sha, and J. Malik. Learning globally-consistent local
distance functions for shape-based image retrieval and classification. In ICCV,
2007.
[29] C.-Y . Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional
single shot detector. arXiv preprint arXiv:1701.06659, 2017.
[30] R. Gal, O. Sorkine, N. J. Mitra, and D. Cohen-Or. iwires: An analyze-and-edit
approach to shape manipulation. ACM Trans. on Graph., 28(3), 2009.
[31] R. Girdhar, D. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and
generative vector representation for objects. In ECCV, 2016.
[32] R. Girshick. Fast r-cnn. In ICCV, 2015.
[33] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for
accurate object detection and semantic segmentation. In CVPR, 2014.
[34] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y . Bengio. Generative adversarial nets. In NIPS, 2014.
[35] T. Groueix, M. Fisher, V . G. Kim, B. Russell, and M. Aubry. 3d-coded : 3d
correspondences by deep deformation. In ECCV, 2018.
[36] T. Groueix, M. Fisher, V . G. Kim, B. Russell, and M. Aubry. AtlasNet: A Papier-
Mˆ ach´ e Approach to Learning 3D Surface Generation. In CVPR, 2018.
[37] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning rich features from
RGB-D images for object detection and segmentation. In ECCV, 2014.
[38] X. Han, T. Leung, Y . Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying
feature and metric learning for patch-based matching. In CVPR, 2015.
[39] X. Han, Z. Li, H. Huang, E. Kalogerakis, and Y . Yu. High-resolution shape
completion using deep neural networks for global structure and local geometry
inference. In ICCV, 2017.
[40] C. H¨ ane, S. Tulsiani, and J. Malik. Hierarchical surface prediction for 3d object
reconstruction. In 3DV, 2017.
126
[41] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers. Fusenet: incorporating depth
into semantic segmentation via fusion-based cnn architecture. In ACCV, 2016.
[42] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In ICCV, 2017.
[43] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In ICCV, 2015.
[44] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recogni-
tion. In CVPR, 2016.
[45] Y . He, W.-C. Chiu, M. Keuper, and M. Fritz. Std2p: Rgbd semantic segmentation
using spatio-temporal data-driven pooling. In CVPR, 2017.
[46] Q. Huang, H. Wang, and V . Koltun. Single-view reconstruction via joint analysis
of image and shape collections. ACM Trans. Graph., 2015.
[47] Q. Huang, W. Wang, and U. Neumann. Recurrent slice networks for 3d segmen-
tation on point clouds. CVPR, 2018.
[48] Q. Huang, W. Wang, K. Zhou, S. You, and U. Neumann. Scene labeling
using gated recurrent units with explicit long range conditioning. arXiv preprint
arXiv:1611.07485, 2016.
[49] E. Insafutdinov and A. Dosovitskiy. Unsupervised learning of shape and pose
with differentiable point clouds. In NeurIPS, 2018.
[50] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. In ICML, 2015.
[51] D. Jack, J. K. Pontes, S. Sridharan, C. Fookes, S. Shirazi, F. Maire, and A. Eriks-
son. Learning free-form deformations for 3d object reconstruction. In ACCV,
2018.
[52] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu. Spatial trans-
former networks. In NIPS, 2015.
[53] A. Janoch, S. Karayev, Y . Jia, J. T. Barron, M. Fritz, K. Saenko, and T. Darrell. A
category-level 3-d object dataset: Putting the kinect to work. In ICCV workshop,
2011.
[54] N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In
EMNLP, 2013.
[55] A. Kanazawa, S. Kovalsky, R. Basri, and D. W. Jacobs. Learning 3d deformation
of animals from 2d images. In Eurographics, 2016.
127
[56] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for genera-
tive adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
[57] H. Kato, Y . Ushiku, and T. Harada. Neural 3d mesh renderer. In CVPR, 2018.
[58] S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri. Geometry driven semantic
labeling of indoor scenes. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars,
editors, ECCV, 2014.
[59] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[60] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
[61] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot
image recognition. In ICML Deep Learning Workshop, 2015.
[62] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
convolutional neural networks. In NIPS, 2012.
[63] A. Kurenkov, J. Ji, A. Garg, V . Mehta, J. Gwak, C. Choy, and S. Savarese.
Deformnet: Free-form deformation network for 3d shape reconstruction from
a single image. arXiv preprint arXiv:1708.04672, 2017.
[64] L. Leal-Taix´ e, C. Canton-Ferrer, and K. Schindler. Learning by tracking: siamese
cnn for robust target association. CVPR DeepVision Workshops, 2016.
[65] Y . Li, H. Qi, J. Dai, X. Ji, and Y . Wei. Fully convolutional instance-aware seman-
tic segmentation. In CVPR, 2017.
[66] Z. Li, Y . Gan, X. Liang, Y . Yu, H. Cheng, and L. Lin. Lstm-cf: Unifying context
modeling and fusion with lstms for rgb-d scene labeling. In ECCV, 2016.
[67] D. Lin, G. Chen, D. Cohen-Or, P.-A. Heng, and H. Huang. Cascaded feature
network for semantic segmentation of rgb-d images. In ICCV, 2017.
[68] T.-Y . Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature
pyramid networks for object detection. In CVPR, 2017.
[69] T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object
detection. In CVPR, 2017.
[70] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg.
SSD: Single shot multibox detector. In ECCV, 2016.
[71] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface
construction algorithm. In ACM siggraph computer graphics, 1987.
128
[72] L. Ma, J. Stueckler, C. Kerl, and D. Cremers. Multi-view deep learning for con-
sistent semantic mapping with rgb-d cameras. In IROS, 2017.
[73] D. Maturana and S. Scherer. V oxnet: A 3d convolutional neural network for
real-time object recognition. In IROS, 2015.
[74] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy
networks: Learning 3d reconstruction in function space. In CVPR, 2019.
[75] A. Newell and J. Deng. Associative embedding: End-to-end learning for joint
detection and grouping. In NIPS, 2016.
[76] C. Niu, J. Li, and K. Xu. Im2struct: Recovering 3d shape structure from a single
rgb image. In CVPR, 2018.
[77] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf:
Learning continuous signed distance functions for shape representation. arXiv
preprint arXiv:1901.05103, 2019.
[78] S.-J. Park, K.-S. Hong, and S. Lee. Rdfnet: Rgb-d multi-level residual feature
fusion for indoor semantic segmentation. In ICCV, 2017.
[79] T. Park, M.-Y . Liu, T.-C. Wang, and J.-Y . Zhu. Semantic image synthesis with
spatially-adaptive normalization. In CVPR, 2019.
[80] D. Pathak, P. Kr¨ ahenb¨ uhl, J. Donahue, T. Darrell, and A. Efros. Context encoders:
Feature learning by inpainting. In CVPR, 2016.
[81] P. O. Pinheiro, R. Collobert, and P. Doll´ ar. Learning to segment object candidates.
In NIPS, 2015.
[82] P. O. Pinheiro, T.-Y . Lin, R. Collobert, and P. Doll´ ar. Learning to refine object
segments. In ECCV, 2016.
[83] J. K. Pontes, C. Kong, S. Sridharan, S. Lucey, A. Eriksson, and C. Fookes.
Image2mesh: A learning framework for single image 3d reconstruction. In
ACCV, 2017.
[84] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets
for 3d classification and segmentation. CVPR, 2017.
[85] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. V olumetric and
multi-view cnns for object classification on 3d data. In CVPR, 2016.
[86] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature
learning on point sets in a metric space. In NIPS, 2017.
129
[87] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 3d graph neural networks for
rgbd semantic segmentation. In ICCV, 2017.
[88] R. Qiu, Q.-Y . Zhou, and U. Neumann. Pipe-run extraction and reconstruction
from point clouds. In ECCV, 2014.
[89] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with
deep convolutional generative adversarial networks. In ICLR, 2016.
[90] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified,
real-time object detection. In CVPR, 2016.
[91] J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In CVPR, 2017.
[92] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object
detection with region proposal networks. In NIPS, 2015.
[93] X. Ren, L. Bo, and D. Fox. Rgb-(d) scene labeling: Features and algorithms. In
CVPR, 2012.
[94] Z. Ren and E. B. Sudderth. Three-dimensional object detection and layout pre-
diction using clouds of oriented gradients. In CVPR, 2016.
[95] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representa-
tions at high resolutions. In CVPR, 2017.
[96] A. Sharma, O. Grau, and M. Fritz. Vconv-dae: Deep volumetric shape learning
without object labels. arXiv preprint arXiv:1604.03755, 2016.
[97] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic
segmentation. PAMI, 2016.
[98] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and sup-
port inference from rgbd images. In ECCV, 2012.
[99] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer.
Discriminative learning of deep convolutional feature point descriptors. In ICCV,
2015.
[100] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[101] F. S. Sin, D. Schroeder, and J. Barbiˇ c. Vega: non-linear fem deformable object
simulator. In Computer Graphics Forum. Wiley Online Library, 2013.
[102] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani. Surfnet: Generating 3d shape
surfaces using deep residual networks. In CVPR, 2018.
130
[103] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d scene understanding
benchmark suite. In CVPR, 2015.
[104] S. Song and J. Xiao. Sliding shapes for 3d object detection in depth images. In
ECCV, 2014.
[105] S. Song and J. Xiao. Deep Sliding Shapes for amodal 3D object detection in
RGB-D images. In CVPR, 2016.
[106] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic
scene completion from a single depth image. In CVPR, 2017.
[107] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic
scene completion from a single depth image. In CVPR, 2017.
[108] O. Sorkine, D. Cohen-Or, Y . Lipman, M. Alexa, C. R¨ ossl, and H.-P. Seidel. Lapla-
cian surface editing. In Eurographics, 2004.
[109] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convolutional
neural networks for 3d shape recognition. In ICCV, 2015.
[110] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T.
Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. In
CVPR, 2018.
[111] M. Sung, V . G. Kim, R. Angst, and L. Guibas. Data-driven structural priors for
shape completion. ACM Transactions on Graphics (Proc. of SIGGRAPH Asia),
2015.
[112] J. Tang, X. Han, J. Pan, K. Jia, and X. Tong. A skeleton-bridged deep learning
approach for generating meshes of complex topologies from single rgb images.
In CVPR, 2019.
[113] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Effi-
cient convolutional architectures for high-resolution 3d outputs. In ICCV, 2017.
[114] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-
view reconstruction via differentiable ray consistency. In CVPR, 2017.
[115] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural
networks. In ICML, 2016.
[116] F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho, Y . Bengio, M. Matteucci,
and A. Courville. Reseg: A recurrent neural network-based model for semantic
segmentation. In CVPR Workshops, June 2016.
131
[117] J. Wang, Z. Wang, D. Tao, S. See, and G. Wang. Learning common and specific
features for RGB-D semantic segmentation with deconvolutional networks. In
ECCV, 2016.
[118] N. Wang, Y . Zhang, Z. Li, Y . Fu, W. Liu, and Y .-G. Jiang. Pixel2mesh: Generating
3d mesh models from single rgb images. arXiv preprint arXiv:1804.01654, 2018.
[119] P.-S. Wang, Y . Liu, Y .-X. Guo, C.-Y . Sun, and X. Tong. O-cnn: Octree-based con-
volutional neural networks for 3d shape analysis. ACM Transactions on Graphics
(TOG), 2017.
[120] P.-S. Wang, C.-Y . Sun, Y . Liu, and X. Tong. Adaptive o-cnn: A patch-based deep
representation of 3d shapes. arXiv preprint arXiv:1809.07917, 2018.
[121] W. Wang, D. Ceylan, R. Mech, and U. Neumann. 3dn: 3d deformation network.
In CVPR, 2019.
[122] W. Wang, Q. Huang, S. You, C. Yang, and U. Neumann. Shape inpainting using
3d generative adversarial network and recurrent convolutional networks. In ICCV,
2017.
[123] W. Wang, N. Wang, X. Wu, S. You, C. Yang, and U. Neumann. Self-paced cross-
modality transfer learning for efficient road segmentation. In ICRA, 2017.
[124] W. Wang, R. Yu, Q. Huang, and U. Neumann. Sgpn: Similarity group proposal
network for 3d point cloud instance segmentation. CVPR, 2018.
[125] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin
nearest neighbor classification. Journal of Machine Learning Research, 2009.
[126] J. Wu, Y . Wang, T. Xue, X. Sun, W. T. Freeman, and J. B. Tenenbaum. MarrNet:
3D Shape Reconstruction via 2.5D Sketches. In NIPS, 2017.
[127] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. Learning a
probabilistic latent space of object shapes via 3d generative-adversarial modeling.
In NIPS, 2016.
[128] J. Wu, C. Zhang, X. Zhang, Z. Zhang, W. T. Freeman, and J. B. Tenenbaum.
Learning shape priors for single-view 3d completion and reconstruction. In NIPS,
2018.
[129] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets:
A deep representation for volumetric shapes. In CVPR, 2015.
[130] H. Xu and J. Barbiˇ c. Signed distance fields for polygon soup meshes. In Proceed-
ings of Graphics Interface 2014, pages 35–41. Canadian Information Processing
Society, 2014.
132
[131] K. Xu, H. Zheng, H. Zhang, D. Cohen-Or, L. Liu, and Y . Xiong. Photo-inspired
model-driven 3d object modeling. ACM Trans. Graph., 30(4):80:1–80:10, 2011.
[132] X. Yan, J. Yang, E. Yumer, Y . Guo, and H. Lee. Perspective transformer nets:
Learning single-view 3d object reconstruction without 3d supervision. In NIPS,
2016.
[133] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution
image inpainting using multi-scale neural patch synthesis. In CVPR, 2017.
[134] G. Yang, Y . Cui, S. Belongie, and B. Hariharan. Learning single-view 3d recon-
struction with limited pose supervision. In ECCV, 2018.
[135] Y . Yang, C. Feng, Y . Shen, and D. Tian. Foldingnet: Point cloud auto-encoder
via deep grid deformation. In CVPR, 2018.
[136] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Deep metric learning for person re-
identification. In ICPR, 2014.
[137] L. Yi, V . G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, A. Lu, Q. Huang, A. Sheffer,
L. Guibas, et al. A scalable active framework for region annotation in 3d shape
collections. ACM Transactions on Graphics (TOG), 2016.
[138] F. Yu and V . Koltun. Multi-scale context aggregation by dilated convolutions. In
ICLR, 2016.
[139] M. E. Yumer and N. J. Mitra. Learning semantic deformation flows with 3d
convolutional networks. In ECCV, 2016.
[140] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regulariza-
tion. arXiv preprint arXiv:1409.2329, 2014.
[141] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing
through ade20k dataset. In CVPR, 2017.
[142] Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation
representations in neural networks. arXiv preprint arXiv:1812.07035, 2018.
[143] R. Zhu, H. Kiani Galoogahi, C. Wang, and S. Lucey. Rethinking reprojection:
Closing the loop for pose-aware shape reconstruction from a single image. In
ICCV, 2017.
[144] C. Zou, E. Yumer, J. Yang, D. Ceylan, and D. Hoiem. 3d-prnn: Generating shape
primitives with recurrent neural networks. In ICCV, 2017.
133
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Point-based representations for 3D perception and reconstruction
PDF
3D object detection in industrial site point clouds
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Green learning for 3D point cloud data processing
PDF
Deep representations for shapes, structures and motion
PDF
Line segment matching and its applications in 3D urban modeling
PDF
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Motion segmentation and dense reconstruction of scenes containing moving objects observed by a moving camera
PDF
Data-driven 3D hair digitization
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Hybrid methods for robust image matching and its application in augmented reality
PDF
Object detection and recognition from 3D point clouds
PDF
Learning to optimize the geometry and appearance from images
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Object localization with deep learning techniques
PDF
Single-image geometry estimation for various real-world domains
Asset Metadata
Creator
Wang, Weiyue
(author)
Core Title
3D deep learning for perception and modeling
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
06/28/2019
Defense Date
06/28/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D mesh,3D modeling,3D perception,3D reconstruction,deep learning,instance segmentation,OAI-PMH Harvest,point cloud,RGB-D,semantic segmentation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Neumann, Ulrich (
committee chair
), Kuo, C.-C. Jay (
committee member
), Nevatia, Ramakant (
committee member
)
Creator Email
weiyuewa@usc.edu,wweiyue@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-178649
Unique identifier
UC11662236
Identifier
etd-WangWeiyue-7521.pdf (filename),usctheses-c89-178649 (legacy record id)
Legacy Identifier
etd-WangWeiyue-7521.pdf
Dmrecord
178649
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wang, Weiyue
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D mesh
3D modeling
3D perception
3D reconstruction
deep learning
instance segmentation
point cloud
RGB-D
semantic segmentation