Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Point-based representations for 3D perception and reconstruction
(USC Thesis Other)
Point-based representations for 3D perception and reconstruction
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Point-based Representations for 3D Perception and Reconstruction by Qiangeng Xu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2022 Copyright 2023 Qiangeng Xu Acknowledgements First and foremost, I am deeply grateful to my Ph.D. advisor, Prof. Ulrich Neumann, for his care and support throughout the years. His insightful discussions, helpful guidance, and general openness have led me to the area of study. It was a great honor to be his Ph.D. advisee and it was a great journey to study in the CGIT lab. I would like to thank my co-authors, Yiqi Zhong, Cho-Ying Wu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, Duygu Ceylan, Radomir Mech, Xudong Sun, Yin Zhou, Weiyue Wang, Charles Ruizhongtai Qi, Dragomir Anguelov and Hanwang Zhang. whom I appre- ciate and enjoyed working with. They also contributed to the work found in this thesis. I would also like to thank my colleagues and labmates at USC and Columbia University, particularly, Cho- Ying Wu, Yiqi Zhong, Weiyue Wang, Quankai Gao, Yuming Gu, Bohan Wang, Tianye Li, Shichen Liu, Xuefeng Hu, Brian Chen, Yulei Niu, Long Chen, Jingxi Xu, Simon Zhai and Hanwang Zhang for wonderful working experiences, their friendship, and help. I also want to thank Adobe Inc., Waymo LLC., and Tusimple Inc. for providing me with internship experiences. My research has been greatly inspired by these industry experiences during the internships. I would like to thank Prof. Zengchang Qin from my undergrad university and Prof. Peter Belhumeur, Prof. Shih-fu Chang, and Prof. John Kender from Columbia University. They have introduced me to the realm of Artificial Intelligence during my pre-Ph.D. years. I feel extremely fortunate for being part of ii this community and getting to solve interesting and challenging problems. I would also sincerely thank my family for their encouragement and love. I greatly appreciate my parents who always give me support, understanding, and open space. Last but not least, I would like to thank Prof. Jernej Barbic, Prof. C.-C. Jay Kuo, Prof. Ram Nevatia, Prof. Laurent Itti and Prof. Bistra Dilkina for being my qualification exam committee or dissertation committee members, and Prof. Justin Haldar, Prof.Jernej Barbic, and Prof. Neumann for being my dissertation defense committee members. I appreciate their insightful suggestions and helpful feedback. iii Table of Contents Acknowledgements ii List of Tables vii List of Figures xi Abstract xvi Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 3D Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Quality, Efficiency, and Scalability . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Overview of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Point-based 3D Representation Learning . . . . . . . . . . . . . . . . . . . 5 1.2.2 Point-based 3D Object Perception . . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 Point-based 3D Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2: Learning Point-based Representation 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 V oxel-based Methods for 3D learning . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Point-based Methods for Point Cloud Learning . . . . . . . . . . . . . . . 12 2.2.3 Data Structuring Strategies for Point-based Representation . . . . . . . . . 13 2.2.4 GCN for Point Cloud Learning . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Grid-GCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Coverage-Aware Grid Query(CAGQ) . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Algorithms of CAGQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.3 Comparison between CAGQ and Naive Grid Query . . . . . . . . . . . . . 20 2.3.4 Grid Context Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.1 Space Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.2 Time complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 iv 2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6.1 3D Object Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6.2 3D Scene Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6.4 Scalability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 3: 3D Object Perception: Point-based Object Detection 38 3.1 Domain Gap and Shape Miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.1 Domain Gap in Point Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.2 Shape Miss in Point Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.1 Point-based 3D Object Detectors . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.2 Unsupervised Domain Adaptation for 3D Visual Tasks . . . . . . . . . . . 45 3.2.3 Point Cloud Transformation . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.4 Learning Shapes for 3D Object Detection . . . . . . . . . . . . . . . . . . 46 3.3 Semantic Point Generation Detector . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 Training Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.3 Recover the Foreground Regions . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.3.1 Hide and Predict . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.3.2 Semantic Area Expansion . . . . . . . . . . . . . . . . . . . . . 51 3.3.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.6 Evaluation on the Waymo Open Dataset . . . . . . . . . . . . . . . . . . . 55 3.3.7 Evaluation on the KITTI Dataset . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.8 Model Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3.9 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4 Behind the Curtain Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.1 Learning Shapes in Occlusion . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4.2 Shape Occupancy Probability Integration . . . . . . . . . . . . . . . . . . 65 3.4.3 Occlusion-Aware Proposal Refinement . . . . . . . . . . . . . . . . . . . . 66 3.4.4 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.5.1 Evaluation on the KITTI Dataset . . . . . . . . . . . . . . . . . 69 3.4.5.2 Evaluation on the Waymo Open Dataset . . . . . . . . . . . . . . 71 3.4.5.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Chapter 4: 3D Reconstruction: Point-based Implicit Fields for Scene Reconstruction 76 4.1 3D Representation for Surface Reconstruction . . . . . . . . . . . . . . . . . . . . 76 4.2 DISN: Deep Implicit Surface Network . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.1 Pixel-aligned Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.2 Camera Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.3 Signed Distance Function Prediction . . . . . . . . . . . . . . . . . . . . . 83 v 4.2.4 Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.3 Neural Radiance Fields for Scene Reconstruction . . . . . . . . . . . . . . . . . . 96 4.4 Point-NeRF: Point-based Neural Radiance Fields . . . . . . . . . . . . . . . . . . 99 4.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.4.2 Point-NeRF Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.3 Point-NeRF Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4.3.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.4.5.1 The DTU Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.4.5.2 The NeRF Synthetic Dataset. . . . . . . . . . . . . . . . . . . . 117 4.4.5.3 The Tanks and Temples Dataset . . . . . . . . . . . . . . . . . . 122 4.4.5.4 Large-scale 3D Scenes (ScanNet) . . . . . . . . . . . . . . . . . 124 4.4.6 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Chapter 5: Conclusion and Future Work 128 5.1 Summary of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 vi List of Tables 2.1 The overall accuracy and latency of three Grid-GCN models on ScanNet[38]. Our full model uses CAGQ with 1K node points in each group. A compact model with 0:5K is also reported. Another model uses a naive Grid Query with 1K node points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Time complexity: We sampleM centers fromN points and queryK neighbors per center. We limit the maximum number of points in each voxel ton v . In practice, K < N, andn v are usually of the same magnitude asK. The approximate FPS algorithm can beO(NlogN)[47]. * indicates our methods. See the supplementary for deduction details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Performance comparisons of data structuring methods, run on ModelNet40[191]. Center sampling methods include RPS, FPS, CAGQ’s RVS, and CAS. Neighbor querying methods include Ball Query, Cube query, and K-Nearest Neighbors. Condition variables include N points, M groups, and K neighbors per group. Occupied space coverage = num. of occupied voxels of queried points / num. of occupied voxels of the original N points. . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Results on ModelNet10 and ModelNet40[191]. Our full model achieves state-of- the-art accuracy. With the model reduction, our compact models Grid-GCN 13 also outspeed other models. We discuss their details in the ablation studies. . . . . 30 2.5 Results on ScanNet[38]. Grid-GCN achieves 10 speed-up on average over other models. Under batch size of 4 and 1, we test our model with 1K neighbor nodes. A compact model with 0:5K is also reported. . . . . . . . . . . . . . . . . . . . 33 2.6 Results on S3DIS[5] area 5. Grid-GCN is on average 8 faster than other models. We halve the output channels of GridConv for Grid-GCN (0:5Ch) . . . . . . . . . . 33 2.7 Segmentation result on S3DIS[5] area 5. We report overall accuracy (OA, %), mean class IoU (mIoU, %) and per-class IoU (%). Grid-GCN achieves the highest overall accuracy and mIoU among the 4 models. . . . . . . . . . . . . . . . . . . . 34 vii 2.8 Ablation studies on ModelNet40[191]. Our models have 3 layers of GridConv. K is the number of node points in the first GridConv. We also change the number of output feature channels from these 3 layers. Grid context pooling (shorted as pooling here) is also removed for Grid-GCN 02 . Grid-GCN 0 also removes coverage weight in edge relation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.9 Inference time (ms) on ScanNet[38] under different scales. We compare Grid- GCN with PoinNet++[138] on different numbers of input points per scene. The batch size is 1.M is the number of point groups on the first network layer. . . . . . 36 3.1 The statistics of OD and Kirk. Each frame contains at most 163.8K points. Kirk Dry is formed by frames with dry weather in the Kirk training set. . . . . . . . . . 39 3.2 Results on Waymo Open Dataset 1.0 and Kirkland Dataset. Results for PointPillars are based on our own implementation following [88]. We use the PV-RCNN source code and obtain training settings for Waymo Open Dataset [161] via direct communication with the author. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3 Comparisons of different strategies targeting the deteriorating point cloud quality. The models are trained on OD and evaluated on Kirk. The metric is LEVEL_1 Vehicle 3D AP. We use PointPillars[88] as the baseline. . . . . . . . . . . . . . . . 55 3.4 Car detection Results on the KITTI test set. See the full list of comparisons in the supplemental. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 Comparisons on the KITTI validation set. Average Precision (AP) is computed over 40 recall positions. The baseline results[150, 166] are obtained based on publically released models. See more results (including Cyclist) in the supplemental. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.6 Latency and model parameters. “M” stands for million. The last column shows the results of a standalone SPG. The evaluation is based on a 1080Ti GPU with a batch size of 1. The latency is averaged over the KITTI val split. . . . . . . . . . . 58 3.7 Ablation studies of SPG. The models are trained on OD and evaluated on Kirk. The metric is LEVEL_1 Vehicle 3D AP. We use PointPillars[88] as our baseline. . 59 3.8 Ablation studies on the probability thresholdP thresh (only keep the semantic point if ~ P f >P thresh ). Our best SPG model usesP thresh = 0:5. The metric is LEVEL_1 Vehicle 3D AP on the Kirk validation set. . . . . . . . . . . . . . . . . . . . . . . 60 3.9 Comparison on the KITTI val set, evaluated by the 3D Average Precision (AP) under 40 recall thresholds (R40). The 3D APs on under 11 recall thresholds are also reported for moderate car objects. . . . . . . . . . . . . . . . . . . . . . . . . 69 viii 3.10 Comparison on the KITTI test set, evaluated by the 3D Average Precision (AP) of 40 recall thresholds (R40) on the KITTI server. BtcDet surpasses all the leaderboard front runners that are associated with publications released before our submission. The mAPs are averaged over the APs of easy, moderate, and hard objects. Please find more results in Appendix F. . . . . . . . . . . . . . . . . . . . 70 3.11 Comparison for vehicle detection on the Waymo Open Dataset validation set. . . . 71 3.12 Ablation studies on the learned features (Sec. 3.4.1) and the features fused into andf geo (Sec. 3.4.2). BtcDet 2 directly uses a binary map that labelsR OC [R SM . } and? indicate the spherical and the Cartesian coordinates. The “1” operator converts float values to binary codes with a threshold of 0.5. All variants share the same architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.13 Ablation studies on which layers of are fused withP(O S ) ? (Eq. 3.11) and whether to fuseP(O S ) ? intof geo . We evaluate the KITTI’s moderate car objects and show the 3DAP R11 of the proposal and final bounding box. . . . . . . . . . . 73 4.1 Quantitative results on ShapeNet Core for various methods. Metrics are CD (0:001, the smaller the better), EMD (100, the smaller the better) and IoU (%, the larger the better). CD and EMD are computed on 2048 points. . . . . . . . . . 91 4.2 F-Score for varying thresholds (% of reconstruction volume side length, same as [164]) on all categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3 Camera pose estimation comparison. The unit ofd 2D is pixels. . . . . . . . . . . . 92 4.4 Quantitative results on the category “chair”. CD (0:001), EMD (100) and IoU (%). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.5 Comparisons of our Point-NeRF with radiance-based models [112, 179, 104] and a point-based rendering model [4] on the DTU dataset [74] with the novel view synthesis setting introduced in [23]. The subscripts indicate the number of iterations during optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.6 Comparisons of Point-NeRF with radiance-based models [112, 179, 104] and a point-based rendering model [4] on the Synthetic-NeRF dataset [112]. The subscripts indicate the number of iterations. Our model not only surpasses other methods when converged after 200K steps (Point-NeRF 200K ), but surpasses IBRNet [179] and is on par with NeRF [120] when optimized by only 20K steps (Point-NeRF 20K ). Our methods can also initialize radiance fields based on point clouds reconstructed by methods such as COLMAP (Point-NeRF col 200K ). . . . . . . 115 ix 4.7 Quantity comparison on five sample scenes in the DTU testing set with the view synthesis setting introduced in [23]. The subscripts indicate the number of iterations during optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.8 Detailed breakdown of quantitative metrics of individual scenes for the NeRF Synthetic [120] for our method and baselines. All scores are averaged over the testing images. The subscripts are the number of iterations of the models and Point-NeRF col 200K indicates our method initiates from COLMAP points and is optimized for 200 thousand iterations. . . . . . . . . . . . . . . . . . . . . . . . . 121 4.9 Quantity comparison on five scenes in the Tanks and Temples dataset [80] selected in NSVF [104]. Our method Point-NeRF outperforms all state-of-the-art models in all metrics by substantial margins. . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.10 Quantity comparison on two scenes in the ScanNet dataset [38] selected in NSVF [104]. RMSE is the Root Mean Square Error. Our method Point-NeRF outperforms all state-of-the-art methods in all metrics by substantial margins. . . . 123 4.11 The quantitative results (PSNR / SSIM / LPIPS Vgg ) of the Ship and Hotdog scene with or without point pruning and growing (P&G). The improvements are significant when using either our generated points or the point cloud generated by COLMAP[145]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.12 Comparisons between using the extracted image features to initialize the point features (our full model) or using the random initialized features. . . . . . . . . . . 126 x List of Figures 2.1 Overview of the Grid-GCN model. (a) Illustration of the network architecture for point cloud segmentation. Our model consists of several GridConv layers, and each can be used in either a downsampling or an upsampling process. A GridConv layer includes two stages: (b) For the data structuring stage, a Coverage-Aware Grid Query (CAGQ) module achieves efficient data structuring and provides point groups for efficient computation. (c) For the convolution stage, a Grid Context Aggregation (GCA) module conducts graph convolution on the point groups by aggregating local context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Illustration of Coverage-Aware Grid Query (CAGQ). Assume we want to sample M = 2 point groups and queryK = 5 node points for each group. (a) The input is N points (grey). The voxel id and a number of points are listed for each occupied voxel. (b) We build voxel-point index and store up ton v = 3 points (yellow) in each voxel. (c) Comparison of different sampling methods: FPS and RPS prefer the two centers inside the marked voxels. Our RVS could randomly pick any two occupied voxels (e.g. (2,0) and (0,0)) as center voxels. If our CAS is used, voxel (0,2) will replace (0,0). (d) Context points of center voxel (2,1) are the yellow points in its neighborhood (we use 3 3 as an example). CAGQ queries 5 points (yellow points with blue ring) from these context points, then calculate the locations of the group centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Different strategies to compute the contribution f f c;i from a noden i to its centerc. f i ; i are the feature maps and the location ofn i .e i is the edge feature betweenn i andc calculated from the edge attention function. (a) Pointnet++ [138] ignorese i . (b) computese i based on low dimensional geometric relation betweenn i andc. (c) also considers the semantic relation between the center and the node point, but c has to be sampled on one of the points from the previous layer. (d). Grid-GCN’s geo-relation also includes the coverage weight. It pools a context featuref cxt from all stored neighbors to provide a semantic reference ine i computing. . . . . . . . . 21 xi 2.4 The red point is the group center. Yellow points are its node points. Black points are node points of the yellow points in the previous layer. The coverage weight is an important feature as it encodes the number of black points that have been aggregated to each yellow point. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 The visualization of the sampled group center and the queried node points by RPS, FPS, and CAS. The blue and green balls indicate Ball Query. The red squares indicate Cube Query. The ball and cube have the same volume. (a) RPS covers 45:6% of the occupied space, while FPS covers 65% and CAS covers 75:2%. . . . 24 2.6 Semantic segmentation results on S3DIS [5] area 5. . . . . . . . . . . . . . . . . . 34 2.7 More visual result of S3DIS[5] area 5 . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 Examples of RGB and range image (intensity channel) in OD validation set and Kirk validation set. The dark regions in the range images indicate missed LiDAR returns. The regions of “missing points" are irregular in shape. . . . . . . . . . . . 40 3.2 figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 The impact of the three types of shape miss. (b) shows PV-RCNN’s [150] car 3D detection APs with different occlusion levels on the KITTI [54] val split. NR means no shape miss recovery. EO, SM, and SO indicate adding car points in the regions of external occlusion, signal miss, and self-occlusion, respectively, as visualized in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 Our Semantic Point Generation (SPG) recovers the foreground regions by generating semantic points (red). Combined with the original cloud, these semantic points can be directly used by modern LiDAR-based detectors and help improve the detection results (green boxes). . . . . . . . . . . . . . . . . . . . . . 47 3.5 Illustration of SPG-aided 3D detection pipeline. SPG voxelizes the entire point cloud and generates a prediction for each voxel (both occupied and empty) within the generation areas. After applying probability thresholding, we take the top voxels with the highest foreground probability and add a semantic point (red) at the predicted location in each of these voxels. These points are merged with the original point cloud and fed into the selected 3D point cloud detector. . . . . . . . 48 3.6 Training targets construction and SPG model architecture. Three steps to create the semantic point training targets: 1. V oxelization; 2. Foreground points searching 3. Label assignment and ground-truth point feature calculation. SPG includes three modules: 1. V oxel Feature Encoding module (VFE). 2. Information Propagation module. 3. Point Generation module. . . . . . . . . . . . . . . . . . . 48 xii 3.7 Visualization of “Semantic Area Expansion”. (a) and (c) shows the occupied voxels and the generation area. (b) and (d) shows the supervision strategies. . . . . 50 3.8 Comparisons between generated semantic points (red) with and without “Semantic Area Expansion”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.9 Learning Occluded Shapes. (a) The regions of occlusion or signal miss R OC [R SM can be identified after the spherical voxelization for the point cloud. (b) To label the occupancyO S (1 or 0), We place the approximated complete object shapesS (red points) in the corresponding boxes. (c) A shape occupancy network predicts the shape occupancy probabilityP(O S ) for voxels inR OC [R SM , supervised byO S . (d) V oxels are colored orange if it has a predictionP(O S )> 0:3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.10 Assemble the approximated complete shapeS for an object (blue) by using points from top match objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.11 The detection pipeline. BtcDet first identifies the regions of occlusion and signal missR OC [R SM ˙ In these regions, BtcDet estimates the shape occupancy probabilityP(O S ) (the orange voxels haveP(O S )> 0:3). When the backbone network extracts detection features from the point cloud,P(O S ) is concatenated with ’s intermediate feature maps. Then, an RPN network takes the output and generates 3D proposals. For each proposal (e.g., the green box), BtcDet pools the local geometric featuresf geo to the nearby grids and finally generates the final bounding box prediction (the red box) and the confidence score. . . . . . . . . . . 65 4.1 Different 3D Representations for Mesh Reconstruction . . . . . . . . . . . . . . . 77 4.2 Single-view reconstruction results using OccNet [115] and DISN on synthetic and real images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3 Illustration of SDF. (a) Rendered 3D surface withs = 0. (b) Cross-section of the SDF. A point is outside the surface ifs> 0, inside ifs< 0, and on the surface if s = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Local feature extraction. Given a 3D point p, we use the estimated camera parameters to project p onto the image plane. Then we identify the projected location on each feature map layer of the encoder. We concatenate features at each layer to get the local features of pointp. . . . . . . . . . . . . . . . . . . . . . . . 81 4.5 Given an image and a pointp, we estimate the camera pose and projectp onto the image plane. DISN uses the local features at the projected location, the global features, and the point features to predict the SDF ofp. ‘MLPs’ denotes multi-layer perceptrons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 xiii 4.6 Camera Pose Estimation Network. ‘PC’ denotes point cloud. ‘GT Cam’ and ‘Pred Cam’ denote the ground truth and predicted cameras. . . . . . . . . . . . . . . . . 83 4.7 Shape reconstruction results (a) without and (b) with local feature extraction. . . . 85 4.8 Single-view reconstruction results of various methods. ‘GT’ denotes ground truth shapes. Best viewed on screen with zooming in. . . . . . . . . . . . . . . . . . . . 86 4.9 Each view of each object has four representations correspondingly . . . . . . . . . 88 4.10 Qualitative results of our method using different settings. ‘GT’ denotes ground truth shapes, and ‘ cam ’ denotes models with estimated camera parameters. . . . . . 93 4.11 Shape interpolation result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.12 Test our model on online product images. . . . . . . . . . . . . . . . . . . . . . . 95 4.13 Multi-view reconstruction results. (a) Single-view input. (b) Reconstruction results from (a). (c)&(d) Two other views. (e) Multi-view reconstruction result from (a), (c), and (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.14 Full set of 3D Scene Elements Reconstruction From RGB Images . . . . . . . . . 96 4.15 Direct V olume Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.16 Discretized Direct V olume Rendering . . . . . . . . . . . . . . . . . . . . . . . . 98 4.17 NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis . . . . . 99 4.18 Point-NeRF uses neural 3D points to efficiently represent and render a continuous radiance volume. The point-based radiance field can be predicted via network forward inference from multi-view images. It can then be optimized per scene to achieve reconstruction quality that surpasses NeRF [120] in tens of minutes. Point-NeRF can also leverage off-the-shelf reconstruction methods like COLMAP [145] and is able to perform point pruning and growing that automatically fix the holes and outliers that are common in these approaches. . . . . . . . . . . . . . . . 100 4.19 The dash lines indicate gradient updates for radiance field initialization and per-scene optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.20 The network pipeline of radiance fields computation at a shading locationx from K neural points neighbors. “PosEN” indicates positional encoding [120]. “d3” indicates the 3 channels vector of view directions atx. The final outputs are the radiance colorr and density. Please also refer to the equations (3-7) in the main paper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 xiv 4.21 Qualitative comparisons of per-scene optimization on the DTU dataset [74]. Our Point-NeRF can recover texture details and geometrical structures more accurately than other methods. Point-NeRF also demonstrates superior efficiency. Within two mins, our model trained for 1K steps is already on par with the state-of-the-art methods such as MVSNeRF [23] and IBRNet[179] . . . . . . . . . . . . . . . . . 117 4.22 Qualitative comparisons on the NeRF Synthetic dataset [120]. The subscripts indicate the number of iterations. Our Point-NeRF can capture fine details and thin structures (see the rope on row 2). Point-NeRF also demonstrates superior efficiency. Our model trained for 20K steps is already on par with NeRF with 30 faster training time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.23 Our neural point clouds and rendered novel views with or without point pruning and growing (P&G). P&G improves both the geometries and rendering results when using the point cloud reconstructed from our model or from COLMAP[145]. 120 4.24 The qualitative results of our Point-NeRF on the Tanks and Temples dataset. . . . . 123 4.25 The qualitative results of our Point-NeRF on the ScanNet dataset [80]. The first row shows five generated test frames of scene 101 and the second row shows five generated test frames of scene 241. . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.26 Starting from 1000 randomly sampled COLMAP points of the Chair scene, our point growing mechanism can help complete the geometry and generate high-quality novel views when only being supervised by RGB images. . . . . . . . 125 xv Abstract Three-dimensional scene perception and scene reconstruction are two key vision tasks for ma- chines to operate in the real world. Autonomous systems such as self-driving vehicles, domestic robots, and augmented reality all incorporate them as major building blocks. The brain of the autonomous agent, an artificial intelligence system, needs to learn from data and obtain under- standing through 3D representations. Common 3D representations include 2D multi-view images, stereo images, LiDAR range images, depth maps, voxel grids, implicit functions, and point sets. In this thesis, we discuss the data form, the properties, the advantage, and the disadvantage of each of the 3D representations. Then, after analyzing quality, efficiency and scalability, for 3D scene perception, we study point-based representation such as LiDAR point cloud and, we propose to combine point-based and implicit representations for 3D scene reconstruction. In a common 3D scene, the point set is sparse and vast in number. It is challenging to efficiently process the point set and hard to maintain high fidelity of the original scene through different stages of computation. We first introduce a Grid-GCN model targeting both the efficiency and accuracy of point-based representation learning. After that, we dive into 3D scene perception and reconstruction separately. xvi For 3D scene perception, to address domain shift in point clouds, we present Semantic Point Generation (SPG), a general approach to enhance the reliability of LiDAR detectors against do- main shifts. Another challenge for point-based machine perception is occlusion. After hitting the first object, a laser beam will return and leave the shapes behind the occluder missing from the point cloud. We present a novel LiDAR-based 3D object detection model, dubbed Behind the Curtain Detector (BtcDet), which learns the object shape priors and estimates the complete object shapes that are partially occluded (curtained) in point clouds. BtcDet predicts the probability of occupancy that indicates if a region contains object shapes and generates high-quality 3D proposals for detection. To reconstruct 3D scenes with good quality and high efficiency, we first present DISN, a Deep Implicit Surface Network that can generate a high-quality detail-rich 3D mesh from a 2D image by predicting the underlying signed distance fields. And to reconstruct all graphics components together by inverse rendering, we introduce Point-NeRF, which combines the advantages of these two approaches by using neural 3D point clouds, with associated neural features, to model a radi- ance field. Point-NeRF can be rendered efficiently by aggregating neural point features near scene surfaces, in a ray marching-based rendering pipeline. xvii Chapter 1 Introduction 1.1 Motivation The revolutionary advances in machine learning techniques have completely transformed the field of computer vision. With a large number of visual labels, a machine can reach or even surpass the human capability on 2D vision tasks. However, real world scenes reside in a three-dimensional space. If we want to further explore visual reasoning or to interact with the real world, machines need to understand and operate with information in the 3D space. 3D scene perception and scene reconstruction are two key 3D vision tasks for machines to operate in the real world. These two tasks are set as two major building blocks for many popular autonomous systems. For example, a self-driving car needs to perceive the 3D surrounding en- vironment to make decisions. The information includes 2D views, stereo images, disparity maps, LiDAR range images, and depth images if RGB cameras, stereo cameras, LiDAR sensors, or depth cameras are deployed. To train the self-driving system with more data, the ability to reconstruct 1 more 3D scenes from sensed 2D data is desirable. To build Metaverse, and AR interaction be- tween virtual 3D content and the real world, an AR kit needs to perceive the surrounding regions and reconstruct the 3D scene, then deploy virtual content in the modeled environment. Human beings constantly conduct these two tasks during daily life. We can sense depth with a single eye, recognize a car from different angles and estimate the occluded part of tables and chairs using reasonable imagination. Our brains can reliably perceive the 3D world with limited visual information. When processing visual information, it turns out human brains subconsciously rely on 3D prior knowledge and build visual cognition. This prior knowledge could also drive machines to decode the 3D world, establish artificial 3D visual understanding and possess the capability of 3D scene recreation. 1.1.1 3D Representations To conduct 3D perception and reconstruction, we need to let machines learn from 3D data. There are several 3D representations developed along with the progression of computer vision and com- puter graphics. Each of them has particular advantages and use cases. Let us look at the most commonly used 3D representations. Depth or RGBD image. A depth image is a 2D image containing a depth value at each pixel. It measures the distance between the camera sensor coordinate system and the first encountered scene surface. Depth image can be obtained by stereo camera [169, 122] or Kinect sensor [228] directly. By using a calibrated binocular system or triangulating an active illumination system, depth images become one of the cheapest 3D raw data that is easiest to obtain. Some sensors can also capture RGB data at the same time which would be naturally aligned with the depth map 2 and finally formed as RGBD images. RGBD data has provided supervision and helped boost the performance for applications such as indoor scene understanding [154] and autonomous driving [15]. Triangular Mesh. Triangular mesh is the most commonly used representation in computer graph- ics because of its convenience and compactness to represent the surface. A set of vertices are con- nected to triangles and in the end, covers the object surfaces in the scene. It is the most intended representation for modeling and rendering. Along with UV mapping and texture mapping, [18], a scene can be modeled, rendered, and edited in modern computer devices. Voxel grid. V oxel, the "volume element" is the counterpart 3D concept to pixels in 2D. To repre- sent scene geometries, one can set a 3D bounding box and tessellate 3D shapes to the voxels inside any object. Some extended concepts such as the occupancy grid use 0 or 1 (occupancy) to fill the voxel grid. Color, opacity, or other properties can be also included on the grid. Implicit function. Implicit function is a function describing the relationship between a spatial location to the 3D scene. The most common implicit functions include Signed Distance Functions [20], Occupancy Functions [116], Radiance Fields [91, 44], et cetera. Explicit surfaces, volume, or point sets can be extracted from these functions by identifying the iso-surface or solving Poisson Equations [132, 113]. Point set. Point set might be the simplest representation to represent a 3D scene. It is also the easiest one to obtain since classic structure from motion, multiview stereo methods [168, 50] or 3 LiDAR sensors [57] can directly generate a point set of object surfaces. Various other represen- tations can directly transform into points as well. For example, taking the center of the occupied voxels, sampling on mesh triangles, unprojecting depth maps into 3D space, or adding points at the zero iso-surface of implicit functions, can all results in point sets. 1.1.2 Quality, Efficiency, and Scalability For 3D scene perception and reconstruction, representing the 3D geometry faithfully with high quality is vital during both model training and inference. Accurate ground-truth supervision can reduce model overfitting and provide a reasonable guide for a model. On the other hand, mak- ing accurate inferences of 3D representation can offer richer and more reliable information for downstream perception tasks or deliver reconstruction with better quality. To support perception tasks such as segmentation or detection, a depth map provides most of the available data, due to its affordability. It is studied that unproject the depth image to 3D space as a point set can help boost scene understanding. LiDAR point cloud, on the other hand, provides the most reliable 3D samples on object surfaces due to the nature of laser metrics. Currently, mesh, voxel, and implicit functions cannot be directly sensed in the real world. Therefore, the point set is the most available representation for the perception task. The accurately measured LiDAR point cloud has the highest quality to describe the scene. Without the need for auxiliary data structures, the point set is more scalable than other presentations such as meshes, voxel grids, or sparse voxels. For 3D reconstruction tasks, it is desirable to obtain the meshes, textures, material properties, and lighting conditions in the scene from partial visual information such as 2D images or partial 3D representation. With all the graphics components, we can achieve downstream tasks such as 4 novel view synthesis, relighting, and scene editing. Classic reconstruction methods include SfM [168] and plane sweep stereo [224] can robustly generate a point set of object surfaces. However, transforming points to meshes cause additional errors. In recent years, implicit functions such as SDF and occupancy fields can generate iso-surface and use Marching Cube [111] to generate mesh surfaces without being limited by resolution. Neural Radiance Fields [120] enables reconstruction of all the aforementioned graphics components all together as radiance fields. However, using a holistic representation for implicit functions limits the reconstruction ability since too many 3D local details are encoded together. Combined with local representation such as a point set, the implicit function methods can achieve higher quality, better efficiency, and greater scalability. 1.2 Overview of Contributions In this study, we present point-based methods for 3D scene understanding and point-based implicit methods for 3D scene reconstruction. Before drilling down in these two directions, we first address the point-based representation methods, since point set processing also faces many challenges. 1.2.1 Point-based 3D Representation Learning Point set is usually sparse in space (80% of the space in natural scenes is empty [198]) and vast in number (160K points on average in an outdoor scene [161]). Point-based methods spend a signif- icant amount of time on data structuring (e.g., Farthest Point Sampling (FPS) and neighbor points querying), which limits the speed and scalability. In Chapter 2, we present a method, named Grid- GCN, for fast and scalable point cloud learning. Grid-GCN uses a novel data structuring strategy, 5 Coverage-Aware Grid Query (CAGQ). By leveraging the the efficiency of grid space, CAGQ im- proves spatial coverage while reducing the theoretical time complexity. Compared with popular sampling methods such as Farthest Point Sampling (FPS) and Ball Query, CAGQ achieves up to 50× speed-up. With a Grid Context Aggregation (GCA) module, Grid-GCN achieves state-of-the- art performance on major point cloud classification and segmentation benchmarks with signifi- cantly faster runtime than previous studies. 1.2.2 Point-based 3D Object Perception A robust machine perception system requires its detector to reliably handle different environmental conditions. While most existing 3D detection methods, it is still an open question how to gener- alize a 3D detector to different domains. On the Waymo Domain Adaptation dataset [161], we identify the domain shift as the root cause of the performance drop. To address this issue, we present Semantic Point Generation (SPG), a general approach to enhance the reliability of LiDAR detectors against domain shifts. Specifically, SPG generates semantic points at the predicted fore- ground regions and faithfully recovers missing parts of the foreground objects, which are caused by phenomena such as occlusions, low reflectance, or weather interference. By merging the semantic points with the original points, we obtain an augmented point cloud, which can be directly con- sumed by modern LiDAR-based detectors. SPG significantly improves detectors in cross-domain scenarios and also in the original domain. Another challenge for point-based machine perception is occlusion. Despite being widely used in these 3D applications, LiDAR frames are technically 2.5D. After hitting the first object, a laser beam will return and leave the shapes behind the occluder missing from the point cloud. To tackle 6 the challenge, we present a novel LiDAR-based 3D object detection model, dubbed Behind the Curtain Detector (BtcDet), which learns the object shape priors and estimates the complete object shapes that are partially occluded (curtained) in point clouds. BtcDet first identifies the regions that are affected by occlusion and signal miss. In these regions, our model predicts the probability of occupancy that indicates if a region contains object shapes. Integrated with this probability map, BtcDet can generate high-quality 3D proposals. Extensive experiments on the KITTI Dataset [54] and the Waymo Open Dataset demonstrates the effectiveness of BtcDet. 1.2.3 Point-based 3D Reconstruction To reconstruct 3D scenes with good quality and high efficiency, we first present DISN, a Deep Implicit Surface Network that can generate a high-quality detail-rich 3D mesh from a 2D image by predicting the underlying signed distance field. In addition to utilizing global image features, DISN predicts the projected location for each 3D point on the 2D image and extracts local features from the image feature maps. Combining global and local features significantly improves the accuracy of the signed distance field prediction, especially for the detail-rich areas. To the best of our knowledge, DISN is the first method that constantly captures details such as holes and thin structures present in 3D shapes from single-view images. DISN achieves state-of-the-art single- view reconstruction performance on a variety of shape categories reconstructed from both synthetic and real images. To reconstruct all graphics components together by inverse rendering, volumetric neural ren- dering methods can generate high-quality view synthesis results but are optimized per-scene. This leads to prohibitive reconstruction time. On the other hand, deep multi-view stereo methods can 7 quickly reconstruct scene geometry via direct network inference. We present Point-NeRF, which combines the advantages of these two approaches by using neural 3D point clouds, with associated neural features, to model a radiance field. Point-NeRF can be rendered efficiently by aggregating neural point features near scene surfaces, in a ray marching-based rendering pipeline. Moreover, Point-NeRF can be initialized via direct inference of a pre-trained deep network to produce a neural point cloud; this point cloud can be finetuned to surpass the visual quality of NeRF with 30× faster training time. Point-NeRF can be combined with other 3D reconstruction methods and handles the errors and outliers in such methods via a novel pruning and growing mechanism. 1.3 Organization of the Dissertation We first discuss the challenges of point-based 3D representation learning in Chapter 2 and present a lightning-fast model Grid-GCN. Then in Chapter 3, we discuss the challenges of 3D scene per- ception and introduce SPG to solve the data domain and introduce BtcDet to solve the occlusion in 3D data. After that, in Chapter 4, to solve 3D reconstruction tasks, we present DISN, an implicit function network to generate detail-rich surfaces of 3D scenes. To reconstruct the full component of 3D scenes, we present Point-NeRF, a model that uses a point-based representation to calculate the implicit function of a scene. Finally, in Chapter 5, we summarize the thesis and discuss future research directions. 8 Chapter 2 Learning Point-based Representation 2.1 Introduction Figure 2.1: Overview of the Grid-GCN model. (a) Illustration of the network architecture for point cloud segmentation. Our model consists of several GridConv layers, and each can be used in either a downsampling or an upsampling process. A GridConv layer includes two stages: (b) For the data structuring stage, a Coverage-Aware Grid Query (CAGQ) module achieves efficient data structuring and provides point groups for efficient computation. (c) For the convolution stage, a Grid Context Aggregation (GCA) module conducts graph convolution on the point groups by aggregating local context. 9 Point cloud data is popular in applications such as autonomous driving, robotics, and un- manned aerial vehicles. Currently, LiDAR sensors can generate millions of points a second, pro- viding dense real-time representations of the world. Many approaches are used for point cloud data processing. Volumetric models are a family of models that transfer point cloud to spatially quantized voxel grids and use a volumetric convolution to perform computation in the grid space [114, 196, 114]. Using grids as data structuring methods, volumetric approaches associate points to locations in grids, and 3D convolutional kernels gather information from neighboring voxels. Although grid data structures are efficient, high voxel resolution is required to preserve the gran- ularity of the data location. Since computation and memory usage grows cubically with the voxel resolution, it is costly to process large point clouds. In addition, since approximately 90% of the voxels are empty for most point clouds[236], significant computation power may be consumed by processing no information. Another family of models for point cloud data processing is Point-based models. In contrast to volumetric models, point-based models enable efficient computation but suffer from inefficient data structuring. For example, PointNet [136] consumes the point cloud directly without quanti- zation and aggregates the information at the last stage of the network, so the accurate data loca- tions are intact but the computation cost grows linearly with the number of points. Later studies [138, 203, 184, 173, 199] apply a downsampling strategy at each layer to aggregate information into point group centers, therefore extracting fewer representative points layer by layer (Figure 2.1(a)). More recently, graph convolutional networks (GCN) [155, 176, 93, 226] are proposed to build a local graph for each point group in the network layer, which can be seen as an extension of the PointNet++ architecture [138]. However, this architecture incurs high data structuring costs 10 (e.g., FPS and k-NN). Liu et al. [109] show that the data structuring cost in three popular point- based models [97, 203, 184] is up to 88% of the overall computational cost. In this paper, we also examine this issue by showing the trends of data structuring overhead in terms of scalability. This paper introduces Grid-GCN, which blends the advantages of volumetric models and point- based models, to achieve efficient data structuring and efficient computation at the same time. As illustrated in Figure 2.1, our model consists of several GridConv layers to process the point data. Each layer includes two stages: a data structuring stage that samples the representative centers and queries neighboring points; a convolution stage that builds a local graph on each point group and aggregates the information to the center. To achieve efficient data structuring, we design a Coverage-Aware Grid Query (CAGQ) module, which 1) accelerates the center sampling and neighbor querying, and 2) provides more complete coverage of the point cloud to the learning process. The data structuring efficiency is achieved through voxelization, and the computational efficiency is obtained through performing computation only on occupied areas. We demonstrate CAGQ’s outstanding speed and space cov- erage in Section 2.4. To exploit the point relationships, we also describe a novel graph convolution module, named Grid Context Aggregation (GCA). The module performs Grid context pooling to extract context features of the grid neighborhood, which benefits the edge relation computation without adding extra overhead. We demonstrate the Grid-GCN model on two tasks: point cloud classification and segmenta- tion. Specifically, we perform the classification task on the ModelNet40 and ModelNet10 [191], and achieve the state-of-the-art overall accuracy of 93:1% (no voting) while being on average 5 faster than other models. We also perform the segmentation tasks on ScanNet [38] and S3DIS 11 [5] dataset, and achieve 10 speed-up on average than other models. Notably, our model demon- strates its ability on real-time large-scale point-based learning by processing 81920 points in a scene within 20 ms. (see Section 2.6.4). 2.2 Related Work 2.2.1 Voxel-based Methods for 3D learning To extend the success of convolutional neural network models[63, 68] on 2D images, V oxnet and its variants [114, 191, 174, 12, 36] start to transfer point cloud or depth map to occupancy grid and apply volumetric convolution. To address the problem of cubically increased memory usage, OctNet[142] constructs tree structures for occupied voxels to avoid the computation in the empty space. Although efficient in data structuring, the drawback of the volumetric approach is the low computational efficiency and the loss of data granularity. 2.2.2 Point-based Methods for Point Cloud Learning Point-based models are first proposed by [136, 138], which pursues the permutation invariant by using pooling to aggregate the point features. Approaches such as kernel correlation [6, 190] and extended convolutions [167] are proposed to better capture local features. To solve the ordering ambiguity, PointCNN [97] predicts the local point order, and RSNet [70] sequentially consumes points from different directions. The computation cost in point-based methods grows linearly with the number of input points. However, the cost of data structuring has become the performance bottleneck on large-scale point clouds. 12 2.2.3 Data Structuring Strategies for Point-based Representation Most point-based methods [138, 97, 173, 107] use FPS [48] to sample evenly spread group centers. FPS picks the point that maximizes the distance to the selected points. If the number of centers is not very small, the method takesO(N 2 ) computation. An approximate algorithm [47] can be O(NlogN). Random Point Sampling (RPS) has the smallest possible overhead, but it’s sensitive to density imbalance. Our CAGQ module has the same complexity as RPS, but it performs the sampling and neighbors querying in one shot, which is even faster than RPS with Ball Query or k-NN (see Table 2.3). KPConv [167] uses a grid sub-sampling to pick points in occupied voxels. Unlike our CAGQ, the strategy cannot query points in the voxel neighbors. CAGQ also has a Coverage-Aware Sampling (CAS) algorithm that optimizes the center selections, which can achieve better coverage than FPS. Alternatively, SO-Net [94] builds a self-organizing map. KDNet [79] uses KD-tree to partition the spaces. PATs[210] uses Gumble Subset Sampling to replace FPS. SPG [87] uses a clustering method to group points as super points. All of these methods are either slow in speed or need structure preprocessing. The lattice projection in SPLATNet [160, 60] preserves more point details than voxel space, but it is slower. Studies such as V oxelNet [236, 90] combines the point-based and volumetric methods by using PointNet [136] inside each voxel and applying voxel convolution. A concurrent high-speed model PVCNN [109] uses similar approaches but does not reduce the number of points in each layer progressively. Grid-GCN, yet, can down-sample a large number of points through CAGQ, and aggregate information by considering node relationships in local graphs. 13 2.2.4 GCN for Point Cloud Learning Graph convolutional networks have been widely applied to point cloud learning [184, 86, 85]. A local graph is usually built for each point group, and GCN aggregates point data according to relations between points. SpecConv[173] blends the point features by using a graph Fourier transformation. Other studies model the edge feature between centers and nodes. Among them, [203, 107, 85, 184, 226] use the geometric relations, while [36, 176] explore semantic relations between the nodes. Apart from those features, our proposed Grid Context Aggregation module considers coverage and extracts the context features to compute the semantic relation. 2.3 Grid-GCN As shown in Figure 2.1, Grid-GCN is built on a set of GridConv layers. Each GridConv layer processes the information ofN points and maps them toM points. The downsampling GridConv (N >M) is repeated several times until a final feature representation is learned. This representa- tion can be directly used for tasks such as classification or further up-sampled by the upsampling GridConv layers (N <M) in segmentation tasks. GridConv consists of two modules: 1. A Coverage-Aware Grid Query (CAGQ) module that samples M point groups from N points. Each group includesK node points and a group center. In the upsampling process, CAGQ takes centers directly through long-range connections, and only queries node points for these cen- ters. 14 2. A Grid Context Aggregation (GCA) module that builds a local graph for each point group and aggregates the information to the group centers. TheM group centers are passed as data points for the next layer. We list all the notations in the supplementary for clarity. Figure 2.2: Illustration of Coverage-Aware Grid Query (CAGQ). Assume we want to sampleM = 2 point groups and query K = 5 node points for each group. (a) The input is N points (grey). The voxel id and a number of points are listed for each occupied voxel. (b) We build voxel-point index and store up ton v = 3 points (yellow) in each voxel. (c) Comparison of different sampling methods: FPS and RPS prefer the two centers inside the marked voxels. Our RVS could randomly pick any two occupied voxels (e.g. (2,0) and (0,0)) as center voxels. If our CAS is used, voxel (0,2) will replace (0,0). (d) Context points of center voxel (2,1) are the yellow points in its neighborhood (we use 3 3 as an example). CAGQ queries 5 points (yellow points with blue ring) from these context points, then calculate the locations of the group centers. 2.3.1 Coverage-Aware Grid Query(CAGQ) In this subsection, we discuss the details of the CAGQ module. Given a point cloud, CAGQ aims to effectively structure the point cloud, and ease the process of center sampling and neighbor points querying. To perform CAGQ, we first voxelize the input space by setting up a voxel size (v x ;v y ;v z ). We then map each point to a voxel indexVid(u;v;w) = floor( x vx ; y vy ; z vz ). Here we only store up ton v points in each voxel. LetO v denote all of the non-empty voxels. We then sampleM center voxelsO c O v . For each center voxelv i , we define its voxel neighbors(v i ) as the voxels within the neighborhood of 15 a center voxel. In Figure 2d,(v(2; 1)) are the 3X3 voxels inside the red box. We call the stored points inside (v i ) as context points. Since we build the point-voxel index in the previous step, CAGQ can quickly retrieve context points for eachv i . After that, CAGQ picks K node points from the context points of each v i . We calculate the barycenter of node points in a group, as the location of the group center. This entire process is shown in Figure 2.2. Two problems remain to be solved here. (1) How do we sample center voxelsO c O v . (2) How do we pickK nodes from context points in(v i ). To solve the first problem, we propose our center voxels sampling framework, which includes two methods: 1. Random Voxel Sampling (RVS): Each occupied voxel will have the same probability of being picked. The group centers calculated inside these center voxels are more evenly distributed than centers picked on input points by RPS. We discuss the details in Section 2.4. 2. Coverage-Aware Sampling (CAS): Each selected center voxel can cover up to occupied voxel neighbors. The goal of CAS is to select a set of center voxelsO c such that they can cover the most occupied space. Seeking the optimal solution to this problem requires iterating all combi- nations of selections. Therefore, we employ a greedy algorithm to approach the optimal solution: We first randomly pick M voxels from O v as incumbents; From all of the unpicked voxels, we iteratively select one to challenge a random incumbent each time. If adding this challenger (and in the meantime removes the incumbent) gives us better coverage, we replace the incumbent with the challenger. For a challengerv C and an incumbentv I , the heuristics are calculated as: 16 (x) = 8 > > > < > > > : 1; ifx = 0: 0; otherwise: (2.1) H add = X V2(V C ) (C V ) C V (2.2) H rmv = X V2(V I ) (C V 1) (2.3) where is the amount of neighbors of a voxel andC V is the number of incumbents covering voxel V . H add represents the coverage gain if addingV C (penalize by a term of over-coverage). H rmv represents the coverage loss after removingV I . IfH add >H rmv , we replace the incumbent with the challenger voxel. If we set as 0, each replacement is guaranteed to improve the spatial coverage. Comparisons of those methods are further discussed in section 2.4. Node points querying CAGQ also provides two strategies to pickK node points from context points in(v i ). 1. Cube Query: We randomly selectK points from context points. Compared to the Ball Query used in PointNet++ [138], Cube Query can cover more space when point density is imbalanced. In the scenario of Figure 2.2, Ball Query samplesK points from all raw points (grey) and may never sample any node point from a voxel (2,1) which only has 3 raw points. 2. K-Nearest Neighbors: Unlike traditional k-NN where the search space is all points, k-NN in CAGQ only needs to search among the context points, making the query substantially faster (We also provide an optimized method in the supplementary materials). We will compare these methods in the next section. 17 2.3.2 Algorithms of CAGQ The general procedure of CAGQ is listed as Algorithm 1. The CAGQ k-NN algorithm mentioned in the paper is listed as Algorithm 2. To use RVS or CAS, we can embed the chosen algorithm into step 2 of Algorithm 1. Cube Query and k-NN can be embedded into step 3. The k-NN methods in Algorithm 2 are efficient in three aspects: • instead of all points, the candidates of k-NN are only the points in the neighborhood. • We collect K nearest neighbor points first from the inner voxel layers, then the outer voxel layers. We can stop at a layer if we have already gotK points since all the points in the outer voxel layers are farther away than the point collected so far. • In each layer of voxel neighbors, if the number of points so far collected plus the points in this layer is less thanK, we do not need to sort them but can directly include all points in this layer. Algorithm 1: CAGQ general procedure InputN points:p i ( i ;w i );i2 (1;:::;N) ParametersN,M,K,,n v 1 Build voxel-point indexVid, collectO v : For eachp i : (u;v;w) quantizep i (x;y;z) into voxel index If voxel(u;v;w) is first visited Add (u;v;w) intoO v IfVid (u;v;w) stores fewer thann v points Push point index intoVid (u;v;w) 2 Center voxel sampling: O c selectM voxels fromO v , by RVS or CAS. 3 Query node points and calculate group centers: For each center voxelv j inO c Retrieve points in(v j ) by using indices. Pick node pointsfp j1 ;p j2 ;::p jK g in the neighborhood by using Cube Query or k-NN w c j P K k=1 w jk c j weighted mean off j1 ;:::; jK g ReturnM point groups:group j : (c j ( c j ;w c j );fp j1 ;p j2 ;::p jK g);j2 (1;:::;M) 18 Algorithm 2: CAGQ k-NN for one point group in step 3 of Algorithm 1 Input A center voxelv j , voxel-point indexVid;O v ParametersK Counter = 0; node points =fg For each level i of(v j ) (level 0 is the center voxelv j itself, level 1 is the surrounding 26 voxels, etc.): kl = 0 LayerPoints =fg For each voxelv l in level i : Ifv l 2O v : stored points Retrieve points fromVid(v l ): LayerPoints add stored points kl+ =jstored pointsj topkl =min(K-Counter,kl) Iftopkl =kl: node points LayerPoints Else: node points topK(LayerPoints,topkl) Counter +=topkl If CounterK: break; Return node points 19 2.3.3 Comparison between CAGQ and Naive Grid Query In 2D image learning, a convolution kernel usually traverses with a stride size that is smaller than the kernel size, leading to overlapping receptive fields. Since naive Grid Query first voxelizes the space and randomly picks M voxels, and then samples K points only within the voxel, the queried point groups have no space overlaps. On the other hand, CAGQ queries points inside voxel neighbors and utilizes Coverage-aware sampling to make the center voxels more evenly distributed. To show the advantage of CAGQ over naive Grid Query, we compare 3 models on ScanNet[38] and report the results in Table 2.1. The two models using CAGQ are also the models we report in section 5.2 of the paper. We also train a model using a naive Grid Query. As a result of the non-overlapping coverage by its point groups, the overall accuracy of the model with naive Grid Query can hardly reach 80%. OA Latency (ms) Grid-GCN(Grid Query + 1K) 79.9% 15.9 Grid-GCN(CAGQ + 0:5K) 83.9% 16.6 Grid-GCN(CAGQ + 1K) 85.4% 20.8 Table 2.1: The overall accuracy and latency of three Grid-GCN models on ScanNet[38]. Our full model uses CAGQ with 1K node points in each group. A compact model with 0:5K is also reported. Another model uses a naive Grid Query with 1K node points. 2.3.4 Grid Context Aggregation For each point group provided by CAGQ, we use a Grid Context Aggregation (GCA) module to aggregate features from the node points to the group center. We first construct a local graph G(V;E), whereV consists of the group center andK node points provided by CAGQ. We then connect each node point to the group center. GCA projects a node point’s featuresf i to e f i . Based 20 Figure 2.3: Different strategies to compute the contribution f f c;i from a node n i to its center c. f i ; i are the feature maps and the location ofn i .e i is the edge feature betweenn i andc calculated from the edge attention function. (a) Pointnet++ [138] ignorese i . (b) computese i based on low dimensional geometric relation betweenn i andc. (c) also considers the semantic relation between the center and the node point, butc has to be sampled on one of the points from the previous layer. (d). Grid-GCN’s geo-relation also includes the coverage weight. It pools a context feature f cxt from all stored neighbors to provide a semantic reference ine i computing. on the edge relation between the node and the center, GCA calculates the contribution of e f i and aggregates all these features as the feature of the center e f c . Formally, the GCA module can be described as e f c;i =e( i ;f i )M(f i ) (2.4) e f c =A(f e f c;i g;i2 1;:::;K) (2.5) where e f c;i is the contribution from a node, and i is the XYZ location of the node.M is a multi- layer perceptron (MLP),e is the edge attention function, andA is the aggregation function. The edge attention functione has been explored by many previous studies [203, 36, 176]. In this work, we design a new edge attention function with the following improvements to better fit into our network architecture (Figure 2.3): Grid Context Pooling Semantic relation is another important aspect when calculating edge attention. In previous works [36, 176], semantic relation is encoded by using the group center’s 21 Figure 2.4: The red point is the group center. Yellow points are its node points. Black points are node points of the yellow points in the previous layer. The coverage weight is an important feature as it encodes the number of black points that have been aggregated to each yellow point. features f c and a node point’s features f i , which requires the group center to be selected from node points. In CAGQ, since a group center is calculated as the barycenter of the node points, we propose Grid context pooling that extracts context featuresf cxt by pooling from all context points, which sufficiently covers the entire grid space of the local graph. Grid context pooling brings the following benefits: • f cxt models the features of a virtual group center, which allows us to calculate the semantic relation between the center and its node points. • Even when the group center is picked on a physical point,f cxt is still a useful feature represen- tation as it covers more points in the neighborhood, instead of only the points in the graph. • Since we have already associated context points to its center voxel in CAGQ, there is no extra point query overhead. f cxt is shared across all edge computations in a local graph, and the pooling is a light-weighted operation requiring no learnable weights, which introduces little computational overhead. GCA module is summarized in Figure 2.3d, and the edge attention function can be modeled as 22 e =mlp(mlp geo ( c ; i ;w i );mlp sem (f cxt ;f i )) (2.6) Coverage Weight Previous studies [203, 107, 85, 184, 226] use c of the center and i of a node to model edge attention as a function of geometric relation (Figure 2.3b). However, the for- mulation ignores the underlying contribution of each node point from previous layers. Intuitively, node points with more information from previous layers should be given more attention. We illus- trate this scenario in Figure 2.4. With that in mind, we introduce the concept of coverage weight, which is defined as the number of points that have been aggregated to a node in previous layers. This value can be easily computed in CAGQ, and we argue that coverage weight is an important feature in calculating edge attention (see our ablation studies in Table 2.8). In a point group, we calculatew c as the sum of its node points’ coverage weight. c is calculated as the barycenter of these nodes, weighted by the coverage weight. w c = K X j=1 w j (2.7) c (x;y;z) = P K j=1 w j j (x;y;z) P K j=1 w j (2.8) 23 (a) Random Point Sampling (b) Farthest Point Sampling (c) Coverage-Aware Sampling Figure 2.5: The visualization of the sampled group center and the queried node points by RPS, FPS, and CAS. The blue and green balls indicate Ball Query. The red squares indicate Cube Query. The ball and cube have the same volume. (a) RPS covers 45:6% of the occupied space, while FPS covers 65% and CAS covers 75:2%. 2.4 Complexity Analysis To analyze the benefit of CAGQ, we test the occupied space coverage and the latency of differ- ent sampling/querying methods under different conditions on ModelNet40 [191]. Center sampling methods include Random Point Sampling (RPS), Farthest Point Sampling (FPS), our Random V oxel Sampling (RVS), and our Coverage-Aware Sampling (CAS). Neighbor querying methods include Ball Query, Cube query, and K-Nearest Neighbors. The conditions include different num- bers of input points, node numbers in a point group, and numbers of point groups, which are denoted by N, K, and M. We summarize the qualitative and quantitative evaluation results in Table 2.3 and Figure 2.5. The reported occupied space coverage is calculated as the ratio between the number of voxels occupied by node points of all groups, and the number of voxels occupied by the originalN points. Results under more conditions are presented in the supplementary. 2.4.1 Space Coverage In Figure 2.5a, the centers sampled by RPS are concentrated in the areas with higher point density, leaving most space uncovered. In Figure 2.5b, FPS picks the points that are far away from each 24 other, mostly on the edges of the 3D shape, which causes the gap between centers. In Figure 2.5c, our CAS optimizes the voxel selection and covers 75:2% of occupied space. Table 2.3 lists the percentage of space coverage by RPS, FPS, RVS, and CAS. CAS leads the space coverage in all cases (30 % more than RPS). FPS has no advantage over RVS whenK is small. The factors that benefit CAGQ in space coverage can be summarized as follows: • Instead of sampling centers fromN points, RVS samples center voxels from occupied space, therefore it is more resilient to point density imbalance (Figure 2.5). • CAS further optimizes the result of RVS by conducting a greedy candidate replacement. Each replacement is guaranteed to result in better coverage. • CAGQ stores the same number of points in each occupied voxel. The context points are more evenly distributed, and so are theK node points picked from the context points. Consequently, the strategy reduces the coverage loss caused by density imbalance in a local area. 2.4.2 Time complexity We treat the number of voxel neighbors as a constant. In addition, the center sampling methods are only used during downsampling GridConv. 1. The time complexity of center sampling methods: • RPS: Random sampling methods such as [3] provide RPS a complexity ofO(min(N;MlogM)). In practice,MlogM has the same or smaller magnitude ofN, therefore reportingO(N) in the time complexity table. 25 • FPS: FPS on a finite point set hasO(N 2 ), whenM is not extremely small. However, [47] uses a V oronoi diagram to find the area where the point should exist, then find the nearest point in the calculated area. As an approximate algorithm, it hasO(NlogN). • RVS: To sample point groups, CAGQ first scans over all points and builds indices, which takes O(N), RVS then randomly samplesM center voxels from at mostN occupied voxels (the num. of occupied voxels the number of raw points), which takesO(min(N;MlogM)). Under the same assumption of RPS, we reportO(N) in the time complexity table. • CAS: If choosing CAS, we need to check all the unpicked occupied voxels and challenge the incumbents. To calculate a pair of H add and H rmv , CAGQ checks voxel neighbors of a challenger and an incumbent, resulting in aO(N) =O(N) for all extra operations. Therefore CAS still has a complexity ofO(N). 2. The time complexity of node querying methods: • Ball Query: For each center, Ball Query needs to run overN points to collect in-range points, then sampleKpoints. Therefore it needsO(MN) • k-NN: For each center, vanilla k-NN picksKnearest points fromN points. The partition-based topK method takesO(N) computation. Therefore each center hasO(N). k-NN can also first check if a point is within a range, then query topKcandidate points. These two methods have the same worst-case complexity. The overall complexity isO(MN). • Cube Query: CAGQ’s Cube Query randomly picksKpoints from n v context points. Since the order of points in each neighborhood is already randomized during GPU’s multithreading collection, the overall complexity isO(MK). 26 • CAGQ’s k-NN: CAGQ picksKnearest points among points in the neighborhood. The partition- based topK algorithm provides a O(n v ) solution for each point group. If is treated as a constant, the overall complexity isO(Mn v ). We summarize the time complexity of different methods in Table 2.2. Table 2.3 shows the empirical results of latency. We see that our CAS is much faster than FPS and achieves 50 speed-up. CAS + Cube Query can even outperform RPS + Ball Query when the size of the input point cloud is large. This is due to the higher neighborhood query speed. Because of better time complexity, RVS + k-NN leads the performance under all conditions and achieves 6 speed-up over FPS + k-NN. Sample centers RPS FPS[48] RVS* CAS* O(N) O(NlogN) O(N) O(N) Query nodes Ball Query Cube Query* k-NN[37] CAGQ k-NN* O(MN) O(MK) O(MN) O(Mn v ) Table 2.2: Time complexity: We sample M centers from N points and query K neighbors per center. We limit the maximum number of points in each voxel ton v . In practice,K <N, andn v are usually of the same magnitude asK. The approximate FPS algorithm can beO(NlogN)[47]. * indicates our methods. See the supplementary for deduction details. 2.5 Training Details For all experiments on ModelNet10, ModelNet40[191], ScanNet[38] and S3DIS[5], we use Adam[78] optimizer with beta1 = 0.9 and beta2 = 0.999. All models use batch normalization[73] with no mo- mentum decay and are trained on a single RTX 2080 GPU. For ModelNet10 and ModelNet40[191], we start with a learning rate of 0.001 and reduce the learning rate by a factor of 0.7 every 60 epochs, and stop at 330 epochs. We don’t apply weight 27 Center sampling RPS FPS RVS* CAS* RPS FPS RVS* CAS* RPS FPS RVS* CAS* Neighbor querying Ball Ball Cube Cube Ball Ball Cube Cube k-NN k-NN k-NN k-NN N K M Occupied space coverage(%) Latency (ms) with batch size = 1 1024 8 8 12.3 12.9 13.1 14.9 0.29 0.50 0.51 0.64 0.84 0.85 0.51 0.65 32 8 22.9 21.4 22.4 31.7 0.34 0.51 0.51 0.69 2.12 1.96 0.63 0.71 128 8 22.3 22.6 23.5 34.2 0.34 0.51 0.94 1.04 8.26 6.70 1.41 1.63 8 32 34.4 43.7 40.0 45.6 0.31 0.53 0.51 0.65 1.31 1.36 0.57 0.69 32 32 58.2 69.48 60.1 73.0 0.36 0.55 0.53 0.57 4.68 4.72 0.93 0.68 128 32 60.0 70.1 61.3 74.7 0.37 0.53 0.96 1.08 22.23 21.08 2.24 2.58 8 128 64.0 72.5 82.3 85.6 0.32 0.78 0.44 0.58 1.47 1.74 0.52 0.61 32 128 92.7 98.9 95.3 99.6 0.38 0.81 0.50 0.62 5.34 5.66 1.18 1.39 128 128 93.6 99.5 95.8 99.7 0.38 0.69 0.97 0.97 32.48 32.54 6.85 6.94 8192 8 64 19.2 22.9 22.1 25.1 0.64 1.16 0.66 0.82 1.58 1.80 0.65 0.76 32 64 42.7 42.7 35.8 46.3 1.47 1.47 1.15 1.39 2.73 2.73 1.72 2.16 128 64 40.6 47.1 38.6 51.3 1.14 1.55 1.18 1.38 13.70 12.72 9.42 11.71 8 256 60.1 64.1 73.3 94.3 0.40 1.51 0.53 0.61 4.53 5.54 0.54 0.68 32 256 75.4 88.4 77.6 90.7 1.11 2.19 1.04 1.29 5.13 5.94 3.06 3.52 128 256 79.9 90.7 80.0 93.5 1.19 1.19 1.17 1.31 21.5 21.5 15.19 17.38 8 1024 82.9 96.8 92.4 94.4 0.81 4.90 0.54 0.77 1.53 5.36 0.93 0.97 32 1024 96.3 97.8 99.3 99.9 1.15 5.09 1.10 1.54 5.18 8.99 4.92 6.32 128 1024 98.8 99.9 99.5 100.0 1.22 5.25 1.40 1.76 111.42 111.74 24.18 26.45 81920 8 256 21.7 25.7 26.2 31.2 3.46 10.54 2.20 2.87 9.77 15.97 1.85 2.15 32 256 34.2 40.1 36.0 48.5 7.59 14.51 3.15 4.34 20.43 26.14 5.95 6.17 128 256 36.6 42.6 37.4 51.1 9.41 15.91 3.52 4.19 77.68 78.34 34.04 40.04 8 1024 50.7 63.8 67.4 76.0 4.73 30.79 2.13 2.34 10.01 35.18 1.84 2.02 32 1024 70.6 86.3 78.3 91.6 8.30 33.52 3.34 3.88 19.49 43.69 8.76 10.05 128 1024 72.7 88.2 79.1 92.6 9.68 34.72 4.32 4.71 71.99 93.02 50.70 51.94 8 10240 98.8 99.2 100.0 100.0 8.82 255.9 4.11 8.23 19.96 268.22 9.54 14.88 32 10240 98.8 99.2 100.0 100.0 8.93 260.48 4.22 9.35 20.38 272.48 9.65 17.44 128 10240 99.7 100.0 100.0 100.0 10.73 258.49 5.83 11.72 234.19 442.87 69.02 83.32 Table 2.3: Performance comparisons of data structuring methods, run on ModelNet40[191]. Cen- ter sampling methods include RPS, FPS, CAGQ’s RVS, and CAS. Neighbor querying methods include Ball Query, Cube query, and K-Nearest Neighbors. Condition variables include N points, M groups, and K neighbors per group. Occupied space coverage = num. of occupied voxels of queried points / num. of occupied voxels of the original N points. 28 decay. The network has 2 downsampling GridConv layers each has 1024 and 128 point groups and a final global GridConv layer to group all points as one graph. For ScanNet[38], we start with a learning rate of 0.001 and reduce the learning rate by a factor of 0.7 every 150 epochs, and stop at 1500 epochs. Please note during each epoch, we only sample one block on the fly in each region. Therefore the training of one epoch is very quick. The network has 3 downsampling GridConv layers each having 1024, 256, and 24 point groups, and 3 upsampling GridConv layers. We set our weight decay as 10 5 during training. For S3DIS[5], we start with a learning rate of 0.001 and reduce the learning rate by a factor of 0.8 every 40 epochs and stop at 200 epochs. The network has 3 downsampling GridConv layers each having 512, 256, and 24 point groups, and 3 upsampling GridConv layers. We set our weight decay as 10 8 during training. 2.6 Experiments We evaluate Grid-GCN on multiple datasets: ModelNet10 and ModelNet40[191] for object clas- sification, ScanNet[38] and S3DIS[5] for semantic segmentation. Following the convention of PVCNN [109], we report latency and performance in each level of accuracy. We collect the result of other models either from published papers or the authors. All the latency results are reported under the corresponding batch size and number of input points. All experiments are conducted on a single RTX 2080 GPU. Training details are listed in the supplementary. 29 ModelNet40 ModelNet10 latency (ms) Input (XYZ as default) OA mAcc OA mAcc OA6 91:5 PointNet[136] 161024 89.2 86.2 - - 15.0 SCNet[195] 161024 90.0 87.6 - - SpiderCNN[203] 8 1024 90.5 - - - 85.0 O-CNN[178] octree 90.6 - - - 90.0 SO-net[94] 8 2048 90.8 87.3 94.1 93.9 - Grid-GCN 1 161024 91.5 88.6 93.4 92.1 15.9 OA6 92:0 3DmFVNet[7] 161024 91.6 - 95.2 - 39.0 PAT[210] 8 1024 91.7 - - 88.6 Kd-net[79] kd-tree 91.8 88.5 94.0 93.5 - PointNet++[138] 161024 91.9 90.7 - 26.8 Grid-GCN 2 161024 92.0 89.7 95.8 95.3 21.8 OA> 92:0 DGCNN[184] 161024 92.2 90.2 - 89.7 PCNN[6] 161024 92.3 - 94.9 - 226.0 Point2Seq[105] 161024 92.6 - - A-CNN[81] 161024 92.6 90.3 95.5 95.3 68.0 KPConv[167] 166500 92.7 - - - 125.0 Grid-GCN 3 161024 92.7 90.6 96.5 95.7 26.2 Grid-GCN full 161024 93.1 91.3 97.5 97.4 42.2 Table 2.4: Results on ModelNet10 and ModelNet40[191]. Our full model achieves state-of-the- art accuracy. With the model reduction, our compact models Grid-GCN 13 also outspeed other models. We discuss their details in the ablation studies. 2.6.1 3D Object Classification Datasets and settings We conduct the classification tasks on the ModelNet10 and ModelNet40 dataset[191]. ModelNet10 is composed of 10 object classes with 3991 training and 908 testing objects. ModelNet40 includes 40 different classes with 9843 training objects and 2468 testing objects. We prepare our data following the convention of PointNet[136], which uses 1024 points with 3 channels of spatial location as input. Several studies use normal [138, 81], octree [178], or KD-tree for input, and [107, 106] use voting for evaluation. 30 Evaluation To compare with different models with different levels of accuracy and speed, we train Grid-GCN with 4 different settings to balance performance and speed (Details are shown in section 2.6.3). The variants are in the number of feature channels and the number of node points in a group in the first layer (see Table 2.8). The results are shown in Table 2.4. We report our results without voting. For all of the four settings, our Grid-GCN model not only achieves state-of-the-art performance on both ModelNet10 and ModelNet40 datasets but has the best speed-accuracy trade- off. Although Grid-GCN uses the CAGQ module for data structuring, it has similar latency as PointNet which has no data structuring step while its accuracy is significantly higher than PointNet. 2.6.2 3D Scene Segmentation Dataset and Settings We evaluate our Grid-GCN on two large-scale point cloud segmentation datasets: ScanNet[38] and Stanford 3D Large-Scale Indoor Spaces (S3DIS) [5]. ScanNet consists of 1513 scanned indoor scenes, and each voxel is annotated in 21 categories. We follow the exper- iment setting in [38] and use 1201 scenes for training, and 312 scenes for testing. Following the routine and evaluation protocol in PointNet++[138], we sample 8192 points during training and 3 spatial channels for each point. S3DIS contains 6 large-scale indoor areas with 271 rooms. Each point is labeled with one of 13 categories. Since area 5 is the only area that doesn’t have overlaps with other areas, we follow [165, 97, 109] to train on areas 1-4 and 6, and test on area 5. In each divided section, 4096 points are sampled for training, and we adopt the evaluation method from [97]. 31 Evaluation We report the overall voxel labeling accuracy (OA) and the runtime latency for ScanNet[38]. We trained two versions of the Grid-GCN model, with a full model using 1K node points and a compact model using 0:5K node points. Results are reported in Table 2.5. Since the segmentation tasks generally use more input points than the classification model, our advantage of data structuring becomes outstanding. With the same amount of input points (32768) in a batch, Grid-GCN out-speed PointNet++ 4:5 while maintaining the same level of accuracy. Compared with more sophisticated models such as PointCNN [97] and A-CNN [81], Grid-GCN is 25 and 12 faster, respectively, while achieving the state-of-the-art accuracy. Remarkably, Grid-GCN can run as fast as 50 to 133 FPS with state-of-the-art performance, which is desired in real-time applications. A popular model MinkowskiNet[34] doesn’t report the overall accuracy, therefore we don’t put it in the table. But its GitHub example shows a latency of 103ms on ScanNet. We show the quantitative results on S3DIS in Table 2.6 and visual results in Figure 2.6. Our compact version of Grid-GCN is generally 4 to 14 faster than other models with data struc- turing. Notably, even compared with PointNet which has no data structuring at all, we are still 1:6 faster while achieving 12% performance gain in mIOU. For our full model, we are still the fastest and achieve 2 speed-up over PVCNN++[109], a state-of-the-art study focusing on speed improvement. We report the IOU of each object class in Table 2.7 and visualize more results of S3DIS[5] in Figure 2.7 The segmentation results are generated by our full model. In the visual results, Grid- GCN can predict objects such as chairs and tables very accurately, but sometimes mislabels the points on the border of two planar objects such as a board and a wall. 32 Input (XYZ as default) OA latency (ms) OA< 84:0 PointNet[136] 8 4096 73.9 20.3 OctNet[142] volume 76.6 - PointNet++[138] 8 4096 83.7 72.3 Grid-GCN (0:5K) 4 8192 83.9 16.6 OA> 84:0 SpecGCN[173] - 84.8 - PointCNN[97] 122048 85.1 250.0 Shellnet[229] - 85.2 - Grid-GCN (1K) 4 8192 85.4 20.8 A-CNN[81] 1 8192 85.4 92.0 Grid-GCN (1K) 1 8192 85.4 7.48 Table 2.5: Results on ScanNet[38]. Grid-GCN achieves 10 speed-up on average over other models. Under batch size of 4 and 1, we test our model with 1K neighbor nodes. A compact model with 0:5K is also reported. Input (xyzrgb as default) mIOU OA latency(ms) mIOU< 54:0 PointNet[136] 8 4096 41.09 - 20.9 DGCNN[184] 8 4096 47.94 83.64 178.1 SegCloud[165] - 48.92 - - RSNet[70] 8 4096 51.93 - 111.5 PointNet++[138] 8 4096 52.28 - DeepGCNs[93] 1 4096 52.49 - 45.63 TanConv[163] 8 4096 52.8 85.5 - Grid-GCN (0:5Ch) 8 4096 53.21 85.61 12.9 mIOU> 54:0 3D-UNet[36] 8 96 3 volume 54.93 86.12 574.7 PointCNN[97] - 57.26 85.91 - PVCNN++[109] 8 4096 57.63 86.87 41.1 Grid-GCN (1Ch) 8 4096 57.75 86.94 25.9 Table 2.6: Results on S3DIS[5] area 5. Grid-GCN is on average 8 faster than other models. We halve the output channels of GridConv for Grid-GCN (0:5Ch) . 2.6.3 Ablation Studies In the experiment on ModelNet10 and ModelNet40[191], our full model has 3 GridConv layers. As shown in Table 2.8, we conduct reductions on the number of the output feature channels from 33 (a) Ground Truth (b) Ours Figure 2.6: Semantic segmentation results on S3DIS [5] area 5. Method OA mIoU ceiling floor wall beam column window door table chair sofa case board clutter PointNet[136] - 41.09 88.80 97.33 69.80 0.05 3.92 46.26 10.76 58.93 52.61 5.85 40.28 26.38 33.22 SegCloud[165] 48.92 90.06 96.05 69.86 0.00 18.37 38.35 23.12 70.40 75.89 40.88 58.42 12.96 41.60 PointCNN[97] 85.91 57.26 92.31 98.24 79.41 0.00 17.60 22.77 62.09 74.39 80.59 31.67 66.67 62.05 56.74 Grid-GCN 86.94 57.75 94.12 97.28 77.66 0.00 16.61 32.91 58.53 72.15 81.32 36.46 68.74 64.54 50.46 Table 2.7: Segmentation result on S3DIS[5] area 5. We report overall accuracy (OA, %), mean class IoU (mIoU, %) and per-class IoU (%). Grid-GCN achieves the highest overall accuracy and mIoU among the 4 models. 34 K Channels PoolingWeight OA latency Grid-GCN 0 32 (32,64,256) No No 91.1 15.4ms Grid-GCN 1 32 (32,64,256) No Yes 91.5 15.9ms Grid-GCN 2 32 (64,128,256) No Yes 92.0 21.8ms Grid-GCN 3 64 (64,128,256) Yes Yes 92.7 26.2ms Grid-GCN full 64 (128,256,512) Yes Yes 93.1 42.2ms Table 2.8: Ablation studies on ModelNet40[191]. Our models have 3 layers of GridConv. K is the number of node points in the first GridConv. We also change the number of output feature channels from these 3 layers. Grid context pooling (shorted as pooling here) is also removed for Grid-GCN 02 . Grid-GCN 0 also removes coverage weight in edge relation. (a) Input XYZ+RGB (b) Ground truth (c) Ours Figure 2.7: More visual result of S3DIS[5] area 5 35 GridConv layers, the number of nodes K in the first GridConv layer, and whether to use Grid context pooling and coverage weight. On one hand, reducing the number of channels from Grid- GCN full gives Grid-GCN 3 37% speed-up. On the other hand, reducing K and removing Grid context pooling from Grid-GCN 3 doesn’t give Grid-GCN 2 much speed benefit but incurs a loss in accuracy. This demonstrates the efficiency and effectiveness of CAGQ and Grid context pooling. Coverage weight is useful as well because it introduces little overhead in latency but increases the overall accuracy. 2.6.4 Scalability Analysis Num. of points (N) 2048 4096 16384 40960 81920 Num. of clusters (M) 512 1024 2048 4096 8192 PointNet++ 4.7 8.6 19.9 64.6 218.9 Grid-GCN 4.3 4.7 8.1 12.3 19.8 Table 2.9: Inference time (ms) on ScanNet[38] under different scales. We compare Grid-GCN with PoinNet++[138] on different numbers of input points per scene. The batch size is 1.M is the number of point groups on the first network layer. We also test our model’s scalability by gradually increasing the number of input points on ScanNet [38]. We compare our model with PointNet++ [138], one of the most efficient point- based methods. We report the results in Table 2.9. Under the setting of 2048 points, the latency of the two models is similar. However, when increasing the input point from 4096 to 81920, Grid- GCN achieves up to 11 speed-up over PointNet++, which shows the dominating capability of our model in processing large-scale point clouds. 36 2.7 Conclusion In this paper, we propose Grid-GCN for fast and scalable point cloud learning. Grid-GCN achieves efficient data structuring and computation by introducing Coverage-Aware Grid Query (CAGQ). CAGQ drastically reduces data structuring costs through voxelization and provides point groups with complete coverage of the whole point cloud. A graph convolution module Grid Context Aggregation (GCA) is also proposed to incorporate the context features and coverage information in the computation. With both modules, Grid-GCN achieves state-of-the-art accuracy and speed on various benchmarks. Grid-GCN, with its superior performance and unparalleled efficiency, can be used in large-scale real-time point cloud processing applications. 37 Chapter 3 3D Object Perception: Point-based Object Detection 3.1 Domain Gap and Shape Miss With high fidelity, the point clouds acquired by LiDAR sensors significantly improved autonomous agents’ ability to understand 3D scenes. Point-based machine perception models achieved state- of-the-art performance on 3D object classification [198], visual odometry [126], and 3D object detection [150]. However, a robust machine perception system requires its detector to reliably handle different environmental conditions, e.g., geographic locations, and weather conditions. While most existing 3D detection methods [236, 25, 28, 29, 46, 82, 88, 92, 95, 100, 101, 119, 133, 135, 150, 151, 152, 153, 204, 207, 208, 212, 213, 235] focused on the performance in a single domain, where training and test data are captured in similar conditions. It is still an open question how to generalize a 3D detector to different domains, where the environment varies significantly. Another challenge for point-based machine perception is occlusion. Despite being widely used in these 3D applications, LiDAR frames are technically 2.5D. After hitting the first object, a laser beam will return and leave the shapes behind the occluder missing from the point cloud. 38 Dataset Rainy frames Avg. number of missing points per frame Avg. number of points per vehicle 3D L1 AP OD Val 0.5 % 23.0K 306.2 56.54 Kirk Dry 0.0 % 25.1K 303.6 55.98 Kirk Val 100.0% 42.8K 222.3 34.74 Table 3.1: The statistics of OD and Kirk. Each frame contains at most 163.8K points. Kirk Dry is formed by frames with dry weather in the Kirk training set. 3.1.1 Domain Gap in Point Cloud In this chapter, we address the domain gap caused by the deteriorating point cloud quality and aim to improve 3D detectors in a practical setting as unsupervised domain adaptation (UDA). We use Waymo Domain Adaptation dataset[161] to analyze the domain gap and introduce semantic point generation (SPG), a general approach to enhance the reliability of LiDAR detectors against domain shift. SPG can improve detection quality in both the target domain and the source domain and can be naturally combined with modern LiDAR-based detectors. Understanding the Domain Gap Waymo Open Dataset (OD) is mainly collected in California and Arizona, and Waymo Kirkland Dataset (Kirk) [161] is collected in Kirkland, Washington. We consider OD as the source domain and Kirk as the target domain. To understand the possible domain gap, we take a PointPillars [88] model trained on the OD training set and compare its 3D vehicle detection performance on OD validation set and those on the Kirk validation set. We observe a drastic performance drop of 21:8 points in 3D average precision (AP) (see Table 3.1). We first confirm that there is no significant difference in object size between the two domains. Then by investigating the metadata in the datasets, we find that only 0:5% of LiDAR frames in OD are collected under rainy weather, but almost all frames in Kirk share the rainy weather attribute. To rule out other factors, we extract all dry weather frames in the Kirk training set and form a "Kirk 39 (a) OD RGB Image (b) Kirk RGB Image (c) OD Range Image (d) Kirk Range Image Figure 3.1: Examples of RGB and range image (intensity channel) in OD validation set and Kirk validation set. The dark regions in the range images indicate missed LiDAR returns. The regions of “missing points" are irregular in shape. Dry" dataset. Because the raindrop changes the surface property of objects, there are twice the amount of missing LiDAR points per frame in the Kirk validation set than in OD or Kirk Dry (see Table 3.1). As a result, vehicles in Kirk receive around 27% fewer LiDAR point observations than those in OD (see statistics and more details in the supplemental). In Figure 3.1, we visualize two range images from OD and Kirk, respectively. We can observe that in rainy weather, a significant number of points are missing and the distribution of missing points is more irregular compared to dry weather. To conclude, the major domain gap between OD and Kirk is the deteriorating point cloud qual- ity which is caused by the rainy weather condition. In the target domain, we name this phenomenon the “missing point" problem. Previous Methods to Address the Domain Gap Multiple studies propose to align the features across domains. Most of them focus on 2D tasks [121, 56, 52, 171, 43] or object-level 3D tasks [233, 139]. Applying feature alignment [30, 64] requires model-specific re-design of a detector. 40 Here, our goal is to seek a general solution to benefit recently reported LiDAR-based detectors[88, 150, 236, 151, 61]. Another direction is to apply transformations to the data from one domain to match the data from another domain. A naive approach is to randomly down-sample the point cloud but this not only fails to satisfactorily simulate the pattern of missing points (Figure 3.1d) but also hurts the performance on the source domain. Another approach is to up-sample the point cloud [222, 218, 96] in the target domain, which can increase point density around observed regions. However, those methods have a limited capability in recovering the 3D shape of very partially observed objects. Moreover, up-sampling the entire point cloud will lead to significantly higher latency. A third approach is to leverage style transfer techniques: [237, 131, 33, 64, 148, 66, 143] render point clouds as 2D pseudo images and enforce the renderings from different domains to be resemblant in style. However, these methods introduce an information bottleneck during rasterization [236] and they do not apply to modern point-based 3D detectors [150]. SPG for Closing the Domain Gap The “missing point” problem deteriorates the point cloud quality and reduces the number of point observations, thus undermining the detection performance. To address this issue, we propose Semantic Point Generation (SPG). Our approach aims to learn the semantic information of the point cloud and perform foreground region prediction to identify voxels that are inside foreground objects. Based on the predicted foreground voxels, SPG generates points to recover the foreground regions. Since these points are discriminatively generated at foreground objects, we denote them by semantic points. These semantic points are merged with the original points into an augmented point cloud, which is then fed to a 3D detector. 41 3.1.2 Shape Miss in Point Cloud Figure 3.2: figure In a LiDAR scan (a) and (b), locating an object is difficult when its shape is largely missing. We discover three causes of shape miss: external occlusion (red regions in (c)), signal miss (blue regions in (c)), and self-occlusion (green regions in (d)). BtcDet learns the occupancy probability of complete object shapes (e) and achieves state-of-the-art detection performance. (a) The points to recover different shape miss re- gions. (b) The 3D Average Precisions with shape miss recovery. Figure 3.3: The impact of the three types of shape miss. (b) shows PV-RCNN’s [150] car 3D detection APs with different occlusion levels on the KITTI [54] val split. NR means no shape miss recovery. EO, SM, and SO indicate adding car points in the regions of external occlusion, signal miss, and self-occlusion, respectively, as visualized in (a). To locate a severely occluded object (e.g., the car in Figure 3.2(b)), a detector has to recognize the underlying object shapes even when most of its parts are missing. Since shape miss inevitably affects object perception, it is important to answer two questions: 42 • What are the causes of shape miss in point clouds? • What is the impact of shape miss on 3D object detection? Causes of Shape Miss To answer the first question, we study the objects in KITTI [54] and discover three causes of shape miss. External-occlusion. As visualized in Figure 3.2(c), occluders block the laser beams from reaching the red frustums behind them. In this situation, the external occlusion is formed, which causes the shape miss located at the red voxels. Signal miss. As Figure 3.2(c) illustrates, certain materials and reflection angles prevent laser beams from returning to the sensor after hitting some regions of the car (blue voxels). After projected to range view, the affected blue frustums in Figure 3.2(c) appear as the empty pixels in Figure 3.2(a). Self-occlusion. LiDAR data is 2.5D by nature. As shown in Figure 3.2(d), for the same object, its parts on the far side (the green voxels) are occluded by the parts on the near side. The shape miss resulting from self-occlusion inevitably happens to every object in LiDAR scans. Impact of Shape Miss To analyze the impact of shape miss on 3D object detection, we evaluate the car detection results of the scenarios where we recover certain types of shape miss on each object by borrowing points from similar objects (see the details of finding similar objects and filling points in Sec. 3.1). In each scenario, after resolving certain shape miss in both the train and val split of KITTI [54], we train and evaluate a popular detector PV-RCNN [150]. The four scenarios are: • NR: Using the original data without shape miss recovery. 43 • EO: Recovering the shape miss caused by external occlusion (adding the red points in Figure 3.3(a)). • EO+SM: Recovering the shape miss caused by external occlusion and signal miss (adding the red and blue points in Figure 3.3(a)). • EO+SM+SO: Recovering all the shape miss (adding the red, blue, and green points in Figure 3.3(a)). We report detection results on cars with three occlusion levels (level labels are provided by the dataset). As shown in Figure 3.3(b), without recovery (NR), it is more difficult to detect objects with higher occlusion levels. Recovering shapes miss will reduce the performance gaps between objects with different levels of occlusion. If all shape miss are resolved (EO+SM+SO), the perfor- mance gaps are eliminated and almost all objects can be effectively detected (APs> 99%). 3.2 Related Work 3.2.1 Point-based 3D Object Detectors V oxel-based methods divide point clouds by voxel grids to extract features [236]. Some of them also use sparse convolution to improve model efficiency, e.g., SECOND[207]. Point-based meth- ods such as PointRCNN [151] generate proposals directly from points. STD [213] applies sparse to dense refinement and V oteNet [134] votes the proposal centers from point clusters. These mod- els are supervised on the ground truth bounding boxes without explicit consideration for the object shapes. 44 3.2.2 Unsupervised Domain Adaptation for 3D Visual Tasks Most of the Unsupervised Domain Adaptation (UDA) methods focus on 2D tasks, only a few studies explore the UDA in 3D. [233, 139] align the global and local features for object-level tasks. To reduce the sparsity, [187] projects the point cloud to a 2D view, while [143] projects the point cloud to a birds-eye view (BEV). [45] creates a car model set and adapts its features to the detection object features. However, this study targets regular car 3D detection on a single point cloud domain. [182] is the first published study targeting UDA for 3D LiDAR detection. They identify the vehicle size as the domain gap between KITTI[54] and other datasets. So they resize the vehicles in data. In contrast, we identify the point cloud quality as the major domain gap between Waymo’s two datasets[161]. We use a learning-based approach to close the domain gap. 3.2.3 Point Cloud Transformation One way to improve point cloud quality is to conduct point cloud transformation. Studies of point cloud up-sampling [222, 218, 96] can transfer the point cloud of low density to the point cloud of higher density. However, they need high-density point cloud ground truth during training. These networks can densify the point cloud in the observed regions. But in our case, we also need to recover regions with no point observation, caused by "missing points". Point cloud completion networks [223, 27, 211, 194] aim to complete the point cloud. Special- ized in object-level completion, these models assume a single object has been manually located and the input only consists of the points on this object. Therefore, these models do not fit our purpose of object detection. Point cloud style transfer models [17, 16] can transfer the color theme and the object-level geometric style for the point cloud. However, these models do not focus on 45 preserving local details with high fidelity. Therefore, their transformation cannot directly help 3D detection. 3.2.4 Learning Shapes for 3D Object Detection Bounding box prediction requires models to understand object shapes. Some detectors learn shape- related statistics as an auxiliary task. PartA 2 [149] learns object part locations. SA-SSD and AssociateDet [61, 45] use auxiliary networks to preserve structural information. Studies [99, 206, 123, 202] such as SPG conduct point cloud completion to improve object detection. These models are shape-aware but overlook the impact of occlusion on object shapes. 3.3 Semantic Point Generation Detector In the input point cloudPC raw =fp 1 ;p 2 ;:::;p N g2R 3+F , each point has three channels of XYZ and F properties (e.g., intensity, elongation). Figure 3.5 illustrates the SPG-aided 3D detection pipeline. SPG takes raw point cloudPC raw as input and outputs the semantic points in the pre- dicted foreground regions. Then, these semantic points are combined with the original point cloud into an augmented point cloud PC aug , which is fed into a point cloud detector to obtain object detection results. As shown in Figure 3.6, SPG voxelizesPC raw into an evenly spaced 3D voxel grid, and learns the point cloud semantics for these voxels. For each voxel, the network predicts the probability confidence ~ P f of it being a foreground voxel (contained in a foreground object bounding box). In each foreground voxel, the network generates a semantic point ~ sp with point features ~ = [~ ; ~ f]. ~ 2R 3 is the XYZ coordinate of ~ sp and ~ f2R F is the point properties. 46 Figure 3.4: Our Semantic Point Generation (SPG) recovers the foreground regions by generating semantic points (red). Combined with the original cloud, these semantic points can be directly used by modern LiDAR-based detectors and help improve the detection results (green boxes). Since it is unreliable to generate points far away from observed points, we define a generation area. Only voxels occupied or neighbored by the observed points are considered within the gen- eration area. We also filter out semantic points with ~ P f less than P thresh , then take K semantic pointsf ~ sp 1 ; ~ sp 2 ;:::; ~ sp K g with the highest ~ P f and merge them with the original point cloudPC raw to getPC aug . In practice, we useP thresh = 0:5. To enable SPG to be directly used by modern LiDAR-based detectors, we encode the aug- mented point cloud PC aug asf^ p 1 ; ^ p 2 ;:::; ^ p N ; ~ sp 1 ; ~ sp 2 ;:::; ~ sp K g2 R 3+F +1 . Here we add another property channel to each point, indicating the confidence in foreground prediction: ~ P f is used for the semantic points, and 1.0 for the original raw points. 3.3.1 Training Targets To train SPG, we need to create two supervisions: 1)y f , the class label if a voxel (either occupied or empty) is a foreground voxel, which supervises ~ P f ; 2) 2 R 3+F , the regression target for semantic point features ~ . 47 3D Detector Foreground Generation Probability Thresholding Original Point Cloud Augmented Point Cloud Detection Result Figure 3.5: Illustration of SPG-aided 3D detection pipeline. SPG voxelizes the entire point cloud and generates a prediction for each voxel (both occupied and empty) within the generation areas. After applying probability thresholding, we take the top voxels with the highest foreground proba- bility and add a semantic point (red) at the predicted location in each of these voxels. These points are merged with the original point cloud and fed into the selected 3D point cloud detector. As visualized in Figure 3.6, we mark a point as a foreground point if it is inside an object bounding box. V oxels contained in a foreground bounding box are marked as foreground voxels V f . For voxel v i , we assign y f i = 1 if v i 2 V f and y f i = 0 otherwise. If v i is an occupied foreground voxel, we set i = [ i ; f i ] as the regression target, where i 2 R 3 is the centroid (XYZ) of all foreground points in v i while f i 2 R F is the mean of their point properties (e.g. intensity, elongation). 3.3.2 Model Structure 3 Classification Loss Regression Loss Info Propagation 1 2 foreground point semantic point foreground voxel Semantic Points Training Targets Construction Point Generation VFE Original Point Cloud raw point Figure 3.6: Training targets construction and SPG model architecture. Three steps to create the semantic point training targets: 1. V oxelization; 2. Foreground points searching 3. Label assign- ment and ground-truth point feature calculation. SPG includes three modules: 1. V oxel Feature Encoding module (VFE). 2. Information Propagation module. 3. Point Generation module. 48 The lower part of Figure 3.6 illustrates the network architecture. SPG uses a light-weight encoder-decoder network [236, 88] which is composed of three modules: 1) V oxel Feature Encoding[236] module aggregates points inside each voxel by using several MLPs. Similar to [88, 150], these voxel features are later stacked into pillars and projected onto a birds-eye view feature space; 2) Information Propagation module applies 2D convolutions on the pillar features. As shown in Figure 3.6, the semantic information in the occupied pillars (dark green) is populated into the neighboring empty pillars (light green), which enables SPG to recover the foreground regions in the empty space. 3. Point Generation module maps the pillar features to the corresponding voxels. For voxelv i in the generation area, it predicts foreground probability ~ P f i and point features ~ i = [~ i ; ~ f i ]. Finally, this module outputs semantic point ~ sp i at XYZ coordinate ~ i , while encoding it as [~ i ; ~ f i ; ~ P f i ]. 3.3.3 Recover the Foreground Regions The above pipeline supervises SPG to generate semantic points in the occupied voxels. However, it is also crucial to recover the empty voxels caused by the “missing points” problem. To generate semantic points in the empty areas, SPG employs two strategies: • "Hide and Predict", which produces the “missing points” on the source domain during training and guides SPG to recover the foreground object shape in the empty space. • "Semantic Area Expansion", which leverages the foreground/background voxel label derived from the bounding boxes and encourages SPG to recover more uncaptured foreground regions in each bounding box. 49 Foreground point Occupied voxel Generation Area Bounding box Negative supervision Positive supervision Weighted-positive supervision Generation Area (a) With Expansion Without Expansion (b) (c) (d) Supervision Background point Figure 3.7: Visualization of “Semantic Area Expansion”. (a) and (c) shows the occupied voxels and the generation area. (b) and (d) shows the supervision strategies. 3.3.3.1 Hide and Predict SPG voxelizes PC raw 2 R 3+F into voxel set V = fv 1 ;v 2 ;:::;v M g. Before passing V to the network, we randomly select % occupied voxels V hide V and hide all their points. During training, SPG is required to predict the foreground/background labely f for all voxels inV , even though it only observes points injVV hide j. The predicted point features ~ inV f hide should match the corresponding ground-truth calculated by these hidden points. This strategy brings two benefits: 1. Hiding points regions by regions mimics the missing point pattern in the target domain; 2. The strategy naturally creates the training targets for semantic points in the empty space. Section 3.3.9 shows the effectiveness of this strategy. Here we set = 25. 50 3.3.3.2 Semantic Area Expansion In section 3.1.1, we find the poor point cloud quality leads to insufficient points on each object and substantially undermines the detection performance. To remedy this problem, we allow SPG to expand the generation area to empty space. Figure 3.7 a and c shows the examples of the generation area with and without the expansion. Without the expansion, we can use the ground-truth knowledge of foreground points to super- vise SPG only on the occupied voxels (Figure 3.7 b). However, with the expansion, there is no foreground point inside these empty voxels. Therefore, as shown in Figure 3.7 d, we design a supervision scheme as follows: 1. For both occupied and empty background voxelsV b o andV b e , we impose negative supervision and set labely f = 0. 2. For the occupied foreground voxelsV f o , we sety f = 1. 3. For the empty voxels inside a bounding box V f e , we set their foreground label y f = 1 and assign a weighting factor, where< 1. 4. We only impose point features supervision at occupied foreground voxelV f o . To investigate the effectiveness of the expansion, we train a model on the OD training set and evaluate it on the Kirk validation set. The expansion results in 510% more semantic points on foreground objects, which mitigates the "missing points" problem caused by environmental inter- ference and occlusions. Figure 3.8 shows the generation results with and without the expansion. The supervision scheme encourages SPG to learn the extended shape of vehicle parts and enables SPG to fill in more foreground space with semantic points. We also conduct ablation studies (Sec- tion 3.3.9) to show the effectiveness of the proposed strategy. 51 (a) Without expansion (b) With expansion Figure 3.8: Comparisons between generated semantic points (red) with and without “Semantic Area Expansion”. 3.3.4 Objectives We use two loss functions, i.e., foreground area classification lossL cls and feature regression loss L reg . To supervise ~ P f with labely f , we use Focal loss [102] to mitigate the background-foreground class imbalance.L cls can be decomposed as focal losses on four categories of voxels: the occupied voxelsV o , the empty background voxelsV b e , the empty foreground voxelsV f e and the hidden voxels V hide . The labeling strategy for these categories is described in Section 3.3.3.2. L cls = 1 jV o [V b e j X Vo[V b e L focal + jV f e j X V f e L focal + jV hide j X V hide L focal (3.1) We use Smooth-L1 loss [64] for point feature ~ regression, and supervise the semantic points in occupied foreground voxelsV f o and the hidden foreground voxelsV f hide . L reg = 1 jV f o j X V f o L smoothL1 ( ~ ; )+ + jV f hide j X V f hide L smoothL1 ( ~ ; ) (3.2) 52 Please note that we are only interested in theL cls andL reg on voxels inside the generation area. We find = 0:5 and = 2:0 achieves the best result. 3.3.5 Experiments In this section, we first evaluate the effectiveness of SPG as a general UDA approach for 3D detec- tion, based on Waymo Domain Adaptation Dataset [161]. In addition, we show that SPG can also improve results for top-performing 3D detectors on the source domain[161, 54]. To demonstrate the wide applicability of SPG, we choose two representative detectors: 1) PointPillars [88], popu- lar among industrial-grade autonomous driving systems; 2) PV-RCNN [150], a high-performance LiDAR-based 3D detector [54, 161]. We perform two groups of model comparisons under the set- ting of unsupervised domain adaptation (UDA) and general 3D object detection: group 1, PointPil- lars vs. SPG + PointPillars; group 2, PV-RCNN vs. SPG + PV-RCNN. SPG can also be combined with range image-based detectors [119, 235, 8] by applying ray casting to the generated points. However, we leave this as future work. Datasets Waymo Domain Adaptation dataset 1.0 [161] consists of two sub-datasets, Waymo Open Dataset (OD) and Waymo Kirkland Dataset (Kirk). OD provides 798 training segments of 158,361 LiDAR frames and 202 validation segments of 40,077 frames. Captured across Cal- ifornia and Arizona, 99:40% of its frames have dry weather. Kirk is a smaller dataset including 80 training segments of 15,797 frames and 20 validation segments of 3,933 frames. Captured in Kirkland, 97:99% its LiDAR frames have rainy weather. To examine a detector’s reliability when entering a new environment, we conduct UDA experiments without using labels from Kirk during training. 53 KITTI [54] contains 7481 training samples and 7518 testing samples. Following [26], we divide the training data into a train split and a val split containing 3721 and 3769 LiDAR frames, respectively. Implementation and Training Details We use a single lightweight network architecture in all experiments. As shown in Figure 3.6, V oxel Feature Encoding[236] module includes a single layer point-wise MLP and a voxel-wise max-pooling [136, 236]. The information Propagation module includes two levels of CNN layers. The first level includes three layers of CNN with stride 1. The second level includes one CNN layer with stride 2 and four subsequent CNN layers with stride 1, then up-sampled back to the original resolution. Each layer has an output dimension of 128. From the BEV feature map, the Point Generation module uses one FC layer to produce ~ P f and another FC layer to generate the features ~ for the voxels in each pillar. SPG and each detector are trained separately. We implement PointPillars following [88] and we use the PV-RCNN source code provided by [150] (the training settings on OD 1.0 is obtained via direct communication with the author). On Waymo Domain Adaptation Dataset [161], we set SPG voxel dimension to (0.32m, 0.32m, 0.4m) for PointPillars and (0.2m, 0.2m, 0.3m) for PV-RCNN. On KITTI, we set the voxel dimension to (0.16m, 0.16m, 0.2m) and (0.2m, 0.2m, 0.3m) for PointPillars and PV-RCNN, respectively. By default, the generation area includes voxels within 6 steps of any occupied voxel. After probability thresholding, we preserve up to 8000 semantic points for Waymo Domain Adaptation Dataset and 6000 for KITTI. 54 Target Domain - Kirk Source Domain - OD Vehicle Pedestrian Vehicle Pedestrian Difficulty Method 3D AP BEV AP 3D AP BEV AP 3D AP BEV AP 3D AP BEV AP LEVEL_1 PointPillars 34.65 51.88 20.65 22.33 57.27 72.26 55.20 63.82 SPG + PointPillars 41.56 60.44 23.72 24.83 62.44 77.63 56.06 64.66 Improvement +6.91 +8.56 +3.07 +2.50 +5.17 +5.37 +0.86 +0.84 LEVEL_2 PointPillars 31.67 47.93 17.66 18.40 52.96 69.09 51.33 60.13 SPG + PointPillars 38.15 56.94 19.57 20.67 58.54 74.90 52.33 60.93 Improvement +6.48 +9.01 +1.91 +2.27 +5.58 +5.81 +1.00 +0.80 LEVEL_1 PV-RCNN 55.16 70.38 24.47 25.39 74.01 85.13 65.34 70.35 SPG + PV-RCNN 58.31 72.56 30.82 31.92 75.27 87.38 66.93 70.37 Improvement +3.15 +2.18 +6.35 +6.53 +1.26 +2.25 +1.59 +0.02 LEVEL_2 PV-RCNN 45.81 60.13 17.16 17.88 64.69 76.84 56.03 60.81 SPG + PV-RCNN 48.70 62.03 22.05 22.65 65.98 78.05 57.68 60.88 Improvement +2.89 +1.90 +4.89 +4.77 +1.29 +1.21 +1.65 +0.07 Table 3.2: Results on Waymo Open Dataset 1.0 and Kirkland Dataset. Results for PointPillars are based on our own implementation following [88]. We use the PV-RCNN source code and obtain training settings for Waymo Open Dataset [161] via direct communication with the author. Method Baseline RandDrop 3-frames 5-frames SPG 3D AP 34.65 35.45 38.00 38.51 41.56 Table 3.3: Comparisons of different strategies targeting the deteriorating point cloud quality. The models are trained on OD and evaluated on Kirk. The metric is LEVEL_1 Vehicle 3D AP. We use PointPillars[88] as the baseline. 3.3.6 Evaluation on the Waymo Open Dataset We perform two groups of model comparison by training them on the OD training set and evaluat- ing them on both the OD validation set and the Kirk validation set. Evaluation Metrics The Kirk 1.0 validation set only provides the evaluation labels of vehicles and pedestrians. We use the official evaluation tool released by [161]. The IoU thresholds for vehicles and pedestrians are 0.7 and 0.5. In Table 3.2 we report both 3D and BEV AP on two difficulty levels. More results with distance breakdown are shown in the supplemental. Target Domain On Kirk, we observe that SPG brings remarkable improvements over both de- tectors across all object types. Averaged over two difficulty levels, SPG improves PointPillars on 55 Kirk vehicle 3D AP by 6:7% and BEV AP by 8:8%. For PV-RCNN, SPG improves Kirk pedestrian 3D AP by 5:6% and BEV AP by 5:7%. Source Domain Different from most UDA methods [30, 66, 148] that only optimize the per- formance on the target domain, SPG also consistently improves the results on the source domain. Averaged across both difficulty levels, SPG improves OD vehicle 3D AP for PointPillars by 5:4% and improves OD pedestrian 3D AP for PV-RCNN by 1:6% Compare with alternative strategies We compare SPG with alternative strategies that also tar- get the deteriorating point cloud quality. Two strategies are implemented: 1. RandDrop, where we randomly drop 17% of points on the source domain during training. This dropout ratio is chosen to produce the same amount of missing points as in the target domain (see Table 3.1). 2. K-frames, where we useK consecutive historical frames on the source domain and the target domain. The points in the firstK 1 are transformed into the last frame according to the ground- truth ego-motion so that the last frame hasK times the number of points in the scene. We employ PointPillars as the baseline and choose LEVEL_1 vehicle 3D AP as the main metric on the Kirk validation set, during UDA. As shown in Table 3.3, although "RandDrop" enforces the number of missing points in the source domain to match with that in the target domain, the pattern of missing points still differs from the reality (see Figure 3.1), which limits the improvement to only 0:80% in 3D AP. To rem- edy the “missing points” problem, “3-frames” contains real points of 3 frames, and “5-frames” contains points of 5 frames. With around 800K points per scene, “5-frames” significantly improve the single-frame baseline. However, aggregating multiple frames inevitably costs higher memory 56 usage and longer processing time. Remarkably, SPG can outperform “5-frames”, by adding only 8000 semantic points, which is less than 6% of the points in a single frame. 3.3.7 Evaluation on the KITTI Dataset In this section, we show besides the usefulness in UDA (Sec. 3.3.6) the proposed SPG can also boost performance in another popular 3D detection benchmark (i.e. KITTI [54]). We adopt the same evaluation protocols as [88, 150]. KITTI Test Set As shown in Table 3.4, SPG significantly improves PV-RCNN on Car 3D de- tection. As of Mar. 3rd, 2021, our method ranks the 1st on KITTI car 3D detection among all published methods (4th among all submitted approaches). Especially, SPG benefits the detection of hard objects (object truncation up to 50%). Among all submitted methods, SPG surpasses all models on hard object detection 3D AP by a big margin and achieves the highest overall 3D AP of 83:83% (averaged over Easy, Mod. and Hard). Car - 3D AP Method Reference Easy Mod. Hard Avg. SA-SSD[61] CVPR 2020 88.75 79.79 74.16 80.90 3D-CVF[219] ECCV 2020 89.20 80.05 73.11 80.79 CIA-SSD[192] AAAI 2021 89.59 80.28 72.87 80.91 Asso-3Ddet[45] CVPR 2020 85.99 77.40 70.53 77.97 V oxel R-CNN[41] AAAI 2021 90.90 81.62 77.06 83.19 PV-RCNN[150] CVPR 2020 90.25 81.43 76.82 82.83 SPG+PV-RCNN - 90.49 82.13 78.88 83.83 Table 3.4: Car detection Results on the KITTI test set. See the full list of comparisons in the supplemental. KITTI Validation Set We summarize the results in Table 3.5. We train each group of models using the recommended settings of baseline detectors [88, 150]. 57 Car - 3D AP Car - BEV AP Pedestrian - 3D AP Pedestrian - BEV AP Method Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard PointPillars 87.75 78.39 75.18 92.03 88.05 86.66 57.30 51.41 46.87 61.59 56.01 52.04 SPG + PointPillars 89.77 81.36 78.85 94.38 89.92 87.97 59.65 53.55 49.24 65.38 59.48 55.32 Improvement +2.02 +2.97 +3.67 +2.35 +1.87 +1.31 +2.35 +2.14 +2.47 +3.79 +3.47 +3.28 PV-RCNN 92.10 84.36 82.48 93.02 90.33 88.53 64.26 56.67 51.91 67.97 60.52 55.80 SPG + PV-RCNN 92.53 85.31 82.82 94.99 91.11 88.86 69.66 61.80 56.39 71.79 64.50 59.51 Improvement +0.43 +0.95 +0.34 +1.97 +0.78 +0.33 +5.40 +5.13 +4.48 +3.82 +3.98 +3.71 Table 3.5: Comparisons on the KITTI validation set. Average Precision (AP) is computed over 40 recall positions. The baseline results[150, 166] are obtained based on publically released models. See more results (including Cyclist) in the supplemental. SPG remarkably improves both PointPillars and PV-RCNN on all object types and difficulty levels. Specifically, for PointPillars, SPG improves the 3D AP of car detection by 2:02%, 2:97%, 3:67% on easy, moderate, and hard levels, respectively. For PV-RCNN, SPG improves the 3D AP of pedestrian detection by 5:40%, 5:13%, 4:48% on easy, moderate, and hard levels, respectively. 3.3.8 Model Efficiency We evaluate the efficiency of SPG on the KITTI val split (Table 3.6). SPG only contains 0:39 mil- lion parameters while adding no more than 17 milliseconds latency to the detectors. This indicates that SPG is highly efficient for industrial-grade deployment with a stringent computation budget. Detectors PointPillars PV-RCNN - With SPG No Yes No Yes Yes Latency (ms) 23.56 36.67 139.96 156.85 16.82 Parameters 4.83M 5.22M 13.12M 13.51M 0.39M Table 3.6: Latency and model parameters. “M” stands for million. The last column shows the results of a standalone SPG. The evaluation is based on a 1080Ti GPU with a batch size of 1. The latency is averaged over the KITTI val split. 58 3.3.9 Ablation Studies We conduct ablation studies on “Semantic Area Expansion”, “Hide and Predict” and whether to add foreground confidence ( ~ P f ) as a point property and show all of them can benefit detection quality (see Table 3.7). We also change the classification weighting factor on the empty fore- ground voxelsV f e . A larger encourages more point generation in the empty foreground space. However, in reality, an object typically will not occupy the entire space within a bounding box. Therefore, over-aggressively generating extra points does not help improve the performance (see = 1:0). Probability thresholding In Table 3.8, we show the effect of choosing different thresholds dur- ing probability thresholding. While a higher P thresh only keeps semantic points with high fore- ground probability ~ P f , a lowerP thresh admits more points but introduces more uncertainties. We find the threshold of 0:5 achieves the best results. Hide & Foreground 3D Model Expansion Predict Confidence AP Improve Baseline 34.65 SPG X 35.89 +1.24 SPG 25% X 38.09 +3.44 SPG X(=0.0) 25% X 38.96 +4.31 SPG X(=1.0) 25% X 38.42 +3.77 SPG X(=0.5) X 39.22 +4.57 SPG X(=0.5) 25% 37.96 +3.31 SPG(ours) X(=0.5) 25% X 41.56 +6.91 Table 3.7: Ablation studies of SPG. The models are trained on OD and evaluated on Kirk. The metric is LEVEL_1 Vehicle 3D AP. We use PointPillars[88] as our baseline. 59 P thresh 0:3 0:4 0:5 0:6 0:7 3D AP 39.39 40.09 41.56 41.18 40.89 Table 3.8: Ablation studies on the probability threshold P thresh (only keep the semantic point if ~ P f >P thresh ). Our best SPG model usesP thresh = 0:5. The metric is LEVEL_1 Vehicle 3D AP on the Kirk validation set. 3.4 Behind the Curtain Detector Let denote the parameters of a detector,fp 1 ;p 2 ;:::;p N g denote the LiDAR point cloud,X;D;S ob ;S oc denote the estimated box center, the box dimension, the observed objects shapes, and the occluded object shapes, respectively. Most LiDAR-based 3D object detectors [217, 24, 153] only supervise the bounding box prediction. These models have MLE = argmax P (X;Djfp 1 ;p 2 ;:::;p N g; ); (3.3) while structure-aware models [149, 61, 45] also superviseS ob ’s statistics so that MLE = argmax P (X;D;S ob jfp 1 ;p 2 ;:::;p N g; ): (3.4) None of the above studies explicitly model the complete object shapesS =S ob [S oc , while the experiments in Sec. 3.1.2 show the improvements ifS is obtained. BtcDet estimatesS by predict- ing the shape occupancyO S for regions of interest. After that, BtcDet conducts object detection conditioned on the estimated probability of occupancyP(O S ). The optimization objectives can be described as follows: 60 argmax P (O S jfp 1 ;p 2 ;:::;p N g;R SM ;R OC ; ); (3.5) argmax P (X;Djfp 1 ;p 2 ;:::;p N g;P(O S ); ): (3.6) Model overview. As illustrated in Figure 3.11, BtcDet first identifies the regions of occlusionR OC and signal missR SM , and then, let a shape occupancy network estimate the probability of object shape occupancyP(O S ). The training process is described in Sec. 3.4.1. Next, BtcDet extracts the point cloud 3D features by a backbone network . The features are sent to a Region Proposal Network (RPN) to generate 3D proposals. To leverage the occupancy estimation, the sparse tensorP(O S ) is concatenated with the feature maps of . (See Sec. 3.4.2.) Finally, BtcDet applies the proposal refinement. The local geometric features f geo are com- posed ofP(O S ) and the multi-scale features from . For each region proposal, we construct local grids covering the proposal box. BtcDet pools the local geometric featuresf geo onto the local grids, aggregates the grid features and generates the final bounding box predictions. (See Sec. 3.4.3.) 3.4.1 Learning Shapes in Occlusion Approximate the complete object shapes for ground truth labels. Occlusion and signal miss precludes the knowledge of the complete object shapesS. However, we can assemble the approx- imated complete shapesS, based on two assumptions: • Most foreground objects resemble a limited number of shape prototypes, e.g., pedestrians share a few body types. 61 Figure 3.9: Learning Occluded Shapes. (a) The regions of occlusion or signal missR OC [R SM can be identified after the spherical voxelization for the point cloud. (b) To label the occupancyO S (1 or 0), We place the approximated complete object shapesS (red points) in the corresponding boxes. (c) A shape occupancy network predicts the shape occupancy probabilityP(O S ) for voxels inR OC [R SM , supervised byO S . (d) V oxels are colored orange if it has a prediction P(O S )> 0:3. • Foreground objects, especially vehicles and cyclists, are roughly symmetric. We use the labeled bounding boxes to query points belonging to the objects. For cars and cyclists, we mirror the object points against the middle section plane of the bounding box. A heuristicH(A;B) is created to evaluate if a source object B covers most parts of a target objectA and provides points that can fillA’s shape miss. To approximateA’s complete shape, we select the top 3 source objectsB 1 ;B 2 ;B 3 with the best scores. The final approximationS consists ofA’s original points and the points ofB 1 ;B 2 ;B 3 that fillA’s shape miss. The target objects are the occluded object in the current training frame, while the source objects are other objects of the same class in the detection training set. Both can be extracted by the ground truth bounding boxes. Please find details ofH(A;B) in Appendix B and more visualization of assemblingS in Appendix G. IdentifyR OC [R SM in the spherical coordinate system. According to our analysis in Sec. 3.1.2, “shape miss” only exists in the occluded regionsR OC and the regions with signal missR SM 62 Figure 3.10: Assemble the approximated complete shapeS for an object (blue) by using points from top match objects. (see Figure 3.2(c) and (d)). Therefore, we need to identifyR OC [R SM before learning to estimate shapes. In real-world scenarios, there exists at most one point in the tetrahedron frustum of a range image pixel. When the laser is stopped at a point, the entire frustum behind the point is occluded. We propose to voxelize the point cloud using an evenly spaced spherical grid so that the occluded regions can be accurately formed by the spherical voxels behind any LiDAR point. As shown in Figure 3.9(a), each point (x;y;z) is transformed to the spherical coordinate system as (r;;): r = p (x 2 +y 2 +z 2 ); =arctan2(y;x); (3.7) =arctan2(z; p x 2 +y 2 ): R OC includes nonempty spherical voxels and the empty voxels behind these voxels. In Figure 3.2(a), the dashed lines mark the potential areas of signal miss. In the range view, we can find pixels on the borders between the areas having LiDAR signals and the areas with no signal.R SM is formed by the spherical voxels that project to these pixels. 63 Create training targets. InR OC [R SM , we predict the probabilityP(O S ) for voxels if they contain points ofS. As illustrated in 3.9(b),S are placed at the locations of the corresponding objects. We setO S = 1 for the spherical voxels that containS, andO S = 0 for the others.O S is used as the ground truth label to approximate the occupancyO S of the complete object shape. Estimating occupancy has two advantages over generating points: •S is assembled by multiple objects. The shape details approximated by the borrowed points are inaccurate and the point density of different objects is inconsistent. The occupancyO S avoids these issues after rasterization. • The plausibility issue of point generation can be avoided. Estimate the shape occupancy. InR OC [R SM , we encode each nonempty spherical voxel with the average properties of the points inside (x,y,z,feats), then, send them to a shape occupancy net- work . The network consists of two down-sampling sparse-conv layers and two up-sampling inverse-convolution layers. Each layer also includes several sub-manifold sparse-convs [58] (see Appendix D). The spherical sparse 3D convolutions are similar to the ones in the Cartesian coor- dinate, except that the voxels are indexed along (r;;). The outputP(O S ) is supervised by the sigmoid cross-entropy Focal Loss [102]: L focal (p v ) =(1p v ) log(p v ); (3.8) wherep v = 8 > > > < > > > : P(O S ) ifO S = 1 at voxel v 1P(O S ) otherwise, 64 Figure 3.11: The detection pipeline. BtcDet first identifies the regions of occlusion and signal miss R OC [R SM ˙ In these regions, BtcDet estimates the shape occupancy probabilityP(O S ) (the orange voxels haveP(O S )> 0:3). When the backbone network extracts detection features from the point cloud,P(O S ) is concatenated with ’s intermediate feature maps. Then, an RPN network takes the output and generates 3D proposals. For each proposal (e.g., the green box), BtcDet pools the local geometric featuresf geo to the nearby grids and finally generates the final bounding box prediction (the red box) and the confidence score. L shape = P v2R OC [R SM w v L focal (p v ) jR OC [R SM j ; (3.9) wherew v = 8 > > > < > > > : ifv2 regions of shape miss 1 otherwise. SinceS borrows points from other objects in the shape miss regions, we assign them a weighting factor, where< 1. 3.4.2 Shape Occupancy Probability Integration Trained with customized supervision, learns the shape priors of partially observed objects and generatesP(O S ). To benefit detection,P(O S ) is transformed from the spherical coordinate to the Cartesian coordinate and fused with , a sparse 3D convolutional network that extracts detection features in the Cartesian coordinate. 65 For example, a spherical voxel has a center (r;;) which is transformed asx =rcoscos; y = rcossin; z = rsin. Assumex;y;z is inside a Cartesian voxelv i;j;k . Since several spherical voxels can be mapped tov i;j;k ,v i;j;k takes the max value of these voxelsSV (v i;j;k ): P(O S ) v ijk =max(fP(O S ) sv :sv2SV (v i;j;k )g): (3.10) The occupancy probability of these Cartesian voxels forms a sparse tensor mapP(O S ) ? =fP(O S ) v g, which is, then, down-sampled by max-poolings into multiple scales and concatenated with ’s in- termediate feature maps: f in i = f out i1 ; maxpool i1 2 (P(O S ) ? ) ; (3.11) wheref in i ,f out i1 andmaxpool i1 2 () denote the input features of ’sith layer, the output features of ’si 1th layer, and applying stride-2 maxpoolingi 1 times, respectively. The Region Proposal Network (RPN) takes the output features of and generates 3D propos- als. Each proposal includes (x p ;y p ;z p ); (l p ;w p ;h p ); p ;p p , namely, center location, proposal box size, heading, and proposal confidence. 3.4.3 Occlusion-Aware Proposal Refinement Local geometry features. BtcDet’s refinement module further exploits the benefit of shape oc- cupancy. To obtain accurate final bounding boxes, BtcDet needs to look at the local geometries 66 around the proposals. Therefore, we construct a local feature mapf geo by fusing multiple levels of ’s features. In addition, we also fuseP(O S ) ? intof geo to bring awareness to the shape miss in the local regions.P(O S ) ? provides two benefits for proposal refinement: •P(O S ) ? only has values inR OC [R SM so that it can help the box regression avoid the regions outsideR OC [R SM , e.g., the regions with cross marks in Figure 3.11. • The estimated occupancy indicates the existence of unobserved object shapes, especially for empty regions with highP(O S ) , e.g., some orange regions in Figure 3.11. f geo is a sparse 3D tensor map with spatial resolution of 400 352 5. The process for producing f geo is described in Appendix D. RoI pooling. On each proposal, we construct local grids which have the same heading as the proposal. To expand the receptive field, we set a size factor so that: w grid =w p ; l grid =l p ; h grid =h p : (3.12) The grid has a dimension of 12 4 2. We pool the nearby featuresf geo onto the nearby grids through trilinear interpolation (see Figure 3.11) and aggregate them by sparse 3D convolutions. After that, the refinement module predicts an IoU-related class confidence score and the residues between the 3D proposal boxes and the ground truth bounding boxes, following [207, 150]. 67 3.4.4 Loss The RPN lossL rpn and the proposal refinement lossL pr follow the most popular design among detectors [150, 207]. The total loss is: L total = 0:3L shape +L rpn +L pr : (3.13) More details of the losses and the network architectures can be found in Appendix C and D. 3.4.5 Experiments In this section, we describe the implementation details of BtcDet and compare BtcDet with state- of-the-art detectors on two datasets: the KITTI Dataset [54] and the Waymo Open Dataset [161]. We also conduct ablation studies to demonstrate the effectiveness of shape occupancy and feature integration strategies. More detection results can be found in Appendix F. The quantitative and qualitative evaluations of the occupancy estimation can be found in Appendix E and H. Datasets. The KITTI Dataset includes 7481 LiDAR frames for training and 7518 LiDAR frames for testing. We follow [26] to divide the training data into a train split of 3712 frames and a val split of 3769 frames. The Waymo Open Dataset (WOD) consists of 798 segments of 158361 LiDAR frames for training and 202 segments of 40077 LiDAR frames for validation. The KITTI Dataset only provides LiDAR point clouds in 3D, while the WOD also provides LiDAR range images. Implementation and training details. BtcDet transforms the point locations (x;y;z) to (r;;) for the KITTI Dataset, while directly extracting (r;;) from the range images for the WOD. On the KITTI Dataset, we use a spherical voxel size of (0:32m; 0:52 ; 0:42 ) within the range 68 Method Car 3DAP R40 Ped. 3DAP R40 Cyc. 3DAP R40 3DAP R11 Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard Car Mod. PointPillars [88] 87.75 78.39 75.18 57.30 51.41 46.87 81.57 62.94 58.98 77.28 SECOND [207] 90.97 79.94 77.09 58.01 51.88 47.05 78.50 56.74 52.83 76.48 SA-SSD [61] 92.23 84.30 81.36 - - - - - - 79.91 PV-RCNN [150] 92.57 84.83 82.69 64.26 56.67 51.91 88.88 71.95 66.78 83.90 V oxel R-CNN [41] 92.38 85.29 82.86 - - - - - - 84.52 BtcDet (Ours) 93.15 86.28 83.86 69.39 61.19 55.86 91.45 74.70 70.08 86.57 Table 3.9: Comparison on the KITTI val set, evaluated by the 3D Average Precision (AP) under 40 recall thresholds (R40). The 3D APs on under 11 recall thresholds are also reported for moderate car objects. [2:24m; 70:72m] forr, [40:69 ; 40:69 ] for and [16:60 ; 4:00 ] for. On the WOD, we use a spherical voxel size of (0:32m; 0:81 ; 0:31 ) within the range [2:94m; 74:00m] forr, [180 ; 180 ] for and [33:80 ; 6:00 ] for. Determined by grid search, we set = 2 in Eq.3.8, = 0:2 in Eq.3.9 and = 1:05 in Eq.3.12. In all of our experiments, we train our models with a batch size of 8 on 4 GTX 1080 Ti GPUs. On the KITTI Dataset, we train BtcDet for 40 epochs, while on the WOD, we train BtcDet for 30 epochs. The BtcDet is end-to-end optimized by the ADAM optimizer [78] from scratch. We apply the widely adopted data augmentations [150, 41, 88, 212, 216], which includes flipping, scaling, rotation, and ground-truth augmentation. 3.4.5.1 Evaluation on the KITTI Dataset We evaluate BtcDet on the KITTI val split after training it on the train split. To evaluate the model on the KITTI test set, we train BtcDet on 80% of all train+val data and hold out the remaining 20% data for validation. Following the protocol in [54], results are evaluated by the Average Precision (AP) with an IoU threshold of 0.7 for cars and 0.5 for pedestrians and cyclists. 69 Method Reference Modality Car 3DAP R40 Cyc. 3DAP R40 Easy Mod. Hard mAP Easy Mod. Hard mAP EPNet [71] ECCV 2020 LiDAR+RGB 89.81 79.28 74.59 81.23 - - - - 3D-CVF [219] ECCV 2020 LiDAR+RGB 89.20 80.05 73.11 80.79 - - - - PointPillars [88] CVPR 2019 LiDAR 82.58 74.31 68.99 75.29 77.10 58.65 51.92 62.56 STD [213] ICCV 2019 LiDAR 87.95 79.71 75.09 80.92 78.69 61.59 55.30 65.19 HotSpotNet [24] ECCV 2020 LiDAR 87.60 78.31 73.34 79.75 82.59 65.95 59.00 69.18 PartA 2 [149] TPAMI 2020 LiDAR 87.81 78.49 73.51 79.94 79.17 63.52 56.93 66.54 3DSSD [212] CVPR 2020 LiDAR 88.36 79.57 74.55 80.83 82.48 64.10 56.90 67.83 SA-SSD [61] CVPR 2020 LiDAR 88.75 79.79 74.16 80.90 - - - - Asso-3Ddet [45] CVPR 2020 LiDAR 85.99 77.40 70.53 77.97 - - - - PV-RCNN [150] CVPR 2020 LiDAR 90.25 81.43 76.82 82.83 78.60 63.71 57.65 66.65 V oxel R-CNN [41] AAAI 2021 LiDAR 90.90 81.62 77.06 83.19 - - - - CIA-SSD [192] AAAI 2021 LiDAR 89.59 80.28 72.87 80.91 - - - - TANet [108] AAAI 2021 LiDAR 83.81 75.38 67.66 75.62 73.84 59.86 53.46 62.39 BtcDet (Ours) AAAI 2022 LiDAR 90.64 82.86 78.09 83.86 82.81 68.68 61.81 71.10 Improvement - - -0.26 +1.24 +0.94 +0.67 +0.33 +2.73 +2.81 +1.92 Table 3.10: Comparison on the KITTI test set, evaluated by the 3D Average Precision (AP) of 40 recall thresholds (R40) on the KITTI server. BtcDet surpasses all the leaderboard front runners that are associated with publications released before our submission. The mAPs are averaged over the APs of easy, moderate, and hard objects. Please find more results in Appendix F. KITTI validation set. As summarized in Table 3.9, we compare BtcDet with the state-of-the-art LiDAR-based 3D object detectors on cars, pedestrians, and cyclists using the AP under 40 recall thresholds (R40). We reference the R40 APs of SA-SSD, PV-RCNN and V oxel R-CNN to their papers, the R40 APs of SECOND to [127], and the R40 APs of PointRCNN and PointPillars to the results of the officially released code. We also report the published 3D APs under 11 recall thresh- olds (R11) for moderate car objects. On all object classes and difficulty levels, BtcDet outperforms models that only supervise bounding boxes (Eq.3.3) as well as structure-aware models (Eq.3.4). Specifically, BtcDet outperforms other models by 2:05% 3D R11 AP on the moderate car objects, which makes it the first detector that reaches above 86% on this primary metric. KITTI test set. As shown in Table 3.10, we compare BtcDet with the front runners on the KITTI test leaderboard. Besides the official metrics, we also report the mAPs that average over the APs of 70 easy, moderate, and hard objects. As of May. 4th, 2021, compared with all the models associated with publications, BtcDet surpasses them on car and cyclist detection by big margins. Those methods include the models that take inputs of both LiDAR and RGB images and the ones taking LiDAR input only. We also list more comparisons and the results in Appendix F. 3.4.5.2 Evaluation on the Waymo Open Dataset LEVEL_1 3D mAP mAPH LEVEL_2 3D mAP mAPH Method Overall 0-30m 30-50m 50m-Inf Overall Overall 0-30m 30-50m 50m-Inf Overall PointPillar [88] 56.62 81.01 51.75 27.94 - - - - - - MVF [235] 62.93 86.30 60.02 36.02 - - - - - - SECOND [207] 72.27 - - - 71.69 63.85 - - - 63.33 Pillar-OD [183] 69.80 88.53 66.50 42.93 - - - - - - AFDet [53] 63.69 87.38 62.19 29.27 - - - - - - PV-RCNN [150] 70.30 91.92 69.21 42.17 69.69 65.36 91.58 65.13 36.46 64.79 V oxel R-CNN [41] 75.59 92.49 74.09 53.15 - 66.59 91.74 67.89 40.80 - BtcDet (ours) 78.58 96.11 77.64 54.45 78.06 70.10 95.99 70.56 43.87 69.61 Table 3.11: Comparison for vehicle detection on the Waymo Open Dataset validation set. We also compare BtcDet with other models on the Waymo Open Dataset (WOD). We report both 3D mean Average Precision (mAP) and 3D mAP weighted by Heading (mAPH) for vehicle detection. The official metrics also include separate mAPs for objects belonging to different dis- tance ranges. Two difficulty levels are also introduced, where the LEVEL_1 mAP calculates for objects that have more than 5 points and the LEVEL_2 mAP calculates for objects that have more than 1 point. As shown in Table 3.11, BtcDet outperforms these state-of-the-art detectors on all distance ranges and all difficulty levels by big margins. BtcDet outperforms other detectors on the LEVEL_- 1 3D mAP by 2:99% and the LEVEL_2 3D mAP by 3:51%. In general, BtcDet brings more improvement on the LEVEL_2 objects, since objects with fewer points usually suffer more from 71 occlusion and signal miss. These strong results on WOD, one of the largest published LiDAR datasets, manifest BtcDet’s ability to generalize. 3.4.5.3 Ablation Studies We conduct ablation studies to demonstrate the effectiveness of shape occupancy and feature inte- gration strategies. All model variants are trained on the KITTI train split and evaluated on the val split. Model Variant Learned Features Integrated Features 3DAP R11 Car Mod. BtcDet 1 (base) 83.71 BtcDet 2 R OC [R SM 84.01 BtcDet 3 P(O S ) ? P(O S ) ? 86.03 BtcDet 4 P(O S ) } 1(P(O S ) ? 0:5) 85.59 BtcDet (main) P(O S ) } P(O S ) ? 86.57 Table 3.12: Ablation studies on the learned features (Sec. 3.4.1) and the features fused into and f geo (Sec. 3.4.2). BtcDet 2 directly uses a binary map that labelsR OC [R SM . } and? indicate the spherical and the Cartesian coordinates. The “1” operator converts float values to binary codes with a threshold of 0.5. All variants share the same architecture. Shape Features. As shown in Table 3.12, we conduct ablation studies by controlling the shape features learned by and the features used in the integration. All the model variants share the same architecture and integration strategies. Similarly to [67], BtcDet 2 directly fuses the binary map of. R OC [R SM into the detection pipeline. Although the binary map provides information on occlusion, the improvement is limited since the regions with code 1 are mostly background regions and less informative. BtcDet 3 learnsP(O S ) ? directly. The network predicts the probability for Cartesian voxels. One Cartesian voxel will cover multiple spherical voxels when being close to the sensor and will 72 cover a small portion of a spherical voxel when being located at a remote distance. Therefore, the occlusion regions are misrepresented in the Cartesian coordinate. BtcDet 4 convert the probability to hard occupancy, which cannot inform the downstream branch if a region is less likely or more likely to contain object shapes. These experiments demonstrate the effectiveness of our choices for shape features, which help the main model improve 2:86 AP over the baseline BtcDet 1 . Model Variant Integrate Layers of Integrate f geo Proposal bbox 3DAP R11 Final bbox 3DAP R11 BtcDet 1 (base) 77.75 83.71 BtcDet 5 X 77.73 84.50 BtcDet 6 1,2 78.97 85.72 BtcDet 7 1 X 78.54 85.73 BtcDet 8 1,2,3 X 78.76 86.11 BtcDet (main) 1,2 X 78.93 86.57 Table 3.13: Ablation studies on which layers of are fused withP(O S ) ? (Eq. 3.11) and whether to fuseP(O S ) ? intof geo . We evaluate the KITTI’s moderate car objects and show the 3DAP R11 of the proposal and final bounding box. Integration strategies. We conduct ablation studies by choosing different layers of to concate- nate withP(O S ) ? and whether to useP(O S ) ? to formf geo . The former mostly affects proposal generation, while the latter affects proposal refinement. In Table 3.13, the experiment on BtcDet 5 shows that we can improve the final prediction AP by 0:8 if we only integrateP(O S ) ? for proposal refinement. On the other hand, the experiment on BtcDet 6 shows the integration with alone can improve the AP by 1:2 for the proposal box and final bounding box prediction AP by 2:0 over the baseline. The comparisons of BtcDet 7 , BtcDet 8 and BtcDet (main) demonstrate integratingP(O S ) ? with ’s first two layers is the best choice. SinceP(O S ) is a low-level feature while the third layer 73 of would contain high-level features, we observe a regression when BtcDet 8 also concatenates P(O S ) ? with ’s third layer. These experiments demonstrate both the integration with and the integration to formf geo can bring improvement independently. When working together, two integrations finally help BtcDet surpass all the state-of-the-art models. 3.5 Conclusion In this chapter, we investigate unsupervised domain adaptation for LiDAR-based 3D detectors across different geographic locations and weather conditions. We identify that rainy weather can severely deteriorate the point cloud quality and cause drastic performance drops for modern 3D detectors, based on the Waymo Domain Adaptation dataset. To address this issue, we present Semantic Point Generation (SPG), a general approach that improves point cloud detection during unsupervised domain adaptation. Utilizing two strategies “Hide and Predict” and “Semantic Area Generation”, SPG generates semantic points to recover the shape of foreground objects. Besides, SPG introduces a negligible overhead (only adding 6% extra points) and can be conveniently in- tegrated with modern LiDAR-based detectors. SPG achieves significant performance gains on the challenging target domain and consistently benefits detection quality on the source domain. We also analyze the impact of shape miss on 3D object detection, which is attributed to occlu- sion and signal miss in point cloud data. To solve this problem, we propose Behind the Curtain Detector (BtcDet), the first 3D object detector that targets this fundamental challenge. A train- ing method is designed to learn the underlying shape priors. BtcDet can faithfully estimate the complete object shape occupancy for regions affected by occlusion and signal miss. After the 74 integration with the probability estimation, both the proposal generation and refinement are sig- nificantly improved. BtcDet surpasses all the published state-of-the-art methods by remarkable margins. 75 Chapter 4 3D Reconstruction: Point-based Implicit Fields for Scene Reconstruction 4.1 3D Representation for Surface Reconstruction The rendering process can be described as follows. Given the 3D geometry and texture map, a camera, and lighting condition, we can obtain a 2D image by following the selected reflectance models. However, people are also interested in the inverse problem, given a single 2D image, can we reconstruct its 3D surfaces? Since the information is lost during 3D to 2D projection, the 2D image doesn’t contain all the information of the original 3D surfaces. Therefore, 3D surface reconstruction is in general, an under-constrained problem. Inspired by human brains, the model should use prior knowledge, for example, the common 3D shape of a sports car. To use a deep learning model to generate 3D shapes and to make the model end-to-end differentiable, we need to find a proper 3D shape representation to parameterize the 3D geometries. Among all differentiable 3D representations, the most commonly used are 2d manifold, volume, point set, deformation code of mesh templates, and signed-distance functions. 76 Figure 4.1: Different 3D Representations for Mesh Reconstruction However, among the aforementioned representations, the 2d manifolds [59] are constrained to a fixed amount of 2D atlas. V olume [205] and point set [49] are also constrained to spatial resolution. The fixed template with deformation codes approaches [177] are limited to mesh tem- plates and fixed topologies. In recent years, the emerged deep implicit functions [128, 115, 31] hold better performance compared to other representations, due to its great expressiveness and simplicity. Therefore, we would like to focus on modeling the implicit function of 3D shapes. 4.2 DISN: Deep Implicit Surface Network Over the recent years, a multitude of single-view 3D reconstruction methods has been proposed where deep learning-based methods have specifically achieved promising results. To represent 3D shapes, many of these methods utilize either voxels [205, 55, 238, 188, 170, 189, 209, 181] or point clouds [49] due to ease of encoding them in a neural network. However, such representations are often limited in terms of resolution. A few recent methods [59, 177, 180] have explored utilizing 77 Rendered Image OccNet DISN Real Image OccNet DISN Figure 4.2: Single-view reconstruction results using OccNet [115] and DISN on synthetic and real images. explicit surface representations in a neural network but make the assumption of fixed topology, limiting the flexibility of the approaches. Moreover, point- and mesh-based methods use Chamfer Distance (CD) and Earth-mover Distance (EMD) as training losses. However, these distances only provide approximated metrics for measuring shape similarity. To address the aforementioned limitations in voxels, point clouds, and meshes, in this paper, we study an alternative implicit 3D surface representation, Signed Distance Functions (SDF). SDFs have recently attracted attention from researchers and a few other works [39, 128, 31, 115] also choose to reconstruct 3D shapes by generating an implicit field. However, such methods either generate a binary occupancy grid or consider only the global information. Therefore, while they succeed in recovering overall shape, they fail to recover fine-grained details. After exploring dif- ferent forms of the implicit field and the information that preserves local details, we present an 78 efficient, flexible, and effective Deep Implicit Surface Network (DISN) for predicting SDFs from single-view images (Figure 4.2). An SDF simply encodes the signed distance of each point sample in 3D from the boundary of the underlying shape. Thus, given a set of signed distance values, the shape can be extracted by identifying the iso-surface using methods such as Marching Cubes [111]. As illustrated in Figure 4.5, given a convolutional neural network (CNN) that encodes the input image into a feature vector, DISN predicts the SDF value of a given 3D point using this feature vector. By sampling different 3D point locations, DISN can generate an implicit field of the underlying surface with infinite resolution. Moreover, without the need for a fixed topology assumption, the regressing target for DISN is an accurate ground truth instead of an approximated metric. While many single-view 3D reconstruction methods [205, 49, 31, 115] that learn a shape em- bedding from a 2D image are able to capture the global shape properties, they have a tendency to ignore details such as holes or thin structures. Such fine-grained details only occupy a small portion of 3D space and thus sacrificing them does not incur a high loss compared to ground truth shape. However, such results can be visually unsatisfactory. To address this problem, we introduce a local feature extraction module. Given an image of an object, our goal is to reconstruct a 3D shape that captures both the overall structure and fine-grained details of the object. We consider modeling a 3D shape as a signed distance function (SDF). As illustrated in Figure 4.3, SDF is a continuous function that maps a given spatial point p = (x;y;z)2R 3 to a real values2R: s = SDF (p): Instead of more common 3D representations such as depth [230], the absolute value of s indicates the distance of the point to the surface, while the sign of s represents if the point is inside or outside the surface. An iso-surfaceS 0 = fpjSDF (p) = 0g implicitly represents the underlying 3D shape. 79 S<0 S>0 (a) (b) Figure 4.3: Illustration of SDF. (a) Rendered 3D surface withs = 0. (b) Cross-section of the SDF. A point is outside the surface ifs> 0, inside ifs< 0, and on the surface ifs = 0. In this paper, we use a feed-forward deep neural network, Deep Implicit Surface Network (DISN), to predict the SDF from an input image. DISN takes a single image as input and predicts the SDF value for any given point. Unlike the 3D CNN methods [39] which generate a volumetric grid with a fixed resolution, DISN produces a continuous field with arbitrary resolution. Moreover, we introduce a local feature extraction method to improve the recovery of shape details. 4.2.1 Pixel-aligned Feature Extraction The overview of our method is illustrated in Figure 4.5. Given an image, DISN consists of two parts: camera pose estimation and SDF prediction. DISN first estimates the camera parameters that map an object in world coordinates to the image plane. Given the predicted camera parameters, we project each 3D query point onto the image plane and collect multi-scale CNN features for the corresponding image patch. DISN then decodes the given spatial point to an SDF value using both the multi-scale local image features and the global image features. 4.2.2 Camera Pose Estimation Given an input image, our first goal is to estimate the corresponding viewpoint. We train our net- work on the ShapeNet Core dataset [21] where all the models are aligned. Therefore we use this 80 Local Features Global Features p(x,y,z) Concat Figure 4.4: Local feature extraction. Given a 3D pointp, we use the estimated camera parameters to project p onto the image plane. Then we identify the projected location on each feature map layer of the encoder. We concatenate features at each layer to get the local features of pointp. aligned model space as the world space where our camera parameters are with respect to, and we assume a fixed set of intrinsic parameters. Regressing camera parameters from an input image directly using a CNN often fails to produce accurate poses as discussed in [72]. To overcome this issue, Insafutdinov and Dosovitskiy [72] introduce a distilled ensemble approach to regress camera pose by combining several pose candidates. However, this method requires a large num- ber of network parameters and a complex training procedure. We present a more efficient and effective network illustrated in Figure 4.6. In recent work, Zhou et al. [234] show that a 6D ro- tation representation is continuous and easier for a neural network to regress compared to more commonly used representations such as quaternions and Euler angles. Thus, we employ the 6D rotation representation b = (b x ;b y ), where b2 R 6 ,b x 2 R 3 , b y 2 R 3 . Given b, the rotation matrixR = (R x ;R y ;R z ) T 2R 33 is obtained by R x =N(b x );R z =N(R x b y );R y =R z R x ; (4.1) 81 Encoder Global Features SDF Local Features p(x,y,z) Feature Maps Estimated Camera Pose MLPs Point Features Decoder Decoder + Point Features Figure 4.5: Given an image and a point p, we estimate the camera pose and project p onto the image plane. DISN uses the local features at the projected location, the global features, and the point features to predict the SDF ofp. ‘MLPs’ denotes multi-layer perceptrons. whereR x ;R y ;R z 2R 3 ,N() is the normalization function, ‘’ indicates cross product. Transla- tiont2R 3 from world space to camera space is directly predicted by the network. Instead of calculating losses on camera parameters directly as in [72], we use the predicted camera pose to transform a given point cloud from the world space to the camera coordinate space. We compute the lossL cam by calculating the mean squared error between the transformed point cloud and the ground truth point cloud in the camera space: L cam = P pw2PCw jjp G (Rp w +t))jj 2 2 P pw2PCw 1 ; (4.2) wherePC w 2 R N3 is the point cloud in the world space, N is number of points inPC w . For eachp w 2PC w ,p G represents the corresponding ground truth point location in the camera space, andjjjj 2 2 is the squaredL 2 distance. 82 CNN Translation Rotation PC in World Space Apply Transformation PC in Pred Cam Space PC in GT Cam Space MSE Figure 4.6: Camera Pose Estimation Network. ‘PC’ denotes point cloud. ‘GT Cam’ and ‘Pred Cam’ denote the ground truth and predicted cameras. 4.2.3 Signed Distance Function Prediction Given an imageI, we denote the ground truth SDF bySDF I (), and the goal of our networkf() is to estimateSDF I (). Unlike the commonly used CD and EMD losses in previous reconstruction methods [49, 59], our guidance is a true ground truth instead of approximated metrics. Park et al [128] recently propose DeepSDF, a direct approach to regress SDF with a neural net- work. DeepSDF concatenates the location of a query 3D point and the shape embedding extracted from a depth image or a point cloud and uses an auto-decoder to obtain the corresponding SDF value. The auto-decoder structure requires optimizing the shape embedding for each object. In our initial experiments, when we applied a similar network architecture in a feed-forward manner, we observed convergence issues. Alternatively, Chen and Zhang [31] propose to concatenate the global features of an input image and the location of a query point to every layer of a decoder. While this approach works better in practice, it also results in a significant increase in the number 83 of network parameters. Our solution is to use a multi-layer perceptron to map the given point lo- cation to a higher-dimensional feature space. This high dimensional feature is then concatenated with global and local image features respectively and used to regress the SDF value. Local Feature Extraction As shown in Figure 4.7(a), our initial experiments showed that it is hard to capture shape details such as holes and thin structures when only global image features are used. Thus, we introduce a local feature extraction method to focus on reconstructing fine- grained details, such as the back poles of a chair (Figure 4.7). As illustrated in Figure 4.4, a 3D pointp2R 3 is projected to a 2D locationq2R 2 on the image plane with the estimated camera parameters. We retrieve features on each feature map corresponding to locationq and concatenate them to get the local image features. Since the feature maps in the later layers are smaller in dimension than the original image, we resize them to the original size with bilinear interpolation and extract the resized features at locationq. Two decoders then take the global and local image features respectively as input with the point features and make an SDF prediction. The final SDF is the sum of these two predictions. Figure 4.7 compares the results of our approach with and without local feature extraction. With only global features, the network is able to predict the overall shape but fails to produce details. Local feature extraction helps to recover these missing details by predicting the residual SDF. Loss Functions We regress continuous SDF values instead of formulating a binary classification problem (e.g., inside or outside of a shape) as in [31]. This strategy enables us to extract surfaces that correspond to different iso-values. To ensure that the network concentrates on recovering the 84 Input (a) (b) Figure 4.7: Shape reconstruction results (a) without and (b) with local feature extraction. details near and inside the iso-surfaceS 0 , we propose a weighted loss function. Our loss is defined by L SDF = X p mjf(I;p)SDF I (p)j; m = 8 > > > < > > > : m 1 ; ifSDF I (p)<; m 2 ; otherwise; (4.3) wherejj is theL 1 -norm.m 1 , andm 2 are different weights, and for points whose signed distance is below a certain threshold, we use a higher weight ofm 1 . 85 4.2.4 Surface Reconstruction To generate a mesh surface, we first define a dense 3D grid and predict SDF values for each grid point. Once we compute the SDF values for each point in the dense grid, we use Marching Cubes [111] to obtain the 3D mesh that corresponds to the iso-surfaceS 0 . Input 3DN AtlasNet Pix2Mesh 3DCNN IMNET OccNet Ours cam Ours GT Figure 4.8: Single-view reconstruction results of various methods. ‘GT’ denotes ground truth shapes. Best viewed on screen with zooming in. 86 4.2.5 Evaluation We perform quantitative and qualitative comparisons on single-view 3D reconstruction with state- of-the-art methods [59, 177, 180, 31, 115] in Section 4.2.5. We also compare the performance of our method on camera pose estimation with [72] in Section 4.2.5. We further conduct ablation studies in Section 4.2.5 and showcase several applications in Section 4.2.6. Dataset For both camera prediction and SDF prediction, we follow the settings of [59, 177, 180, 115], and use the ShapeNet Core dataset [21], which includes 13 object categories, and an official training/testing split to train and test our method. We train a single network on all categories and report the test results generated by this network. Choy et al. [35] provide a dataset of renderings of ShapeNet Core models where each model is rendered from 24 views with limited variation in terms of camera orientation. In order to make our method more general, we provide a new 2D dataset * composed of renderings of the models in ShapeNet Core. Specifically, for each mesh model, our dataset provides 36 renderings with smaller variations (similar to [35]’s) and 36 views with a larger variation(bigger yaw angle range and larger distance variation). Unlike Choy et al., we allow the object to move away from the origin, therefore, providing more degrees of freedom in terms of camera parameters. We ignore the "Roll" angle of the camera since it is very rare in a real-world scenario. We also render higher- resolution images (224 by 224 instead of the original 137 by 137). Finally, to facilitate future studies, we also pair each rendered RGBA image with a depth image, a normal map, and an albedo image as shown in Figure 4.9. * https://github.com/Xharlie/ShapenetRender_more_variation 87 RGBA Albedo Depth Normal Figure 4.9: Each view of each object has four representations correspondingly Data Preparation and Implementation Details For each 3D mesh in ShapeNet Core, we first generate an SDF grid with resolution 256 3 using [197, 157]. Models in ShapeNet Core are aligned and we choose this aligned model space as our world space where each renders view in [35] represents a transformation to a different camera space. We train our camera pose estimation network and SDF prediction network separately. For both networks, we use VGG-16 [156] as the image encoder. When training the SDF prediction net- work, we extract the local features using the ground truth camera parameters. As mentioned in Section ??, DISN is able to generate a signed distance field with an arbitrary resolution by contin- uously sampling points and regressing their SDF values. However, in practice, we are interested in points near the iso-surfaceS 0 . Therefore, we use Monte Carlo sampling to choose 2048 grid points under Gaussian distributionN (0; 0:1) during training. We choosem 1 = 4, m 2 = 1, and = 0:01 as the parameters of Equation 4.3. Our network is implemented with TensorFlow. We use the Adam optimizer with a learning rate of 1 10 4 and a batch size of 16. For testing, we first use the camera pose prediction network to estimate the camera parameters for the input image and feed the estimated parameters as input to SDF prediction. We follow the aforementioned surface reconstruction procedure (Section 4.2.4) to generate the output mesh. = 88 Evaluation Metrics For quantitative evaluations, we apply four commonly used metrics to com- pute the difference between a reconstructed mesh object and its ground truth mesh: (1) Chamfer Distance (CD), (2) Earth Mover’s Distance (EMD) between uniformly sampled point clouds, (3) Intersection over Union (IoU) on voxelized meshes, and (4) F-Score [164]. Single-view Reconstruction Comparison With State-of-the-art Methods In this section, we compare our approach to single-view reconstruction with state-of-the-art methods: AtlasNet [59], Pixel2Mesh [177], 3DN [180], OccNet [115] and IMNET [31]. AtlasNet [59] and Pixel2Mesh [177] generate a fixed-topology mesh from a 2D image. 3DN [180] deforms a given source mesh to re- construct the target model. When comparing to this method, we choose a source mesh from a given set of templates by querying a template embedding as proposed in the original work. IMNET [31] and OccNet [115] both predict the sign of SDF to reconstruct 3D shapes. Since IMNET trains an individual model for each category, we implement their model following the original paper and train a single model on all 13 categories. Due to the mismatch between the scales of shapes recon- structed by our method and OccNet, we only report their IoU, which is scale-invariant. In addition, we train a 3D CNN model, denoted by ‘3DCNN’, where the encoder is the same as DISN and the decoder is a volumetric 3D CNN structure with an output dimension of 64 3 . The ground truth for 3DCNN is the SDF values on all 64 3 grid locations. For both IMNET and 3DCNN, we use the same surface reconstruction method as ours to output reconstructed meshes. We also report the results of DISN using estimated camera poses and ground truth poses, denoted by ‘Ours cam ’ and ‘Ours’ respectively. AtlasNet, Pixel2Mesh, and 3DN use explicit surface generation, while 3DCNN, IMNET, OccNet, and our methods reconstruct implicit surfaces. 89 As shown in Table 4.1, DISN outperforms all other models in EMD and IoU. Only 3DN per- forms better than our model on CD, however, 3DN requires more information than ours in the form of a source mesh as input. Figure 4.8 shows qualitative results. As illustrated in both quantitative and qualitative results, implicit surface representation provides a flexible method of generating topology-variant 3D meshes. Comparisons to 3D CNN show that predicting SDF values for given points produces smoother surfaces than generating a fixed 3D volume using an image embedding. We speculate that this is due to SDF being a continuous function with respect to point locations. It is harder for a deep network to approximate an overall SDF volume with global image features only. Moreover, our method outperforms IMNET and OccNet in terms of recovering shape de- tails. For example, in Figure 4.8, local feature extraction enables our method to generate different patterns of the chair backs in the first three rows, while other methods fail to capture such details. We further validate the effectiveness of our local feature extraction module in Section 4.2.5. Al- though using ground truth camera poses (i.e., ’Ours’) outperforms using predicted camera poses (i.e., ’Ours cam ’) in quantitative results, respective qualitative results demonstrate no significant dif- ference. Among the methods using SDF, training on regression of the SDF value rather than the classification of its sign can bring advantages in error toleration and smoothness to the surface pre- diction. Therefore, even the 3DCNN model can outperform IMNET. Inference with each location independently brings our model the flexibility to choose more locations close to the surface. On the other hand, 3DCNN has to treat less important points (points that are far away from the surface) and points near the surface equally. Figure 4.8 shows the qualitative results. We also compute the F-score (see Table 4.2) which measures the percentage of surface area that is reconstructed correctly and thus provides a reliable metric [164]. In our evaluations, we use F 1 = 2 (Precision Recall)=(Precision + Recall). We uniformly sample points from both ground 90 plane bench box car chair display lamp speaker rifle sofa table phone boat Mean EMD AtlasNet 3.39 3.22 3.36 3.72 3.86 3.12 5.29 3.75 3.35 3.14 3.98 3.19 4.39 3.67 Pxl2mesh 2.98 2.58 3.44 3.43 3.52 2.92 5.15 3.56 3.04 2.70 3.52 2.66 3.94 3.34 3DN 3.30 2.98 3.21 3.28 4.45 3.91 3.99 4.47 2.78 3.31 3.94 2.70 3.92 3.56 IMNET 2.90 2.80 3.14 2.73 3.01 2.81 5.85 3.80 2.65 2.71 3.39 2.14 2.75 3.13 3D CNN 3.36 2.90 3.06 2.52 3.01 2.85 4.73 3.35 2.71 2.60 3.09 2.10 2.67 3.00 Ours cam 2.67 2.48 3.04 2.67 2.67 2.73 4.38 3.47 2.30 2.62 3.11 2.06 2.77 2.84 Ours 2.45 2.41 2.99 2.52 2.62 2.63 4.11 3.37 1.93 2.55 3.07 2.00 2.55 2.71 CD AtlasNet 5.98 6.98 13.76 17.04 13.21 7.18 38.21 15.96 4.59 8.29 18.08 6.35 15.85 13.19 Pxl2mesh 6.10 6.20 12.11 13.45 11.13 6.39 31.41 14.52 4.51 6.54 15.61 6.04 12.66 11.28 3DN 6.75 7.96 8.34 7.09 17.53 8.35 12.79 17.28 3.26 8.27 14.05 5.18 10.20 9.77 IMNET 12.65 15.10 11.39 8.86 11.27 13.77 63.84 21.83 8.73 10.30 17.82 7.06 13.25 16.61 3D CNN 10.47 10.94 10.40 5.26 11.15 11.78 35.97 17.97 6.80 9.76 13.35 6.30 9.80 12.30 Ours cam 9.96 8.98 10.19 5.39 7.71 10.23 25.76 17.90 5.58 9.16 13.59 6.40 11.91 10.98 Ours 9.01 8.32 9.98 4.92 7.54 9.58 22.73 16.70 4.36 8.71 13.29 6.21 10.87 10.17 IoU AtlasNet 39.2 34.2 20.7 22.0 25.7 36.4 21.3 23.2 45.3 27.9 23.3 42.5 28.1 30.0 Pxl2mesh 51.5 40.7 43.4 50.1 40.2 55.9 29.1 52.3 50.9 60.0 31.2 69.4 40.1 47.3 3DN 54.3 39.8 49.4 59.4 34.4 47.2 35.4 45.3 57.6 60.7 31.3 71.4 46.4 48.7 IMNET 55.4 49.5 51.5 74.5 52.2 56.2 29.6 52.6 52.3 64.1 45.0 70.9 56.6 54.6 3D CNN 50.6 44.3 52.3 76.9 52.6 51.5 36.2 58.0 50.5 67.2 50.3 70.9 57.4 55.3 OccNet 54.7 45.2 73.2 73.1 50.2 47.9 37.0 65.3 45.8 67.1 50.6 70.9 52.1 56.4 DISN cam 57.5 52.9 52.3 74.3 54.3 56.4 34.7 54.9 59.2 65.9 47.9 72.9 55.9 57.0 DISN 61.7 54.2 53.1 77.0 54.9 57.7 39.7 55.9 68.0 67.1 48.9 73.6 60.2 59.4 Table 4.1: Quantitative results on ShapeNet Core for various methods. Metrics are CD (0:001, the smaller the better), EMD (100, the smaller the better) and IoU (%, the larger the better). CD and EMD are computed on 2048 points. truth and generated meshes. We define precision as the percentage of the generated points whose distance to the closest ground truth point is less than a threshold. Similarly, we define recall as the percentage of ground truth points whose distance to the closest generated point is less than a threshold. Camera Pose Estimation We compare our camera pose estimation with [72]. Given a point cloudPC w in world coordinates for an input image, we transformPC w using the predicted camera pose and compute the mean distanced 3D between the transformed point cloud and the ground truth point cloud in camera space. We also compute the 2D reprojection errord 2D of the transformed 91 Threshold(%) 0.5% 1% 2% 5% 10% 20% 3DCNN 0.064 0.295 0.691 0.935 0.984 0.997 IMNet 0.063 0.286 0.673 0.922 0.977 0.995 DISN 0.079 0.327 0.718 0.943 0.984 0.996 DISN cam 0.070 0.307 0.700 0.940 0.986 0.998 Table 4.2: F-Score for varying thresholds (% of reconstruc- tion volume side length, same as [164]) on all categories. [72] Ours Ours new d 3D 0.073 0.047 0.059 d 2D 4.86 2.95 4.38/2.67 Table 4.3: Camera pose estimation comparison. The unit ofd 2D is pixels. point cloud after we project it onto the input image. Table 4.3 reportsd 3D andd 2D of [72] and our method. With the help of the 6D rotation representation, our method outperforms [72] by 2 pixels in terms of 2D reprojection error. We also train and test the pose estimation on the new 2D dataset. Even though these images possess more view variation, because of the better rendering quality, we can achieve an average 2D distance of 4.38 pixels on 224 by 224 images (2.67 pixels if normalized to the original resolution of 137 by 137). Ablation Studies To show the impact of the camera pose estimation, local feature extraction, and different network architectures, we conduct ablation studies on the ShapeNet “chair” category, since it has the greatest variety. Table 4.4 reports the quantitative results and Figure 4.10 shows the qualitative results. Camera Pose Estimation As is shown in Section 4.2.5, camera pose estimation potentially in- troduces uncertainty to the local feature extraction process with an average reprojection error of 2:95 pixels. Although the quantitative reconstruction results with ground truth camera parameters are constantly superior to the results with estimated parameters in Table 4.4, Figure 4.10 demon- strates that a small difference in the image projection does not affect the reconstruction quality significantly. 92 Input Binary cam Binary Global One- stream cam One- stream Two- stream cam Two- stream GT Figure 4.10: Qualitative results of our method using different settings. ‘GT’ denotes ground truth shapes, and ‘ cam ’ denotes models with estimated camera parameters. Binary Classification Previous studies [115, 31] formulate SDF prediction as a binary classifi- cation problem by predicting the probability of a point is inside or outside the surfaceS 0 . Even though Section 4.2.5 illustrates our superior performance over [115, 31], we further validate the effectiveness of our regression supervision by comparing it with classification supervision using our own network structure. Instead of producing an SDF value, we train our network with classi- fication supervision and output the probability of a point being inside the mesh surface. We use a softmax cross entropy loss to optimize this network. We report the result of this classification network as ‘Binary’. Local Feature Extraction Local image features of each point provide access to the correspond- ing local information that captures shape details. To validate the effectiveness of this information, we remove the ‘local features extraction’ module from DISN and denote this setting by ‘Global’. This model predicts the SDF value solely based on the global image features. By comparing 93 Figure 4.11: Shape interpolation result. ‘Global’ with other methods in Table 4.4 and Figure 4.10, we conclude that local feature extrac- tion helps the model capture shape details and improve the reconstruction quality by a large margin. Network Structures To further assess the impact of different network architectures, in addition to our original architecture with two decoders (which we call ’Two-stream’), we also introduce a ‘One-stream’ architecture where the global features, the local features, and the point features are concatenated and fed into a single decoder which predicts the SDF value. As illustrated in Table 4.4 and Figure 4.10, the original Two-stream setting is slightly superior to One-stream, which shows that DISN is robust to different network architectures. Camera Binary Global One-stream Two-stream Pose ground truth / estimated n/a ground truth / estimated ground truth / estimated EMD 2.88 / 2.99 2.75 / n/a 2.71 / 2.74 2.62 / 2.65 CD 8.27 / 8.80 7.64 / n/a 7.86 / 8.30 7.55 / 7.63 IoU 54.9 / 53.5 54.8 / n/a 53.6 / 53.5 55.3 / 53.9 Table 4.4: Quantitative results on the category “chair”. CD (0:001), EMD (100) and IoU (%). 94 l Figure 4.12: Test our model on online product images. 4.2.6 Applications Shape interpolation Figure 4.11 shows shape interpolation results where we interpolate both global and local image features going from the leftmost sample to the rightmost. We see that the generated shape is gradually transformed. Test with online product images Figure 4.12 illustrates 3D reconstruction results by DISN on online product images. Note that our model is trained on rendered images, this experiment vali- dates the domain transferability of DISN. Multi-view reconstruction Our model can also take multiple 2D views of the same object as input. After extracting the global and the local image features for each view, we apply max pooling and use the resulting features as input to each decoder. We have retrained our network for 3 input views and visualize some results in Figure 4.13. Combining multi-view features helps DISN to further address shape details. 95 (a) (b) (c) (d) (e) Figure 4.13: Multi-view reconstruction results. (a) Single-view input. (b) Reconstruction results from (a). (c)&(d) Two other views. (e) Multi-view reconstruction result from (a), (c), and (d). 4.3 Neural Radiance Fields for Scene Reconstruction So far, we only talked about 3D geometry reconstruction from 2d images, but actually, we need many more components other than geometry to render a 3d scene. For instance, the lighting con- dition, the textures, and the material’s reflectance properties. Figure 4.14: Full set of 3D Scene Elements Reconstruction From RGB Images 96 The inverse problem is even more complicated. Not only do we have to estimate the geometry, but also the other components so that we can reconstruct the original 3d scene. Several studies are focusing on different individual components, but reconstructing them all together is still very challenging. However, there is an alternative rendering model, direct volume rendering. which model every component we mentioned before, together as radiance fields. Suppose the 3d space is filled with particles emitting radiance. When we shoot a ray to a camera. We can use this integration to ag- gregate the radiance along the ray to get the RGB color. However, there is an alternative rendering Figure 4.15: Direct V olume Rendering model, direct volume rendering. which model every component we mentioned before, together as radiance fields. Suppose the 3d space is filled with particles that can emit view-dependent light and at the same time, block the light emitted by other particles. We can use this integration to aggregate the radiance, marching along the ray to get the RGB color. After discretizing the integration, we can use a network to compute the color and density along the ray, and eventually model the entire radiance field. NeRF is one the most popular deep learning method to model the radiance fields and achieve outstanding results of novel view synthesis. 97 Figure 4.16: Discretized Direct V olume Rendering Specifically, a pixel’s radiance can be computed by marching a ray through the pixel, sampling M shading points atfx j j j = 1;:::;Mg along the ray, and accumulating radiance using volume density, as: c = X M j (1 exp( j j ))r j ; j = exp( j1 X t=1 t t ): (4.4) Here, represents volume transmittance; j andr j are the volume density and radiance for each shading pointj atx j , t is the distance between adjacent shading samples. A radiance field represents the volume density and view-dependent radiance r at any 3D location. NeRF [120] proposes to use a multi-layer perceptron (MLP) to regress such radiance fields. NeRF is one the most popular deep learning method to model the radiance fields and achieve outstanding results of novel view synthesis. The Original NeRF model optimizes a scene by taking hundreds of Multiview images. Along each camera ray, NeRF samples hundreds of shading points, 98 Figure 4.17: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis even though most of them are in empty regions. During shading, NeRF encodes the entire radiance field as an MLP, which takes the shading location and ray direction and out put the RGB color and volume density. Since NeRF is unaware of scene geometry, its shading sampling is inefficient. Using a single network to encode the scene also makes it hard to scale and converge. Here we summarize some of its weaknesses. Because NeRF has no awareness of the shape priors, it has to probe and encode the empty space as well, so that, it is inefficient to render and very slow to converge. Besides, a single network will limit the model’s scalability and also make it hard to generalize. In the following sections, we proposed a new method, aiming to solve all these problems 4.4 Point-NeRF: Point-based Neural Radiance Fields Modeling real scenes from image data and rendering photo-realistic novel views is a central prob- lem in computer vision and graphics. NeRF [120] and its extensions [104, 112, 225] have shown great success in this by modeling neural radiance fields. These methods [120, 225, 129] often reconstruct radiance fields using global MLPs for the entire space through ray marching. This leads to long reconstruction times due to the slow per-scene network fitting and the unnecessary sampling of vast empty space. 99 Figure 4.18: Point-NeRF uses neural 3D points to efficiently represent and render a continuous radiance volume. The point-based radiance field can be predicted via network forward inference from multi-view images. It can then be optimized per scene to achieve reconstruction quality that surpasses NeRF [120] in tens of minutes. Point-NeRF can also leverage off-the-shelf recon- struction methods like COLMAP [145] and is able to perform point pruning and growing that automatically fix the holes and outliers that are common in these approaches. We address this issue using Point-NeRF, a novel point-based radiance field representation that uses 3D neural points to model a continuous volumetric radiance field. Unlike NeRF which purely depends on per-scene fitting, Point-NeRF can be effectively initialized via the inference of a deep neural network, pre-trained across scenes, leading to efficient radiance field reconstruc- tion. Moreover, Point-NeRF avoids ray sampling in the empty scene space by leveraging classical point clouds that approximate the actual scene geometry. This advantage of Point-NeRF leads to more efficient reconstruction and more accurate rendering than other neural radiance field mod- els [120, 23, 179, 221]. Our Point-NeRF representation consists of a point cloud with per-point neural features: each neural point encodes the local 3D scene geometry and appearance around it. Prior point-based neural rendering techniques [4] use similar neural point clouds but perform rendering with rasteri- zation and 2D CNNs operating in image space. We instead treat these neural points as local neural basis functions in 3D to model a continuous volumetric radiance field which enables high-quality neural rendering using differentiable ray marching. In particular, for any 3D location, we propose to use an MLP network to aggregate the neural points in its neighborhood to regress the volume 100 density and view-dependent radiance at that location. This expresses a continuous radiance field, enabling high-quality neural rendering via volume rendering We present a learning-based framework to efficiently generate and optimize the point-based radiance fields. To generate an initial field, we leverage deep multi-view stereo (MVS) techniques [214], i.e., applying a cost-volume-based network to predict depth which is then unprojected to 3D space. In addition, a deep CNN is trained to extract 2D feature maps from input images, naturally providing the per-point features for the unprojected points. These neural points from multiple views are combined as a neural point cloud, which forms a point-based radiance field of the scene. We train this point generation module with the point-based volume rendering networks from end to end, to render novel view images and supervise them with the ground truth. This leads to a generalizable model that can directly predict a point-based radiance field at inference time. Once predicted, the initial point-based field is further optimized per scene in a short period to achieve photo-realistic rendering. As shown in Fig. 4.18 (left), 40 minutes of optimization with Point-NeRF outperforms a NeRF model trained for two days. Besides using the in-built point cloud reconstruction, our approach is generic and can also generate a radiance field based on a point cloud of other reconstruction techniques. However, the reconstructed point cloud produced by techniques like COLMAP [145], in practice, contains holes and outliers that adversely affect the final rendering. To address this issue, we introduce point growing and pruning as part of our optimization process. We leverage the geometric reason- ing during volume rendering [44] and grow points near the point cloud boundary in high-density regions and prune points in low-density regions. The mechanism effectively improves our final reconstruction and rendering quality. We show an example in Fig. 4.18 (right) where we convert 101 COLMAP points to a radiance field and successfully fill large holes and produce photo-realistic renderings. We train our model on the DTU dataset [74] and evaluate it on DTU testing scenes and NeRF synthetic scenes. The results demonstrate that our approach can achieve state-of-the-art novel view synthesis, outperforming many prior arts including point-based methods [4], NeRF, NSVF [104], and many other generalizable neural methods [221, 179, 23] (see (Tab. 4.5 and 4.6)). 4.4.1 Related Work Scene representations. Traditional and neural methods have studied many 3D scene represen- tations, including volumes [147, 84, 75, 191, 137], point clouds [136, 2, 175], meshes [76, 177], depth maps [103, 69], and implicit functions [31, 117, 125, 215], in diverse vision and graphics ap- plications. Recently, various neural scene representations have been presented [232, 158, 110, 10], advancing the state of the art in novel view synthesis and realistic rendering, with volumetric neu- ral radiance fields (NeRFs) [120] producing high fidelity results. NeRFs are often reconstructed as global MLPs [120, 225, 129] that encode the entire scene space; this can be inefficient and ex- pensive when reconstructing complex and large-scale scenes. Instead, Point-NeRF is a localized neural representation, combining volumetric radiance fields with point clouds that are classically used to approximate scene geometry. We distribute fine-grained neural points to model complex local scene geometry and appearance, leading to better rendering quality than NeRF. V oxel grids with per-voxel neural features [104, 23, 65] are also a local neural radiance rep- resentation. However, our point-based representation adapts better to actual surfaces, leading to 102 better quality. Also, we directly predict good initial neural point features, bypassing the per-scene optimization that is required by most voxel-based methods [104, 65]. Multi-view reconstruction and rendering. Multi-view 3D reconstruction has been extensively studied and addressed with a number of structure-from-motion [144, 172, 162] and multi-view stereo techniques [51, 84, 145, 214, 32]. Point clouds are often the direct output from MVS or depth sensor, though they are usually converted to meshes [111, 77] for rendering and visualization. Meshing can introduce errors and may require image-based rendering [40, 14, 231] for high-quality rendering. We instead directly use point clouds from deep MVS to achieve realistic rendering. Point clouds have been widely used in rendering, often via rasterization-based point splatting, and even differentiable rasterization modules [186, 89]. However, reconstructed point clouds often have holes and outliers that lead to artifacts in rendering. Point-based neural rendering methods address this by splatting neural features and using 2D CNNs to render them [4, 83, 118]. In contrast, our point-based approach utilizes 3D volume rendering, leading to significantly better results than previous point-based methods. Neural radiance fields. NeRFs [120] have demonstrated remarkably high-quality results for novel view synthesis. They have been extended to achieve dynamic scene capture [98, 130], re- lighting [9, 11], appearance editing [193], fast rendering [65, 220], and generative models [19, 146, 124]. However, most methods [98, 130, 193, 9] still follow the original NeRF framework and train per-scene MLPs to represent radiance fields. We make use of neural points with spatially varying neural features in a scene to encode its radiance field. This localized representation can model more complex scene content than pure MLPs that have limited network capacity. More importantly, we 103 show that our point-based neural field can be efficiently initialized via a pre-trained deep neural network that generalizes across scenes and leads to highly efficient radiance field reconstruction. Prior works also present generalizable radiance field-based methods. PixelNeRF [221] and IBRNet [179] aggregate multi-view 2D image features at every sampled ray point to regress vol- ume rendering properties for radiance field rendering. In contrast, we leverage features in 3D neural points around the scene surface to model radiance fields. This avoids sampling points in the vast empty space and leads to higher rendering quality and faster radiance field reconstruction than PixelNeRF and IBRNet. MVSNeRF [23] can achieve very fast voxel-based radiance field recon- struction. However, its prediction network requires a fixed number of three small-baseline images as input and thus can only efficiently reconstruct local radiance fields. Our approach can fuse neu- ral points from an arbitrary number of views and achieve the fast reconstruction of complete 360 radiance fields which MVSNeRF cannot support. 4.4.2 Point-NeRF Representation We denote a neural point cloud byP =f(p i ;f i ; i )ji = 1;:::;Ng, where each pointi is located atp i and associated with a neural feature vectorf i that encodes the local scene content. We also assign each point a scale confidence value i 2 [0; 1] that represents how likely that point is to be located near an actual scene surface. We regress the radiance field from this point cloud. Given any 3D location x, we query K neighboring neural points around it within a certain radiusR. Our point-based radiance field can be abstracted as a neural module that regresses volume 104 density and view-dependent radiancer (along any viewing directiond) at any shading location x from its neighboring neural points as: (;r) = Point-NeRF(x;d;p 1 ;f 1 ; 1 ;:::;p K ;f K ; K ): (4.5) We use a PointNet-like [136] neural network, with multiple sub-MLPs, to do this regression. Over- all, we first conduct neural processing for each neural point and then aggregate the multi-point information to obtain the final estimates. Per-point processing. We use an MLPF to process each neighboring neural point to predict a new feature vector for the shading locationx by: f i;x =F (f i ;xp i ): (4.6) Essentially, the original featuref i encodes the local 3D scene content aroundp i . This MLP network expresses a local 3D function that outputs the specific neural scene descriptionf i;x atx, modeled by the neural point in its local frame. The usage of relative position xp makes the network invariant to point translation for better generalization. View-dependent radiance regression. We use standard inverse distance weighting to aggregate the neural featuresf i;x regressed from these K neighboring points to obtain a single featuref x that describes scene appearance atx: f x = X i i w i P w i f i;x ;w i = 1 kp i xk : (4.7) 105 Then an MLP,R, regress the view-dependent radiance from this feature given a viewing direc- tion,d: r =R(f x ;d): (4.8) The inverse-distance weight w i is widely used in scattered data interpolation; we leverage it to aggregate neural features, making closer neural points contribute more to the shading compu- tation. In addition, we use the per-point confidence in this process; this is optimized in the final reconstruction with a sparsity loss, giving the network the flexibility of rejecting unnecessary points. Density regression. To compute volume density atx, we follow a similar multi-point aggre- gation. However, we first regress a density i per point using an MLP T and then do inverse distance-based weighting, given by: i =T (f i;x ) (4.9) = X i i i w i P w i ;w i = 1 kp i xk : (4.10) Thus, each neural point directly contributes to the volume density, and point confidence i is explicitly associated with this contribution. We leverage this in our point removal process (see Sec. 4.4.3.2). Discussion. Unlike previous neural point-based methods [4, 118] that rasterize point features and then render them with 2D CNNs, our representation and rendering are entirely in 3D. By using a point cloud that approximates the scene geometry, our representation naturally and efficiently 106 adapts to scene surfaces and avoids sampling shading locations in empty scene space. For shading points along each ray, we implement an efficient algorithm to query neighboring neural points. 4.4.3 Point-NeRF Reconstruction We now introduce our pipeline for efficiently reconstructing point-based radiance fields. We first leverage a deep neural network, trained across scenes, to generate an initial point-based field via direct network inference (Sec. 4.4.3.1). This initial field is further optimized per scene with our point growing and pruning techniques, leading to our final high-quality radiance field reconstruc- tion (Sec. 4.4.3.2). Figure. 4.19 shows this workflow with the corresponding gradient updates for the initial prediction and per-scene optimization. Figure 4.19: The dash lines indicate gradient updates for radiance field initialization and per- scene optimization. 4.4.3.1 Initialization Given a set of known images I 1 ,...,I Q , and a point cloud, our Point-NeRF representation can be reconstructed by optimizing the randomly initialized per-point neural features and the MLPs with a rendering loss (similar to NeRF). However, this pure per-scene optimization depends on an existing point cloud and can be prohibitively slow. Therefore, we propose a neural generation module to predict all neural point properties, including point locations p i , neural features f i and point 107 confidence i , via a feed-forward neural network for efficient reconstruction. The direct inference of the network outputs a good initial point-based radiance field. The initial fields can then be fine- tuned to achieve high-quality rendering. In a very short period, the rendering quality is better or on par with NeRF which takes a substantially longer time to optimize (see Tab. 4.5 and 4.6). Point location and confidence. We leverage deep MVS methods to generate 3D point locations using cost volume-based 3D CNNs [214, 32]. Such networks produce high-quality dense geom- etry and generalize well across domains. For each input imageI q with camera parameters q at viewpointq, we follow MVSNet [69] to first build a plane-swept cost volume by warping 2D im- age features from neighboring viewpoints and then regress depth probability volume using deep 3D CNNs. A depth map is computed by linearly combining per-plane depth values weighted by the probabilities. We unprojected the depth map to 3D space to get a point cloudfp 1 ;:::;p Nq g per viewq. Since the depth probabilities describe the likelihood of the point is on the surface, we tri- linearly sample the depth probability volume to obtain the point confidence i at each point p i . The above process can be expressed by fp i ; i g =G p; (I q ; q ;I q 1 ; q 1 ;I q 2 ; q 2 ;:::); (4.11) whereG p; is the MVSNet-based network. I q 1 ; q 1 ;::: are additional neighboring views used in the MVS reconstruction; we use two additional views in most cases. 108 Point features. We use a 2D CNNG f to extract neural 2D image feature maps from each image I q . These feature maps are aligned with the point (depth) prediction from G p; and are used to directly predict per-point featuresf i as: ff i g =G f (I q ): (4.12) In particular, we use a VGG network architecture for G f that has three downsampling lay- ers. We combine intermediate features at different resolutions asf i , providing a meaningful point description that models multi-scale scene appearance. (See Fig. 4.5(a)) End-to-end reconstruction. We combine point clouds from multiple viewpoints to obtain our final neural point cloud. We train the point generation networks along with the representation networks, from end to end with a rendering loss (see Fig. 4.19). This allows our generation modules to produce reasonable initial radiance fields. It also initializes the MLPs in our Point- NeRF representation with reasonable weights, significantly saving the per-scene fitting time. Moreover, apart from using the full generation module, our pipeline also supports using a point cloud reconstructed from other approaches like COLMAP [145], where our model (excluding the MVS network) can still provide meaningful initial neural features for each point. 4.4.3.2 Optimization The above pipeline can output a reasonable initial point-based radiance field for a novel scene. Through differentiable ray marching, we can further improve the radiance field by optimizing the neural point cloud (point featuresf i and point confidence i ) and the MLPs in our representation, for that specific scene (see Fig. 4.19). 109 The initial point cloud, especially ones from external reconstruction methods (e.g., Metashape or COLMAP in Fig. 4.18), can often contain holes and outliers that degrade the rendering quality. During per-scene optimization, to solve this problem, we find that directly optimizing the location of the existing points makes the training unstable and cannot fill the large holes (see 4.18). In- stead, we apply novel point pruning and growing techniques that gradually improve both geometry modeling and rendering quality. Point pruning. As introduced in Sec. ??, we designed point confidence values i that describe whether a neural point is near a scene surface. We utilize these confidence values to prune unnec- essary outlier points. Note that the point confidence is directly related to the per-point contribution in volume density regression (Eqn. 4.10); as a result, low confidence reflects low volume density in a point’s local region indicating that it is empty. Therefore, we prune points that have i < 0:1 every 10K iterations. We also impose a sparsity loss on point confidence [110]: L sparse = 1 j j X i [log( i ) +log(1 i )] (4.13) which forces the confidence value to be close to either zero or one. As shown in Fig. 4.23, this pruning technique can remove outlier points and reduce the corresponding artifacts. Point growing. We also propose a novel technique to grow new points to cover missing scene geometry in the original point cloud. Unlike point pruning which directly utilizes information from existing points, growing points requires recovering information in empty regions where no point 110 exists. We achieve this by progressively growing points near the point cloud boundary based on the local scene geometry modeled by our Point-NeRF representation. In particular, we leverage the per-ray shading locations (x j in Eqn. 4.4) sampled in the ray marching to identify new point candidates. Specifically, we identify the shading locationx jg with the highest opacity along the ray: j = 1 exp( j j ); j g = argmax j j : (4.14) We compute jg asx jg ’s distance to its closest neural point. For a marching ray, we grow a neural point atx jg if jg >T opacity and jg >T dist . This implies that the location lies near the surface, but is far from other neural points. By repeating this growing strategy, our radiance field can be expanded to cover missing regions in the initial point cloud. Point growing especially benefits point clouds reconstructed by methods like COLMAP that are not dense (see Fig. 4.23). We show that even in an extreme case with only 1000 initial points, our technique is able to progressively grow new points and reasonably cover the object surface (see Fig. 4.26). 4.4.4 Implementation Neural point querying. To efficiently query neural point neighbors for ray marching, inspired by the CAGQ point query introduced in [198], we implement a grid query method. Then we build grid-point indices which register each neural point to evenly spaced 3D grids. Since these grids in the perspective coordinate are cubic, in the world coordinate, they have shapes of spherical voxels. 111 With the grid-point indices, we can discover grids that have neural points and also their grid neighbors. These grid neighbors are the regions of interest since there should exist neural points within the query radius. If a ray crosses these regions, we can place shading points inside. Finally, we query neural points by directly retrieving the stored neural points according to the grid-point indices. In all of our experiments, we query 8 nearest neural point neighbors for each shading location. Along each ray, we only search for neural point neighbors and compute radiance for shading locations in a grid that is occupied itself or nearby occupied grids. Therefore, our shading is much more efficient by skipping the empty space, unlike other radiance field representations. This is one key advantage that enables fast convergence. Even NSVF [104], a high-performance local radiance representation, has to probe the empty space in the beginning and gradually prune the voxels along its training process. The benefit of this strategy is two-fold: First, we only place shading points in the area where exists neural points, so that we avoid radiance computation in the empty space. Second, the nearby points can be efficiently retrieved according to the indices, which substantially accelerates the point query speed. Network details. We apply frequency positional encoding on the relative position and the per- point features for the per-point processing networkG f , and the viewing direction for the network R. We extract multi-scale image features from three layers at different resolutions in network G f , leading to a vector with 56 (8+16+32) channels. We additionally append the corresponding viewing directions from each input viewpoint, to handle view-dependent effects. Therefore our final per-point neural feature is a 59-channel vector. 112 Figure 4.20: The network pipeline of radiance fields computation at a shading location x from K neural points neighbors. “PosEN” indicates positional encoding [120]. “d3” indicates the 3 channels vector of view directions atx. The final outputs are the radiance colorr and density. Please also refer to the equations (3-7) in the main paper. Cost volume-based CNNG p; . Our cost volume-based CNN adopts the popular architecture of [214], which is simple and efficient. It includes three layers of depth features extraction CNN, while the latter two layers down-samples the spatial dimension by 4 and output a feature map with 32 channels. Then, these features from each view will be warped according to the camera pose and the variance will be computed. The variance features will go through a narrow U-Net [185] and output a 1-channel feature to calculate the depth probability. Image Feature Extraction 2D CNNG f . The image feature extraction network takes inputs of RGB image and has three down-sampling layers, each output feature with channels of 8; 16; 32. We extract the point features by unprojecting a 3D point to each layer and taking the multi-scale features. 113 Point-based Radiance Fields MLP. We visualize the details of the point feature aggregation and radiance computation in Figure 4.20. In all of our experiments, we setc 1 = 56,c2 = 128. The MLPsF;R;T have 2, 3, and 2 layers, respectively. The intermediate feature channels ofF andT are 256, and 128 channels forR. Training and optimization details. We train our full pipeline on the DTU dataset, using the same training and testing split as PixelNeRF and MVSNeRF. We first pretrain the MVSNet-based depth generation network using the ground truth depth similar to the original MVSNet paper [214]. We then train our full pipeline from end to end purely with an L2 rendering lossL render , supervising our rendered pixels from ray marching (via Eqn. 4.4) with the ground truth, to obtain our Point- NeRF reconstruction network. We train our full pipeline using Adam [78] optimizer with an initial learning rate of 5e 4 . Our feed-forward network takes 0:2s to generate a point cloud from three input views. In the per-scene optimization stage, we adopt a loss function that combines the rendering and the sparsity loss L opt =L render +aL sparse ; (4.15) where we use a = 2e 3 for all our experiments. We perform point growing and pruning every 10K iterations to achieve our final high-quality reconstruction. 114 No Per-scene Optimization Per-scene Optimization PixelNeRF MVSNeRF IBRNet Ours Ours 1K Ours 10K MVSNeRF 10K IBRNet 10K NeRF 200k PSNR" 19.31 26.63 26.04 23.89 28.43 30.12 28.50 31.35 27.01 SSIM" 0.789 0.931 0.917 0.874 0.929 0.957 0.933 0.956 0.902 LPIPS Vgg # 0.382 0.168 0.190 0.203 0.183 0.117 0.179 0.131 0.263 Time# - - - - 2min 20min 24min 1h 10h Table 4.5: Comparisons of our Point-NeRF with radiance-based models [112, 179, 104] and a point-based rendering model [4] on the DTU dataset [74] with the novel view synthesis setting introduced in [23]. The subscripts indicate the number of iterations during optimization. NPBG[4] NeRF IBRNet NSVF [104] Point-NeRF col 200K Point-NeRF 20K Point-NeRF 200K PSNR" 24.56 31.01 28.14 31.75 31.77 30.71 33.31 SSIM" 0.923 0.947 0.942 0.964 0.973 0.967 0.978 LPIPS Vgg # 0.109 0.081 0.072 - 0.062 0.081 0.049 LPIPS Alex # 0.095 - - 0.047 0.040 0.050 0.027 Table 4.6: Comparisons of Point-NeRF with radiance-based models [112, 179, 104] and a point- based rendering model [4] on the Synthetic-NeRF dataset [112]. The subscripts indicate the num- ber of iterations. Our model not only surpasses other methods when converged after 200K steps (Point-NeRF 200K ), but surpasses IBRNet [179] and is on par with NeRF [120] when optimized by only 20K steps (Point-NeRF 20K ). Our methods can also initialize radiance fields based on point clouds reconstructed by methods such as COLMAP (Point-NeRF col 200K ). 4.4.5 Evaluation 4.4.5.1 The DTU Dataset We first evaluate our model on the DTU testing set. We produce novel view synthesis results from both direct network inference and per-scene fine-tuning optimization with different iterations and compare them with the previous state-of-the-art methods including PixelNeRF[221], IBRNet[179], MVSNeRF[23], and NeRF[120]. IBRNet and MVSNeRF utilize similar per-scene fine-tuning; we fine-tune all methods with 10k iterations for the comparison. Additionally, we show our results with only 1k iterations to demonstrate the efficiency of our optimization. Tab. 4.5 shows the quantitative results of all methods with PSNR, SSIM, and LPIPS; qualita- tive rendering results are shown in Fig. 4.21. We can see that our final fine-tuning results after 10k 115 iterations achieve the best SSIM and LPIPS[227], two out of the three metrics. These are signif- icantly better than the final MVSNeRF and NeRF results. While IBRNet produces slightly better PSNR results, our final renderings in fact recover more accurate texture details and highlights as shown Fig. 4.21. On the other hand, IBRNet is also more expensive to fine-tune, taking 1 hour—5x longer than our fine-tuning for the same iteration number. This is because IBRNet depends on a large global CNN, whereas our model leverages local point features with small MLPs that are eas- ier to optimize. More importantly, our point-based representation lies near actual scene surfaces and thus avoids sampling ray points in the empty scene space, leading to highly efficient per-scene optimization. Apart from the optimization results, our initial radiance field estimated from our network is significantly better than PixelNeRF. In this case, our direct inference is worse than IBRNet and MVSNet, mainly because these two methods are using more complex variance-based feature ex- traction. Our point features are extracted from a simple VGG network. The same design is used in PixelNeRF; we achieve significantly better results than PixelNeRF due to our novel surface- adaptive point-based representation. While a more complex feature extractor as in IBRNet might improve quality, it will add a burden to memory usage and training efficiency. More importantly, our generation network has al- ready provided a high-quality initial radiance field to support efficient optimization. We show that even 2 min / 1K iterations of fine-tuning for our method lead to a very high visual quality compa- rable to MVSNeRF’s final 10k-iteration results. This clearly demonstrates the high reconstruction efficiency of our approach. We show the per-scene detailed quantitative results of the comparisons on the DTU[74] dataset in Table 4.7 and additional qualitative comparisons in our video. Since our method also faithfully 116 Figure 4.21: Qualitative comparisons of per-scene optimization on the DTU dataset [74]. Our Point-NeRF can recover texture details and geometrical structures more accurately than other methods. Point-NeRF also demonstrates superior efficiency. Within two mins, our model trained for 1K steps is already on par with the state-of-the-art methods such as MVSNeRF [23] and IBRNet[179] reconstructs the scene geometry, our method has the best SSIM scores in most cases. Our model also has the best LPIPS for most of the scenes and therefore, is more visually authentic, as shown in Figure 6 of the main paper and the video. IBRNet combines the colors from the source views to compute the radiance colors during shading. This image-based approach results in better PSNR. However, as shown in our video, our method is more temporal consistent because the local radiance and geometries are consistently stored at each neural point location. 4.4.5.2 The NeRF Synthetic Dataset. While our model is purely trained on the DTU dataset, our network generalizes well to novel datasets that have completely different camera distributions. We demonstrate such results on the NeRF synthetic dataset and compare them with other state-of-the-art methods with qualitative results in Fig. 4.22 and quantitative results in Tab. 4.6. In particular, we compare with a point- based rendering model (NPBG) [4], a generalizable radiance field method (IBRNet) [179], and per-scene radiance field reconstruction techniques (NeRF and NSVF)[120, 104]. 117 Scan #1 #8 #21 #103 #114 SSIM" Ours 1K 0.935 0.906 0.913 0.944 0.948 Ours 10K 0.962 0.949 0.954 0.961 0.960 MVSNeRF 10K [23] 0.934 0.900 0.922 0.964 0.945 IBRNET 10K [179] 0.955 0.945 0.947 0.968 0.964 NeRF 200K [120] 0.902 0.876 0.874 0.944 0.913 LPIPS Vgg # Ours 1K 0.151 0.207 0.201 0.208 0.148 Ours 10K 0.095 0.130 0.134 0.145 0.096 MVSNeRF 10K 0.171 0.261 0.142 0.170 0.153 IBRNET 10K 0.129 0.170 0.104 0.156 0.099 NeRF 200K 0.265 0.321 0.246 0.256 0.225 PSNR" Ours 1K 28.79 28.39 24.78 30.36 29.82 Ours 10K 30.85 30.72 26.22 32.08 30.75 MVSNeRF 10K 28.05 28.88 24.87 32.23 28.47 IBRNET 10K 31.00 32.46 27.88 34.40 31.00 NeRF 200K 26.62 28.33 23.24 30.40 26.47 Table 4.7: Quantity comparison on five sample scenes in the DTU testing set with the view synthe- sis setting introduced in [23]. The subscripts indicate the number of iterations during optimization. Figure 4.22: Qualitative comparisons on the NeRF Synthetic dataset [120]. The subscripts indicate the number of iterations. Our Point-NeRF can capture fine details and thin structures (see the rope on row 2). Point-NeRF also demonstrates superior efficiency. Our model trained for 20K steps is already on par with NeRF with 30 faster training time. 118 Comparisons with generalizing methods. We compare with IBRNet, to the best of our knowl- edge, is the previous best NeRF-based generalizable model that can handle free-viewpoint ren- dering with any arbitrary numbers. Note that, this dataset has a 360 camera distribution, which is much wider than the DTU dataset. In this case, a local reconstruction method like MVSNeRF cannot be applied, since it recovers a local perspective frustum volume from a fixed number of three input images, which cannot cover the entire 360 viewing range. We, therefore, compare with IBRNet and focus on the final results after per-scene optimization in this experiment. We use their released model to produce the results. We show results from both methods optimized for 20K iterations. Our results (Point-NeRF 20K ) are significantly better than the IBRNet results with better PSNR, SSIM, and LIPIPS; we also achieve rendering quality with better geometry and texture details as shown in Fig. 4.22. Comparisons with pure per-scene methods. Our results after 20K iterations are quantitatively very close to NeRF’s results trained with 200K iterations. Visually, our model at 20K iterations already has better renderings in some cases, e.g. the Ficus scene (4th row) in Fig. 4.22. Point- NeRF 20K is optimized for only 40 minutes, which is at least 30 faster than the 20+ hours opti- mization time taken by NeRF. NSVF’s [104] results are also from very long per-scene optimiza- tion and yet are only slightly better than our 40min results. Optimizing our model for 200K until convergence can lead to significantly better results than NeRF, NSVF, and all other comparison methods. As shown in Fig. 4.22, our 200K results contain the most geometry and texture details. Moreover, our method is the only one that can fully recover details like the thin rope structure in the Ship scene (2nd row) because of our point-growing technique. 119 Figure 4.23: Our neural point clouds and rendered novel views with or without point pruning and growing (P&G). P&G improves both the geometries and rendering results when using the point cloud reconstructed from our model or from COLMAP[145]. Comparisons with point-based rendering. Our results are significantly better than the previous state-of-the-art point-based rendering methods. For a fair comparison, we run NPBG[4] using the same point cloud generated by our MVSNet-based network. However, NPBG can only produce blurry rendering results with their rasterization and 2D CNN framework. In contrast, we leverage the volumetric rendering technique with neural radiance fields, leading to photo-realistic results and high PSNRs. We show the per-scene detailed quantitative results of the comparisons on the NeRF Synthetic[120] dataset in Table 4.8 and additional qualitative comparisons in our video. Point-NeRF achieves the best PSNRs, SSIMs, and LPIPSs on most of the scenes and outperforms state-of-the-art methods [4, 120, 104, 179] with a big margin. On the other hand, our method initiated with COLMAP points is on par with NeRF. Even starting from the unideal initial points, we still manage to im- prove the geometry reconstruction and generate a high-quality radiance field with point pruning 120 NeRF Synthetic Chair Drums Lego Mic Materials Ship Hotdog Ficus PSNR" NPBG[4] 26.47 21.53 24.84 26.62 21.58 21.83 29.01 24.60 NeRF[120] 33.00 25.01 32.54 32.91 29.62 28.65 36.18 30.13 NSVF[104] 33.19 25.18 32.54 34.27 32.68 27.93 37.14 31.23 Point-NeRF col 200K 35.09 25.01 32.65 35.54 26.97 30.18 35.49 33.24 Point-NeRF 20K 32.50 25.03 32.40 32.31 28.11 28.13 34.53 32.67 Point-NeRF 200K 35.40 26.06 35.04 35.95 29.61 30.97 37.30 36.13 SSIM" NPBG 0.939 0.904 0.923 0.959 0.887 0.866 0.964 0.940 NeRF 0.967 0.925 0.961 0.980 0.949 0.856 0.974 0.964 NSVF 0.968 0.931 0.960 0.987 0.973 0.854 0.980 0.973 Point-NeRF col 200K 0.990 0.944 0.983 0.993 0.955 0.941 0.986 0.989 Point-NeRF 20K 0.981 0.944 0.980 0.986 0.959 0.916 0.983 0.986 Point-NeRF 200K 0.991 0.954 0.988 0.994 0.971 0.942 0.991 0.993 LPIPS Vgg # NPBG 0.085 0.112 0.119 0.060 0.134 0.210 0.075 0.078 NeRF 0.046 0.091 0.050 0.028 0.063 0.206 0.121 0.044 Point-NeRF col 200K 0.026 0.099 0.031 0.019 0.100 0.134 0.061 0.028 Point-NeRF 20K 0.051 0.103 0.054 0.039 0.102 0.181 0.074 0.043 Point-NeRF 200K 0.023 0.078 0.024 0.014 0.072 0.124 0.037 0.022 LPIPS Alex # NSVF 0.043 0.069 0.029 0.010 0.021 0.162 0.025 0.017 Point-NeRF col 200K 0.013 0.073 0.016 0.011 0.076 0.087 0.032 0.012 Point-NeRF 20K 0.027 0.057 0.022 0.024 0.076 0.127 0.044 0.022 Point-NeRF 200K 0.010 0.055 0.011 0.007 0.041 0.070 0.016 0.009 Table 4.8: Detailed breakdown of quantitative metrics of individual scenes for the NeRF Synthetic [120] for our method and baselines. All scores are averaged over the testing images. The subscripts are the number of iterations of the models and Point-NeRF col 200K indicates our method initiates from COLMAP points and is optimized for 200 thousand iterations. 121 and growing. The fact that our model at 20K iterations matches the results of NeRF at 500K iterations clearly demonstrates our ability for fast convergence. 4.4.5.3 The Tanks and Temples Dataset Tanks & Tamples Ignatius Truck Barn Caterpillar Family Mean PSNR" NV [110] 26.54 21.71 20.82 20.71 28.72 23.70 NeRF [120] 25.43 25.36 24.05 23.75 30.29 25.78 NSVF [104] 27.91 26.92 27.16 26.44 33.58 28.40 Point-NeRF (Ours) 28.43 28.22 29.15 27.00 35.27 29.61 SSIM" NV [110] 0.992 0.793 0.721 0.819 0.916 0.848 NeRF [120] 0.920 0.860 0.750 0.860 0.932 0.864 NSVF [104] 0.930 0.895 0.823 0.900 0.954 0.900 Point-NeRF (Ours) 0.961 0.950 0.937 0.934 0.986 0.954 LPIPS Alex # NV [110] 0.117 0.312 0.479 0.280 0.111 0.260 NeRF [120] 0.111 0.192 0.395 0.196 0.098 0.198 NSVF [104] 0.106 0.148 0.307 0.141 0.063 0.153 Point-NeRF (Ours) 0.069 0.077 0.120 0.111 0.024 0.080 LPIPS Vgg # Point-NeRF (Ours) 0.079 0.117 0.180 0.156 0.046 0.115 Table 4.9: Quantity comparison on five scenes in the Tanks and Temples dataset [80] selected in NSVF [104]. Our method Point-NeRF outperforms all state-of-the-art models in all metrics by substantial margins. We also experiment with Point-NeRF on the Tanks and Temples dataset [80]. we reconstruct the radiance field of five scenes selected in NSVF [104] and compare our model with three models NV [110], NeRF [120] and NSVF [104]. We show the quantitative comparison in Tab. 4.9 and visualize quality results in Figure 4.24. Please find more visual results in our video. 122 Figure 4.24: The qualitative results of our Point-NeRF on the Tanks and Temples dataset. Average over two scenes Scene 101 Scene 241 SRN [159] NeRF [112] NSVF [104] Point-NeRF (Ours) Point-NeRF (Ours) PSNR" 18.25 22.99 25.48 30.32 30.13 30.51 SSIM" 0.592 0.620 0.688 0.909 0.912 0.906 RMSE# 14.764 0.681 0.079 0.031 0.032 0.030 LPIPS Alex # 0.586 0.369 0.301 0.220 0.203 0.238 LPIPS Vgg # - - - 0.292 0.286 0.299 Table 4.10: Quantity comparison on two scenes in the ScanNet dataset [38] selected in NSVF [104]. RMSE is the Root Mean Square Error. Our method Point-NeRF outperforms all state-of- the-art methods in all metrics by substantial margins. Figure 4.25: The qualitative results of our Point-NeRF on the ScanNet dataset [80]. The first row shows five generated test frames of scene 101 and the second row shows five generated test frames of scene 241. 123 4.4.5.4 Large-scale 3D Scenes (ScanNet) While our model is purely trained on a dataset of objects (the DTU dataset), our network general- izes well to large-scale 3D scene datasets. Following [104], we use two 3D scenes, scene 0101_04 and scene 0241_01, from ScanNet [38]. We extract both RGB and depth images from the original videos and from which we sample one out of five frames as a training set and use the rest for test- ing. The RGB images are scaled to 640 × 480. We finetune each scene for 300K steps with point pruning and growing. We compare with 3 other state-of-the-art methods with quantitative results in Tab. 4.6. In particular, we compare with a scene representation model (SRN) [159], NeRF [120] and a sparse voxel-based neural radiance field, NSVF [104]. The qualitative comparison is shown in Tab. 4.10 and visual results are shown in Figure 4.25. Our Point-NeRF outperforms all these previous studies in all metrics by substantial margins. Please find more visual results in our video. 4.4.6 Additional Experiments Converting COLMAP point clouds to Point-NeRF As mentioned, apart from using our full pipeline, our method can also be used to convert standard point clouds reconstructed by other techniques to point-based radiance fields. We run experiments for this on the full NeRF synthetic dataset, using the point cloud reconstructed by COLMAP [145]. The quantitative results are shown as Point-NeRF col in Tab. 4.6. Point-NeRF can use the points of any external reconstruction method. For instance, the output of COLMAP[145] is a point cloudf(p i )ji = 1;:::;Ng. We set i as 0:3 in the beginning. The confidence score of valid points will be pushed to 1 during the optimization process. To acquire 124 point featuresf i for a point, We first rule out all the views where the point is occluded by other points, then we find the view of which the camera is the closest to the point. Then from that view, we can unproject the point onto the feature maps extracted by G f (see Figure 2(a) in the main paper) from the selected view and obtain thef i . Since COLMAP point clouds may contain a lot of holes (as shown in Fig. 4.18) and noises, we optimize the model for 200K after the initialization to address the point cloud issues with our point growing and pruning techniques. Note that, even from this low-quality point cloud, our final results are still of very high quality with very high SSIM and LPIPS numbers compared to all other methods. This demonstrates that our technique can be potentially combined with any existing point cloud reconstruction techniques, to achieve realistic rendering while improving the point cloud geometry. Figure 4.26: Starting from 1000 randomly sampled COLMAP points of the Chair scene, our point growing mechanism can help complete the geometry and generate high-quality novel views when only being supervised by RGB images. Point growing and pruning. To further demonstrate the effectiveness of our point growing and pruning modules, we show ablation study results with and without the point growing and pruning in the per-scene optimization. We conduct this experiment on the Hotdog and Ship scenes, using both our full model and our model with COLMAP point clouds. The quantitative results are shown 125 in Tab. 4.11; our point growing and pruning techniques are very effective, significantly improving the reconstruction results in both cases. We also show the visual results of the Hotdog scene in Fig. 4.23. We can clearly see that our model is able to prune the point outliers on the left and successfully fill the severe holes on the right in the original COLMAP point cloud. We also manually create an extreme example to show our point-growing technique in Fig. 4.26, where we start from a very sparse point cloud with only 1000 points sampled from our original point reconstruction. We demonstrate that our approach can progressively grow new points from the point cloud boundary until filling the entire scene surface through iterations. This example further demonstrates the effectiveness of our model, which has high potential in using image data to recover accurate scene geometry and appearance from low-quality point clouds. Method P&G Ship Hotdog Ours No 25.50 / 0.878 / 0.182 34.91 / 0.983 /0.067 Ours Yes 30.82 / 0.943 / 0.126 36.93 / 0.990 / 0.041 COLMAP No 19.35 / 0.905 / 0.167 29.91 / 0.978 / 0.061 COLMAP Yes 29.83 / 0.940 / 0.136 34.76 / 0.985 / 0.062 Table 4.11: The quantitative results (PSNR / SSIM / LPIPS Vgg ) of the Ship and Hotdog scene with or without point pruning and growing (P&G). The improvements are significant when using either our generated points or the point cloud generated by COLMAP[145]. Ablation studies on point features initialization. We conduct experiments to demonstrate the importance of our feature initialization. We compare our full model and our model initialized without using the extracted image features on the NeRF Synthetic dataset [120]. Without using the features from images, we randomly initialize the point features by using the popular Kaiming Extract 20k Rand 20k Extract 200k Rand 200k PSNR" 30.09 25.44 33.00 32.01 SSIM" 0.963 0.932 0.978 0.972 Table 4.12: Comparisons between using the extracted image features to initialize the point features (our full model) or using the random initialized features. 126 Initialization [62]. As shown in Table 4.12, the neural points with image feature not only achieve better performance after convergence at 200K iterations but also converge much faster in the be- ginning. The randomly initialized neural points even cannot perform as well as our full model, still outperforms state-of-the-art methods such as NeRF and NSVF[104]. 127 Chapter 5 Conclusion and Future Work 5.1 Summary of Research This thesis addresses three aspects of point-based 3D representation, namely, point cloud repre- sentation learning, point-based 3D scene understanding, and point-based implicit methods for 3D scene reconstruction. Due to the fact that point sets are usually sparse in space and vast in number, we present a method, Grid-GCN [198] for fast and scalable point cloud learning. The model uses a Coverage- Aware Grid Query strategy which improves spatial coverage with lightning-fast speed. With a Grid Context Aggregation (GCA) module, Grid-GCN achieves state-of-the-art performance on major point cloud classification and segmentation benchmarks with significantly faster runtime than previous studies. After that, the thesis discusses two critical challenges of point-based 3D perception, the domain gap between point clouds with different quality and occlusion. To address the issue of domain shift, a point-based domain adaptation model, Semantic Point Generation (SPG) [202], enhances the re- liability of LiDAR detectors against domain shifts. Specifically, SPG generates semantic points 128 at the predicted foreground regions and faithfully recovers the missing parts of the foreground ob- jects. Any modern LiDAR-based detector can directly consume the points generated by SPG along with the raw point cloud and generalize well on both the source and target domain. Aiming to rem- edy the occlusion, especially in a LiDAR frame, a novel LiDAR-based 3D object detection model, Behind the Curtain Detector (BtcDet) [201] learns the object shape priors and estimates the com- plete object shapes that are partially occluded in point clouds. BtcDet can generate high-quality 3D proposals and demonstrate superior performance on several 3D object detection benchmarks. With the explanation of how to learn the point-based 3D representation and to understand the 3D scene, the thesis demonstrates how to hybrid the implicit functions with the point-based explicit representation to reconstruct the 3D scene. First, DISN, a Deep Implicit Surface Network [199] is presented. The model can generate a high-quality detail-rich 3D mesh from a 2D image by predicting the underlying signed distance fields. Then, the thesis moves on to discuss how to reconstruct all graphics components together by inverse rendering. A point-based radiance fields model, Point-NeRF [200], is introduced, which combines the advantages of these two approaches by using neural 3D point clouds, with associated neural features, to model a radiance field. Point- NeRF can be rendered efficiently by aggregating neural point features near scene surfaces, in a ray marching-based rendering pipeline. The model can generate a 3D scene with the equal visual quality of NeRF with 30× faster training time and achieves the state-of-the-art reconstruction quality that outperforms all previous and concurrent models by a considerable margin. Moreover, Point-NeRF handles the errors and outliers in such methods via a novel pruning and growing mechanism. 129 5.2 Future Work The thesis has presented several advances in the direction of 3D representation learning, 3D scene understanding, and reconstruction of 3D scenes. Based on these studies, future directions can be derived, which have the potential to lead to full automation and immersive augmented reality eventually. Tensor and Point For 3D scene reconstruction, a promising future direction is to exploit spar- sity for novel 3D representations. Our current work Point-NeRF [200] contains many points and features that need hundreds of megabytes to store, which could potentially create problems during the deployment on mobile devices. A recent method, TensoRF [22], decomposes the 3D scene as tensors. Its features are stored in one-D or two-D tensors and their outer product would result in a feature volume covering the entire scene. Conditioned on corresponding 3d voxel features, the density and RGB color for each shading location can be generated. The method will result in very compact models, only 10 megabytes, and very fast to render since the outer product is faster than querying point neighbors. However, the local details are sacrificed due to the information loss brought up by the global tensor decomposition. One way to combine sparsity and Point-Nerf local- ization is to distribute or grow tensors following the geometry distribution. For example, each of the boxes is a local tensor radiance field. During ray marching, instead of querying point features, we query the local tensors. This new direction has the potential to achieve high-quality rendering, and faster render speed, and at the same time, the final model will take less space. Foundation Model for 3D Content Foundation models use self-supervised or weakly-supervised learning to obtain outstanding generalization ability for the different tasks; BERT [42] for language 130 understanding, GPT-3 for [13] few-shot language reasoning, CLIP [140] for associating text and images. There is one extremely important puzzle missing in the current progress, an general em- bedding space for 3D shapes. Compared with the correlation to text, 3D shapes are more related to 2D RGB images. The computer vision community in general lacks scene data with 3D and 2D correspondence. However, inspired by structure from motion [168], we can leverage online videos and automatically find the same object in a video segment, and reconstruct sparse but highly con- fident 3D surface points. With this 3D supervision signal, we can let the model predict the object surface from the single frame and be supervised on these reconstructed 3D locations. By following this procedure, a joint embedding space between 2D images and 3D shapes can be obtained and can be further merged with the embedding space with the one in CLIP. Further generative models can be created, for example, a “DALL-E 3D” model which can generate realistic 3D shapes or even 3D scenes that are similar to the 2D images creation by DALLE2 [141]. Here I would like to quote a prospective description for the DALLE-3D: DALL-E 3D is generative 3D model created by the artificial intelligence research company OpenAI. It uses machine learning algorithms to generate 3D objects based on a given description or concept. For example, if you provide DALL-E 3D with the concept of “a chair with wings,” it will generate a 3D model of a chair with wings. DALL-E 3D is a type of generative 3D model, which means that it can create new, unique 3D objects based on the data it has been trained on. It is a powerful tool for creating 3D objects and has a wide range of potential applications. (ChatGPT, 2022) The above paragraph is generated by ChatGPT [1], a foundation AI model for conversation gener- ation. This description of DALL-E 3D is predictive, but it is likely to be fullfilled in the near future. 131 I believe AI model understands another AI model better. With large-scale dataset and computation resource, the future of generalizable 3D generative models is unfolding in front of our eyes. 132 Bibliography [1] Chatgpt. https://chat.openai.com/chat. Accessed: 12-08-2022. [2] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning rep- resentations and generative models for 3D point clouds. In ICML, pages 40–49, 2018. [3] Abejide Olu Ade-Ibijola. A simulated enhancement of fisher-yates algorithm for shuffling in virtual card games using domain-specific data structures. International Journal of Computer Applications, 54(11), 2012. [4] Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry Ulyanov, and Victor Lempitsky. Neural point-based graphics. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 696–712. Springer, 2020. [5] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese. Joint 2D-3D-Semantic Data for Indoor Scene Understanding. ArXiv e-prints, February 2017. [6] Matan Atzmon, Haggai Maron, and Yaron Lipman. Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091, 2018. [7] Yizhak Ben-Shabat, Michael Lindenbaum, and Anath Fischer. 3dmfv: Three-dimensional point cloud classification in real-time using convolutional neural networks. IEEE Robotics and Automation Letters, 3(4):3145–3152, 2018. [8] Alex Bewley, Pei Sun, Thomas Mensink, Drago Anguelov, and Cristian Sminchisescu. Range conditioned dilated convolutions for scale invariant 3d object detection. In Con- ference on Robot Learning, 2020. [9] Sai Bi, Zexiang Xu, Pratul Srinivasan, Ben Mildenhall, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, David Kriegman, and Ravi Ramamoorthi. Neural reflectance fields for appearance acquisition. arXiv preprint arXiv:2008.03824, 2020. [10] Sai Bi, Zexiang Xu, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, David Krieg- man, and Ravi Ramamoorthi. Deep reflectance volumes: Relightable reconstructions from multi-view photometric images. In Proc. ECCV, 2020. [11] Mark Boss, Raphael Braun, Varun Jampani, Jonathan T Barron, Ce Liu, and Hendrik Lensch. Nerd: Neural reflectance decomposition from image collections. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12684–12694, 2021. 133 [12] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Generative and discriminative voxel modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236, 2016. [13] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. [14] Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen. Unstructured lumigraph rendering. In Proc. SIGGRAPH, pages 425–432, 2001. [15] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020. [16] Xu Cao, Weimin Wang, and Katashi Nagao. Neural style transfer for point clouds. arXiv preprint arXiv:1903.05807, 2019. [17] Xu Cao, Weimin Wang, Katashi Nagao, and Ryosuke Nakamura. Psnet: A style transfer network for point cloud stylization on geometry and color. In The IEEE Winter Conference on Applications of Computer Vision, pages 3337–3345, 2020. [18] Edwin Earl Catmull. A subdivision algorithm for computer display of curved surfaces. The University of Utah, 1974. [19] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi- gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5799–5809, 2021. [20] Tony Chan and Wei Zhu. Level set based shape prior segmentation. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 1164–1170. IEEE, 2005. [21] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich 3d model repository. arxiv, 2015. [22] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. arXiv preprint arXiv:2203.09517, 2022. [23] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. arXiv preprint arXiv:2103.15595, 2021. [24] Qi Chen, Lin Sun, Zhixin Wang, Kui Jia, and Alan Yuille. Object as hotspots: An anchor- free 3d object detection approach via firing of hotspots. In European Conference on Com- puter Vision, pages 68–84. Springer, 2020. 134 [25] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017. [26] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6526–6534, 2017. [27] Xuelin Chen, Baoquan Chen, and Niloy J Mitra. Unpaired point cloud completion on real scans using adversarial training. In ICLR, 2020. [28] Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Fast point r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 9775–9784, 2019. [29] Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12536–12545, 2020. [30] Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3339–3348, 2018. [31] Zhiqin Chen and Hao Zhang. Learning implicit fields for generative shape modeling. arXiv preprint arXiv:1812.02822, 2018. [32] Shuo Cheng, Zexiang Xu, Shilin Zhu, Zhuwen Li, Li Erran Li, Ravi Ramamoorthi, and Hao Su. Deep stereo using adaptive thin volume representation with uncertainty awareness. In Proceedings of the CVPR, pages 2524–2534, 2020. [33] Jaehoon Choi, Taekyung Kim, and Changick Kim. Self-ensembling with gan-based data augmentation for domain adaptation in semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 6830–6840, 2019. [34] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 3075–3084, 2019. [35] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d- r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, 2016. [36] Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ron- neberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pages 424–432. Springer, 2016. [37] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967. 135 [38] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017. [39] Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. Shape completion using 3d- encoder-predictor cnns and shape synthesis. 2017. [40] Paul Debevec, Yizhou Yu, and George Borshukov. Efficient view-dependent image-based rendering with projective texture-mapping. In Rendering Techniques’ 98, pages 105–116. 1998. [41] Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. V oxel r-cnn: Towards high performance voxel-based 3d object detection. arXiv:2012.15712, 2020. [42] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [43] Jiahua Dong, Yang Cong, Gan Sun, and Dongdong Hou. Semantic-transferable weakly- supervised endoscopic lesions segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 10712–10721, 2019. [44] Robert A Drebin, Loren Carpenter, and Pat Hanrahan. V olume rendering. ACM Siggraph Computer Graphics, 22(4):65–74, 1988. [45] Liang Du, Xiaoqing Ye, Xiao Tan, Jianfeng Feng, Zhenbo Xu, Errui Ding, and Shilei Wen. Associate-3ddet: Perceptual-to-conceptual association for 3d point cloud object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13329–13338, 2020. [46] Xinxin Du, Marcelo H Ang, Sertac Karaman, and Daniela Rus. A general pipeline for 3d detection of vehicles. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3194–3200. IEEE, 2018. [47] Y Eldar. Irregular image sampling using the voronoi diagram. PhD thesis, M. Sc. thesis, Technion-IIT, Israel, 1992. [48] Yuval Eldar, Michael Lindenbaum, Moshe Porat, and Yehoshua Y Zeevi. The farthest point strategy for progressive image sampling. IEEE Transactions on Image Processing, 6(9):1305–1315, 1997. [49] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. In CVPR, 2017. [50] Olivier D Faugeras and Steve Maybank. Motion from point matches: multiplicity of solu- tions. International Journal of Computer Vision, 4(3):225–246, 1990. [51] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, 2009. 136 [52] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropaga- tion. In International conference on machine learning, pages 1180–1189. PMLR, 2015. [53] Runzhou Ge, Zhuangzhuang Ding, Yihan Hu, Yu Wang, Sijia Chen, Li Huang, and Yuan Li. Afdet: Anchor free one stage 3d object detection. arXiv preprint arXiv:2006.12671, 2020. [54] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013. [55] R. Girdhar, D.F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In ECCV, 2016. [56] Boqing Gong, Kristen Grauman, and Fei Sha. Connecting the dots with landmarks: Dis- criminatively learning domain-invariant features for unsupervised domain adaptation. In International Conference on Machine Learning, pages 222–230, 2013. [57] Guy G Goyer and Robert Watson. The laser and its application to meteorology. Bulletin of the American Meteorological Society, 44(9):564–570, 1963. [58] Benjamin Graham and Laurens van der Maaten. Submanifold sparse convolutional net- works. arXiv preprint arXiv:1706.01307, 2017. [59] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan Russell, and Mathieu Aubry. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. In CVPR, 2018. [60] Xiuye Gu, Yijie Wang, Chongruo Wu, Yong Jae Lee, and Panqu Wang. Hplflownet: Hier- archical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3254–3263, 2019. [61] Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, 2020. [62] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015. [63] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [64] Zhenwei He and Lei Zhang. Multi-adversarial faster-rcnn for unrestricted object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 6668– 6677, 2019. [65] Peter Hedman, Pratul P Srinivasan, Ben Mildenhall, Jonathan T Barron, and Paul De- bevec. Baking neural radiance fields for real-time view synthesis. arXiv preprint arXiv:2103.14645, 2021. 137 [66] Han-Kai Hsu, Chun-Han Yao, Yi-Hsuan Tsai, Wei-Chih Hung, Hung-Yu Tseng, Maneesh Singh, and Ming-Hsuan Yang. Progressive domain adaptation for object detection. In The IEEE Winter Conference on Applications of Computer Vision, pages 749–757, 2020. [67] Peiyun Hu, Jason Ziglar, David Held, and Deva Ramanan. What you see is what you get: Exploiting visibility for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [68] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. [69] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deep- mvs: Learning multi-view stereopsis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2821–2830, 2018. [70] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Recurrent slice networks for 3d seg- mentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2626–2635, 2018. [71] Tengteng Huang, Zhe Liu, Xiwu Chen, and Xiang Bai. Epnet: Enhancing point features with image semantics for 3d object detection. In European Conference on Computer Vision, pages 35–52. Springer, 2020. [72] Eldar Insafutdinov and Alexey Dosovitskiy. Unsupervised learning of shape and pose with differentiable point clouds. In NeurIPS, 2018. [73] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network train- ing by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [74] Rasmus Jensen, Anders Dahl, George V ogiatzis, Engil Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. In 2014 CVPR, pages 406–413. IEEE, 2014. [75] Mengqi Ji, Juergen Gall, Haitian Zheng, Yebin Liu, and Lu Fang. SurfaceNet: An end-to- end 3D neural network for multiview stereopsis. In Proc. ICCV, 2017. [76] Angjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Learning category-specific mesh reconstruction from image collections. In Proc. ECCV, 2018. [77] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. In Proc. Eurographics Symposium on Geometry Processing, volume 7, 2006. [78] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [79] Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the recog- nition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pages 863–872, 2017. 138 [80] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics, 36(4), 2017. [81] Artem Komarichev, Zichun Zhong, and Jing Hua. A-cnn: Annularly convolutional neural networks on point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7421–7430, 2019. [82] Hendrik Königshof, Niels Ole Salscheider, and Christoph Stiller. Realtime 3d object de- tection for automated driving using stereo vision and semantic information. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 1405–1410. IEEE, 2019. [83] Georgios Kopanas, Julien Philip, Thomas Leimkühler, and George Drettakis. Point-based neural rendering with per-view optimization. In Computer Graphics Forum, volume 40, pages 29–43. Wiley Online Library, 2021. [84] Kiriakos N Kutulakos and Steven M Seitz. A theory of shape by space carving. International Journal of Computer Vision, 38(3):199–218, 2000. [85] Shiyi Lan, Ruichi Yu, Gang Yu, and Larry S Davis. Modeling local geometric structure of 3d point clouds using geo-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 998–1008, 2019. [86] Loic Landrieu and Mohamed Boussaha. Point cloud oversegmentation with graph- structured deep metric learning. arXiv preprint arXiv:1904.02113, 2019. [87] Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4558–4567, 2018. [88] Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12697–12705, 2019. [89] Christoph Lassner and Michael Zollhofer. Pulsar: Efficient sphere-based neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1440–1449, 2021. [90] Truc Le and Ye Duan. Pointgrid: A deep network for 3d shape understanding. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 9204–9214, 2018. [91] Marc Levoy. Display of surfaces from volume data. IEEE Computer graphics and Applica- tions, 8(3):29–37, 1988. [92] Buyu Li, Wanli Ouyang, Lu Sheng, Xingyu Zeng, and Xiaogang Wang. Gs3d: An effi- cient 3d object detection framework for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1019–1028, 2019. 139 [93] Guohao Li, Matthias Müller, Guocheng Qian, Itzel C Delgadillo, Abdulellah Abualshour, Ali Thabet, and Bernard Ghanem. Deepgcns: Making gcns go as deep as cnns. arXiv preprint arXiv:1910.06849, 2019. [94] Jiaxin Li, Ben M Chen, and Gim Hee Lee. So-net: Self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9397–9406, 2018. [95] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo r-cnn based 3d object detection for au- tonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7644–7652, 2019. [96] Ruihui Li, Xianzhi Li, Chi-Wing Fu, Daniel Cohen-Or, and Pheng-Ann Heng. Pu-gan: a point cloud upsampling adversarial network. In Proceedings of the IEEE International Conference on Computer Vision, pages 7203–7212, 2019. [97] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. In Advances in Neural Information Processing Sys- tems, pages 820–830, 2018. [98] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6498–6508, 2021. [99] Ziyu Li, Yuncong Yao, Zhibin Quan, Wankou Yang, and Jin Xie. Sienet: Spatial infor- mation enhancement network for 3d object detection from point cloud. arXiv preprint arXiv:2103.15396, 2021. [100] Ming Liang, Bin Yang, Yun Chen, Rui Hu, and Raquel Urtasun. Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7345–7353, 2019. [101] Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 641–656, 2018. [102] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. [103] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monoc- ular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2024–2039, 2016. [104] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. arXiv preprint arXiv:2007.11571, 2020. 140 [105] Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8778–8785, 2019. [106] Yongcheng Liu, Bin Fan, Gaofeng Meng, Jiwen Lu, Shiming Xiang, and Chunhong Pan. Densepoint: Learning densely contextual representation for efficient point cloud processing. In Proceedings of the IEEE International Conference on Computer Vision, pages 5239– 5248, 2019. [107] Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. Relation-shape convolu- tional neural network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8895–8904, 2019. [108] Zhe Liu, Xin Zhao, Tengteng Huang, Ruolan Hu, Yu Zhou, and Xiang Bai. Tanet: Robust 3d object detection from point clouds with triple attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11677–11684, 2020. [109] Zhijian Liu, Haotian Tang, Yujun Lin, and Song Han. Point-voxel cnn for efficient 3d deep learning. arXiv preprint arXiv:1907.03739, 2019. [110] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751, 2019. [111] William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction algorithm. In ACM siggraph computer graphics, 1987. [112] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for uncon- strained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7210–7219, 2021. [113] Jon Mathews and Robert Lee Walker. Mathematical methods of physics, volume 501. WA Benjamin New York, 1970. [114] Daniel Maturana and Sebastian Scherer. V oxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928. IEEE, 2015. [115] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In CVPR, 2019. [116] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019. 141 [117] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. Proc. CVPR, 2019. [118] Moustafa Meshry, Dan B Goldman, Sameh Khamis, Hugues Hoppe, Rohit Pandey, Noah Snavely, and Ricardo Martin-Brualla. Neural rerendering in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6878–6887, 2019. [119] Gregory P Meyer, Ankit Laddha, Eric Kee, Carlos Vallespi-Gonzalez, and Carl K Welling- ton. Lasernet: An efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12677–12686, 2019. [120] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pages 405–421. Springer, 2020. [121] Pietro Morerio, Jacopo Cavazza, and Vittorio Murino. Minimal-entropy correlation align- ment for unsupervised deep domain adaptation. arXiv preprint arXiv:1711.10288, 2017. [122] Don Murray and James J Little. Using real-time stereo vision for mobile robot navigation. autonomous robots, 8(2):161–171, 2000. [123] Mahyar Najibi, Guangda Lai, Abhijit Kundu, Zhichao Lu, Vivek Rathod, Thomas Funkhouser, Caroline Pantofaru, David Ross, Larry S Davis, and Alireza Fathi. Dops: Learning to detect 3d objects and predict their 3d shapes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11913–11922, 2020. [124] Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11453–11464, 2021. [125] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In Proc. CVPR, 2020. [126] Yue Pan, Pengchuan Xiao, Yujie He, Zhenlei Shao, and Zesong Li. Mulls: Versatile lidar slam via multi-metric linear least square. arXiv preprint arXiv:2102.03771, 2021. [127] Su Pang, Daniel Morris, and Hayder Radha. Clocs: Camera-lidar object candidates fusion for 3d object detection. arXiv preprint arXiv:2009.00784, 2020. [128] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. arXiv preprint arXiv:1901.05103, 2019. 142 [129] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865– 5874, 2021. [130] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M Seitz. Hypernerf: A higher- dimensional representation for topologically varying neural radiance fields. arXiv preprint arXiv:2106.13228, 2021. [131] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. Contrastive learning for unpaired image-to-image translation. arXiv preprint arXiv:2007.15651, 2020. [132] Siméon-Denis Poisson. Mémoire sur la théorie du magnétisme en movement. L’Académie, 1826. [133] Alex D Pon, Jason Ku, Chengyao Li, and Steven L Waslander. Object-centric stereo match- ing for 3d object detection. In 2020 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 8383–8389. IEEE, 2020. [134] Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9277–9286, 2019. [135] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 918–927, 2018. [136] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017. [137] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. V olumetric and multi-view cnns for object classification on 3d data. In CVPR, 2016. [138] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierar- chical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017. [139] Can Qin, Haoxuan You, Lichen Wang, C-C Jay Kuo, and Yun Fu. Pointdan: A multi- scale 3d domain adaption network for point cloud representation. In Advances in Neural Information Processing Systems, pages 7192–7203, 2019. [140] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning trans- ferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 143 [141] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. [142] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. Octnet: Learning deep 3d repre- sentations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3577–3586, 2017. [143] Khaled Saleh, Ahmed Abobakr, Mohammed Attia, Julie Iskander, Darius Nahavandi, Mo- hammed Hossny, and Saeid Nahvandi. Domain adaptation for vehicle detection from bird’s eye view lidar point cloud data. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019. [144] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proc. CVPR, 2016. [145] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pix- elwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV), 2016. [146] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. Graf: Generative radi- ance fields for 3d-aware image synthesis. arXiv preprint arXiv:2007.02442, 2020. [147] Steven M Seitz and Charles R Dyer. Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision, 35(2):151–173, 1999. [148] Yuhu Shan, Wen Feng Lu, and Chee Meng Chew. Pixel and feature level based domain adaptation for object detection in autonomous driving. Neurocomputing, 367:31–38, 2019. [149] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2020. [150] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hong- sheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10529–10538, 2020. [151] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn: 3d object proposal genera- tion and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–779, 2019. [152] Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. arXiv preprint arXiv:1907.03670, 2019. [153] Weijing Shi and Raj Rajkumar. Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1711–1719, 2020. 144 [154] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pages 746–760. Springer, 2012. [155] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolu- tional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3693–3702, 2017. [156] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [157] Fun Shing Sin, Daniel Schroeder, and Jernej Barbiˇ c. Vega: non-linear fem deformable object simulator. In Computer Graphics Forum, volume 32, pages 36–48. Wiley Online Library, 2013. [158] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhofer. Deepvoxels: Learning persistent 3D feature embeddings. In Proc. CVPR, 2019. [159] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. arXiv preprint arXiv:1906.01618, 2019. [160] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming- Hsuan Yang, and Jan Kautz. Splatnet: Sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2530–2539, 2018. [161] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scala- bility in perception for autonomous driving: Waymo open dataset, 2019. [162] Chengzhou Tang and Ping Tan. BA-net: Dense bundle adjustment network. In Proc. ICLR, 2019. [163] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-Yi Zhou. Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3887–3896, 2018. [164] Maxim Tatarchenko, Stephan R Richter, René Ranftl, Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do single-view 3d reconstruction networks learn? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3405–3414, 2019. [165] Lyne Tchapmi, Christopher Choy, Iro Armeni, JunYoung Gwak, and Silvio Savarese. Seg- cloud: Semantic segmentation of 3d point clouds. In 2017 International Conference on 3D Vision (3DV), pages 537–547. IEEE, 2017. 145 [166] OpenPCDet Development Team. Openpcdet: An open-source toolbox for 3d object detec- tion from point clouds. https://github.com/open-mmlab/OpenPCDet, 2020. [167] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, François Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. arXiv preprint arXiv:1904.08889, 2019. [168] Carlo Tomasi and Takeo Kanade. Shape and motion from image streams under orthography: a factorization method. International journal of computer vision, 9(2):137–154, 1992. [169] Roger Tsai. A versatile camera calibration technique for high-accuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses. IEEE Journal on Robotics and Au- tomation, 3(4):323–344, 1987. [170] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. Multi-view supervi- sion for single-view reconstruction via differentiable ray consistency. In CVPR, 2017. [171] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017. [172] Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804, 2017. [173] Chu Wang, Babak Samari, and Kaleem Siddiqi. Local spectral graph convolution for point set feature learning. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 52–66, 2018. [174] Dominic Zeng Wang and Ingmar Posner. V oting for voting in online point cloud object detection. In Robotics: Science and Systems, volume 1, pages 10–15607, 2015. [175] Jinglu Wang, Bo Sun, and Yan Lu. MVPnet: Multi-view point regression networks for 3D object reconstruction from a single image. Proc. AAAI Conference on Artificial Intelligence, 2019. [176] Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, and Jie Shan. Graph attention convolution for point cloud semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10296–10305, 2019. [177] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. arXiv preprint arXiv:1804.01654, 2018. [178] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-CNN: Octree- based Convolutional Neural Networks for 3D Shape Analysis. ACM Transactions on Graph- ics (SIGGRAPH), 36(4), 2017. 146 [179] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibr- net: Learning multi-view image-based rendering. In CVPR, 2021. [180] Weiyue Wang, Duygu Ceylan, Radomir Mech, and Ulrich Neumann. 3dn: 3d deformation network. In CVPR, 2019. [181] Weiyue Wang, Qiangui Huang, Suya You, Chao Yang, and Ulrich Neumann. Shape inpaint- ing using 3d generative adversarial network and recurrent convolutional networks. In ICCV, 2017. [182] Yan Wang, Xiangyu Chen, Yurong You, Li Erran Li, Bharath Hariharan, Mark Campbell, Kilian Q Weinberger, and Wei-Lun Chao. Train in germany, test in the usa: Making 3d object detectors generalize. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11713–11723, 2020. [183] Yue Wang, Alireza Fathi, Abhijit Kundu, David Ross, Caroline Pantofaru, Tom Funkhouser, and Justin Solomon. Pillar-based object detection for autonomous driving. arXiv preprint arXiv:2007.10323, 2020. [184] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), 38(5):146, 2019. [185] W Weng and X Zhu. Convolutional networks for biomedical image segmentation. IEEE Access, 2015. [186] Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin Johnson. Synsin: End-to- end view synthesis from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7467–7477, 2020. [187] Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and Kurt Keutzer. Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmenta- tion from a lidar point cloud. In 2019 International Conference on Robotics and Automation (ICRA), pages 4376–4382. IEEE, 2019. [188] Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T Freeman, and Joshua B Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In NeurIPS, 2017. [189] Jiajun Wu, Chengkai Zhang, Xiuming Zhang, Zhoutong Zhang, William T. Freeman, and Joshua B. Tenenbaum. Learning shape priors for single-view 3d completion and reconstruc- tion. In NeurIPS, 2018. [190] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2019. [191] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015. 147 [192] Sijin Chen Li Jiang Chi-Wing Fu Wu Zheng, Weiliang Tang. Cia-ssd: Confident iou-aware single-stage object detector from point cloud. In AAAI, 2021. [193] Fanbo Xiang, Zexiang Xu, Milos Hasan, Yannick Hold-Geoffroy, Kalyan Sunkavalli, and Hao Su. Neutex: Neural texture mapping for volumetric neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7119–7128, 2021. [194] Haozhe Xie, Hongxun Yao, Shangchen Zhou, Jiageng Mao, Shengping Zhang, and Wenxiu Sun. Grnet: Gridding residual network for dense point cloud completion. arXiv preprint arXiv:2006.03761, 2020. [195] Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. Attentional shapecontextnet for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4606–4615, 2018. [196] Christopher B Choy Danfei Xu, JunYoung Gwak, and Kevin Chen Silvio Savarese. 3d- r2n2: A unified approach for single and multi-view 3d object reconstruction. arXiv preprint arXiv:1604.00449, 2016. [197] Hongyi Xu and Jernej Barbiˇ c. Signed distance fields for polygon soup meshes. In Proceed- ings of Graphics Interface 2014, pages 35–41. Canadian Information Processing Society, 2014. [198] Qiangeng Xu, Xudong Sun, Cho-Ying Wu, Panqu Wang, and Ulrich Neumann. Grid-gcn for fast and scalable point cloud learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5661–5670, 2020. [199] Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir Mech, and Ulrich Neumann. Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 492–502. Curran Associates, Inc., 2019. [200] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5438–5448, 2022. [201] Qiangeng Xu, Yiqi Zhong, and Ulrich Neumann. Behind the curtain: Learning occluded shapes for 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelli- gence, volume 36, pages 2893–2901, 2022. [202] Qiangeng Xu, Yin Zhou, Weiyue Wang, Charles R Qi, and Dragomir Anguelov. Spg: Un- supervised domain adaptation for 3d object detection via semantic point generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15446– 15456, 2021. [203] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pages 87–102, 2018. 148 [204] Zhenbo Xu, Wei Zhang, Xiaoqing Ye, Xiao Tan, Wei Yang, Shilei Wen, Errui Ding, Ajin Meng, and Liusheng Huang. Zoomnet: Part-aware adaptive zooming neural network for 3d object detection. In AAAI, pages 12557–12564, 2020. [205] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective trans- former nets: Learning single-view 3d object reconstruction without 3d supervision. In NeurIPS, 2016. [206] Xu Yan, Jiantao Gao, Jie Li, Ruimao Zhang, Zhen Li, Rui Huang, and Shuguang Cui. Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. arXiv preprint arXiv:2012.03762, 2020. [207] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018. [208] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7652–7660, 2018. [209] Guandao Yang, Yin Cui, Serge Belongie, and Bharath Hariharan. Learning single-view 3d reconstruction with limited pose supervision. In ECCV, 2018. [210] Jiancheng Yang, Qiang Zhang, Bingbing Ni, Linguo Li, Jinxian Liu, Mengdie Zhou, and Qi Tian. Modeling point clouds with self-attention and gumbel subset sampling. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3323– 3332, 2019. [211] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 206–215, 2018. [212] Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11040–11048, 2020. [213] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE International Conference on Computer Vision, pages 1951–1960, 2019. [214] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSnet: Depth inference for unstructured multi-view stereo. In Proc. ECCV, pages 767–783, 2018. [215] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural surface reconstruction by disentangling geometry and appearance. In Proc. NeurIPS, 2020. [216] Maosheng Ye, Shuangjie Xu, and Tongyi Cao. Hvnet: Hybrid voxel network for lidar based 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1631–1640, 2020. 149 [217] Hongwei Yi, Shaoshuai Shi, Mingyu Ding, Jiankai Sun, Kui Xu, Hui Zhou, Zhe Wang, Sheng Li, and Guoping Wang. Segvoxelnet: Exploring semantic context and depth-aware features for 3d vehicle detection from point cloud. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 2274–2280. IEEE, 2020. [218] Wang Yifan, Shihao Wu, Hui Huang, Daniel Cohen-Or, and Olga Sorkine-Hornung. Patch- based progressive 3d point set upsampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5958–5967, 2019. [219] Jin Hyeok Yoo, Yecheol Kim, Ji Song Kim, and Jun Won Choi. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. arXiv preprint arXiv:2004.12636, 3, 2020. [220] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. Plenoctrees for real-time rendering of neural radiance fields. arXiv preprint arXiv:2103.14024, 2021. [221] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021. [222] Lequan Yu, Xianzhi Li, Chi-Wing Fu, Daniel Cohen-Or, and Pheng-Ann Heng. Pu-net: Point cloud upsampling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2790–2799, 2018. [223] Wentao Yuan, Tejas Khot, David Held, Christoph Mertz, and Martial Hebert. Pcn: Point completion network. In 2018 International Conference on 3D Vision (3DV), pages 728–737. IEEE, 2018. [224] Christopher Zach, David Gallup, Jan-Michael Frahm, and Marc Niethammer. Fast global labeling for real-time stereo using multiple plane sweeps. In VMV, pages 243–252, 2008. [225] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020. [226] Kuangen Zhang, Ming Hao, Jing Wang, Clarence W de Silva, and Chenglong Fu. Linked dynamic graph cnn: Learning on point cloud via linking hierarchical features. arXiv preprint arXiv:1904.10014, 2019. [227] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. [228] Zhengyou Zhang. Microsoft kinect sensor and its effect. IEEE multimedia, 19(2):4–10, 2012. [229] Zhiyuan Zhang, Binh-Son Hua, and Sai-Kit Yeung. Shellnet: Efficient point cloud convolu- tional neural networks using concentric shells statistics. arXiv preprint arXiv:1908.06295, 2019. [230] Yiqi Zhong, Cho-Ying Wu, Suya You, and Ulrich Neumann. Deep rgb-d canonical correla- tion analysis for sparse depth completion. In NeurIPS, 2019. 150 [231] Qian-Yi Zhou and Vladlen Koltun. Color map optimization for 3D reconstruction with consumer depth cameras. ACM Transactions on Graphics, 33(4):155, 2014. [232] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magni- fication: learning view synthesis using multiplane images. ACM Transactions on Graphics, 37(4):1–12, 2018. [233] Xingyi Zhou, Arjun Karpur, Chuang Gan, Linjie Luo, and Qixing Huang. Unsupervised domain adaptation for 3d keypoint estimation via view consistency. In Proceedings of the European Conference on Computer Vision (ECCV), pages 137–153, 2018. [234] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. arXiv preprint arXiv:1812.07035, 2018. [235] Yin Zhou, Pei Sun, Yu Zhang, Dragomir Anguelov, Jiyang Gao, Tom Ouyang, James Guo, Jiquan Ngiam, and Vijay Vasudevan. End-to-end multi-view fusion for 3d object detection in lidar point clouds. In Conference on Robot Learning, pages 923–932, 2020. [236] Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d ob- ject detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018. [237] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE inter- national conference on computer vision, pages 2223–2232, 2017. [238] Rui Zhu, Hamed Kiani Galoogahi, Chaoyang Wang, and Simon Lucey. Rethinking repro- jection: Closing the loop for pose-aware shape reconstruction from a single image. In ICCV, 2017. 151
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D deep learning for perception and modeling
PDF
Scalable dynamic digital humans
PDF
Deep representations for shapes, structures and motion
PDF
Closing the reality gap via simulation-based inference and control
PDF
Green learning for 3D point cloud data processing
PDF
3D urban modeling from city-scale aerial LiDAR data
PDF
3D object detection in industrial site point clouds
PDF
Object detection and recognition from 3D point clouds
PDF
Reconstructing 3D reconstruction: a graphical taxonomy of current techniques
PDF
Towards building a live 3D digital twin of the world
PDF
Rendering for automultiscopic 3D displays
PDF
Fast iterative image reconstruction for 3D PET and its extension to time-of-flight PET
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
City-scale aerial LiDAR point cloud visualization
PDF
CUDA deformers for model reduction
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Multi-scale dynamic capture for high quality digital humans
Asset Metadata
Creator
Xu, Qiangeng
(author)
Core Title
Point-based representations for 3D perception and reconstruction
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-12
Publication Date
12/12/2022
Defense Date
12/12/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D computer vision,3D scene reconstruction,3D scene understanding,automation,autonomous driving,differentiable rendering,generative model,inverse rendering,neural radiance fields,neural rendering,OAI-PMH Harvest,object detection,perception,point-based representation
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Neumann, Ulrich (
committee chair
), Barbic, Jernej (
committee member
), Haldar, Justin (
committee member
)
Creator Email
charlie.learning@yahoo.com,qiangenx@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112619733
Unique identifier
UC112619733
Identifier
etd-XuQiangeng-11356.pdf (filename)
Legacy Identifier
etd-XuQiangeng-11356
Document Type
Dissertation
Format
theses (aat)
Rights
Xu, Qiangeng
Internet Media Type
application/pdf
Type
texts
Source
20221213-usctheses-batch-995
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
3D computer vision
3D scene reconstruction
3D scene understanding
automation
autonomous driving
differentiable rendering
generative model
inverse rendering
neural radiance fields
neural rendering
object detection
perception
point-based representation