Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Green learning for 3D point cloud data processing
(USC Thesis Other)
Green learning for 3D point cloud data processing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Green Learning for 3D Point Cloud Data Processing by Pranav Kadam A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2023 Copyright 2023 Pranav Kadam ii Table of Contents List Of Tables ..................................................................................................................................v List Of Figures .............................................................................................................................. vii Abstract .......................................................................................................................................... xi Chapter 1: Introduction ..................................................................................................................1 1.1 Significance of the Research ............................................................................................ 1 1.2 Contributions of the Research .......................................................................................... 6 1.2.1 Unsupervised Point Cloud Registration............................................................... 6 1.2.2 Point Cloud Pose Estimation and SO(3) Invariant Classification ....................... 7 1.2.3 Point Cloud Odometry and Scene Flow Estimation ............................................ 8 1.3 Organization of the Dissertation ...................................................................................... 9 Chapter 2: Background Review .................................................................................................. 10 2.1 Traditional Methods for Registration ............................................................................. 10 2.1.1 Iterative Closest Point (ICP) .............................................................................. 10 2.1.2 Point-to-Plane ICP ............................................................................................. 12 2.1.3 Generalized ICP ................................................................................................. 12 2.1.4 Global Registration ............................................................................................ 15 2.2 Deep Learning Methods for Registration ....................................................................... 15 2.2.1 PointNetLK ........................................................................................................ 15 2.2.2 Deep Closest Point (DCP) ................................................................................. 19 2.2.3 3DMatch ............................................................................................................ 24 2.2.4 PPFNet ............................................................................................................... 25 2.3 Point Cloud Classification and Pose Estimation ............................................................ 29 2.4 Point Cloud Odometry and Scene Flow Estimation ...................................................... 31 2.5 Green Learning for Point Clouds ................................................................................... 33 2.5.1 PointHop ............................................................................................................ 33 2.5.2 PointHop++ ....................................................................................................... 34 2.6 Datasets .......................................................................................................................... 36 2.6.1 ModelNet40 ....................................................................................................... 36 2.6.2 3DMatch ............................................................................................................ 36 2.6.3 KITTI and Argoverse ........................................................................................ 37 Chapter 3: Point Cloud Registration ........................................................................................... 39 iii 3.1 Introduction .................................................................................................................... 39 3.2 Salient Points Analysis (SPA) ........................................................................................ 40 3.2.1 Problem Statement ............................................................................................. 40 3.2.2 Feature Learning ................................................................................................ 41 3.2.3 Salient Points Selection ..................................................................................... 42 3.2.4 Point Correspondence and Transformation Estimation ..................................... 44 3.2.5 Experimental Results ......................................................................................... 45 3.3 R-PointHop .................................................................................................................... 49 3.3.1 Problem Statement ............................................................................................. 49 3.3.2 Feature Learning ................................................................................................ 50 3.3.3 Point Correspondences ...................................................................................... 56 3.3.4 Transformation Estimation ................................................................................ 57 3.3.5 Experimental Results ......................................................................................... 59 3.3.6 Toward Green Learning ..................................................................................... 72 3.3.7 Discussion .......................................................................................................... 73 3.4 Conclusion ..................................................................................................................... 75 Chapter 4: Point Cloud Pose Estimation and SO(3) Invariant Classification ............................. 76 4.1 Introduction .................................................................................................................... 76 4.2 PCRP Method................................................................................................................. 79 4.2.1 Methodology ...................................................................................................... 79 4.2.2 Experimental Results ......................................................................................... 84 4.2.3 Discussion .......................................................................................................... 88 4.3 S3I-PointHop Method .................................................................................................... 88 4.3.1 Methodology ...................................................................................................... 89 4.3.2 Experiments ....................................................................................................... 93 4.3.3 Discussion .......................................................................................................... 95 4.4 Conclusion ..................................................................................................................... 96 Chapter 5: Point Cloud Odometry and Scene Flow Estimation ................................................. 98 5.1 Introduction .................................................................................................................... 98 5.2 GreenPCO Method ....................................................................................................... 102 5.2.1 Methodology .................................................................................................... 102 5.2.2 Experimental Results ....................................................................................... 108 5.2.3 Discussion ........................................................................................................ 112 5.3 PointFlowHop Method ................................................................................................. 113 5.3.1 Methodology .................................................................................................... 113 5.3.2 Experiments ..................................................................................................... 120 5.3.3 Discussion ........................................................................................................ 123 5.4 Conclusion ................................................................................................................... 125 Chapter 6: Conclusion and Future Work ...................................................................................127 6.1 Summary of the Research ............................................................................................ 127 6.2 Future Research Topics ................................................................................................ 130 6.2.1 3D Object Detection ........................................................................................ 131 iv 6.2.2 Semantic Segmentation ................................................................................... 132 Bibliography ................................................................................................................................134 ListOfTables 3.1 Registration performance comparison on ModelNet-40 with respect to unseen classes (left) and noisy input point clouds (right). . . . . . . . . . . . . . . . . . . 47 3.2 Registration performance comparison on the 3DMatch dataset. . . . . . . . . . . . 60 3.3 Registration on unseen point clouds . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Registration on unseen classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5 Registration on noisy point clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 Registration on partial point clouds (R-PointHop* indicates choosing correspon- dences without the ratio test). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7 Registration on the Stanford Bunny dataset . . . . . . . . . . . . . . . . . . . . . . 66 3.8 Ablation study on object registration. . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.1 Performance comparison of object registration. . . . . . . . . . . . . . . . . . . . . 85 4.2 Comparison of point cloud retrieval performance. . . . . . . . . . . . . . . . . . . 86 4.3 Mean and median rotation errors in degrees. . . . . . . . . . . . . . . . . . . . . . 86 4.4 Classification accuracy comparison of PointHop-family methods. . . . . . . . . . 94 4.5 Comparison with Deep Learning Networks. . . . . . . . . . . . . . . . . . . . . . 94 4.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.1 Performance comparison between GreenPCO and five supervised DL methods on two test sequences in the KITTI dataset. . . . . . . . . . . . . . . . . . . . . . . 109 v 5.2 Ablation study on KITTI dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3 The effect of different amounts of training data . . . . . . . . . . . . . . . . . . . . 112 5.4 Comparison of scene flow estimation results on the Stereo KITTI dataset, where the best performance number is shown in boldface. . . . . . . . . . . . . . . . . . 121 5.5 Comparison of scene flow estimation results on the Argoverse dataset, where the best performance number is shown in boldface. . . . . . . . . . . . . . . . . . 122 5.6 Ego-motion compensation – ICP vs. GreenPCO. . . . . . . . . . . . . . . . . . . . 122 5.7 Performance gain due to object refinement. . . . . . . . . . . . . . . . . . . . . . . 122 5.8 Performance gain due to flow refinement. . . . . . . . . . . . . . . . . . . . . . . . 123 5.9 The number of trainable parameters and training time of the proposed PointFlowHop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.10 Comparison of model sizes (in terms of the number of parameters) and computational complexity of inference (in terms of FLOPs) of four benchmarking methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 vi ListOfFigures 2.1 Illustration of the ICP algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 PointNetLK architecture. The figure is from [2]. . . . . . . . . . . . . . . . . . . . 16 2.3 DCP architecture. The figure is from [89]. . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 3DMatch architecture. The figure is from [99]. . . . . . . . . . . . . . . . . . . . . 24 2.5 Local geometry encoding in PPFNet. The figure is from [20]. . . . . . . . . . . . . 27 2.6 PPFNet patch feature construction. The figure is from [20]. . . . . . . . . . . . . . 28 2.7 PPFNet architecture. The figure is from [20]. . . . . . . . . . . . . . . . . . . . . . 28 2.8 Overview of the PointHop method. The figure is from [107]. . . . . . . . . . . . . 33 2.9 Overview of the PointHop++ method. The figure is from [106]. . . . . . . . . . . . 35 2.10 Samples from the airplane and vase class from ModelNet40 dataset. . . . . . . . . 36 2.11 Example of point clouds from 3DMatch dataset. . . . . . . . . . . . . . . . . . . . 37 2.12 Examples of point clouds from KITTI dataset. . . . . . . . . . . . . . . . . . . . . 38 3.1 The system diagram of the proposed SPA method. . . . . . . . . . . . . . . . . . . 41 3.2 M salient points (with M = 32) of several point cloud sets selected by our algorithm are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Registration of noisy point clouds (in orange) with noiseless point clouds (in blue) from ModelNet40 dataset using the SPA method, where the first row is the input and the second row is the output. . . . . . . . . . . . . . . . . . . . . . . . . 45 vii 3.4 Comparison of the mean absolute registration errors of rotation and translation for four benchmarking methods as a function of the maximum rotation angle on unseen point clouds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Histogram of mean absolute rotation error for experiment #2 (left) and class-wise error distribution (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6 The system diagram of the proposed R-PointHop method, which consists of three modules: 1) feature learning, 2) point correspondence, and 3) transformation estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.7 Illustration of the local reference frame (LRF). . . . . . . . . . . . . . . . . . . . . 53 3.8 Illustration of point attribute construction. . . . . . . . . . . . . . . . . . . . . . . 54 3.9 Correspondences found using R-PointHop, where the source point cloud is shown in red and the target is shown in blue. . . . . . . . . . . . . . . . . . . . . . 57 3.10 Registration of indoor point clouds from 3DMatch dataset: point clouds from 7-Scenes (the left two) and point clouds from SUN3D (the right two)) . . . . . . . 60 3.11 Registration of seven point clouds from the ModelNet40 dataset using R-PointHop. 63 3.12 (From left to right) The plots of the maximum rotation angle versus the root mean square rotation error, the mean absolute rotation error, the root mean square translation error, and the mean absolute translation error. . . . . . . . . . . 65 3.13 Registration on the Stanford Bunny dataset: the source and the target point clouds (left) and the registered result (right). . . . . . . . . . . . . . . . . . . . . . 67 3.14 Registration of point clouds from the Stanford 3D scanning repository, where the objects are (from left to right): drill bit, armadillo, Buddha, dragon and bunny. The top row shows input point clouds while the bottom row shows the registered output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.15 (From left to right) The source and target point clouds to be aligned, registration with ICP only, with R-PointHop only, with R-PointHop followed by ICP. . . . . . 68 3.16 The t-SNE plot of point features, where a different number indicates a different object class of points. Some points are highlighted and their 3D location in the point cloud is shown. Features of points with a similar local neighborhood are clustered together despite of differences in their 3D coordinates. . . . . . . . . . . 69 3.17 Registration of two point cloud models, where the first two columns are input point clouds and the third column is the output after registration. . . . . . . . . . 70 viii 4.1 Summary of the proposed PCRP method. First, a similar object to the input query object (in red) is retrieved from the gallery set (top row). Then, the query object is registered with the retrieved object (bottom row) to estimate its pose. . . . . . . 77 4.2 An overview of the proposed PCRP method. . . . . . . . . . . . . . . . . . . . . . 79 4.3 (Top) Illustration of partitioning of point clouds into two symmetrical parts and (Bottom) point correspondences between symmetric parts. . . . . . . . . . . . . . 83 4.4 An overview of the proposed S3I-PointHop method: 1) an input point cloud scan is approximately aligned with the principal axes, 2) local and global point features are extracted and concatenated followed by the Saab transform, 3) point features are aggregated from different conical and spherical volumes, 4) discriminant features are selected using DFT and a linear classifier is used to predict the object class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.5 Illustration of conical and spherical aggregation. The conventional “global pooling" is shown in (a), where features of all points are aggregated at once. The proposed “regional pooling" schemes are depicted in (b)-(d), where points are aggregated only in distinct spatial regions. Only, the solid red points are aggregated. For better visual representation, cones/spheres along only one axis are shown. (b) and (c) use the conical pooling while (d) adopts spherical pooling in local regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.6 An example of conical aggregation. For every point cloud object, points lying in each cone are colored uniquely. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1 An overview of the GreenPCO method. . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 Comparison between random sampling (left) and geometry-aware sampling (right), where sampled points are marked in blue and red, respectively. . . . . . . 103 5.3 View-based partitioning using the azimuthal angle, where the front, rear, left and right views are highlighted in blue, green, red, and yellow, respectively. . . . 106 5.4 Sampled points at time instances t and t + 1 are marked in blue and red, respectively, while point correspondences between two consecutive scans are shown in green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.5 Evaluation results on sequences 4 (left) and 10 (right) of the KITTI dataset. . . . . 108 5.6 An overview of the PointFlowHop method, which consists of six modules: 1) ego-motion compensation, 2) scene classification, 3) object association, 4) object refinement, 5) object motion estimation, and 6) scene flow initialization and refinement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 ix 5.7 Objects clustered using the DBSCAN algorithm are shown in different colors. . . 116 5.8 Flow estimation results using PointFlowHop: input point clouds (left) and warped output using flow vectors (right). . . . . . . . . . . . . . . . . . . . . . . . 119 6.1 Overview of VoteNet. The figure is from [65]. . . . . . . . . . . . . . . . . . . . . 132 x Abstract 3D Point Cloud processing and analysis has attracted a lot of attention in present times due to the numerous applications such as in autonomous driving, computer graphics, and robotics. In this dissertation, we focus on the problems of point cloud registration, pose estimation, rotation invariant classification, odometry and scene flow estimation. These tasks are important in the realization of a 3D vision system. Rigid registration aims at finding a 3D transformation consisting of rotation and translation that optimally aligns two point clouds. The next two tasks focus on object-level analysis. For pose estimation, we predict the 6-DOF pose of an object with respect to a chosen frame of reference. Rotation invariant classification aims at classifying 3D objects which are arbitrarily rotated. In odometry, we want to estimate the incremental motion of an object using the point cloud scans captured by it at every instance. While the scene flow estimation task aims at determining the point-wise flow between two consecutive point clouds. 3D perception using point clouds is dominated by deep learning methods nowadays. How- ever, large scale learning on point clouds with deep learning techniques has several issues which are often overlooked. This research is based on the green learning (GL) paradigm and focuses on interpretability, smaller training times and smaller model size. Using GL, we separate the fea- ture learning process from the decision. Features are derived in an unsupervised feedforward xi manner from the statistics of the training data. For the decision part, we mainly use well estab- lished model-free techniques which are optimized during inference. When the decision process involves classification, a lightweight classifier is trained. Overall, the proposed methods can be trained within an hour on CPUs and the number of model parameters are much fewer than deep learning methods. These advantages are promising keeping in mind applications that demand low power and complexity, such as in edge computing. First, we develop two point cloud registration methods, Salient Points Analysis (SPA) and R-PointHop, targeted at local and global registration respectively. SPA combines unsupervised feature learning with traditional deterministic method to derive the 3D transformation from point correspondences. Analyzing the drawbacks of SPA in presence of large rotation angles, we modify the design in R-PointHop by introducing local rotation invariant feature learning. In the work, Point Cloud Retrieval and Pose Estimation (PCRP), the local features learned from R-PointHop are reused for two tasks, object retrieval and pose estimation simultaneously. In the following work, SO(3) invariant PointHop (S3I-PointHop), we extend PointHop, a GL based point cloud classification method for classifying unaligned objects. We replace the pose dependent modules in PointHop with rotation invariant counterparts. In addition, we simplify the PointHop pipeline using only one single hop along with multiple spatial aggregation techniques. The last two works deal with LiDAR point clouds. In Green Point Cloud Odometry (GreenPCO), we show how the R- PointHop method can be adapted to estimate the incremental motion of vehicle from point cloud data captured at each time step. In particular, we show how we can exploit prior knowledge and improve the performance. In PointFlowHop, an efficient 3D scene flow estimation method is proposed. It decomposes the scene flow estimation task into a set of subtasks, including ego- motion compensation, object association and object-wise motion estimation. xii Chapter1 Introduction 1.1 SignificanceoftheResearch A 3D point cloud is a set of points in three-dimensional space. The point cloud represents the surface of an object in 3D. Each point consists of three coordinates that uniquely identify its location with respect to three mutually orthogonal axes. Optionally, additional information such as RGB color values and surface normal can be embedded as point attributes, depending upon the sensor used to capture the point cloud. Typically, a point cloud comprises a large number of points (tens to hundreds of thousands, or even million). Unlike 2D images, which can be represented by a regular grid, 3D point clouds are unorganized with no particular order. This unordered nature needs to be considered when dealing with point clouds and designing methods for processing them. This, combined with the large number of points, makes the realization of real-world point-cloud-based systems challenging. Point clouds are inherently different from 2D images and other forms of 3D representations such as voxel grids and meshes. While 2D images are only projections of the 3D world on a 2D plane, point clouds give 3D information about the structure of object. Point cloud geometry 1 is more robust under different lighting conditions and even reliable in darkness. However, one advantage of 2D images is that they have a well-defined grid structure. In contrast, point clouds are unordered in nature, which means that the methods and algorithms used to process them must be invariant under any permutation of the point cloud data and insensitive to the order in which the points appear in the stored data. The 2D grid representation in images permits the use of operations like convolution with kernels. The use of convolutional neural networks (CNNs), which are based on such convolutions, has offered tremendous benefit to 2D vision problems in recent times. However, due to the irregular structure of point clouds, such convolutions are not possible directly. In most practical cases, point clouds are very sparsely populated and contain outliers and noise. Point clouds have become increasingly popular due to the rich applications and advantages they provide over conventional 2D images. Another reason for a more widespread use of point clouds is the reduction in the costs of sensors such as the Light Detection and Ranging (LiDAR). To keep pace with the demands of markets such as autonomous vehicles, robotics and virtual reality, research and development of methods for fast and accurate processing and understanding 3D world from point cloud data has been on a steady rise. With the proliferation of deep learning since the year 2012, a natural inclination of researchers in academia and industry has been to search for deep learning based solutions for point cloud perception tasks. PointNet [66] has been an early pioneering work in this regards. Prior to the advent of deep learning, there have been exemplary works on point clouds which are based on traditional handcrafted feature representations. Two noteworthy point cloud feature descriptors include the Signature Histogram of Orientations (SHOT) [78] and Fast Point Feature Histogram (FPFH) [73]. Another classical method includes the Iterative Closest Point (ICP) [6] 2 algorithm for aligning two point clouds. ICP is popular as a post-processing step in numerous point cloud applications even today. Deep learning has achieved state of the art results for the perception tasks like point cloud classification, segmentation, and detection. The advantages of deep learning in terms of perfor- mance measures like accuracy / error are unquestionable in almost all of the computer vision tasks in general, and tasks that involve point cloud data in specific. However, the performance gain has come only at a cost of massive training data, large training times, and usage of expensive GPU resources. On the flip side, the traditional methods which based on handcrafted feature de- sign, often fail to match the performance of deep learning based methods. The only advantage of these traditional approaches is that they are much lightweight in nature and have a faster compu- tation. Another important aspect of deep learning is use of supervision. Deep learning methods often rely on heavy supervision, whereas reliable ground truth label information cannot always be made available in a cost effective manner. Furthermore, deep learning methods are hard to interpret and fail to generalize on newer tasks or even on different datasets for the same task. Several attempts have been made to demystify deep learning and make it more explainable. The work carried out in this dissertation finds roots in one such attempts to build a new pipeline of explainable machine learning, one which is very different from deep learning. It is known as Green Learning (GL) [42]. GL decouples feature representation learning from the decision part. The feature learning process uses the statistics of the training dataset to learn the filters in an unsupervised, one pass, and feedforward manner. This is contrasting to deep learning which uses iterative non-convex optimization via the backpropagation algorithm. Once the feature rep- resentation has been obtained, the decision part can be supervised or unsupervised depending upon the task. The feature learning is unsupervised and task agnostic. 3 The GL framework is a result of several years of research aimed at developing mathematically transparent machine learning method for vision applications. Some of the milestones in this pur- suit include the view of Convolutional Neural Networks (CNNs) as Multi-layer RECOS transform [40], the Subspace Approximation with Augmented Kernels (Saak) transform [41], the Subspace Approximation with Adjusted Bias (Saab) transform [43], PixelHop [12], and PointHop [107]. In particular, PointHop and its follow-up work PointHop++ [106] offer impressive performance on the point cloud classification task. Our feature design method is inspired from these works. GL has shown remarkable results on a series of computer vision problems in point cloud segmen- tation [105, 104], face recognition [69], fake image detection [10, 113], anomaly detection [103], and texture generation [45]. In this dissertation, we focus on the problems of point cloud registration, pose estimation, rotation invariant classification, odometry, and scene flow estimation. All these problems have some commonalities and each of the proposed method share some principles among them. Reg- istration is a key step in many applications of point clouds. Given a pair of point cloud scans, registration attempts to find a rigid transformation for their optimal alignment. Multiple point cloud scans can be registered to get a complete 3D scene of the environment. Previously, the handcrafted descriptors such as SHOT [78], Spin Images [30], and FPFH [73] were were widely used for registration. These descriptors captured local neighborhood information of points. Then, the corresponding points in the two point clouds were used to solve for the optimal rotation and translation. Typically, RANSAC [22] has been used to get a robust solution free from outliers. On the other hand, deep learning methods employ end-to-end optimization where the features and the transformation are learned jointly. Unlike traditional methods which are unsupervised, 4 deep learning introduced supervised learning for registration where the training labels consisted of ground truth transformation (rotation matrix and translation vector). Our work attempts to combine the benefits of the traditional descriptors with unsupervised learning. One downside of the traditional descriptors apart from the fact that they are handcrafted is that they cannot capture information at different scales (or receptive fields) of the point cloud simultaneously. While, CNNs have shown us the advantages of multi-stage feature learning. One of the significance of our work is that we leverage the successive feature learning capabilities of GL and capture the short-, mid-, and long-range point cloud neighborhoods and learn point fea- tures. The point matches found based on these features are then combined with well established method [74] to get the 3D transformation. This makes our method completely unsupervised. We develop methods for local registration in the form of Salient Points Analysis (SPA) [35] and global registration in the form of R-PointHop [34]. While unsupervised deep learning methods for registration have been explored, ours stand apart in terms of being able to learn different neighborhood information in a one pass manner. In the latter works, we use the R-PointHop method and modify it for point cloud classification of arbitrarily rotated objects and pose estima- tion followed by point cloud odometry and scene flow estimation. Another significance of our research lies in the green learning aspect. As mentioned previ- ously, deep learning methods require long training times on GPUs. This leads to a large carbon footprint. In this regards, the proposed methods are cost effective and utilize only CPUs with training times of an order of magnitude less than deep learning based solutions. The model size and number of floating point operations (FLOPs) during inference are much less in comparison with deep learning methods as well. Such solutions are encouraging keeping in mind resource constrained applications such as in mobile and edge computing. 5 1.2 ContributionsoftheResearch 1.2.1 UnsupervisedPointCloudRegistration Supervised learning based point cloud registration methods outperform traditional methods by a significant margin. We believe, registration being more of a geometry based problem rather than a semantics based problem, supervision is unnecessary for registration. While, for problems such as semantic segmentation and 3D detection, supervised learning still makes sense since they involve understanding of high level scene semantics. To this end, we propose two unsupervised learning methods for 3D point cloud registration - SPA and R-PointHop. These two works focus on object-level and indoor scene registration. The main contributions of these works include – • SPA learns the point features from the distribution of its neighbors using the PointHop++ method. • In SPA, we exploit the local surface properties in selecting a set of salient points. This small subset of points are representative and can be used to estimate the rotation and translation of the entire point cloud set effectively and accurately. • Unlike SPA, R-PointHop learns point features that are invariant to point cloud rotation and translation, thereby making it suitable for global registration. The effectiveness of the proposed features for geometric registration task is demonstrated through a series of experiments on indoor point cloud scans from the 3DMatch dataset as well as synthetic and real-world models from ModelNet40 and Stanford Bunny datasets respectively. 6 • The SPA and R-PointHop methods emphasize on designing a green solution that has a smaller model size, lower memory consumption, and reduced training time as compared to state-of-the-art methods. 1.2.2 PointCloudPoseEstimationandSO(3)InvariantClassification Point cloud pose estimation closely resembles the registration problem. For pose estimation, we predict the 6-DOF (Degree of Freedom) pose of an object by first retrieving a similar aligned object from a gallery set, and then registering the input object to the retrieved object. Inspired from the rotation invariant feature learning in R-PointHop, we propose to classify 3D objects which are arbitrarily rotated. For this, the rotation invariant local and global features are fed to a classifier. It extends the use case of PointHop and PointHop++ for classification of unaligned objects. The main contributions of the proposed works, Point Cloud Retrieval and Pose Estimation (PCRP) and SO(3) Invariant PointHop (S3I-PointHop) include – • In PCRP, we extend R-PointHop, which was originally designed for point cloud registration, to object retrieval and pose estimation. We show how features derived from R-PointHop can be aggregated to yield a global feature descriptor for object retrieval and reuse point features for pose estimation. • In PCRP, we propose a way to modify the attribute representation in R-PointHop and make it more general. As a result, any traditional point local descriptors, such as FPFH [73] and SHOT [78], can be adopted by R-PointHop. • In S3I-PointHop, the pose dependent octant partitioning operation in PointHop is replaced by an ensemble of three rotation invariant representations to guarantee SO(3) invariance. 7 • In S3I-PointHop, by exploiting the rich spatial information, we simplify multi-hop learn- ing in PointHop to one-hop learning in S3I-PointHop. Specifically, two novel aggrega- tion schemes (i.e., conical and spherical aggregations in local regions) are proposed, which makes one-hop learning possible. 1.2.3 PointCloudOdometryandSceneFlowEstimation Point cloud odometry and scene flow estimation closely resemble the registration problem. For odometry, we want to estimate the trajectory of a moving object based on the point cloud scans captured by it at every instance. We treat the problem of finding the incremental motion of the object as a registration problem that tries to estimate the rotation and translation between the two point cloud scans captured at two consecutive time steps. For scene flow estimation, we determine the per-point flow vector of points in the first point cloud. R-PointHop is a special case here, where all the points are related by a single rigid transformation. The main contributions of the proposed works, Green Point Cloud Odometry (GreenPCO) and PointFlowHop include – • In GreenPCO, we introduce the idea of geometry aware point cloud sampling for deter- mining a subset of discriminant points from large scale and sparse LiDAR point clouds. It makes use of eigen features [27] along with farthest point selection. • In GreenPCO, we use a view-based partitioning technique that divides the sampled points into four disjoint sets as belonging to the four views surrounding the object – front view, rear view, left and right side views. Then correspondences between consecutive scans are found in each set independently. This offers a powerful prior on the presence of corre- sponding points. 8 • In PointFlowHop, we decompose the scene flow in vehicle’s ego-motion and the object motion components. The ego-motion and object-wise motion are optimized during infer- ence based on point features learned using a single task-agnostic feedforward PointHop++ model. • In PointFlowHop, we develop a lightweight 3D scene classifier that identifies moving points and further clusters and associates them into moving object pairs. 1.3 OrganizationoftheDissertation The rest of the dissertation is organized as follows. In Chapter 2, we review the traditional and learning based methods for point cloud registration, classification, pose estimation, odometry, and scene flow estimation. We also discuss the PointHop [107] and PointHop++ [106] point cloud classification methods in particular since these methods form the basis for our work. In Chapter 3, we propose two green and unsupervised point cloud registration methods namely Salient Points Analysis (SPA) and R-PointHop. Next two chapters are based on the extension of R-PointHop for the problem of point cloud pose estimation and SO(3) invariant classification followed by point cloud odometry and scene flow estimation. In Chapter 4, we propose the PCRP method for joint point cloud retrieval and pose estimation, followed by the S3I-PointHop method for SO(3) group invariant point cloud classification. In Chapter 5, we propose a method called Green Point Cloud Odometry (GreenPCO) for point cloud odometry estimation. Then, we extend GreenPCO for 3D scene flow estimation in PointFlowHop, where GreenPCO is responsible for the sub-problem of vehicle ego-motion estimation. Finally, we conclude our work and provide future research directions in Chapter 6. 9 Chapter2 BackgroundReview In this chapter, we discuss literature relevant to our research. To begin with, we study some of the traditional and deep learning (DL) based methods for point cloud registration, classification, pose estimation, odometry, and scene flow estimation. Next, two Green Learning (GL) based point cloud classification methods - PointHop [107] and PointHop++ [106] are discussed which form a basis for our work. In the end, we summarize the datasets used during our experiments. 2.1 TraditionalMethodsforRegistration 2.1.1 IterativeClosestPoint(ICP) The classical iterative closest point (ICP) [6] algorithm alternates between two steps – finding point correspondences and estimating the transformation that minimizes the Euclidean distance between matching points. For finding point correspondences, a nearest neighbor search is used. The ICP algorithm is very basic and relies purely on the 3D coordinates of points. The ICP algo- rithm works as follows. 10 Let A = {a i } and B = {b i } be the two sets of point clouds to be registered. We are inter- ested in the transformationT that best aligns the two point clouds. T is a rigid transformation consisting of 3D rotation and translation. Usually, an initial alignment T 0 is obtained using a global alignment algorithm. Otherwise,T 0 is considered an identity transformation. In the first iteration,T is set toT 0 and point cloudB is transformed usingT . Then, for every point in point cloudA, its nearest point in the transformedB is found. For example, if every pointT· b i matches with point m i from point cloud A, we then have a ordered pair of point corresponding points, (m i ,T · b i ). These point correspondences are then used to find the optimal transformation that minimizes the Euclidean distance between corresponding points. It is formulated as T =argmin T { X i ∥T · b i − m i ∥ 2 }. (2.1) Figure 2.1: Illustration of the ICP algorithm. The optimalT is found using singular value decomposition (SVD) or a least squares technique. These steps are repeated for a fixed number of iterations or until convergence. There are two main drawbacks of the ICP algorithm. First, it assumes that there is a one-to-one point correspondence between the two point clouds. However, in practice, there may only be a small partial overlap between the two point clouds. Therefore, it is common practice to set a threshold distanced max 11 between two corresponding points, above which the pair of corresponding points is not reliable. Then, only the set of point correspondences with a distance between corresponding points of less thand max is used to estimate the transformation. The second drawback of ICP is that, when the initial alignment is far from the optimal solution, the algorithm can get trapped in a local minimum. In such cases, it is preferable to conduct a prior global registration using a different algorithm, after which ICP can be used to a obtain a tighter alignment. A simple illustration of the ICP algorithm for the alignment of two curves (“model” and “scene”) is shown in Fig. 2.1. 2.1.2 Point-to-PlaneICP Point-to-plane ICP [11] incorporates surface normal information in order to improve the perfor- mance of the ICP algorithm. Instead of minimizing the pointwise Euclidean distance (error term), it minimizes the projection of the error term onto the subspace spanned by the surface normal. First, the surface normal of every point in point cloudA is found. Ifη i is the surface normal for pointm i , then equation 2.1 is modified as follows T =argmin T { X i ∥η i · (T · b i − m i )∥ 2 }. (2.2) 2.1.3 GeneralizedICP Generalized ICP [76] replaces the cost function of the original ICP (equation 2.1) with a proba- bilistic model. The steps to find correspondence using a nearest neighbor search are the same as that for ICP. For example, let (a i ,b i ) i=1,··· ,N be the set of point correspondences found. It is 12 assumed that there exists a set of points ˆ A={ˆ a i } and ˆ B ={ ˆ b i } that generate the points in point cloudsA andB, respectively. The point samples are assumed to be drawn from a normal distri- bution asa i ∼ N(ˆ a i ,C A i ) andb i ∼ N( ˆ b i ,C B i ), whereC A i andC B i are the associated covariance matrices. For the correct transformation T ∗ , the presence of perfect correspondence gives the relation ˆ b i =T ∗ ˆ a i . (2.3) Letd (T) i =b i − Ta i for any transformationT . Then,d (T ∗ ) i is governed by the following distribu- tion: d (T ∗ ) i ∼ N( ˆ b i − (T ∗ )ˆ a i ,C B i +(T ∗ )C A i (T ∗ ) T ) =N(0,C B i +(T ∗ )C A i (T ∗ ) T ). (2.4) This is reduced to a zero mean Gaussian by using equation 2.3. Subsequently, maximum likeli- hood estimation (MLE) is used to iteratively solve forT : T =argmax T Y i p(d (T) i ) =argmax T X i log(p(d (T) i )). (2.5) This simplifies to T =argmin T X i d (T) T i (C B i +TC A i T T ) − 1 d (T) i . (2.6) 13 Once,T is found using equation 2.6, the algorithm repeats. The original ICP algorithm is a special case of the generalized ICP case, where C A i =0 C B i =I, (2.7) which reduces equation 2.6 to T =argmin T X i d (T) T i d (T) i =argmin T X i ∥d (T) i ∥ 2 (2.8) Similarly, the point-to-plane ICP can be thought of as finding transformation T , such that T =argmin T { X i ∥P i d i ∥ 2 }, (2.9) where P i is the projection onto the surface normal of b i . Using the property of an orthogonal projection matrix,P i =P 2 i =P T i . Then, equation 2.9 can be equated to T =argmin T { X i d T i P i d i }, (2.10) Comparing this to equation 2.6, the covariance matrices for point-to-plane ICP are given by C A i =0 C B i =P − 1 i . (2.11) 14 The advantage of generalized ICP is that it allows any set of covariance matrices{C A i } and {C B i } to be selected. A direct application of generalized ICP is the plane-to-plane ICP, which considers surface normal information from both point clouds. 2.1.4 GlobalRegistration The ICP algorithm fails to achieve good alignment in the absence of good initialization; that is, only locally optimal solutions are guaranteed by ICP. There are several methods that offer global registration. One popular method is globally optimal ICP or Go-ICP [96]. It is based on the Branch-n-Bound (BnB) algorithm that searches the entire SE(3) space, which represents all pos- sible transformations. The error term minimization of Go-ICP is similar to that of ICP (equation 2.1). Another method is fast global registration (FGR) [109]. FGR facilitates the global registration of multiple partially overlapping 3D surfaces. Teaser [95] proposed a certifiable algorithm that registers two point clouds with a large number of outlier correspondences. 2.2 DeepLearningMethodsforRegistration 2.2.1 PointNetLK PointNet [66] successfully demonstrated the application of Deep learning for point cloud pro- cessing, particularly for classification and semantic segmentation tasks. However, it is nontrivial to directly adopt PointNet for 3D registration. PointNetLK [2] extends PointNet for point cloud registration by extracting the global point cloud feature vector of the two point clouds to be reg- istered (source and template) using PointNet, and combines it with the classical Lucas & Kanade 15 (LK) algorithm [54] for iterative alignment. The authors considered PointNet from the perspec- tive of a trainable imaging function, which motivated them to apply the well-established image registration LK algorithm to PointNet. However, due to the unordered nature of points and lack of well-defined neighborhoods in point clouds, it is not possible to compute local gradients, as demanded by the LK algorithm. Thus, a modified LK algorithm was proposed for point cloud processing. Figure 2.2: PointNetLK architecture. The figure is from [2]. The registration problem is set up as follows. The PointNet functionϕ encodes a global vector descriptor of dimensionK for the point cloud. This is the feature vector obtained after the global max pooling operation in the classification pipeline of PointNet. P T and P S are the template and source point clouds, respectively. The goal is to find a rigid body transformation G∈ SE(3) (special Euclidean group in three dimensions) that best alignsP S toP T . G is represented as G=exp X i ξ i T i ! , ξ =(ξ 1 ,ξ 2 ,...,ξ 6 ) T , (2.12) where T i are generators with twist parameters ξ ∈ R 6 . Taking this into consideration, Point- NetLK tries to find the optimal G, such that the PointNet function of the template is equal to the 16 PointNet function of the transformed source. That is,ϕ (P T )=ϕ (G· P S ). The input and feature transformation modules (T-net) in PointNet are omitted in PointNetLK, and the fully-connected layers after global pooling are not present, as these two elements are a part of the classification framework and not required for registration. Furthermore, to reduce the computational time of each iteration, an inverse compositional (IC) formulation is used, which reversed the role of tem- plate and source. In every iteration, incremental warps to the template are found and inverse of the transformation is applied to the source. This modifies the objective to ϕ (P S )=ϕ (G − 1 · P T ). (2.13) The right hand side of of equation 2.13 is then linearized as ϕ (P S )=ϕ (P T )+ ∂ ∂ξ [ϕ (G − 1 · P T )]ξ, (2.14) where G − 1 is given by G = exp(− P i ξ i T i ). Further, J ∈ R K× 6 is the Jacobian matrix and J = ∂ ∂ξ [ϕ (G − 1 · P T )]. Unlike images, it is nontrivial to computeJ for point clouds. Hence, in the modified LK algorithm, every column J i is estimated using finite difference gradients as J i = ϕ (exp(− t i T i )· P T )− ϕ (P T ) t i , (2.15) wheret i is a small perturbation ofξ . The solution forξ in equation 2.14 is given by ξ =J + [ϕ (P S )− ϕ (P T )]. (2.16) 17 Here, J + is the pseudo-inverse of matrixJ. Usingξ , the one step update∆ G can be used to update the source point cloud as P S ← ∆ G· P S , ∆ G=exp X i ξ i T i ! . (2.17) The final transformation G est is a composition of all the transformation estimates found in every iteration and is given by ∆ G est =∆ G n · ...· ∆ G 1 · ∆ G 0 . (2.18) The loss function during training of PointNetLK uses the orthogonality property of the spatial transformationG and is given by ∥(G est ) − 1 · G gt − I 4 ∥ F , (2.19) whereG gt is the ground truth transformation matrix that is used for supervision,I 4 is an identity matrix, and∥·∥ F is the matrix Frobenius norm. The PointNetLK architecture is shown in Fig. 2.2. The one-time and looping computations are highlighted by the blue and orange lines, respectively. We can see that the Jacobian matrix is computed only once from the feature of the template point cloud. The twist parameters are updated in every iteration based on the incremental alignments of the source point cloud. A minimum threshold is set for ∆ G which serves as the stopping criteria. 18 2.2.2 DeepClosestPoint(DCP) Similar to PointNet [66], DGCNN [91] is another pioneering work targeted at the point cloud classification and semantic segmentation tasks. The authors of DGCNN developed a registra- tion method that closely mimics the ICP-like alignment pipeline; however, unlike the iterative path of ICP, they proposed a method that employed a one-pass end-to-end trained deep network capable of global registration. This method was termed Deep Closest Point (DCP) [89]. It is common for traditional methods (ICP and many more) to find point correspondences and use them to estimate the 3D transformation that best aligns two point clouds. DCP follows a similar correspondence-based approach. In contrast, PointNetLK uses a global point cloud feature vector to find the transformation. The method used by DCP for registration can be summarized in four steps: 1) point feature extraction using DGCNN; 2) feature transformation using Transformer; 3) soft pointer generation using a pointer network; and 4) estimating the transformation using singular value decomposition (SVD). The rigid alignment problem is formulated as follows. The two point clouds to be aligned are represented as X = {x 1 ,...,x i ,...,x N } ∈ R 3 and Y = {y 1 ,...,y i ,...,y N } ∈ R 3 , with the assumption that both point clouds contain the same number of pointsN. A rigid transformation [R XY ,t XY ] is applied toX in order to obtainY , whereR XY ∈ SO(3) is the 3D rotation matrix and t XY ∈ R 3 is the translation vector. The goal of the network is to minimize the error term E(R XY ,t XY ) which is given by E(R XY ,t XY )= 1 N N X i ∥R XY x i +t XY − y i ∥ 2 , (2.20) where(x i ,y i ) are the ordered pairs of corresponding points. 19 Figure 2.3: DCP architecture. The figure is from [89]. Next, we consider the details of each step. The DCP architecture is shown in Fig. 2.3. Point feature extraction. In the point feature extraction step, the point clouds X, Y ∈ R (N× 3) are mapped to a higher dimensional feature embedding space using a trainable network to obtain the point features. For this, two learning modules, PointNet and DGCNN, can be used. PointNet extracts the features of each point independently, whereas DGCNN incorporates local geometry information in the feature learning process by constructing a graph of the neighboring points, applying a nonlinearity at the edge endpoints, and finally performing a vertex-wise local aggregation. The global max pooling operations in PointNet and DGCNN are avoided, as they are relevant only to the classification task. For registration, only pointwise features are used. At the output of this step, we obtain point features denoted by F X = {x L 1 ,x L 2 ,...,x L i ,...,x L N } and F Y = {y L 1 ,y L 2 ,...,y L i ,...,y L N }, considering L layers in DGCNN or PointNet. Empirical results show that DGCNN performs better than PointNet for this step, mainly due to the incorporation of local structure information. Featuretransformation. The next step, feature transformation, captures self-attention and conditional attention. Transformer [82], which is based on the principle of attention, has shown massive success in natural language processing tasks like machine translation. So far, the features F X and F Y have been learned independently; that is, X does not influence F Y and Y does not 20 influence F X . The attention network in the form of a transformer is included with the goal of capturing contextual information between two feature embeddings. The attention model learns a functionϕ :R N× P × R N× P →R N× P , whereP is the dimension of the feature embedding from previous step. The new feature embeddings are then given by Φ X =F X +ϕ (F X ,F Y ), Φ Y =F Y +ϕ (F Y ,F X ). (2.21) This operation modifies F X → Φ X such that the point features ofX have knowledge about the structure of Y . For function Φ , the transformer network [82] is incorporated using four head attention. Pointergeneration. After feature transformation, the pointer generation step is conducted. The goal of this step is to establish point correspondences. However, unlike a hard assignment, which is non-differentiable, DCP uses a probabilistic approach to generate soft pointers that allow the computation and propagation of gradients. Accordingly, each point x i ∈ X is assigned a probability vector over the pointsy i ∈Y , and is given by m(x i ,Y)=softmax(Φ Y Φ T x i ). (2.22) Here,Φ x i denotes thei-th row ofΦ X andm(x i ,Y) is the soft pointer. 21 Transformation estimation. The final step is to predict the transformation i.e., R XY and t XY . This is done using Singular Value Decomposition (SVD). First, the soft pointers are used to generate an estimate of the matching point inY for every point inX. This is given by ˆ y i =Y T m(x i ,Y)∈R 3 , (2.23) whereY ∈R N× 3 is the matrix of input points. Then,(x i , ˆ y i ) are used as point correspondences to find the transformation. First, the centroids of X andY are found: ¯x= 1 N N X i=1 x i and ¯y = 1 N N X i=1 y i . (2.24) Then, the covariance matrix is calculated as H = N X i=1 (x i − ¯x)(y i − ¯y) T . (2.25) The matrix H is then decomposed as H = USV T using SVD, where U ∈ SO(3) is the matrix of left singular vectors,S is the3× 3 diagonal matrix of singular values andV ∈ SO(3) is the matrix of right singular vectors. Thereafter, the optimal[R XY ,t XY ] to minimize equation 2.20 is given by R XY =VU T and t XY =− R XY ¯x+ ¯y. (2.26) 22 The error function in equation 2.20 assumes exact point correspondences are known. Since DCP uses soft pointers, only approximate locations in theY point cloud are obtained instead. Hence, the error term can be modified to E(R XY ,t XY )= 1 N N X i ∥R XY x i +t XY − y m(x i ) ∥ 2 . (2.27) The mapping functionm(·) is learned with the objective m(x i ,Y)=argmin j ∥R XY x i +t XY − y j ∥. (2.28) The network is trained in an end-to-end manner using the ground truth rotation matrix and translation vector as supervision. The loss term is given by Loss=∥R T XY R g XY − I∥ 2 +∥t XY − t g XY ∥ 2 +λ ∥θ ∥ 2 (2.29) where superscriptg denotes the ground truth transformations andθ is a regularization term. For training and evaluation, the ModelNet40 dataset [93] is used. The experiments are mainly split into three parts—tests on unseen point clouds (test data), tests on unseen categories (object classes set aside during training), and registration of noisy point clouds. DCP outperforms Point- NetLK as well as traditional methods like ICP. The inclusion of the Transformer module further reduces the error. 23 2.2.3 3DMatch 3DMatch [99] employs deep learning to learn a mapping function ψ that encodes a feature de- scriptor for a local 3D patch around a point. A smallerl 2 distance between features of two patches is desired for corresponding points (or patches). To generate pairs of correspondence for training the network, correspondence labels from RGB-D reconstruction datasets such as 7-Scenes [77] and SUN3D [94] are used. The mapping functionψ in 3DMatch is a 3D ConvNet that outputs a 512-dimensional patch descriptor. The network weights are learned with the objective of minimizing the l 2 distance between the feature descriptors of corresponding patches, as well as maximizing thel 2 distance between the features of non-matching patches. The network architecture is illustrated in Fig. 2.4. It is a Siamese style 3D CNN comprising eight convolution layers and a pooling layer. It takes two local patches in the form of30× 30× 30 truncated distance function (TDF) voxel grids, as explained in the following paragraph, and predicts whether they correspond. Equal numbers of true matches and non-matches are fed to the network during training. Figure 2.4: 3DMatch architecture. The figure is from [99]. 24 Next, we go over the process of 3D data representation, which provides the input for the 3D CNN. For an input point, the 3D region in the local neighborhood is converted to a30× 30× 30 voxel grid of TDF values. The TDF value of every voxel represents the distance between the voxel center and the nearest 3D surface. Later, the TDF values are truncated and normalized. They lie between 0 and 1, where 1 is at the surface and 0 is far from the surface. These voxels are then aligned with respect to the camera view. The voxel representation, being an ordered grid, enables the use of 3D CNNs to process the patches. The 3DMatch network has been evaluated for keypoint matching tasks as well as geometric registration tasks. For registration, the correspondences found using 3DMatch were combined with RANSAC for robust alignment of the 3D point clouds. For the matching task, 3DMatch out- performs handcrafted features such as Spin-Images [30] and FPFH [73]. Further experiments also show that 3DMatch can be integrated into 3D reconstruction frameworks. That is, the keypoint matches can be considered during the bundle adjustment [79] step to refine the reconstructed 3D model. 3DMatch differs from the other point based methods, in that it converts unordered points in a local region (local patch) to a regular 3D representation in the form of a voxel grid. 2.2.4 PPFNet PPFNet [20] employs a globally informed 3D local feature descriptor to find correspondences between two point clouds. It takes a local point cloud patch and encodes its point-pair-feature along with raw point coordinates and point normal information. Further, PointNet is used to learn permutation-invariant point features. The final feature is found by concatenating the local 25 point features and global point cloud features, followed by a further MLP layer. The mechanism of PPFNet is discussed in detail below. PPFNet considers two point clouds,X ∈R 3 andY ∈R 3 , withx i andy i being their i th points. Assuming a rigid transformation, X and Y are related by a permutation matrix P ∈ P n and a rigid transformationT =R∈SO(3),t∈R 3 . The error for registration is given by d(X,Y|R,t,P)= 1 N n X i=1 ∥x i − y i(P) − t∥ 2 . (2.30) In matrix notation, d(X,Y|R,t,P)= 1 N ∥X− PYT T ∥ 2 . (2.31) PPFNet attempts to learn an effective mapping function f(·) that achieves d f (X,Y|T,P) ≈ 0 under any transformationT and permutationP , and, d f (X,Y|R,t,P)= 1 N ∥f(X)− f(PYT T )∥ 2 . (2.32) The functionf is designed to be invariant to point permutationsP and tolerant to transforma- tionsT . Next, we go over the point pair features (PPF) method. Given two 3D pointsx 1 andx 2 , their PPF,ψ 12 , defines the surface characteristics as follows: ψ 12 =(∥d∥ 2 ,∠(n 1 ,d),∠(n 2 ,d),∠(n 1 ,n 2 )), (2.33) 26 where∥d∥ 2 is thel 2 distance between the two points; and,n 1 andn 2 are the surface normals of x 1 andx 2 , respectively.∠ is the angle between two directions that lies in[0,π ) and computed as ∠(v 1 ,v 2 )=atan2(∥v 1 × v 2 ∥,v 1 · v 2 ). (2.34) PPF is invariant under 3D rotation, translation and reflection. We next study the local geometry encoding process of PPFNet, which forms the basis of the network. For a reference pointx r ∈X, a set of points{m i }∈Ω ⊂ X in the local neighborhood is found. The local reference frame is found aroundx r with respect to the neighboring points in Ω . Then for every pointm i in the setΩ , its PPFψ ri is found by pairing it with the reference point x r . The PPFs are concatenated with the point coordinates and point normals to obtain the local geometry encoding of the reference point. This is given by F r ={x r ,n r ,x i ,··· ,n i ,··· ,ψ ri ,···} . (2.35) The local geometry encoding is shown in Fig. 2.5. Figure 2.5: Local geometry encoding in PPFNet. The figure is from [20]. After encoding the local patch information, PointNet [66] is used to extract features from the local patches. N local patches are uniformly sampled from the point cloud and presented to the network. Max pooling is used to aggregate the features of all local patches into a global 27 descriptor for the entire point cloud. The local patch features and global pooled features are then concatenated to further learn the final patch features. The feature construction process is illustrated in Fig. 2.6. Figure 2.6: PPFNet patch feature construction. The figure is from [20]. The network is trained using an N-tuple loss that considers N pairs of patches from the two point clouds to be matched. By selecting some pairs of true correspondences and non- corresponding patches, the network learns distinctive features that preserve proximity in fea- ture space for matching patches. These set of true and false correspondences are selected using ground truth pose information. The detailed architecture is shown in Fig. 2.7. Figure 2.7: PPFNet architecture. The figure is from [20]. The performance of PPFNet for matching and geometric registration tasks has been evalu- ated on the 3DMatch dataset. PPFNet outperforms 3DMatch and several other handcrafted 3D descriptors. 28 2.3 PointCloudClassificationandPoseEstimation PointCloudClassification . There has been major advancement in point cloud processing due to deep learning. PointNet [66] employed an MLP-based permutation invariant network for point cloud classification and segmentation. It is followed by a series of work [67, 91, 51]. Besides permutation invariance, features invariant to point cloud rotation [19, 25] are desirable for appli- cations such as data association, registration and classification of unaligned objects. Works like PointNet [66], PointNet++ [67], DGCNN [91] and PointCNN [51] are susceptible to point cloud ro- tations. Designing rotation invariant networks has been popular for 3D registration when global alignment is needed. Methods such as PPFNet [20] and PPF-FoldNet [19] achieve partial and full invariance to 3D transformations, respectively. The idea behind any rotation invariant method is to design a representation that is free from the pose information. This is done by exploiting properties of 3D transformations such as preservation of distances, relative angles, and principal components. Global and local rotation invariant features for classification were proposed in [48], which form a basis of our S3I-PointHop [32] method. Ambiguities associated with global PCA alignment were analyzed and a disambiguation network was proposed in [46]. Another approach is the design of equivariant neural networks that achieve invariance via certain pooling opera- tions. SO(3)- and SE(3)-equivariant convolutions make networks equivariant to the 3D rotation and 3D roto-translation groups, respectively. Exemplary work includes the Vector Neurons [18] for classification and segmentation, results in [9, 49] for category-level pose estimation. ObjectPoseEstimation. One limitation of exemplary deep networks for pair-wise regis- tration [2, 89, 90] is the assumption that a pre-aligned reference object is available, which is the same instance of the input object whose pose is to be estimated. However, such an assumption 29 may not hold in practice and, as a result, the pose estimation problem cannot be solved by regis- tration. One way to avoid this difficulty is to retrieve a similar pre-aligned object from a gallery set first and then estimate the object pose by registering it with the retrieved object as done in CORSAIR [108]. CORSAIR uses the bottleneck layer representation of a registration network to train another network for retrieval using metric learning. It is worthwhile to mention that some pose estimation methods are reference object free [49, 9]. They adopt point cloud equivariant networks as the backbone and do not use object retrieval for pose estimation. In our work, we first retrieve a similar aligned object, and then register the input object with it. Hence, point cloud retrieval is briefly reviewed next. 3DObjectRetrieval. Point cloud retrieval has been studied extensively due to its rich ap- plications. Noteworthy work includes 3D ShapeNets [93] for shape retrieval and PointNetVLAD [81] for place recognition. Our retrieval idea is inspired by PointNetVLAD, which aggregates pointwise features from PointNet using the VLAD feature [29]. Their VLAD feature is obtained by taking a weighted average of point features, where weights are jointly trained with the Point- Net network under supervised learning. Most retrieval methods assume that the query object and objects from the gallery set are pre-aligned. Yet, in the context of joint object retrieval and pose estimation, this assumption does not hold, and it is essential to develop a retrieval method, where the query object can possess any arbitrary rotation. Overall, rotation invariant point cloud classification and pose estimation form the basis of object-level understanding in a 3D scene. In a given scene, a detection algorithm may isolate 3D objects which should be first classified. Since the objects may be in arbitrary poses, a rotation invariant point cloud classification method would be more robust in correctly identifying objects. 30 Further, the object pose information may facilitate downstream tasks such as object grasping or obstacle avoidance. 2.4 PointCloudOdometryandSceneFlowEstimation The state-of-the-art point cloud odometry method is LOAM [101]. It is a combination of two algorithms running in parallel – one for LiDAR odometry and the other for point cloud registra- tion. LOAM adopts a pipeline which includes scan matching, motion estimation and mapping. Since drift correction and loop closure detection are commonly encountered in odometry and SLAM systems, several techniques are developed to address them. V-LOAM (Visual LOAM) [102] combines LiDAR and image data by exploiting the advantages of multi-modality sensors. LoL [70] integrates LOAM and segment matching with respect to an offline map to compensate for odometry drift. ORB-SLAM [61] is a SLAM system based on the ORB descriptor [71] that lever- ages monocular vision. A new camera calibration technique that improves the visual odometry estimates of several high performance methods was proposed in [23] for the KITTI dataset [17]. There are also a few geometry-based point cloud odometry methods that perform scan match- ing using the iterative closest point (ICP) algorithm [6] or its variants [72]. For example, the ICP-based Pose-Graph SLAM [55] uses the odometry estimate to create a pose graph that in turn updates the map. A similar idea is adopted in [7]. Besides, the fusion of IMU and GPS information with the Kalman and extended Kalman filtering [31, 37] has been applied to odometry for years. Early work on 3D scene flow estimation uses 2D optical flow estimation followed by triangula- tion such as that given in [83]. The Iterative Closest Point (ICP) [6] and the non-rigid registration work, NICP [1], can operate on point clouds directly. Series of image- and point-based seminal 31 methods for scene flow estimation relying on similar ideas were proposed in the last two decades. The optical flow is combined with dense stereo matching for flow estimation in [28]. A varia- tional framework that predicts the scene flow and depth is proposed in [3]. A peicewise rigid scene flow estimation method is investigated in [85]. Similarly, the motion of rigidly moving 3D objects is examined in [56]. Scene flow based on Lucas-Kanade tracking [54] is studied in [68]. An exhaustive survey on 2D optical flow and 3D scene flow estimation methods has been done by Zhai et al. [100]. A large number of deep learning models have been proposed for point cloud odometry. Niko- lai et al.[62] project 3D point clouds to panoramic depth images, use a 2D convolutional layers to extract features, and estimate motion parameters with fully-connected (FC) layers, leading to an end-to-end network design. DeepPCO [88] uses two separate networks to predict 3 orien- tation and 3 translation parameters, respectively. Deep learning methods for visual odometry have been proposed as well. Konda et al.[38] use a CNN for visual odometry by relating depth and motion to velocity change. DeepVO [87] uses the recurrent CNN for visual odometry. ViNet [14] treats visual odometry as a sequence-to-sequence problem. Other noteworthy works include Flowdometry [60] and LS-VO [15]. For DL based 3D scene flow estimation, FlowNet3D [52] adopts the feature learning opera- tions from PointNet++ [67]. HPLFlowNet [26] uses bilateral convolution layers and projects point clouds to an ordered permutohedral lattice. PointPWC-Net [92] takes a self-supervised learning approach that works in a coarse-to-fine manner. FLOT [64] adopts a correspondence-based ap- proach based on optimal transport. HALFlow [86] uses a hierarchical network structure with an attention mechanism. The Just-Go-With-the-Flow method [59] uses self-supervised learning with the nearest neighbor loss and the cycle consistency loss. 32 Deep learning methods that attempt to simplify the flow estimation problem using ego-motion and/or object-level motion have also been investigated. For example, Rigid3DSceneFlow [24] reasons the scene flow at the object level (rather than the point level). Accordingly, the flow of scene background is analyzed via ego-motion and that of a foreground object is described by a rigid model. RigidFlow [47] enforces the rigidity constraint in local regions and performs rigid alignment in each region to produce rigid pseudo flow. SLIM [4] uses a self-supervised loss function to separate moving and stationary points. 2.5 GreenLearningforPointClouds In this section, we summarize the PointHop and PointHop++ methods for point cloud classifica- tion. Our feature learning process is inspired from these two methods. 2.5.1 PointHop Point Cloud PointHop Unit PointHop Unit PointHop Unit PointHop Unit Table Bench TV Stand Cup Door Classifier M Aggregations M Aggregations M Aggregations M Aggregations M⇥ D 1 <latexit sha1_base64="AdLIbwrJCi/l//zS6zFz+7Sb2W0=">AAAB9XicbVBNS8NAEJ3Ur1q/qh69BFvBU0nqQY8FPXgRKtgPaNOy2W7apZtN2J0oJfR/ePGgiFf/izf/jds2B219MPB4b4aZeX4suEbH+bZya+sbm1v57cLO7t7+QfHwqKmjRFHWoJGIVNsnmgkuWQM5CtaOFSOhL1jLH1/P/NYjU5pH8gEnMfNCMpQ84JSgkXrluy7ykOn0pudOy/1iyak4c9irxM1ICTLU+8Wv7iCiScgkUkG07rhOjF5KFHIq2LTQTTSLCR2TIesYKolZ5aXzq6f2mVEGdhApUxLtufp7IiWh1pPQN50hwZFe9mbif14nweDKS7mME2SSLhYFibAxsmcR2AOuGEUxMYRQxc2tNh0RRSiaoAomBHf55VXSrFbci0r1vlqqVbI48nACp3AOLlxCDW6hDg2goOAZXuHNerJerHfrY9Gas7KZY/gD6/MHjByR0g==</latexit> M⇥ D 2 <latexit sha1_base64="eTFhWn1xb0KI3DZvd6bPXRNRnOk=">AAAB9XicbVBNS8NAEJ3Ur1q/qh69LLaCp5DEgx4LevAiVLAf0KZls920SzebsLtRSuj/8OJBEa/+F2/+G7dtDtr6YODx3gwz84KEM6Ud59sqrK1vbG4Vt0s7u3v7B+XDo6aKU0log8Q8lu0AK8qZoA3NNKftRFIcBZy2gvH1zG89UqlYLB70JKF+hIeChYxgbaRe9a6rWURVdtPzptV+ueLYzhxolbg5qUCOer/81R3EJI2o0IRjpTquk2g/w1Izwum01E0VTTAZ4yHtGCqwWeVn86un6MwoAxTG0pTQaK7+nshwpNQkCkxnhPVILXsz8T+vk+rwys+YSFJNBVksClOOdIxmEaABk5RoPjEEE8nMrYiMsMREm6BKJgR3+eVV0vRs98L27r1Kzc7jKMIJnMI5uHAJNbiFOjSAgIRneIU368l6sd6tj0VrwcpnjuEPrM8fjaKR0w==</latexit> M⇥ D 3 <latexit sha1_base64="rsXkukuBB7JAuSKZ8WxauulRAfQ=">AAAB9XicbVBNT8JAEJ3iF+IX6tFLI5h4Ii0c9EiiBy8mmAiYQCHbZQsbtttmd6ohDf/DiweN8ep/8ea/cYEeFHzJJC/vzWRmnh8LrtFxvq3c2vrG5lZ+u7Czu7d/UDw8aukoUZQ1aSQi9eATzQSXrIkcBXuIFSOhL1jbH1/N/PYjU5pH8h4nMfNCMpQ84JSgkXrl2y7ykOn0uleblvvFklNx5rBXiZuREmRo9Itf3UFEk5BJpIJo3XGdGL2UKORUsGmhm2gWEzomQ9YxVBKzykvnV0/tM6MM7CBSpiTac/X3REpCrSehbzpDgiO97M3E/7xOgsGll3IZJ8gkXSwKEmFjZM8isAdcMYpiYgihiptbbToiilA0QRVMCO7yy6ukVa24tUr1rlqqV7I48nACp3AOLlxAHW6gAU2goOAZXuHNerJerHfrY9Gas7KZY/gD6/MHjyiR1A==</latexit> M⇥ D 4 <latexit sha1_base64="oraL/i18leMZ4GHZQ7V5591x154=">AAAB9XicbVDLSgNBEOz1GeMr6tHLYCJ4WnajoMeAHrwIEcwDkk2YnUySIbMPZnqVsOQ/vHhQxKv/4s2/cZLsQRMLGoqqbrq7/FgKjY7zba2srq1vbOa28ts7u3v7hYPDuo4SxXiNRTJSTZ9qLkXIayhQ8masOA18yRv+6HrqNx650iIKH3Accy+gg1D0BaNopE7pro0i4Dq96VxMSt1C0bGdGcgycTNShAzVbuGr3YtYEvAQmaRat1wnRi+lCgWTfJJvJ5rHlI3ogLcMDalZ5aWzqyfk1Cg90o+UqRDJTP09kdJA63Hgm86A4lAvelPxP6+VYP/KS0UYJ8hDNl/UTyTBiEwjID2hOEM5NoQyJcythA2pogxNUHkTgrv48jKpl2333C7fl4sVO4sjB8dwAmfgwiVU4BaqUAMGCp7hFd6sJ+vFerc+5q0rVjZzBH9gff4AkK6R1Q==</latexit> N 2 ⇥ D 2 <latexit sha1_base64="exvh+hIv4/DlmJtCIgNJ6dyId9c=">AAAB9XicbVDLSgNBEOz1GeMr6tHLYBA8LbtR0GNAD54kgnlAsgmzk9lkyOyDmV4lLPkPLx4U8eq/ePNvnCR70MSChqKqm+4uP5FCo+N8Wyura+sbm4Wt4vbO7t5+6eCwoeNUMV5nsYxVy6eaSxHxOgqUvJUoTkNf8qY/up76zUeutIijBxwn3AvpIBKBYBSN1L3rVkgHRcg1uelWeqWyYzszkGXi5qQMOWq90lenH7M05BEySbVuu06CXkYVCib5pNhJNU8oG9EBbxsaUbPIy2ZXT8ipUfokiJWpCMlM/T2R0VDrceibzpDiUC96U/E/r51icOVlIkpS5BGbLwpSSTAm0whIXyjOUI4NoUwJcythQ6ooQxNU0YTgLr68TBoV2z23K/cX5aqdx1GAYziBM3DhEqpwCzWoAwMFz/AKb9aT9WK9Wx/z1hUrnzmCP7A+fwDjI5Fm</latexit> N 3 ⇥ D 3 <latexit sha1_base64="Xtd99vYQslmBOimtzJdqop5XUjU=">AAAB9XicbVDLSgNBEOyNrxhfUY9eBoPgadlNBD0G9OBJIpgHJJswO5lNhsw+mOlVQsh/ePGgiFf/xZt/4yTZgyYWNBRV3XR3+YkUGh3n28qtrW9sbuW3Czu7e/sHxcOjho5TxXidxTJWLZ9qLkXE6yhQ8laiOA19yZv+6HrmNx+50iKOHnCccC+kg0gEglE0UveuWyEdFCHX5KZb6RVLju3MQVaJm5ESZKj1il+dfszSkEfIJNW67ToJehOqUDDJp4VOqnlC2YgOeNvQiJpF3mR+9ZScGaVPgliZipDM1d8TExpqPQ590xlSHOplbyb+57VTDK68iYiSFHnEFouCVBKMySwC0heKM5RjQyhTwtxK2JAqytAEVTAhuMsvr5JG2XYrdvn+olS1szjycAKncA4uXEIVbqEGdWCg4Ble4c16sl6sd+tj0Zqzsplj+APr8wfmNpFo</latexit> N 4 ⇥ D 4 <latexit sha1_base64="HZVfdwd1Dg3sfKqQN+u9PRYXxg4=">AAAB9XicbVDLSgNBEOyNrxhfUY9eBoPgadmNAT0G9OBJIpgHJJswO5lNhsw+mOlVQsh/ePGgiFf/xZt/4yTZgyYWNBRV3XR3+YkUGh3n28qtrW9sbuW3Czu7e/sHxcOjho5TxXidxTJWLZ9qLkXE6yhQ8laiOA19yZv+6HrmNx+50iKOHnCccC+kg0gEglE0UveuWyEdFCHX5KZb6RVLju3MQVaJm5ESZKj1il+dfszSkEfIJNW67ToJehOqUDDJp4VOqnlC2YgOeNvQiJpF3mR+9ZScGaVPgliZipDM1d8TExpqPQ590xlSHOplbyb+57VTDK68iYiSFHnEFouCVBKMySwC0heKM5RjQyhTwtxK2JAqytAEVTAhuMsvr5JG2XYv7PJ9pVS1szjycAKncA4uXEIVbqEGdWCg4Ble4c16sl6sd+tj0Zqzsplj+APr8wfpSZFq</latexit> N×3 N 1 ⇥ D 1 <latexit sha1_base64="kbCftQ+eKh+Wp0RIvIQHB9ChDrg=">AAAB9XicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BPXiSCOYBySbMTmaTIbMPZnqVsOQ/vHhQxKv/4s2/cZLsQRMLGoqqbrq7vFgKjbb9beVWVtfWN/Kbha3tnd294v5BQ0eJYrzOIhmplkc1lyLkdRQoeStWnAae5E1vdD31m49caRGFDziOuRvQQSh8wSgaqXvXdUgHRcA1uek6vWLJLtszkGXiZKQEGWq94lenH7Ek4CEySbVuO3aMbkoVCib5pNBJNI8pG9EBbxsaUrPITWdXT8iJUfrEj5SpEMlM/T2R0kDrceCZzoDiUC96U/E/r52gf+mmIowT5CGbL/ITSTAi0whIXyjOUI4NoUwJcythQ6ooQxNUwYTgLL68TBqVsnNWrtyfl6pXWRx5OIJjOAUHLqAKt1CDOjBQ8Ayv8GY9WS/Wu/Uxb81Z2cwh/IH1+QPmFJF4</latexit> Figure 2.8: Overview of the PointHop method. The figure is from [107]. 33 PointHop [107] uses the statistics of 3D points to learn point cloud features in an unsuper- vised one-pass manner. It consists of three steps – local-to-global attribute building via one hop information exchange, dimensionality reduction via Saab transform [43], feature aggregation, and classification. First, the attributes of a point are constructed based on the distribution of points in its local neighborhood. In particular, for each point, itsk nearest neighbors are found. Then, the 3D space is partitioned into eight octants around the point under consideration. The mean of the 3D coordinates of points in each octant is calculated. All the eight means are con- catenated to get the 24 dimensional first hop point attribute. All point attributes from the training data are collected and their covariance matrix is analyzed to define the Saab transform at the first PointHop unit. This process of attribute construction and dimensionality reduction is repeated, which leads to multiple PointHop units. The corresponding receptive field grows as the hop num- ber increases. PointHop consider four hops in total. Point features from all the four PointHop units are pooled to obtain the global point cloud feature vector. Four aggregation methods are used which include max pooling, average pooling,l 1 , andl 2 norm pooling. The global feature is then fed to a machine learning classifier such as Random Forest for predicting object class labels. The system diagram of PointHop is shown in Fig. 2.8. PointHop outperforms PointNet in object classification on ModelNet40 dataset [93]. 2.5.2 PointHop++ In the follow-up work PointHop++ [106], two main improvements were made. First, the overall training complexity was reduced by exploiting an important property of the Saab transform. Next, a subset of discriminant features were selected for classification. 34 Since the Saab kernels are derived from the Principal Components Analysis (PCA), the out- put spectral representation of the first hop is uncorrelated. Hence, each output components are processed separately. In short, instead of processing a spatial-spectral tensor of large spectral di- mension, the new channel-wise (c/w Saab) transform processes multiple spatial tensors of spectral dimension one. The c/w Saab transform is more effective than the Saab transform with regard to computational complexity and storage complexity (i.e., model size). This results in a feature tree decomposition where each level in the tree represents a particular hop, the deepest level cor- responding to the deepest hop. Each tree node represents a single feature dimension. Whether to split a tree node (intermediate node) or keep it as it is (leaf node) is determined based on the energy criteria calculated as normalized energy of the principal components. All the leaf nodes are collected as the point feature and aggregated into a global descriptor. A subset of discriminant feature dimensions are found using the cross entropy criteria using the training labels. Finally, the features are fed into a classifier to obtain the class labels. The feature learning process in PointHop++ is illustrated in Fig. 2.9. PointHop++ beats PointHop in classification performance. First hop Second hop Third hop Input Points PointHop Unit N 1 ⇥ 24 <latexit sha1_base64="BqnOUAeHp7awyVyuQjQfC0yCxLs=">AAAB+HicbVBNS8NAEJ34WetHox69LLaCp5JEUY8FL56kgv2ANpbNdtMu3WzC7kaoob/EiwdFvPpTvPlv3LY5aOuDgcd7M8zMCxLOlHacb2tldW19Y7OwVdze2d0r2fsHTRWnktAGiXks2wFWlDNBG5ppTtuJpDgKOG0Fo+up33qkUrFY3OtxQv0IDwQLGcHaSD27VLl9cFFXs4gq5J1XenbZqTozoGXi5qQMOeo9+6vbj0kaUaEJx0p1XCfRfoalZoTTSbGbKppgMsID2jFUYLPIz2aHT9CJUfoojKUpodFM/T2R4UipcRSYzgjroVr0puJ/XifV4ZWfMZGkmgoyXxSmHOkYTVNAfSYp0XxsCCaSmVsRGWKJiTZZFU0I7uLLy6TpVd2zqnfnlWsXeRwFOIJjOAUXLqEGN1CHBhBI4Rle4c16sl6sd+tj3rpi5TOH8AfW5w9IMJGA</latexit> N⇥ 3 <latexit sha1_base64="y8JtUpFZim1qFo5dywOHOvmMjSQ=">AAAB83icbVDLSgNBEOyNrxhfUY9eBhPBU9hNQD0GvHiSCOYB2SXMTmaTIbMPZnqFsOQ3vHhQxKs/482/cZLsQRMLGoqqbrq7/EQKjbb9bRU2Nre2d4q7pb39g8Oj8vFJR8epYrzNYhmrnk81lyLibRQoeS9RnIa+5F1/cjv3u09caRFHjzhNuBfSUSQCwSgaya3eExdFyDVpVAflil2zFyDrxMlJBXK0BuUvdxizNOQRMkm17jt2gl5GFQom+azkpponlE3oiPcNjajZ42WLm2fkwihDEsTKVIRkof6eyGio9TT0TWdIcaxXvbn4n9dPMbjxMhElKfKILRcFqSQYk3kAZCgUZyinhlCmhLmVsDFVlKGJqWRCcFZfXiedes1p1OoP9UrzKo+jCGdwDpfgwDU04Q5a0AYGCTzDK7xZqfVivVsfy9aClc+cwh9Ynz8wZZBv</latexit> c/w Subspace Decomposition N 1 ⇥ 1 <latexit sha1_base64="bUfOg/wAQHAy3Ehm+spIWzl7Cns=">AAAB9XicbVDLSgNBEOyNrxhfUY9eBhPBU9iNoB4DXjxJBPOAZBNmJ7PJkNkHM71KWPIfXjwo4tV/8ebfOEn2oIkFDUVVN91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCoqaNEMd5gkYxU26OaSxHyBgqUvB0rTgNP8pY3vpn5rUeutIjCB5zE3A3oMBS+YBSN1Cvf9RzSRRFwTZxyv1iyK/YcZJU4GSlBhnq/+NUdRCwJeIhMUq07jh2jm1KFgkk+LXQTzWPKxnTIO4aG1Oxx0/nVU3JmlAHxI2UqRDJXf0+kNNB6EnimM6A40sveTPzP6yToX7upCOMEecgWi/xEEozILAIyEIozlBNDKFPC3ErYiCrK0ARVMCE4yy+vkma14lxUqvfVUu0yiyMPJ3AK5+DAFdTgFurQAAYKnuEV3qwn68V6tz4WrTkrmzmGP7A+fwBW9JEQ</latexit> ... PointHop Unit N 2 ⇥ 8 <latexit sha1_base64="ftKP88bcWbU4OY5umnmglWBQYk4=">AAAB9XicbVDLTgJBEOzFF+IL9ehlIph4IruYKEcSL54MJgImsJDZYRYmzD4y06shG/7DiweN8eq/ePNvHGAPClbSSaWqO91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCopaNEMd5kkYzUg0c1lyLkTRQo+UOsOA08ydve+Hrmtx+50iIK73ESczegw1D4glE0Uq9826uSLoqAa1Ir94slu2LPQVaJk5ESZGj0i1/dQcSSgIfIJNW649gxuilVKJjk00I30TymbEyHvGNoSM0eN51fPSVnRhkQP1KmQiRz9fdESgOtJ4FnOgOKI73szcT/vE6Cfs1NRRgnyEO2WOQnkmBEZhGQgVCcoZwYQpkS5lbCRlRRhiaoggnBWX55lbSqFeeiUr2rluqXWRx5OIFTOAcHrqAON9CAJjBQ8Ayv8GY9WS/Wu/WxaM1Z2cwx/IH1+QNjJZEY</latexit> >T c/w Subspace Decomposition N 2 ⇥ 1 <latexit sha1_base64="Ctk5e0bJvuYqexwa/eGPgDyjUQU=">AAAB9XicbVDLSgNBEOyNrxhfUY9eBhPBU9iNoB4DXjxJBPOAZBNmJ7PJkNkHM71KWPIfXjwo4tV/8ebfOEn2oIkFDUVVN91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCoqaNEMd5gkYxU26OaSxHyBgqUvB0rTgNP8pY3vpn5rUeutIjCB5zE3A3oMBS+YBSN1Cvf9aqkiyLgmjjlfrFkV+w5yCpxMlKCDPV+8as7iFgS8BCZpFp3HDtGN6UKBZN8WugmmseUjemQdwwNqdnjpvOrp+TMKAPiR8pUiGSu/p5IaaD1JPBMZ0BxpJe9mfif10nQv3ZTEcYJ8pAtFvmJJBiRWQRkIBRnKCeGUKaEuZWwEVWUoQmqYEJwll9eJc1qxbmoVO+rpdplFkceTuAUzsGBK6jBLdShAQwUPMMrvFlP1ov1bn0sWnNWNnMMf2B9/gBYgpER</latexit> ... ... PointHop Unit N 2 ⇥ 8 <latexit sha1_base64="ftKP88bcWbU4OY5umnmglWBQYk4=">AAAB9XicbVDLTgJBEOzFF+IL9ehlIph4IruYKEcSL54MJgImsJDZYRYmzD4y06shG/7DiweN8eq/ePNvHGAPClbSSaWqO91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCopaNEMd5kkYzUg0c1lyLkTRQo+UOsOA08ydve+Hrmtx+50iIK73ESczegw1D4glE0Uq9826uSLoqAa1Ir94slu2LPQVaJk5ESZGj0i1/dQcSSgIfIJNW649gxuilVKJjk00I30TymbEyHvGNoSM0eN51fPSVnRhkQP1KmQiRz9fdESgOtJ4FnOgOKI73szcT/vE6Cfs1NRRgnyEO2WOQnkmBEZhGQgVCcoZwYQpkS5lbCRlRRhiaoggnBWX55lbSqFeeiUr2rluqXWRx5OIFTOAcHrqAON9CAJjBQ8Ayv8GY9WS/Wu/WxaM1Z2cwx/IH1+QNjJZEY</latexit> >T c/w Subspace Decomposition N 2 ⇥ 1 <latexit sha1_base64="Ctk5e0bJvuYqexwa/eGPgDyjUQU=">AAAB9XicbVDLSgNBEOyNrxhfUY9eBhPBU9iNoB4DXjxJBPOAZBNmJ7PJkNkHM71KWPIfXjwo4tV/8ebfOEn2oIkFDUVVN91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCoqaNEMd5gkYxU26OaSxHyBgqUvB0rTgNP8pY3vpn5rUeutIjCB5zE3A3oMBS+YBSN1Cvf9aqkiyLgmjjlfrFkV+w5yCpxMlKCDPV+8as7iFgS8BCZpFp3HDtGN6UKBZN8WugmmseUjemQdwwNqdnjpvOrp+TMKAPiR8pUiGSu/p5IaaD1JPBMZ0BxpJe9mfif10nQv3ZTEcYJ8pAtFvmJJBiRWQRkIBRnKCeGUKaEuZWwEVWUoQmqYEJwll9eJc1qxbmoVO+rpdplFkceTuAUzsGBK6jBLdShAQwUPMMrvFlP1ov1bn0sWnNWNnMMf2B9/gBYgpER</latexit> ... PointHop Unit >T c/w Subspace Decomposition ... PointHop Unit >T c/w Subspace Decomposition ... N 3 ⇥ 8 <latexit sha1_base64="QNAOBczm4bX2wN1OITTeL35xCpg=">AAAB9XicbVDLTgJBEOzFF+IL9ehlIph4IruQKEcSL54MJvJIYCGzwyxMmH1kpldDCP/hxYPGePVfvPk3DrAHBSvppFLVne4uL5ZCo21/W5mNza3tnexubm//4PAof3zS1FGiGG+wSEaq7VHNpQh5AwVK3o4Vp4Enecsb38z91iNXWkThA05i7gZ0GApfMIpG6hXvehXSRRFwTarFfr5gl+wFyDpxUlKAFPV+/qs7iFgS8BCZpFp3HDtGd0oVCib5LNdNNI8pG9Mh7xgaUrPHnS6unpELowyIHylTIZKF+ntiSgOtJ4FnOgOKI73qzcX/vE6CftWdijBOkIdsuchPJMGIzCMgA6E4QzkxhDIlzK2EjaiiDE1QOROCs/ryOmmWS06lVL4vF2pXaRxZOINzuAQHrqEGt1CHBjBQ8Ayv8GY9WS/Wu/WxbM1Y6cwp/IH1+QNks5EZ</latexit> N 3 ⇥ 8 <latexit sha1_base64="QNAOBczm4bX2wN1OITTeL35xCpg=">AAAB9XicbVDLTgJBEOzFF+IL9ehlIph4IruQKEcSL54MJvJIYCGzwyxMmH1kpldDCP/hxYPGePVfvPk3DrAHBSvppFLVne4uL5ZCo21/W5mNza3tnexubm//4PAof3zS1FGiGG+wSEaq7VHNpQh5AwVK3o4Vp4Enecsb38z91iNXWkThA05i7gZ0GApfMIpG6hXvehXSRRFwTarFfr5gl+wFyDpxUlKAFPV+/qs7iFgS8BCZpFp3HDtGd0oVCib5LNdNNI8pG9Mh7xgaUrPHnS6unpELowyIHylTIZKF+ntiSgOtJ4FnOgOKI73qzcX/vE6CftWdijBOkIdsuchPJMGIzCMgA6E4QzkxhDIlzK2EjaiiDE1QOROCs/ryOmmWS06lVL4vF2pXaRxZOINzuAQHrqEGt1CHBjBQ8Ayv8GY9WS/Wu/WxbM1Y6cwp/IH1+QNks5EZ</latexit> N 3 ⇥ 1 <latexit sha1_base64="jD0gTjCjTHfF10hWXgSrKI6/xoE=">AAAB9XicbVDLTgJBEOzFF+IL9ehlIph4IruQqEcSL54MJgImsJDZYRYmzD4y06shG/7DiweN8eq/ePNvHGAPClbSSaWqO91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCopaNEMd5kkYzUg0c1lyLkTRQo+UOsOA08ydve+Hrmtx+50iIK73ESczegw1D4glE0Uq9826uRLoqAa+KU+8WSXbHnIKvEyUgJMjT6xa/uIGJJwENkkmrdcewY3ZQqFEzyaaGbaB5TNqZD3jE0pGaPm86vnpIzowyIHylTIZK5+nsipYHWk8AznQHFkV72ZuJ/XidB/8pNRRgnyEO2WOQnkmBEZhGQgVCcoZwYQpkS5lbCRlRRhiaoggnBWX55lbSqFadWqd5VS/WLLI48nMApnIMDl1CHG2hAExgoeIZXeLOerBfr3fpYtOasbOYY/sD6/AFaEJES</latexit> N 3 ⇥ 1 <latexit sha1_base64="jD0gTjCjTHfF10hWXgSrKI6/xoE=">AAAB9XicbVDLTgJBEOzFF+IL9ehlIph4IruQqEcSL54MJgImsJDZYRYmzD4y06shG/7DiweN8eq/ePNvHGAPClbSSaWqO91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCopaNEMd5kkYzUg0c1lyLkTRQo+UOsOA08ydve+Hrmtx+50iIK73ESczegw1D4glE0Uq9826uRLoqAa+KU+8WSXbHnIKvEyUgJMjT6xa/uIGJJwENkkmrdcewY3ZQqFEzyaaGbaB5TNqZD3jE0pGaPm86vnpIzowyIHylTIZK5+nsipYHWk8AznQHFkV72ZuJ/XidB/8pNRRgnyEO2WOQnkmBEZhGQgVCcoZwYQpkS5lbCRlRRhiaoggnBWX55lbSqFadWqd5VS/WLLI48nMApnIMDl1CHG2hAExgoeIZXeLOerBfr3fpYtOasbOYY/sD6/AFaEJES</latexit> Saab Transform 8 Quadrant Partitioning z x y Space Grouping Input Point Cloud KNN Local Region (One Hop) Feature Reduction Points in Order ... Local Descriptor ⇠ 1 <latexit sha1_base64="UzZlePkD+eC5KfxCUyta2EL76eU=">AAAB7HicbVBNS8NAEJ2tX7V+VT16WSyCp5BUQY8FLx4rmLbQxrLZbtqlm03Y3Ygl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSq4Nq77jUpr6xubW+Xtys7u3v5B9fCopZNMUebTRCSqExLNBJfMN9wI1kkVI3EoWDsc38z89iNTmify3kxSFsRkKHnEKTFW8ntP/MHrV2uu486BV4lXkBoUaParX71BQrOYSUMF0brruakJcqIMp4JNK71Ms5TQMRmyrqWSxEwH+fzYKT6zygBHibIlDZ6rvydyEms9iUPbGRMz0sveTPzP62Ymug5yLtPMMEkXi6JMYJPg2ed4wBWjRkwsIVRxeyumI6IINTafig3BW355lbTqjnfh1O8uaw2niKMMJ3AK5+DBFTTgFprgAwUOz/AKb0iiF/SOPhatJVTMHMMfoM8ffKSOaA==</latexit> ⇠ 2 <latexit sha1_base64="LKLsRwbmNJ27BN3MQafV1wqlu0w=">AAAB7HicbVBNS8NAEJ2tX7V+VT16WSyCp5BUQY8FLx4rmLbQxrLZbtqlm03Y3Ygl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSq4Nq77jUpr6xubW+Xtys7u3v5B9fCopZNMUebTRCSqExLNBJfMN9wI1kkVI3EoWDsc38z89iNTmify3kxSFsRkKHnEKTFW8ntP/KHer9Zcx50DrxKvIDUo0OxXv3qDhGYxk4YKonXXc1MT5EQZTgWbVnqZZimhYzJkXUsliZkO8vmxU3xmlQGOEmVLGjxXf0/kJNZ6Eoe2MyZmpJe9mfif181MdB3kXKaZYZIuFkWZwCbBs8/xgCtGjZhYQqji9lZMR0QRamw+FRuCt/zyKmnVHe/Cqd9d1hpOEUcZTuAUzsGDK2jALTTBBwocnuEV3pBEL+gdfSxaS6iYOYY/QJ8/fiiOaQ==</latexit> ⇠ 8 <latexit sha1_base64="U+Uc0kn266BMzcqWB8u/qZIwOcs=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbB07JbBXssePFYwX5Au5Zsmm1js8mSZMWy9D948aCIV/+PN/+NabsHbX0w8Hhvhpl5YcKZNp737RTW1jc2t4rbpZ3dvf2D8uFRS8tUEdokkkvVCbGmnAnaNMxw2kkUxXHIaTscX8/89iNVmklxZyYJDWI8FCxiBBsrtXpP7L5W6pcrnuvNgVaJn5MK5Gj0y1+9gSRpTIUhHGvd9b3EBBlWhhFOp6VeqmmCyRgPaddSgWOqg2x+7RSdWWWAIqlsCYPm6u+JDMdaT+LQdsbYjPSyNxP/87qpiWpBxkSSGirIYlGUcmQkmr2OBkxRYvjEEkwUs7ciMsIKE2MDmoXgL7+8SlpV179wq7eXlbqbx1GEEziFc/DhCupwAw1oAoEHeIZXeHOk8+K8Ox+L1oKTzxzDHzifP70njoM=</latexit> Figure 2.9: Overview of the PointHop++ method. The figure is from [106]. 35 2.6 Datasets We conduct our experiments on three datasets – ModelNet40 (synthetic objects), 3DMatch (in- door scenes), and KITTI (outdoor scenes). They are detailed next. 2.6.1 ModelNet40 The ModelNet40 dataset [93] is a compilation of 12,308 Computer-aided Design (CAD) models of point clouds of common objects, such as tables, chairs, sofas, airplanes, and so on. In all, the Mod- elNet40 dataset includes 40 object categories. The dataset is divided as 9840 models for training and 2468 models for testing. Each point cloud consists of 2048 points. All the point cloud mod- els are pre-aligned into a canonical frame of reference. ModelNet40 and its subset ModelNet10 are widely used in point cloud object classification and shape retrieval tasks. ModelNet40 is a synthetic dataset. Some point cloud models from this dataset are shown in Fig. 2.10. Figure 2.10: Samples from the airplane and vase class from ModelNet40 dataset. 2.6.2 3DMatch The 3D Match dataset [98] is an ensemble of different indoor scenes from the SUN3D [94] and 7- Scenes dataset [77]. It comprises of several indoor scenes such as kitchen, bedroom, office, lab, etc. This dataset is used for geometric registration of 3D indoor scenes. Each scene is constructed from 50 depth frames. The authors have used the correspondences from 3D reconstruction datasets of 36 SUN3D and 7-Scenes to generate the ground truth labels for training. Two point clouds from this dataset are shown in Fig. 2.11. Figure 2.11: Example of point clouds from 3DMatch dataset. 2.6.3 KITTIandArgoverse The KITTI vision benchmark suite [23] is a dataset developed for autonomous driving purposes. The KITTI dataset is a collection of RGB images, depth maps, and point clouds of urban street scenes in Germany, closely resembling the environment for a self-driving car. The KITTI vision challenge consists of several computer vision tasks for autonomous driving, such as stereo, odom- etry, object detection, tracking, and so on. One of the tasks is visual odometry/SLAM. For this task, the data consists of 22 stereo sequences, 11 with ground truth information for training and the remaining 11 for testing. The monocular/stereo images can be jointly used along with the point cloud scans for visual odometry tasks. StereoKITTI dataset [57, 58] has 142 pairs of point clouds for the scene flow estimation task. The ground truth flow of each pair is derived from the 2D disparity maps and the optical flow information. Likewise, SemanticKITTI dataset [5] is based on the KITTI dataset for semantic scene understanding. It provides per-point annotations. 37 Inspired by the KITTI dataset, the Argoverse [8] dataset was developed by Argo AI in the United States. It provides detailed high quality maps for tasks like 3D tracking and motion fore- casting. The point cloud data in both these datasets is captured using the LiDAR scanner. Two point cloud scans from the KITTI dataset are shown in Fig. 2.12. Figure 2.12: Examples of point clouds from KITTI dataset. 38 Chapter3 PointCloudRegistration 3.1 Introduction Point cloud registration refers to the task of finding a spatial transformation that aligns two point cloud sets. Spatial transformation can be rigid (e.g., rotation or translation) or non-rigid (e.g., scal- ing or non-linear mappings). In this dissertation, we focus on rigid registration. Reference and transformed point clouds are called the target and the source, respectively. The goal of registra- tion is to find a transformation that aligns the source with the target. Further, we can categorize registration as local and global. When the two point clouds are related by only a small amount of rotation, refinement using a local registration algorithm such as ICP [6] is sufficient to get a tighter alignment. However, for larger rotation angles, a global registration algorithm is required. In this chapter, we first discuss the Salient Points Analysis (SPA) [35] method which is good for local registration. Next, we present the R-PointHop [34] method that is capable of global registration. 39 3.2 SalientPointsAnalysis(SPA) 3.2.1 ProblemStatement Let X ∈ R 3 and Y ∈ R 3 be the target and the source point clouds. Y is obtained from X through rotation and translation, which are represented by rotation matrix,R XY ∈ SO(3), and the translation vector, t XY ∈ R 3 . Given X and Y , the goal is to find the optimal rotation R ∗ XY and translation t ∗ XY for X so that they minimize the point-wise mean squared error between transformedX andY . The error term can be written as E(R XY ,t XY )= 1 N N− 1 X i=0 ||R ∗ XY x i +t ∗ XY − y i || 2 . (3.1) For simplicity, X and Y are assumed to have the same number of points, which is denoted by N. If the source and target point clouds have a different number of points, only their common points would be considered in computing the error. The system diagram of the proposed SPA method is shown in Fig. 3.1. It consists of four main steps – feature learning, salient points selection, point correspondence, and transformation estimation. The target and the source point clouds are fed into PointHop++ [106] to learn features. The output for each point is aD-dimensional feature vector. The salient points selection module selects M salient points from the set of N points. Features of the M salient points are used to find point correspondences using the nearest neighbor search. Finally, SVD is used to estimate the transformation. The output point cloud can be used as the new source for iterative alignment. Yet, the training of PointHop++ is done only once. We will elaborate each module below. 40 Figure 3.1: The system diagram of the proposed SPA method. 3.2.2 FeatureLearning The target and the source point clouds are fed into the same PointHop++ to learn features for ev- ery point. We use target point clouds from the training dataset to obtain Module I of PointHop++ using a statistical approach. The training process is same as that in [106]. We should empha- size that Modules II and III of PointHop++ are removed in our current system. They include: feature aggregation and classifier. Cross-entropy-based feature selection is also removed. These removed components are related to the classification but not the registration task. As a result, fea- ture learning with PointHop++ is completely unsupervised. Features from four PointHop units are concatenated to get the final feature vector. The feature dimension depends on energy thresh- olds chosen in PointHop++ design. We will demonstrate that our method can generalize well on unseen data and unseen classes in the experiments. 41 3.2.3 SalientPointsSelection The Farthest Point Sampling (FPS) technique is used in PointHop++ to downsample the point cloud. It offers several advantages for classification such as reduced computation and rapid cov- erage of the point cloud set. However, FPS is not ideal for registration as it leads to different points being sampled from the target and the source point clouds. We propose a novel method to select a subset of salient points based on the local geometry information. The new method has a higher assurance that selected points be the same ones sampled from the source and the target. It is detailed below. We apply the Principal Component Analysis (PCA) to the (x,y,z) coordinates of points of a local neighborhood which is centered at every point. There are N points. Thus, it defines N local neighbors and we conduct the PCA N times. Mathematically, let p i,1 ,p i,2 ,...,p i,K be the K nearest neighbors ofp i . The input to this local PCA isK 3D coordinate vectors. We obtain a 3x3 covariance matrix that has three eigenvectors and three associated non-negative eigenvalues. We pay attention to the smallest eigenvalue among the three and denote it withλ i , which is the energy of the third principal component of the local PCA at pointp i . Before proceeding, it is worthwhile to give an intuitive explanation to this quantity. Points lying on a flat surface can be well represented by two coordinates in the transformed domain so that the energy of its third component,λ i , is zero or close to zero. For points lying on edges between surfaces, corners or distinct local regions, they have a largeλ i value. These are salient point candidates since they are more discriminative. Thus, we rank order all points based on their 42 energy values of the third principal component from the largest to the smallest. The point with the largest energyλ is initialized as the first salient point m 0 =p i , s.t. i= argmax j (λ j ) (3.2) If we select M points simply based on the rank order of λ i , we may end up selecting many points from the same region of largest curvatures. Yet, we need salient points that span the entire point cloud. For this reason, we use the FPS to find the farthest region that contain salient point candidates. In this new region, we again look for the point which has the largestλ i . To be more exact, we consider two factors jointly: 1) selecting the point according to the order of theλ i value, and 2) selecting the point that has the largest distance from the set of previously selected points. This process is repeated untilM salient points are determined. Several examples of salient points selected by our method are shown in Fig. 3.2. Figure 3.2: M salient points (withM =32) of several point cloud sets selected by our algorithm are highlighted. . 43 3.2.4 PointCorrespondenceandTransformationEstimation We getM salient points from the target and source. We obtain their features from PointHop++ to establish point correspondences. For each selected salient point in the target point cloud, we use the nearest neighbor search technique in the feature space to find its matched point in the source point cloud. The proximity is measured in terms of the Euclidean distance between the correspondence pair. Finally, we can form a pair of points,(q m ,q ′ m ), of them-th matched pair. Once we have the ordered pairs, the next task is to estimate the rotation matrix and translation vector that best aligns the two point clouds. We revisit Eq. (3.1), which gives the optimization cri- terion. A closed-form solution, which minimizes the mean-squared-error (MSE), can be obtained by SVD. The step is summarized below. First, we find centroids of two point clouds as ¯x= N− 1 X i=0 x i , ¯y = N− 1 X i=0 y i . (3.3) Then, the covariance matrix is computed via Cov(X,Y)= N− 1 X i=0 (x i − ¯x)(y i − ¯y) T . (3.4) Then, the covariance matrix can be decomposed in form ofCov(X,Y)=USV T via SVD, where U is the matrix of left singular vectors,S is the diagonal matrix containing singular values and V is the matrix of right singular vectors. Then, the optimal rotation matrixR ∗ XY and translation vectort ∗ XY are given by [74] as R ∗ XY =VU T , t ∗ XY =− R ∗ XY ¯x+ ¯y. (3.5) 44 Then, we can align the source point cloud to the target using this transformation. Optionally, we can use an iterative alignment method where the aligned point cloud at thei th iteration is used as the newsource for the(i+1) th iteration. Even with the iteration, PointHop++ is only trained once. 3.2.5 ExperimentalResults We use the ModelNet-40 [93] dataset in all experiments. It has 40 object classes, 9,843 training samples and 2,468 test samples. Each point cloud sample consists of 2048 points. Only 1024 points are used in the training. Evaluation metrics include – the Mean Squared Error (MSE), the Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE) for rotation angles against the X, Y and Z axes in degrees and the translation vector. To train PointHop++, we set the numbers of nearest neighbors to 32, 8, 8, 8 in Hops #1, #2, #3 and #4, respectively. The energy threshold PointHop++ is set to 0.0001. Some of the registration results are presented in Fig. 3.3. The following three sets of experiments are conducted. Figure 3.3: Registration of noisy point clouds (in orange) with noiseless point clouds (in blue) from ModelNet40 dataset using the SPA method, where the first row is the input and the second row is the output. 45 Experiment #1: Unseen point clouds. We train PointHop++ on the target point clouds with training samples from all 40 classes for feature learning. Then, we apply the SPA method to test samples for the registration task. In the experiment, we vary the maximum rotation angles against three axes from5 ◦ to60 ◦ in steps of5 ◦ . For each case, the three angles are uniformly sam- pled between0 ◦ and the maximum angle. Translation along three axes is uniformly distributed in [− 0.5,0.5]. We select 32 salient points for SPA. We replace the salient points selection method in SPA with random selection and farthest points selection while keeping the rest of the procedure the same for performance comparison. We also compare the SPA method with ICP. For fair com- parison, every point cloud undergoes exactly the same transformation for all four benchmarking methods. The mean absolute registration errors are plotted as a function of the maximum rota- tion angle in Fig. 3.4, where the results after the first iteration are shown. We see from Fig. 3.4 that the SPA method outperforms the other three methods by a clear margin. Random selection gives the worst results. Interestingly, SPA with farthest sampled points works better than ICP for larger angles. The translation error of all three SPA variants is much less than that of ICP. This validates the effectiveness of the proposed SPA method in global registration. Figure 3.4: Comparison of the mean absolute registration errors of rotation and translation for four benchmarking methods as a function of the maximum rotation angle on unseen point clouds. 46 Table 3.1: Registration performance comparison on ModelNet-40 with respect to unseen classes (left) and noisy input point clouds (right). Registrationerrorsonunseenclasses Registrationerrorsonnoisyinputpointclouds Method MSE (R) RMSE (R) MAE (R) MSE (t) RMSE (t) MAE (t) MSE (R) RMSE (R) MAE (R) MSE (t) RMSE (t) MAE (t) ICP [6] 467.37 21.62 17.87 0.049722 0.222831 0.186243 558.38 23.63 19.12 0.058166 0.241178 0.206283 PointNetLK [2] 306.32 17.50 5.28 0.000784 0.028007 0.007203 256.16 16.00 4.60 0.000465 0.021558 0.005652 DCP [89] 19.20 4.38 2.68 0.000025 0.004950 0.003597 6.93 2.63 1.52 0.000003 0.001801 0.001697 SPA 354.57 18.83 6.97 0.000026 0.005120 0.004211 331.73 18.21 6.28 0.000462 0.021511 0.004100 Experiment #2: Unseen classes. The experiment is used to study the generalizability of SPA. We adopt the same setup as that in [89] here. We use target point clouds from the first 20 classes of the ModelNet-40 for feature learning in the training phase. Source and target point clouds from remaining 20 classes are used for registration in the testing phase. Rotation angles and the translation are uniformly sampled from[0 ◦ ,45 ◦ ] and[− 0.5,0.5], respectively. We apply SPA for 10 iterations. The results are shown in Table 3.1. We see that SPA generalizes well on categories that it did not see during training. SPA outperforms ICP by a significant margin. SPA and PointNetLK have comparable performance. Although SPA is still inferior to DCP, the difference between their mean absolute rotation errors is less than 4 ◦ while the difference between translation errors is 0.00062. Experiment #3: Noisy point clouds. The objective is to understand the error-resilience property of SPA against noisy input point cloud data. We train PointHop++ on the noiseless target point clouds from all 40 classes for feature learning. In the testing phase, we add independent zero-mean Gaussian noise of variance0.01 to the three coordinates of each point. Rotation and translation are uniformly sampled from [0 ◦ ,45 ◦ ] and [− 0.5,0.5], respectively. The results are shown in Table 3.1. Again, we see that SPA outperforms ICP, has comparable performance with PointNetLK and is inferior to DCP. Overall, SPA is robust to additive Gaussian noise. 47 Error Analysis. It is worthwhile to examine the error distribution of the test data as the evaluation metrics provide only an average measure of the performance. We plot the histogram of the MAE(R) for experiment #2 in Fig. 3.5. It is clear that a large number of point clouds align well with a very small error. Precisely, 38% of the test samples have their MAE(R) less than 1 ◦ and 72% of them have their MAE(R) less than5 ◦ . Also, in Fig. 3.5 we plot the class-wise MAE(R) for the 20 unseen classes that are not used during training. 8 out of 20 classes have MAE less than 5 ◦ . The error is almost 0 ◦ for the person class. The main reason for a larger error in case of some of the point clouds is that since SPA uses the global point coordinates to construct the point attributes, it fails in presence of large rotation angles, just like ICP. Though, it is reliable for local registration. Figure 3.5: Histogram of mean absolute rotation error for experiment #2 (left) and class-wise error distribution (right) Comparison of Model Size and Training Complexity. The model size of SPA is only 64kB while that of DCP is 21MB. PointNetLK trains PointNet for the classification task first, uses its weights as the initialization and then trains the registration network using a new loss function. Both PointNetLK and DCP demand GPU resources to speed up training. However, in SPA training, CPUs are sufficient and the training time is less than 30 minutes. Apparently, 48 SPA offers a good tradeoff between the model size, computation complexity and registration performance, thereby making it a green solution. 3.3 R-PointHop 3.3.1 ProblemStatement The point cloud registration problem is to find a rigid transformation (including rotation and translation) that optimally aligns two point clouds, where one is the target point cloud denoted byF ∈R 3 and the other is the source point cloud denoted byG∈R 3 . The source is obtained by applying an unknown rotation and translation to the target. The rotation can be expressed in form of a rotation matrix, R∈ SO(3), whereSO(3) is a special orthogonal group (i.e. a 3D rotation group in the Euclidean space). The translation vector, t ∈ R 3 , defines the same displacement vector for all points in the 3D space. GivenF andG, the goal is to find an optimal R ∗ ∈ SO(3) and translationt ∗ ∈R 3 that minimize the mean squared error between matching points given by E(R,t)= 1 N N− 1 X i=0 ∥R ∗ f i +t ∗ − g i ∥ 2 , (3.6) where (f i ,g i ) denotes a pair of N selected matching points. Although the actual number of points in each point could be larger thanN, the error is defined over the N matching points for convenience. The system diagram of the proposed R-PointHop method is shown in Fig. 3.6. It contains three main modules: 1) feature learning, 2) point correspondence, and 3) transformation estimation. These modules are detailed below. 49 Figure 3.6: The system diagram of the proposed R-PointHop method, which consists of three modules: 1) feature learning, 2) point correspondence, and 3) transformation estimation. 3.3.2 FeatureLearning In the feature extraction process, aD-dimensional feature vector is learned for every point in a hierarchical manner. The feature learning function,g(·), takes input points of dimensionD 0 and outputs points with feature dimensionD. Here,D 0 represents 3D point coordinates along with optional point properties like the surface normal, color, etc. Stage h in the hierarchical feature learning process (orh-th hop) is a functiong h (·) that takes the point feature of the previous hop of dimensionD h− 1 and outputs feature of dimensionD h . To find feature f i,h of the i-th point in the h-th hop, the input to g h (·) includes point coor- dinatesx i , features of thei-th point from the previous hopf i,h− 1 , coordinates ofK neighboring pointsx j in hoph, features of neighboring pointsf j,h− 1 from previous hop, and a reference frame F . Thus,f i,h is given by f i,h =g h (x i ,x j ,f i,h− 1 ,f j,h− 1 ,F) (3.7) 50 There are several choices ofg h (·), which can be determined based on whether the goal is to learn a local or global feature. For R-PointHop, we chooseg h (·) such that f i,h =g h (x j − x i ,f i,h− 1 ,f j,h− 1 ,LRF(x i ,x j )), (3.8) whereLRF(x i ,x j ) is the local reference frame centered atx i . This choice ofg h (·) encodes only the local patch information and loses the global shape structure. In contrast, the PointHop and SPA learning functions are given by f i,h =g h (x j ,f i,h− 1 ,f j,h− 1 ,XYZ), (3.9) where XYZ denotes that points are always expressed in the original frame of reference. Al- though this learning function captures the global shape structure as the spatial locations of the neighborhood patches x j are preserved, it limits the registration performance in presence of a large rotation angle. Instead, R-PointHop keeps the local position information with LRF. This is desired for reg- istration, since matching points (or patches), which could be spatially far apart, are still close in the feature space now. In contrast, the global position information is vital for the classification task since we are interested in how different local patches connect to other patches that form the overall shape. Thus, the use of the global coordinates in the classification problem is justified. LocalReferenceFrame(LRF). The Principal Components Analysis (PCA) of the 3D coor- dinates of points in a local neighborhood provides insight into the local surface structure. The third eigenvector of the PCA can be taken as a rough estimate of the surface normal. Although 51 the local PCA computation was used in SPA [35] to select salient points, SPA pays more attention to the eigenvalue rather than the eigenvector. Since the local PCA centered at a point is invariant under a rotation of the point cloud, the local PCA of true corresponding points should be similar. This observation serves as the basis to derive the local reference frame (LRF) for every point. That is, we considerK nearest neighbors of a point and conduct the PCA on their 3D coordinates. This results in three mutually orthogonal eigenvectors. They are sorted in a decreasing order of the associated variances (or eigenvalues). We useX,Y ,Z as a convention to represent the original reference in which the point clouds are defined. For the LRF, we use P, Q, R to label the three axes corresponding to the three eigenvectors of largest, middle, and smallest eigenvalues. The eigenvectors come with a sign ambiguity problem since the negative of an eigenvector is still an eigenvector. There are various methods to tackle the sign ambiguity problem. The distribution of neighboring points at every hop is exploited in our work to handle this ambiguity and is be dis- cussed later. Then, we can define positive eigenvectors (p + , q + , r + ) and negative eigenvectors (p − , q − , r − ) for each point. They are unique and serve as the LRF for every point. An example is illustrated in Fig. 3.7. Constructing Point Attributes. To construct the attributes of a target point, we find its K nearest neighbors. They can be the same as those in the previous step or in a larger neigh- borhood depending on the point density and the total number of points. For each point in the neighborhood, we transform its XYZ coordinates to the LRF of the target point. The eigenvectors (p + , q + , r + ) are used as default axes. To address the sign ambiguity of each axis individually, we consider the 1D coordinates of K points of an axis, find the median point and calculate the 52 Figure 3.7: Illustration of the local reference frame (LRF). first-order moment about the median point. Initially, we can assign p + or p − arbitrarily. The first-order left and right moments are given, respectively, by M l p = X i |p i − p m | ∀p i p m , (3.11) where p i is the 1D coordinates of point i projected to p + and p m is the projected value of the median point. If M r p > M l p , we retain original assignment of p + /p − . Otherwise, we swap the assignment to ensure the direction with the larger first-order moment is the positive axis. This can be implemented by post-multiplying the local data matrix of dimensionK× 3 with a diagonal 53 reflection matrix, R ′ ∈ R 3× 3 , whose diagonal elements are either 1 or − 1 depending on the chosen sign. That is, R ′ ii = 1, ifM l i <M r i , − 1, otherwise, (3.12) and R ′ ij =0, ifi̸=j. (3.13) Here, we deliberately use R ′ for the reflection matrix in above so as to avoid confusion with the rotation matrix R. The 3D space of the K nearest neighbor points is partitioned into eight octants of the LRF, centered at the target point. For each octant, we calculate the mean of all the 3D coordinates of points in that octant and concatenate all eight means to get a 24D vector, which is the attributes of the target point. The same process is conducted on all points in the point cloud. The octant partitioning and grouping is similar to that of PointHop, but the difference lies in the use of the LRF in R-PointHop. The attribute construction process is illustrated in Fig. 3.8. Figure 3.8: Illustration of point attribute construction. Multi-hop Features. The 24D attributes of all points of point clouds from the training set are collected, and the Saab transform [106] is conducted to obtain a 24D spectral representation. This is the output of hop #1. We compute the energy of each node as done in [106] and pass the nodes of energy greater than thresholdT to the next hop and discard the nodes of energy smaller 54 than thresholdT . In PointHop++, the nodes with energy less than thresholdT are collected as leaf nodes. Here, we discard them to avoid mismatched correspondences. This is because hop #1 features carry more local structure information which may be similar in different regions of the point cloud. Proceeding to the next hop, the point cloud is downsampled using the Farthest Point Sampling (FPS). FPS ensures that the structure of the point cloud is preserved after down- sampling. It also helps reduce computations and grow the receptive field quickly. At hop #2, the attribute construction process is repeated at every node passed on from hop #1. Since these features are uncorrelated, we can handle them separately and apply the channel-wise Saab trans- form [106] starting from hop #2 and beyond. Each dimension is treated as a node in the feature tree in the channel-wise Saab transform. TheK nearest neighbors of a target point at hop #2 are found, which are different from hop #1 neighbors due to the downsampling operation. They are represented using the LRF of the target point found in the first step. Since the set of K nearest neighbor points has changed, we have to decide the appropriate sign again. The LRF is parti- tioned into eight octants, in each of which we take the mean of the 1D feature of all points in that octant. The eight means are concatenated to get 8D hop #2 attributes for a node. All the point attributes are collected and the channel-wise Saab transform is used to get the 8D spectral representation. This process is repeated for all nodes at hop #2. The multi-hop learning process continues for four hops. All 1D spectral components at the end of hop #4 are concatenated the get the feature vector of a point. The final feature dimension depends on the choice of different parameters including the neighborhood size, number of points to be downsampled at every hop, and the energy threshold for channel-wise Saab transform. These parameters can be different at different hops. A set of model parameters will be presented in the experiments. 55 The rotation/translation invariance property of R-PointHop comes from the use of the Local Reference Frame (LRF). The LRF is derived by applying PCA to points in a local neighborhood. In the attribute building step, we collect points in a local neighborhood and project them onto the local coordinate system. This ensures that when the point cloud undergoes any rotation or translation, the coordinates of neighboring points remain the same since the LRF also rotates and translates accordingly. In subsequent stages, we keep projecting the neighboring points onto the LRF to ensure the rotation/translation invariance property is preserved at every stage. 3.3.3 PointCorrespondences The trained R-PointHop model is used to extract features from the target and the source point clouds. A feature distance matrix is calculated whose ij th element is the l 2 distance between the feature of the i th point in the target and the j th point in the source. The minimum value along the i th row gives the point in the source which is closest to the i th point in the target in the feature space. These pairs of points nearest in the feature space are used as an initial set of correspondences. Next, we select a subset of good correspondences. To do so, the correspon- dences are first ordered in the increasing l 2 distance between features of matching points. Top M 1 correspondences are selected using this criterion. We use the ratio test to further select a smaller set ofM 2 correspondences. That is, the distance to the second nearest neighbor is found as the second minima along the row in the distance matrix. The ratio between the distance to the first neighbor and that to the second neighbor is calculated. A smaller ratio indicates a higher confidence of match. Top M 2 correspondences are selected using the ratio test. These points are used to find the rotation and translation. Instead of choosing M 1 and M 2 points explicitly, we 56 can alternatively set two thresholdst 1 andt 2 , wheret 1 is for the minimuml 2 distance between matching features andt 2 for the minimum ratio. These hyper-parameters are selected empirically in our experiments. It is worthwhile to comment that SPA [35] presented an analogous method to select a subset of correspondences. The main difference between R-PointHop and SPA lies in the fact that SPA uses local PCA only to find salient points in the point cloud. It ignores the rich multi-hop spectral information. In contrast, R-PointHop uses multi-hop features to select a high- quality subset of correspondences. Some correspondences found using R-PointHop are shown in Fig. 3.9. Figure 3.9: Correspondences found using R-PointHop, where the source point cloud is shown in red and the target is shown in blue. 3.3.4 TransformationEstimation The ordered pairs of corresponding points (f i ,g i ) are used to estimate the optimal rotation R ∗ and translationt ∗ that minimizes the error function as given in Eq. (3.6). A closed-form solution to this optimization problem was given in [74]. It can be solved numerically using the singular value decomposition (SVD) of the data covariance matrix. The procedure is summarized below. 57 1. Find the mean point coordinates from the correspondences by ¯f = 1 N N− 1 X i=0 f i , ¯g = 1 N N− 1 X i=0 g i . (3.14) Then, compute the covariance matrix Cov(F,G)= N− 1 X i=0 (f i − ¯f)(g i − ¯g) T . (3.15) 2. Conduct SVD on the covariance matrix Cov(F,G)=USV T , (3.16) whereU is the matrix of left singular vectors,S is the diagonal matrix containing singular values and V is the matrix of right singular vectors. In this case, U, S, and V are 3× 3 matrices. 3. The optimal rotation matrixR ∗ is given by R ∗ =VU T . (3.17) The optimal translation vectort ∗ can be found usingR ∗ and the means ¯x and ¯y: t ∗ =− R ∗ ¯f +¯g, (3.18) R ∗ andt ∗ are then used to align the source with the target. 58 Finally, the aligned source point cloud(G ′ ) is given by G ′ =R ∗ T (G− t ∗ ), (3.19) where R ∗ T is the transpose of R ∗ which applies the inverse transformation. Unlike SPA which iteratively aligns the source to target, R-PointHop is not iterative and point cloud registration is completed in one run. 3.3.5 ExperimentalResults Experiments are performed on point clouds of indoor scenes and 3D objects. We begin our dis- cussion with indoor scene registration. Indoor Scene Registration. We trained and evaluated R-PointHop on indoor point cloud scans from the 3DMatch dataset [99]. This dataset is an ensemble of several RGB-D reconstruction datasets such as 7-Scenes [77] and SUN3D [94]. The dataset comprises of various indoor scenes such as bedroom, kitchen, office, lab, and hotel. There are 62 scenes in total, which are split into 54 training scenes and 8 testing scenes. Each scene is further divided into various partial overlapping point clouds consisting of 200-700K points. During training and evaluations, 2,048 points are randomly sampled from each point cloud scan. By inspecting several examples visually, randomly selected points roughly span the entire set and, hence, computationally intensive sampling schemes such as FPS can be avoided. 256 neighboring points are used to determine the LRF. We append point coordinates with the surface normal and geometric features (e.g., linearity, planarity, sphericity, eigen entropy, etc. [27]) ob- tained from eigenvalues of local PCA as point attributes. Since local PCA is already performed in 59 the LRF computation, their eigenvalues are readily available. Furthermore, we use RANSAC [22] to estimate the transformation. Some successful registration results are shown in Fig. 3.10. Figure 3.10: Registration of indoor point clouds from 3DMatch dataset: point clouds from 7- Scenes (the left two) and point clouds from SUN3D (the right two)) . We compare R-PointHop with 3DMatch [99] and PPFNet [20] since they are among early su- pervised deep learning methods developed for indoor scene registration. Furthermore, several model-free methods such as SHOT [78], Spin Images [30] and FPFH [73] are also included for performance benchmarking. All methods are evaluated based on 2048 sampled points for fair comparison. By following the evaluation method given by [13], we report the average recall and precision on the test set. The results are summarized in Table 3.2. As shown in the table, R-PointHop offers the highest recall and precision. It outperforms model-free methods by a sig- nificant margin. Its performance is slightly superior to that of PPFNet. Table 3.2: Registration performance comparison on the 3DMatch dataset. Method Recall Precision SHOT [78] 0.27 0.17 Spin Images [30] 0.34 0.18 FPFH [73] 0.41 0.21 3DMatch [99] 0.63 0.24 PPFNet [20] 0.71 0.26 R-PointHop 0.72 0.26 60 Object Registration: Next, we trained R-PointHop on the ModelNet40 dataset [93]. It is a synthetic dataset consisting of40 categories of CAD models of common objects such as car, chair, table, airplane, and person. It comprises 12,308 point cloud models in total, which are split into 9840 training models and 2468 testing models. Every point cloud model consists of 2,048 points. The point clouds are normalized to fit within a unit sphere. For the task of 3D registration, we follow the same experimental setup as DCP [89] and PR-Net [90] for fairness. The following set of parameters are chosen as the default of R-PointHop for object registra- tion. • Number of initial points: 1,024 points (randomly sampled from the original 2,048 points) • Point Attributes: point coordinates only ∗ • Neighborhood size for finding LRF: 64 nearest neighbors • Number of points in each hop: 1024, 768, 512, 384 • Neighborhood size in each hop: 64, 32, 48, 48 • Energy threshold: 0.001 • Number of top correspondences selected: 256 • Number of correspondences selected after the ratio test: 128 We compare R-PointHop with the following six methods: • three model-free methods – ICP [6], Go-ICP [96] and FGR [109]. ∗ Although the surface normal and geometric features were included for indoor registration, they are removed in the context of object registration. 61 • two supervised-learning-based methods – PointNetLK [2] and DCP [89]. • one unsupervised-learning-based method – SPA [35]. For ICP, Go-ICP and FGR we use the open-source implementation in Open3D library [110]. In the following experiments, we apply a random rotation to the target point cloud about its three coordinate axes. Each rotation angle is uniformly sampled in [0 ◦ ,45 ◦ ]. Then, a random uniform translation in [− 0.5,0.5] is applied along the three axes to get the source point cloud. For training, only the target point clouds are used. We report the Mean Square Error (MSE), the Root Mean Square Error (RMSE) and the Mean Absolute Error (MAE) between the ground truth and the predicted rotation angles and the predicted translation vector. Further, we also align real world point clouds from the Stanford 3D scanning repository [80], [16], [39]. Later, we show that R-PointHop can be used for global registration as well as an initialization for ICP and we explain the use of R-PointHop as a general 3D point descriptor. Registration on Unseen Data. In this experiment, we trained R-PointHop from training samples of all 40 classes. For evaluation, registration was performed on point clouds from the test data are used. The results are reported in Table 3.3. We see that R-PointHop clearly outperforms all six benchmarking methods. Two sets of target and source point clouds and their registered results are shown in the first two columns of Fig. 3.11. To plot point clouds, we use the Open3D library [110]. Registration on Unseen Classes. We derive R-PointHop only from the first 20 classes of the ModelNet40 dataset. For registration, test samples from the remaining 20 classes are used. As shown in Table 3.4, R-PointHop can generalize well on unseen classes. PointNetLK and DCP have relatively larger errors as compared to their errors in Table 3.3. This indicates that the use 62 Figure 3.11: Registration of seven point clouds from the ModelNet40 dataset using R-PointHop. Table 3.3: Registration on unseen point clouds Method MSE (R) RMSE (R) MAE (R) MSE (t) RMSE (t) MAE (t) ICP [6] 451.11 21.24 17.69 0.049701 0.222937 0.184111 Go-ICP [96] 140.47 11.85 2.59 0.00659 0.025665 0.007092 FGR [109] 87.66 9.36 1.99 0.000194 0.013939 0.002839 PointNetLK [2] 227.87 15.09 4.23 0.000487 0.022065 0.005405 DCP [89] 1.31 1.14 0.77 0.000003 0.001786 0.001195 SPA [35] 318.41 17.84 5.43 0.000022 0.004690 0.003261 R-PointHop 0.12 0.34 0.24 0.000000 0.000374 0.000295 63 of object labels makes these methods biased to the seen categories. For the first three methods, the results are comparable with those in Table 3.3 as there is no training involved. For SPA and R-PointHop, their errors are similar to those of unseen object classes. This demonstrates the advantage of unsupervised learning methods for registration of unseen classes. Table 3.4: Registration on unseen classes Method MSE (R) RMSE (R) MAE (R) MSE (t) RMSE (t) MAE (t) ICP [6] 467.37 21.62 17.87 0.049722 0.222831 0.186243 Go-ICP [96] 192.25 13.86 2.91 0.000491 0.022154 0.006219 FGR [109] 97.00 9.84 1.44 0.000182 0.013503 0.002231 PointNetLK [2] 306.32 17.50 5.28 0.000784 0.028007 0.007203 DCP [89] 9.92 3.15 2.01 0.000025 0.005039 0.003703 SPA [35] 354.57 18.83 6.97 0.000026 0.005120 0.004211 R-PointHop 0.12 0.34 0.25 0.000000 0.000387 0.000298 Registration on Noisy Data. In this experiment, we were interested in aligning a noisy source point cloud with a target that is free from noise. A Gaussian noise with zero mean and standard deviation of 0.01 was added to the source. The registration results are presented in Table 3.5. The results demonstrate that R-PointHop is robust to Gaussian noise. A fine alignment step using ICP can further reduce the error. In other words, R-PointHop can act as a coarse alignment method in presence of noise. Registration on Partial Data. Registration of partial point clouds is common in practical scenarios. We considered the cases where the source and target have only a subset of points in common. To generate a partial point cloud, we selected a point at random and found its N nearest neighbors. We setN to be3/4 th of the total number of points in the point cloud. In our experiment, the initial point cloud has 1,024 points, and so the number of points in the partial 64 Table 3.5: Registration on noisy point clouds Method MSE (R) RMSE (R) MAE (R) MSE (t) RMSE (t) MAE (t) ICP [6] 558.38 23.63 19.12 0.058166 0.241178 0.206283 Go-ICP [96] 131.18 11.45 2.53 0.000531 0.023051 0.004192 FGR [109] 607.69 24.65 10.05 0.011876 0.108977 0.027393 PointNetLK [2] 256.15 16.00 4.59 0.000465 0.021558 0.005652 DCP [89] 1.17 1.08 0.74 0.000002 0.001500 0.001053 SPA [35] 331.73 18.21 6.28 0.000462 0.021511 0.004100 R-PointHop 7.73 2.78 0.98 0.000001 0.000874 0.003748 R-PointHop + ICP 1.16 1.08 0.21 0.000001 0.000744 0.001002 Table 3.6: Registration on partial point clouds (R-PointHop* indicates choosing correspondences without the ratio test). Registrationerrorsonunseenobjects Registrationerrorsonunseenclasses Method MSE (R) RMSE (R) MAE (R) MSE (t) RMSE (t) MAE (t) MSE (R) RMSE (R) MAE (R) MSE (t) RMSE (t) MAE (t) ICP [6] 1134.55 33.68 25.05 0.0856 0.2930 0.2500 1217.62 34.89 25.46 0.0860 0.293 0.251 Go-ICP [96] 195.99 13.99 3.17 0.0011 0.0330 0.0120 157.07 12.53 2.94 0.0009 0.031 0.010 FGR [109] 126.29 11.24 2.83 0.0009 0.0300 0.0080 98.64 9.93 1.95 0.0014 0.038 0.007 PointNetLK [2] 280.04 16.74 7.55 0.0020 0.0450 0.0250 526.40 22.94 9.66 0.0037 0.061 0.033 DCP [89] 45.01 6.71 4.45 0.0007 0.0270 0.0200 95.43 9.77 6.95 0.0010 0.034 0.025 PR-Net [90] 10.24 3.12 1.45 0.0003 0.0160 0.0100 15.62 3.95 1.71 0.0003 0.017 0.011 R-PointHop* 3.58 1.89 0.58 0.0002 0.0150 0.0008 3.75 1.94 0.58 0.0002 0.0151 0.0008 R-PointHop 2.75 1.66 0.35 0.0002 0.0149 0.0008 2.53 1.59 0.37 0.0002 0.0148 0.0008 Figure 3.12: (From left to right) The plots of the maximum rotation angle versus the root mean square rotation error, the mean absolute rotation error, the root mean square translation error, and the mean absolute translation error. 65 Table 3.7: Registration on the Stanford Bunny dataset Method MSE (R) RMSE (R) MAE (R) MSE (t) RMSE (t) MAE (t) ICP [6] 177.35 13.32 10.72 0.0024 0.0492 0.0242 Go-ICP [96] 166.85 12.92 4.52 0.0018 0.0429 0.0282 FGR [109] 3.98 1.99 1.49 0.0397 0.1993 0.1658 DCP [89] 41.45 6.44 4.78 0.0016 0.0406 0.0374 R-PointHop 2.21 1.49 1.09 0.0013 0.0361 0.0269 point cloud is 768. The number overlapping points are between the source and target is thereby random between 512 and 768. The results of partial-to-partial registration are presented in Table 3.6. They are shown under two scenarios: 1) registration on unseen point clouds and 2) regis- tration on unseen classes. R-PointHop gives the best performance in the registration of partial data too. A critical element in registering partial point clouds is to find correspondences between overlapping points. R-PointHop handles it in the same way as discussed before because of the use of effective R-PointHop features to select good correspondences. Furthermore, we show the effectiveness of using the ratio test to filter out bad correspondences. The row of R-PointHop* in Table 3.6 shows the errors when the ratio test is removed. The errors are higher than those with the ratio test. Some results on partial data registration are shown in Fig. 3.11, where columns 3, 4 and 5 show the results where only the source is partial and columns 6 and 7 show the results where both the source and the target are partial. TestonRealWorldData. We next tested R-PointHop on 3D point clouds from the Stanford Bunny dataset [80]. It consists of 10 point cloud scans. Typically, each scan contains more than 100k points. In contrast with the synthetic ModelNet40 dataset, it is a real world dataset. We apply a random spatial transformation to generate the source point clouds. For registration, we 66 select 2,048 points randomly so that they are evenly spanned across the object. The R-PointHop derived from all 40 classes of ModelNet40 is used for feature extraction. For DCP, we use their model trained on ModelNet40 and test on the Bunny dataset. We compare R-PointHop with other methods and show the results in Table 3.7. One representative registration result is also shown in Fig. 3.13. Table 3.7 shows that R-PointHop derived from ModelNet40 can be generalized to the Bunny dataset well. In contrast, DCP does not perform so well on the Bunny dataset as com- pared with ModelNet40. We further experiment on point clouds from the Stanford 3D scanning repository, which has a collection of several categories of objects including Bunny, Buddha [16], Dragon [16], and Armadillo [39]. Some input scans and their corresponding registered results are shown in Fig. 3.14. Figure 3.13: Registration on the Stanford Bunny dataset: the source and the target point clouds (left) and the registered result (right). Local vs. Global Registration. ICP is local in nature and works only when the optimal alignment is close to the initial alignment. In this case, R-PointHop can be used as an initialization for ICP. That is, R-PointHop can be used to obtain the initial global alignment, after which ICP can be used to achieve a tighter alignment. To demonstrate this property, we plot the mean absolute error (MAE) and the root mean squared error (RMSE) for rotation and translation against the maximum rotation angle in Fig. 3.12. As shown in the figure, as the maximum rotation angle increases, the MAE and the RMSE for ICP increase steadily. In contrast, the RMSE and the MAE 67 Figure 3.14: Registration of point clouds from the Stanford 3D scanning repository, where the objects are (from left to right): drill bit, armadillo, Buddha, dragon and bunny. The top row shows input point clouds while the bottom row shows the registered output. of R-PointHop are very stable, reflecting the global registration power of R-PointHop. In Fig. 3.15, we show three registration results: 1) using ICP alone, 2) using R-PointHop alone, and 3) R-PointHop followed by ICP. We can obtain slightly better results in the third case as compared to the second case. However, without initializing with R-PointHop, ICP fails to align well. Figure 3.15: (From left to right) The source and target point clouds to be aligned, registration with ICP only, with R-PointHop only, with R-PointHop followed by ICP. 3DDescriptor. Fig. 3.16 shows the t-SNE plot of some point features obtained by R-PointHop. It is observed that the points of a similar local structure are close to each other, irrespective of 68 their spatial locations in the 3D point cloud model as well as whether they belong to the same object or the same class. To give an example, we show two point cloud models of a table and a chair in the left. The points on their legs have a similar neighborhood structure and their features are closer in the t-SNE embedding space. This demonstrates the capability of R-PointHop as a general 3D descriptor. As an application, we show the registration of two different objects of the same object class in Fig. 3.17, which has two different airplanes and cars. Although the objects are different, we can still align them reasonably well. This is because points in similar semantic regions are selected as correspondences. Apart from 3D correspondence and registration, the 3D descriptor can be used for a variety of applications such as point cloud retrieval, which can be a future extension of this work. Figure 3.16: The t-SNE plot of point features, where a different number indicates a different object class of points. Some points are highlighted and their 3D location in the point cloud is shown. Features of points with a similar local neighborhood are clustered together despite of differences in their 3D coordinates. 69 Figure 3.17: Registration of two point cloud models, where the first two columns are input point clouds and the third column is the output after registration. AblationStudy. The effects of the model parameters, ratio test and use of RANSAC on the object registration performance are discussed in this subsection. We report the mean absolute errors using different input point numbers and neighborhood sizes of LRF in the first section of Table 3.8. The error values are comparable for different numbers of input points and LRF neighborhood sizes. The performance slightly drops when 512 points are used. Hence, we fix 2048 points and set the neighborhood size of LRF to 128 in following experiments. Next, we consider various degree of partial overlaps and the effect of adding noise of three levels in the second and the third sections of Table 3.8, respectively. For partial registration, the error increases as the maximum overlapping region decreases. We see consistent improvement with the ratio test and RANSAC. Similarly, the performance improves after inclusion of the ratio test and RANSAC for registration with noise. To gain further insights into the difference between simple object point clouds and complex indoor point clouds, we remove the surface normal and geometric features and use only point coordinates for indoor registration. This leads to a sharp decrease in performance, to an average recall and precision of 0.39 and 0.19, respectively. Clearly, the use of point coordinates is not suf- ficient for the registration of complex indoor point clouds. We also reduce the LRF neighborhood size for indoor point clouds and see whether 64 or 128 neighbors could give similar performance 70 Table 3.8: Ablation study on object registration. Inputpoints LRF Partialoverlap Noisestd. Ratiotest RANSAC MAE(R) MAE(t) 1024 64 0.24 0.000301 1024 32 0.25 0.000314 2048 128 ✓ 0.24 0.000297 2048 64 ✓ 0.24 0.000300 1024 64 ✓ 0.24 0.000295 512 32 ✓ 0.29 0.000546 1536 96 75% 0.56 0.000856 1024 64 50% 2.41 0.001340 512 32 25% 8.67 0.031237 1536 96 75% ✓ ✓ 0.31 0.000824 1024 64 50% ✓ ✓ 0.87 0.001339 512 32 25% ✓ ✓ 6.69 0.031202 2048 128 0.01 0.99 0.003752 2048 128 0.05 1.43 0.004138 2048 128 0.1 2.81 0.007123 2048 128 0.01 ✓ ✓ 0.88 0.003711 2048 128 0.05 ✓ ✓ 1.37 0.004122 2048 128 0.1 ✓ ✓ 2.74 0.007093 71 as observed in object registration. Again, there is some performance degradation, and the best results are achieved with 256 neighbors. This is attributed to the fact that the more the number of points used to find the LRF, the more stable the local PCA against small perturbations and noise. In other words, the optimal point attributes and the hyper-parameter settings are different for registering object and indoor scene point clouds. 3.3.6 TowardGreenLearning One shortcoming of deep learning methods is that they tend to have a large model size, which make them difficult to deploy on mobile devices. Moreover, recent studies indicate that training deep learning models has a large carbon footprint. Along with the environmental impact, expen- sive GPU resources are needed to successfully train these networks in reasonable time. The need to search for an environmental friendly green solution to different AI tasks, or green AI [75], is on the rise. Although the use of efficiency (training time, model size etc.) as an evaluation criterion along with the usual performance measures was emphasized in [75], no specific green models were presented. R-PointHop offers a green solution in terms of a smaller model size and training time as compared with deep-learning-based methods. We trained PointNetLK and DCP methods using the open source codes provided with the default parameters set by authors. We compare the training complexity below. • DCP took about 27.7 hours to train using eight NVIDIA Quadro M6000 GPUs. • PointNetLK took approximately 40 minutes to train one epoch using one GPU while the default training setting is 200 epochs. Thus, the total training time was 133.33 hours. 72 • R-PointHop took only 40 minutes to train all model parameters using an Intel(R) Xeon(R) CPU E5-2620 v3 at 2.40GHz. The inference time of all methods was comparable. However, since ICP, Go-ICP, SPA, and PointNetLK are iterative methods, their inference time is a function of the iteration number. We observe that the required iteration number varies from model to model. The model size of R-PointHop is only 200kB compared to 630kB for PointNetLK and 21.3MB of DCP. The use of transformer makes the model size of DCP significantly larger. Although the model free methods are most favorable in terms of model sizes and training time, their registra- tion performance is much worse. Thus, R-PointHop offers a good balance when all factors are considered. 3.3.7 Discussion To determine whether the performance gain using supervised deep learning is due to large un- labeled data, data labeling, or both, we split experiments on ModelNet40 into two parts (i.e., tests on seen and unseen object classes) as a case study. Some supervised learning methods per- formed poorer on unseen classes (see Tables 3.3, 3.4, and 3.5), which indicates that they learn object categories indirectly, even though their supervision uses ground truth rotation/translation values without class labels. This behavior is not surprising since the two benchmark methods, PointNetLK and Deep Closest Point (DCP). are derived from PointNet and DGCNN, respectively, which were designed for point cloud classification. In contrast, our feature extraction is rooted in PointHop, which is unsupervised and task-agnostic. Our model does not know the downstream task. Hence, it can generalize well to unseen classes. To show this point furthermore, we use 73 the R-PointHop model learned from ModelNet40 and evaluate it on the Stanford bunny model. Its performance gap is smaller than those of supervised learning methods. These experiments indicate that the performance gain of supervised learning methods is somehow limited to similar instances of point clouds that the models have already seen and their generalization capability to unseen classes is weaker. In general, we see that R-PointHop works extremely well for the object registration case and also matches the performance with PPFNet for indoor registration. However, there exist some recent exemplary networks (e.g., [25]) that have a higher recall on the 3DMatch dataset. The eight octant partitioning operation in the feature construction step fails to encode better local structure information for point clouds from this dataset. Since R-PointHop is based on successive aggregation of local neighborhood information, an initial set of attributes that captures better local neighborhood structure can help improve the performance. One such choice could be the FPFH descriptor. For ModelNet40, we see that the performance of R-PointHop degrades when the amount of overlap reduces or the amount of noise increases. One reason is the stability of LRF. It is observed that a larger neighborhood number tends to compensate for noise and surface variations in our experiments on the 3DMatch dataset. However, when the number of points in a dataset is small, we cannot opt for more points in finding LRF. Although RANSAC offers a more robust solution, a fine alignment step may be necessary. That is, some ICP iterations may achieve a tighter alignment. 74 3.4 Conclusion In this chapter, first, an unsupervised point cloud registration method called Salient Points Anal- ysis (SPA) was proposed. The SPA method successfully registers two point clouds with only pairs of discriminative points. The pairing is achieved by the shortest distance in the feature domain learned via PointHop++. Next, the R-PointHop method was proposed. R-PointHop extracts point features of varying neighborhood sizes in a one-pass manner, where the neighborhood size grows as the number of hop increases. Features extracted by R-PointHop are invariant with respect to rotation and translation due to the use of the local reference frame (LRF). This enables R-PointHop to find corresponding pairs accurately in presence of partial point clouds and larger rotation angles. It was shown by experimental results that R-PointHop offers the state-of-the-art performance in point cloud registration. It is worth noting that SPA and R-PointHop do not follow the end-to-end optimization frame- work as adopted by deep learning methods nowadays. Furthermore, their training times and model sizes are less than those of deep learning methods by an order of magnitude. This makes SPA and R-PointHop green solutions. Also, it is typical that supervised learning methods outper- forms unsupervised learning methods. But, our work shows that ground truth transformations are not necessary in the point cloud registration problem. 75 Chapter4 PointCloudPoseEstimationandSO(3)Invariant Classification 4.1 Introduction Content-based point cloud object retrieval and category-level point cloud object pose estimation are two important tasks of point cloud processing. For the former, one can find similar objects from the gallery set, which can provide more information about the unknown object. For the latter, the goal is to estimate the 6-DOF (Degree of Freedom) pose of a 3D object comprising of rotation (R ∈ SO(3)) and translation (t ∈ R 3 ) , with respect to a chosen reference. The pose information can facilitate downstream tasks such as object grasping, obstacle avoidance and path planning for robotics. In a typical scene understanding problem using data from range sensors or a depth camera, this problem would arise after a 3D detection algorithm has successfully localized and labeled the objects present in the point cloud scan. Accordingly, an unsupervised point cloud object retrieval and pose estimation method, called PCRP [36], is proposed in this chapter. 76 Figure 4.1: Summary of the proposed PCRP method. First, a similar object to the input query object (in red) is retrieved from the gallery set (top row). Then, the query object is registered with the retrieved object (bottom row) to estimate its pose. We extend the proposed R-PointHop method to the context of point cloud object retrieval and pose estimation against a gallery point cloud set, which contains point cloud objects with known pose orientation information. Point cloud retrieval has been researched for quite some time in terms of retrieving a similar object from a database or aggregating local feature descriptors for recognizing places in the wild. Yet, retrieving objects with pose variations is less investigated. Here, we show how R-PointHop features can be reused to retrieve a similar point cloud object. Registration of two similar objects (potentially with partial overlap), which was the focus of R- PointHop, has been widely studied. Although being a related problem, estimating the pose of a single object addressed in this work is less explored. We analyze several bottlenecks of R- PointHop and propose modifications to enhance its performance for pose estimation. Built upon enhanced R-PointHop, PCRP registers the unknown point cloud object with those in the gallery set to achieve content-based object retrieval and pose estimation jointly. As shown in Fig. 4.1, PCRP consists of “object retrieval” and “pose estimation” two functions. For object 77 retrieval, it first aggregates the pointwise features learned from R-PointHop based on VLAD (Vector of Locally Aggregated Descriptors) [29] to obtain a global point cloud feature vector and then use it to retrieve a similar pre-aligned object from the gallery set. For pose estimation, the 6-DOF pose of the query object is found by registering it with the retrieved object. While PCRP assumes the object category is known for retrieval, the next work, S3I-PointHop aims at classifying point cloud objects with arbitrary rotations. Accordingly, the localized object can be first classified using S3I-PointHop and then fed to PCRP to determine its pose. The un- ordered nature of 3D point cloud demands methods to be invariant toN! point permutations for a scan of N points. It was demonstrated in the pioneering work called PointNet [66] that per- mutation invariance can be achieved using a symmetric function such as the maximum value of point feature responses. Besides permutations, invariance with respect to rotations is desirable in many applications such as 3D registration. In particular, point cloud features are invariant with any 3D transformation in the SO(3) group; namely, the group of3× 3 orthogonal matrices representing rotations in 3D. Achieving rotation invariance can guarantee that point clouds expressed in different orien- tations are regarded the same and, thereby, the classification result is unaffected by the pose. State-of-the-art methods do not account for rotations, and they perform poorly in classifying dif- ferent rotated instances of the same object. In most cases, objects are aligned in a canonical pose before being fed into a learner. Several approaches have been proposed to deal with this problem. First is data augmentation, where different rotated instances of the same object are presented to a learner. Then, the learner implicitly learns to reduce the error metric in classifying similar ob- jects with different poses. This approach leads to an increase in the computation cost and system complexity. Yet, there is no guarantee in rotational invariance. A more elegant way is to design 78 point cloud representations that are invariant to rotations. Thus, point cloud objects expressed in different orientations are indistinguishable to classifiers. Another class of methods are based on SO(3) equivariant networks, where invariance is obtained as a byproduct of the equivariant point cloud features. The green learning based point cloud classification works PointHop [107] and PointHop++ [106] assume that the objects are pre-aligned. Due to this assumption, these methods fail when classifying objects with different poses. The proposed S3I-PointHop method is an SO(3) group invariant member of the PointHop family. Invariance to rotations is achieved through the derivation of invariant representations by leveraging principal components, rotation invariant local/global features, and point-based eigen features. 4.2 PCRPMethod 4.2.1 Methodology Figure 4.2: An overview of the proposed PCRP method. 79 As shown in Fig. 4.2, the PCRP method consists of the following three stages. First, features of every point in the input point cloud are extracted using R-PointHop. Second, features of all points are aggregated into a global descriptor using the VLAD method. The nearest neighbor search is then used to retrieve a similar pre-aligned object from the gallery set. Finally, the input object is registered to the retrieved object to obtain the 6-DOF pose. Each of them is elaborated below. FeatureExtraction. In R-PointHop, point attributes are constructed by partitioning the 3D space around a point into eight octants using three orthogonal directions given by the local reference frame. The mean of points in each octant is concatenated to get a 24D attribute vector. Yet, we observe that the 24D attribute vector is sensitive to noise and unable to capture com- plex local surface patterns for distinction. A modified version that appends point coordinates with eigen features was used for indoor scene registration and in GPCO for odometry. Actually, histogram-based point descriptors such as SHOT (Signature Histogram of Orientations) [78] and FPFH (Fast Point Feature Histogram) [73] have been widely used to describe the local surface geometry. We may leverage them as well. One drawback of histogram-based descriptors is that they cannot capture the far-distance information since they only have a single scale. We propose to integrate histogram-based local descriptors with R-PointHop, which has a multi-scale representation capability, to get a new descriptor. Specifically, we replace the octant- based mean coordinates attributes in the original R-PointHop with the FPFH descriptor in the first hop. The rest is kept the same as the original R-PointHop. That is, the Saab transform is used to get the first-hop spectral representation, and subsequent hops still involve attribute construction by partitioning the 3D space into eight octants and getting 8D attributes for each 80 spectral component. We should emphasize that FPFH is invariant to rotations so that the rotation- invariant property of R-PointHop is preserved without any other adjustment. The output of the first-hop stemming from FPFH is more powerful than the original R-PointHop design. We give the new descriptor a name – FR-PointHop. This modification enriches R-PointHop since it can take any local rotation-invariant feature representation and generalize it to a multi-scale descriptor using the standard R-PointHop pipeline. FeatureAggregation. For an input point cloud, we extract point-wise features using FR- PointHop. Features of all points need to be aggregated to yield a global descriptor for object re- trieval. One choice of aggregation is to use global max/mean pooling. It has been widely adopted in point cloud classification. However, global pooling is not a good choice since point features obtained by FR-PointHop only cover the information of a local neighborhood. In the retrieval literature, Bag of Words (BoW) and vectors of locally aggregated descriptors (VLAD) [29] are popular methods in aggregating local features such as SIFT [53]. Here, we adopt VLAD [29] to aggregate point descriptors obtained by R-PointHop to yield a global feature vector that is suitable for retrieval. The global feature aggregation process is stated below. The first step is to generate a codebook of k codewords ofd dimensions, whered is the fea- ture dimension. The k-means clustering algorithm is used to achieve this objective. Learned point features from the training data are used to form clusters, whose centroids are computed. The k centroids represent k codewords in the codebook. Given an input point cloud, its global VLAD feature vector is calculated as follows. The feature vector of each point is first assigned to the nearest codeword based on the shortest distance criterion. Next, for each point, the differ- ence between its descriptor and the assigned codeword is calculated, which represents the error vector in the feature space. Then, the differences with respect to the same codeword are added 81 together. Finally, error sums of all codewords are concatenated to get the VLAD feature. For a d-dimensional feature vector, the VLAD feature is of dimensionk× d, wherek is the codeword number. In the training process, we use the generated codebook to pre-compute the VLAD features of point cloud objects in the gallery set. They are stored along with the codebook. In the inference stage, the query object is first passed through FR-PointHop to extract pointwise features. Then, its VLAD feature is calculated using the same codebook and compared with the VLAD features of point cloud objects from the gallery set. The nearest neighbor search is used to retrieve the best matching point cloud. It is worthwhile to mention that, due to the rotation invariant nature of FR-PointHop, point features and, hence, VLAD features are invariant with object’s pose, which facilities retrieval in presence of pose variations. PoseEstimation. Once an object from the gallery set is retrieved, the next step is to reg- ister the query and the retrieved objects. The process closely resembles the registration task of R-PointHop with additional aid of the object symmetry information. The pointwise features extracted from the query object for retrieval are reused here. For objects in the gallery set, we only store their VLAD descriptors rather than their pointwise features due to the high memory cost of the latter. Yet, pointwise features of the retrieved object can be computed again using FR-PointHop at run time. The cost is manageable since it is done for one retrieved object. Af- terwards, corresponding points between query and retrieved objects are found using the nearest neighbor criterion in the feature space. We exploit the object symmetry information to limit the correspondence search region. 3D objects often possess different forms of symmetry. For example, chair objects have a planar sym- metry. To avoid mismatched correspondences arising due to object symmetry, correspondences 82 are constrained to be among disjoint sets of points. For every object, we divide its points into two disjoint sets using its principal components. First, we calculate 1D moments of the points projected along each principal direction about the origin. Then, we take the absolute difference between sum of moments of positive coordinates and negative coordinates. Then, the principal component with the least absolute difference is used to divide the points in disjoint sets. The main intuition is to find an axis along which the projection of points is most symmetric. The search space for this axis is restricted to object’s principal components for simplicity. Yet, it yields reasonable results as seen from top row of Fig. 4.3. Afterwards, the point correspondences are found in each disjoint set (see the bottom row of Fig. 4.3) and then concatenated. Figure 4.3: (Top) Illustration of partitioning of point clouds into two symmetrical parts and (Bot- tom) point correspondences between symmetric parts. Once the point correspondence is built, we use the orthogonal procrustes method [74], which is based on the singular value decomposition (SVD) of the data covariance matrix, to estimate the rotation and translation that aligns the query object to the retrieved object. This part is identical with that in R-PointHop. The transformation is optimal in the sense that it achieves the least sum of squared errors between matching points after alignment. RANSAC [22] can be used to get a more robust solution and avoid the noise effect in building correspondences. Since objects 83 in the gallery set are pre-aligned, the obtained rotation and translation gives the 6-DOF pose of the input query object. 4.2.2 ExperimentalResults We use point cloud objects from the ModelNet40 dataset [93] in all experiments. For retrieval and pose estimation, we focus on four object categories: airplane, chair, sofa, and car. For every object, the dataset consists of 2048 points sampled from the surface. We use the original train/test split. The experiments are divided into three parts as discussed below. For R-PointHop, we use two hops and adjust the energy threshold so as to have a 200-D point feature. The farthest point sampling strategy is used after the first hop where the point cloud is downsampled by half. ObjectRegistrationwithFR− PointHop. On replacing the attributes in R-PointHop attributes with the FPFH descriptor, we see an improvement in all object registration tasks of FR- PointHop over R-PointHop. A random rotation and translation is applied to the input object. For the challenging case of partial object registration (under the assumption that the partial reference point cloud is available), we show the rotation and translation errors Table 4.1 in terms of the mean squared error (MSE), the root mean squared error (RMSE) and the mean absolute error (MAE). We conduct performance benchmarking with SPA [35] (GL-based), FGR [109] (FPFH- based), and PRNet [90]. Clearly, the modified feature representation in FR-PointHop is favorable even in presence of the reference object. ObjectRetrievalwithPCRP. We use Precision@M and the chamfer distance as two evaluation metrics for object retrieval. For every query object, we generate its ground truth by rank-ordering objects in the gallery set based on the least chamfer distance. The chamfer distance is calculated as the sum of Euclidean distances of every point in one point cloud to its nearest 84 Table 4.1: Performance comparison of object registration. Rotationerror (in degree) Translationerror Method MSE RMSE MAE MSE RMSE MAE SPA [35] 229.09 15.13 8.22 0.0019 0.0435 0.0089 FGR [109] 126.29 11.24 2.83 0.0009 0.0300 0.0080 PR-Net [90] 10.24 3.12 1.45 0.0003 0.0160 0.0100 R-PointHop 2.75 1.66 0.35 0.0002 0.0149 0.0008 FR-PointHop 2.68 1.64 0.33 0.0002 0.0149 0.0007 point in the other point cloud. The Precision@M is an average measure of the number of topM retrieved objects that match with the top M objects from its ground truth. We expect a higher Precision@M score while a lower chamfer distance for good retrieval methods. Ten codewords are used in the VLAD implementation. we report Precision@10 scores and Top-1 chamfer distances for two cases in Table 4.2. First, the query object is aligned with those in the gallery set. Second, a uniform random rotation and translation is applied to the query object so that it has an arbitrary pose. We provide comparisons with PointNet [66] (an exemplary deep learning method), CORSAIR [108] (on similar lines to our work, but supervised), PointHop [107] (GL-based), and FPFH [73]. For PointNet and PointHop, globally pooled features are adopted for retrieval. For FPFH, we aggregate point features using VLAD. Furthermore, we replace our VLAD aggregated feature with max pooling and report the performance separately. Results in Table 4.2 show the superiority of PCRP over others. For PointNet and PointHop, we see a drop in performance in the case of arbitrary poses. We expect similar performance for other methods that do not take pose variations into account. Since CORSAIR, FPFH and PCRP use pose invariant feature representations, their performance is robust against pose variation. PCRP is the 85 Table 4.2: Comparison of point cloud retrieval performance. Pre-aligned objects Arbitrary poses Method Precision@10 (%) Top-1 Chamfer distance Precision@10 (%) Top-1 Chamfer distance PointNet [66] 60.66 0.121 53.40 0.145 PointHop [107] 58.23 0.129 19.71 0.211 FPFH [73] 53.23 0.164 52.12 0.160 CORSAIR [108] 61.28 0.106 61.24 0.107 PCRP (max pool) 43.23 0.147 41.89 0.131 PCRP (VLAD) 63.23 0.101 63.07 0.111 best among the three. Moreover, aggregating local features with max pooling in PCRP degrades the performance significantly, thereby justifying the inclusion of VLAD in PCRP. ObjectPoseEstimationwithPCRP. We report mean and median rotation errors be- tween the predicted and ground truth pose for all four object classes in Table 4.3. For performance benchmarking, we select ICP [6] and FGR [109] among traditional methods for which we pro- vide the template point cloud. For learning-based methods, we selected Chenetal.[9] (supervised) and Lietal.[49] (self-supervised) two methods, which are based on equivariant networks. Finally, CORSAIR [108] is also included. Table 4.3: Mean and median rotation errors in degrees. Chair Airplane Car Sofa Method Mean Median Mean Median Mean Median Mean Median ICP [6] 88.92 96.28 8.11 1.22 22.76 2.94 39.00 9.69 FGR [109] 22.10 6.04 6.84 3.33 18.44 2.69 9.97 2.36 CORSAIR [108] 13.99 4.58 8.09 3.43 12.09 2.13 9.12 3.24 Chen et al.[9] 8.56 3.87 3.35 1.12 9.48 1.85 4.76 1.56 Li et al.[49] 7.05 4.55 23.09 1.66 17.24 2.13 8.87 3.22 PCRP 14.42 4.24 2.98 1.65 11.22 2.11 8.84 2.29 86 From mean and median rotation errors, PCRP is significantly better than ICP which is only good for local registration. It also outperforms FGR and CORSAIR. Its performance is slightly inferior to the method of Li et al.[49]. In all the cases, the mean rotation errors are higher than the median error. We study the distribution of rotation errors across all point clouds and observe that the higher error is only due to a large registration error in only a few point clouds. Actually, after plotting the CDF of the error for all point clouds, more than 90% of test samples have a rotation error less than 5 degrees. It is worthwhile to point out the advantages of PCRP. First, it combines unsupervised feature learning with established non-learning-based pose estimation using point correspondences. The training time is typically less than 30 minutes in building the Saab kernels and the VLAD code- book. The FR-PointHop model size is only 230kB along with 1.6MB to store the VLAD features and the codebook. In contrast, we find that the model size of the exemplary works is 30MB for Chenetal.and 72MB for Lietal.. This clearly highlights that PCRP is a green solution and would be favorable in resource constrained occasions. Further investigation into failure cases reveals that most of them occur when a similar match- ing point cloud is not available in the database for retrieval. This is a bottleneck of PCRP. Under the assumption that a similar object is present in the gallery set, it can estimate the pose accu- rately. One way to filter out query samples automatically is to compare the Chamfer distance to the best retrieved object. If the Chamfer distance is above a certain threshold, it cannot be treated as a reliable result. 87 4.2.3 Discussion A point cloud pose estimation method called PCRP was proposed in this section. PCRP estimates the 6-DOF pose comprising of rotation and translation of a point cloud object using a similar pre-aligned object of the same object category. It uses the R-PointHop method to extract point features from 3D point cloud objects. The features are aggregated into a global descriptor using VLAD and used to retrieve similar pre-aligned objects. Finally, the point features of the query and retrieved point clouds are used to estimate the 3D pose of the query point cloud using registration. 4.3 S3I-PointHopMethod The S3I-PointHop method assigns a class label to a point cloud scan, X, whose points are ex- pressed in an arbitrary coordinate system. Its block diagram is shown in Fig. 4.4. It comprises of object coarse alignment, feature extraction, dimensionality reduction, feature aggregation, fea- ture selection and classification steps as detailed below. Figure 4.4: An overview of the proposed S3I-PointHop method: 1) an input point cloud scan is approximately aligned with the principal axes, 2) local and global point features are extracted and concatenated followed by the Saab transform, 3) point features are aggregated from differ- ent conical and spherical volumes, 4) discriminant features are selected using DFT and a linear classifier is used to predict the object class. 88 4.3.1 Methodology PoseDependencyinPointHop. The first step in PointHop is to construct a 24-dimensional local descriptor for every point based on the distribution of 3D coordinates of the nearest neigh- bors of that point. 3D rotations are distance preserving transforms and, hence, the distance be- tween any two points remains the same before and after rotation. As a consequence, the nearest neighbors of points are unaffected by the object pose. However, the use of 3D coordinates makes PointHop sensitive to rotations since the 3D Cartesian coordinates of every point change with rotation. Furthermore, the 3D space surrounding the current point is partitioned into 8 octants using the standard coordinate axes. The coordinate axes change under different orientations of the point cloud scan. We align an object with its three principal axes. The PCA alignment only offers a coarse alignment, and it comes with several ambiguities as pointed out in [46]. Fur- thermore, object asymmetries may disturb the alignment since PCA does not contain semantic information. Yet, fine alignment is not demanded. Here, we develop rotation invariant features based on PCA aligned objects. FeatureExtraction. Local and global information fusion is effective in feature learning for point cloud classification [91]. To boost the performance of S3I-PointHop, three complementary features are ensembled. The first feature set contains the omni-directional octant features of points in the 3D space as introduced in PointHop. That is, the 3D space is partitioned into eight octants centered at each point as the origin. The mean of 3D coordinates of points in each octant then constitute the 24D octant feature. The second feature set is composed by eigen features [27] obtained from the covariance analysis of the neighborhood of a point. They are functions of the three eigen values derived from the Singular Value Decomposition (SVD) of the local covariance 89 matrix. The 8 eigen features comprise of linearity, planarity, anisotropy, sphericity, omnivariance, verticality, surface variation and eigen entropy. They represent the surface information in the local neighborhood. The third feature set is formed by geometric features derived from distances and angles in local neighborhoods as proposed in [48]. For simplicity, we replace the geometric median in [48] with the mean of the neighboring coordinates. The 12D feature representation is found using the K nearest neighbors, leading to a pointwise 12× K matrix. While a small network is trained in [48] to aggregate these features into a single vector, we perform a channel- wise max, mean andl 2 -norm pooling to yield a 36D vector of local geometric feature. The octant, covariance and geometric features are concatenated together to build a 68D (24+8+36 = 68) feature vector. After that, the Saab transform is performed for dimension reduction. FeatureAggregation. The point features need to be aggregated into a global point cloud feature for classification. A symmetric aggregation function such as max or average pooling is a popular choice for feature aggregation. Four aggregations (the max, mean,l 1 norm, andl 2 norm) have been used in PointHop and PointHop++. Instead of aggregating all points globally at once as shown in Fig. 4.5 (a), we propose to aggregate subsets of points from different spatial regions here. We consider regions of the 3D volume defined by cones and spheres. For conical aggregation, we consider two types of cones, one with tip at the origin and the other with tip at a unit distance along the principal axes. They are illustrated in Figs. 4.5 (b) and (c), respectively. The latter cone cuts the plane formed by the other two principal axes in a unit circle and vice versa for the former. For each principal axis, we get four such cones, two along the positive axis and two along the negative. Thus, 12 cones are formed for all three axes in total. For each cone, only the features of points lying inside the cone are pooled together. The pooling 90 methods are the max, mean, variance,l 1 norm, andl 2 norm. This means for a single point feature dimension, we get a 5D feature vector from each cone. Figure 4.5: Illustration of conical and spherical aggregation. The conventional “global pooling" is shown in (a), where features of all points are aggregated at once. The proposed “regional pooling" schemes are depicted in (b)-(d), where points are aggregated only in distinct spatial regions. Only, the solid red points are aggregated. For better visual representation, cones/spheres along only one axis are shown. (b) and (c) use the conical pooling while (d) adopts spherical pooling in local regions. For spherical aggregation, we consider four spheres of a quarter radius centered at a distance of positive/negative one and three quarters from the origin along each principal axis. One ex- ample is illustrated in Fig. 4.5 (d). This gives 12 spheres in total. Points lying in each sphere are pooled together in a similar manner as cones. For instance, points lying in different cones for four point cloud objects are shaded in Fig. 4.6. Unlike max/average pooling, aggregating local feature descriptors into a global shape de- scriptor such as Bag of Words (BoW) or Vector of Locally Aggregated Descriptors (VLAD) [29] 91 Figure 4.6: An example of conical aggregation. For every point cloud object, points lying in each cone are colored uniquely. is common in traditional literature. On the other hand, the region-based local spatial aggrega- tion has never been explored before. These resulting features are powerful in capturing local geometrical characteristics of objects. DiscriminantFeatureSelectionandClassification . In order to select a subset of dis- criminant features for classification, we adopt the Discriminant Feature Test (DFT) as proposed in [97]. DFT is a supervised learning method that can rank features in the feature space based on their discriminant power. Since they are calculated independently of each other, the DFT com- putation can be parallelized. Each 1D featuref i of all point clouds are collected and the interval [f i min ,f i max ] is partitioned into two subspacesS i L andS i R about an optimal thresholdf i op . Then, the purity of each subspace is measured by a weighted entropy loss function. A smaller loss indi- cates stronger discriminant power. DFT helps control the number of features fed to the classifier. As shown in Sec. 4.3.2, it improves the classification accuracy significantly and prevents classifier from overfitting. In our experiments, we select top 2700 features. Finally, we train a linear least squares classifier to predict the object class. 92 4.3.2 Experiments We evaluate the proposed S3I-PointHop method for the point cloud classification task on the ModelNet40 dataset [93], which consists of 40 object classes. Objects in ModelNet40 are pre- aligned. We rotate them in the train and test sets in the following experiments. The rotation angles are uniformly sampled in[0,2π ]. We usez to denote random rotations along the azimuthal axis and SO(3) to indicate rotations about all three orthogonal axes. In Tables 4.4, 4.5 and 4.6, z/SO(3) means that the training set follows the z rotations while the test set adopts SO(3) rotations, and so on. For all experiments, we set the numbers of nearest neighbors in calculating geometric, covariance, and octant features to be 128, 32, and 64, respectively. ComparisonwithPointHop− familyMethods. Table 4.4 compares the performance of S3I-PointHop, PointHop [107], PointHop++ [106] and R-PointHop [34]. Clearly, S3I-PointHop outperforms the three benchmarking methods by a huge margin. Although R-PointHop was proposed for point cloud registration and not classification, we include it here due to its rotation invariant feature characteristics. Similar to the global aggregation in PointHop and PointHop++, we aggregate the point features of R-PointHop and train a Least Squares classifier. We also report the classification accuracy with only one hop for these methods. Both PointHop and PointHop++ perform poor since their features are not invariant to rotations. Especially, for thez/SO(3) case, there is an imbalance in the train and test sets, the accuracy is worse. R-PointHop only considers local octant features with respect to a local reference frame. Although they are invariant to rotations, they are not optimal for classification. ComparisonwithDeepLearningNetworks. We compare the performance of the pro- posed S3I-PointHop method with 4 deep-learning-based point cloud classification networks in 93 Table 4.4: Classification accuracy comparison of PointHop-family methods. Method # hops z/z z/SO(3) SO(3)/SO(3) PointHop [107] 1 70.50 21.35 45.70 4 75.12 22.85 50.48 PointHop++ [106] 1 9.11 7.90 9.09 4 82.49 20.62 57.61 R-PointHop [34] 1 53.44 53.42 53.44 4 64.87 64.86 64.86 S3I-PointHop 1 83.10 83.10 83.10 Table 4.5. They are PointNet [66], PointNet++ [67], PointCNN [51] and Dynamic Graph CNN (DGCNN) [91]. Since these methods were originally developed for aligned point clouds, we re- train them with rotated point clouds and report the corresponding results. We see from the table that S3I-PointHop outperforms these benchmarking methods significantly. These methods offer reasonable accuracy when rotations are restricted about the azimuthal (z) axis. However, they are worse when rotations are applied about all three axes. Table 4.5: Comparison with Deep Learning Networks. Method z/z z/SO(3) SO(3)/SO(3) PointNet [66] 70.50 21.35 45.70 PointNet++ [67] 75.12 22.85 50.48 PointCNN [51] 82.11 24.89 51.66 DGCNN [91] 82.49 20.62 57.61 S3I-PointHop 83.10 83.10 83.10 AblationStudy. We consider the contributions of different elements in S3I-PointHop. To do so, we conduct an ablation study and report the results in Table 4.6. From the first three rows, it is evident that the global octant features are most important, and their removal results in the highest drop in accuracy. The results also reinforce the fact that locally oriented features 94 such as those in R-PointHop are not optimal for classification. In rows 4 and 5, we compare the proposed spatial aggregation scheme (termed as local aggregation) with global pooling as done in PointHop. The accuracy sharply drops by 12% when only the global aggregation is used. Clearly, global aggregation is not appropriate in S3I-PointHop. Finally, we show in the last row that the accuracy drops to 78.56% without DFT. The is because, when the feature dimension is too high, the classifier can overfit easily without DFT. Table 4.6: Ablation Study Feature Aggregation DFT SO(3)/SO(3) Geometric Covariance Octant Local Global ✓ ✓ ✓ ✓ 82.49 ✓ ✓ ✓ ✓ 82.45 ✓ ✓ ✓ ✓ 80.75 ✓ ✓ ✓ ✓ ✓ 83.10 ✓ ✓ ✓ ✓ ✓ 71.02 ✓ ✓ ✓ ✓ 78.56 4.3.3 Discussion One advantage of S3I-PointHop is that its rotation invariant features allow it to handle point cloud data captured from different orientations. To further support this claim, we retrain PointHop with PCA coarse alignment as a pre-processing step during the training and the testing. The test accuracy is 78.16% and 74.10% with four-hop and one-hop, respectively. This reinforces that only the PCA alignment is not the reason for the performance gain of S3I-PointHop. While efforts to learn rotation invariant features were already made in R-PointHop, we see that the lack of global features in it degrades its performance. On the other hand, appending the same global feature to R-PointHop does not help in the registration problem. 95 An interesting aspect of S3I-PointHop is its use of a single hop (rather than four hops such as in PointHop). It is generally perceived that deeper networks perform better than shallower coun- terparts. However, the use of multiple spatial aggregations on top of a single hop, S3I-PointHop can achieve good performance. This leads to the benefit of reducing the training time and the model size as explained below. In any point-cloud processing method, one of the most costly operations is the nearest neigh- bor search. To search thek nearest neighbors of each ofN points, the complexity of an efficient algorithm is O(klogN). PointHop uses the nearest neighbor search in four hops and three in- termediate farthest point downsampling operations. In contrast, the nearest neighbor search is only conducted once for each point in S3I-PointHop. Another costly operation is the PCA in the Saab transform. It is performed only once in S3I-PointHop. Its model size is 900 kB, where only one-hop Saab filters are stored. 4.4 Conclusion In this chapter, we discussed the PCRP method for point cloud pose estimation and the S3I- PointHop method for point cloud classification of objects with arbitrary pose. These methods extend the scope of PointHop and PointHop++ methods which work well only with pre-aligned point clouds. PCRP determines the object pose in two steps. First, a similar aligned object is retrieved from the gallery set. Then, the input object is registered with the retrieved object. S3I-PointHop extracts local and global point neighborhood information using an ensemble of ge- ometric, covariance and octant features. Only a single hop is adopted in S3I-PointHop followed 96 by conical and spherical aggregations of point features from multiple spatial regions. PCRP as- sumes the input object class to be known in advance. To address this, the S3I-PointHop method can be used prior to retrieval to determine the class, following which the pose can be estimated. 97 Chapter5 PointCloudOdometryandSceneFlowEstimation 5.1 Introduction Dynamic 3D scene understanding based on captured 3D point cloud data is a critical enabling technology in the 3D vision systems. Odometry is an object localization technique that estimates the position change of an object by sensing its surrounding environment change over time. It finds a range of applications such as the navigation of mobile robots and autonomous vehicles. Odometry is also responsible for the localization task in an SLAM (Simultaneous Localization And Mapping) system. Established algorithms such as LOAM (LiDAR Odometry and Mapping) [101] solve the SLAM problem using two systems running in parallel. One is responsible for odometry while the other takes care of mapping. When only the visual information is exploited, it is called visual odometry. Several visual odometry solutions that use monocular, monochrome and stereo vision have been proposed and successfully deployed. With the growing popularity of 3D point clouds, point cloud scans ob- tained by range sensors such as LiDAR are recently used in odometry, which is known as the point cloud odometry (PCO). In this work, we focus on the PCO problem by proposing a lightweight 98 PCO solution called the green point cloud odometry (GreenPCO) method. GreenPCO is an un- supervised learning method that predicts object motion by matching features of two consecutive point cloud scans. It is called green since it has a smaller model size and significantly less training time as compared with other learning-based PCO methods. It is worthwhile to mention that multi-modal data from different sensors such as motion sen- sors (e.g, wheel encoders), inertial measurement unit (IMU) and the geo positioning system (GPS) are often used jointly to boost localization accuracy since different sensors provide complemen- tary information. However, the accuracy of multi-sensor odometry is still built upon that of each individual sensor. Generally, since techniques of improving PCO accuracy and performance en- hancement via multi-modal sensor fusion are of different nature, they can be treated separately. There is a recent trend in the design of deep neural networks for point cloud odometry. They replace the matching of traditional handcrafted features with the end-to-end optimized network models. On one hand, deep learning is promising in handling several long standing scan match- ing problems, e.g., matching in presence of noisy data and outliers, matching in featureless en- vironments, etc. On the other hand, the gain comes at the cost of collecting large datasets and training large network models. Different datasets are needed for different applications. For exam- ple, datasets for indoor robot navigation and self driving vehicles have to be collected separately. Furthermore, these methods are based on supervised learning that demands the ground truth transformation parameters in network training, which contradicts classical methods that do not need any ground truth and are purely based on local geometry properties of point clouds. The problem is to find the rigid transformation of a point cloud set that is viewed in different translated and rotated coordinates. The point cloud registration technique can be extended to 99 point cloud odometry in principle. Yet, to tailor it to the odometry task, several modifications to R- PointHop are needed. To this end, we propose a new method, called green point cloud odometry (GreenPCO) [33], targeting at the autonomous driving application. In the training stage, a small number of point cloud scans from the training dataset are used to learn model parameters of GreenPCO in the feedforward one-pass manner. In the inference stage, GreenPCO finds the vehicle trajectory online by incrementally predicting the motion between two consecutive point cloud scans. 3D scene flow aims at finding the point-wise 3D displacement between consecutive point cloud scans. With the increase in the availability of point cloud data, especially those acquired via the LiDAR scanner, 3D scene flow estimation directly from point clouds is an active research topic nowadays. 3D scene flow estimation finds rich applications in 3D perception tasks such as semantic segmentation, action recognition, and inter-prediction in compressing sequences of LiDAR scans. Today’s solutions to 3D scene flow estimation mostly rely on supervised or self-supervised deep neural networks (DNNs) that learn to predict the point-wise motion field from a pair of input point clouds via end-to-end optimization. One of the important components of these methods is to learn flow embedding by analyzing spatio-temporal correlations among regions of the two point clouds. After the successful demonstration of such an approach in FlowNet3D [52], there has been an increased number of papers on this topic by exploiting and combining other ideas such as point convolutions and attention mechanism. These DL-based methods work well in an environment that meets the local scene rigidity assumption. They usually outperform classical point-correspondence-based methods. On the other hand, they have a large number of parameters and rely on large training datasets. For the 100 3D scene flow estimation problem, it is non-trivial to obtain dense point-level flow annotations. Thus, it is challenging to adopt the heavily supervised learning paradigm with the real world data. Instead, methods are typically trained on synthetic datasets with ground truth flow information first. They are later fine-tuned for real world datasets. This makes the training process very complicated. In this chapter, we propose a green and interpretable 3D scene flow estimation method for the autonomous driving scenario and name it “PointFlowHop”. We decompose our solution into vehicle ego-motion and object motion modules. Scene points are classified as static and moving. Moving points are grouped into moving objects and a rigid flow model is established for each ob- ject. Furthermore, the flow in local regions is refined assuming local scene rigidity. PointFlowHop method adopts the green learning (GL) paradigm [42]. It is built upon GreenPCO [33], and pre- ceding foundation works such as R-PointHop [34], PointHop [107], and PointHop++ [106]. The task-agnostic nature of the feature learning process in prior art enables scene flow esti- mation through seamless modification and extension. Furthermore, a large number of operations in PointFlowHop are not performed during training. The ego-motion and object-level motion is optimized in inference only. Similarly, the moving points are grouped into objects only dur- ing inference. This makes the training process much faster and the model size very small. The decomposition of 3D scene flow into object-wise rigid motion and/or ego-motion components is not entirely novel. However, our focus remains in developing a GL-based solution with improved overall performance, including accuracy, model sizes and computational complexity. 101 5.2 GreenPCOMethod 5.2.1 Methodology The GreenPCO method takes two consecutive point cloud scans at time t and t + 1 as input and predicts the 6-DOF (Degree of Freedom) rigid transformation parameters as output. The transformation can be written in form of (R,t), where R and t denote the rotation matrix and the translation vector, respectively. Let x i and y i be corresponding points in point cloud scans att andt+1, respectively. The objective is to find R andt so as to minimize the mean squared error E(R,t)= 1 N N X i=1 ∥R· x i +t− y i ∥ 2 (5.1) between matching point pairs (x i ,y i ), i = 1,··· ,N. GreenPCO is an unsupervised learning method since we do not need to create different rotational and translational sequences for model training. All model parameters can be learned from raw training sequences. The system diagram of GreenPCO is shown in Fig. 5.1. It consists of four steps: 1) Geometry-aware point sampling, 2) view-based partitioning, 3) feature extraction and matching, and 4) motion estimation. They are elaborated below. Geometry− awarepointsampling. An outdoor point cloud scan captured by the LiDAR sensor typically consists of hundreds of thousands of points. It is desired and sufficient to select a subset of points to build the correspondence. Popular sampling methods include farthest point sampling and random sampling. Since the farthest point sampling has a quadratic time complex- ity, it is not suitable for large-scale point clouds. In contrast, random sampling is extremely fast and can be performed in constant time. Thus, we adopt random sampling in GreenPCO as the 102 Figure 5.1: An overview of the GreenPCO method. baseline. However, many points in outdoor environments are featureless and non-discriminant and they should be excluded from random selection. To this end, we propose geometry-aware point sampling that selects points that are spatially spread out and with nontrivial local charac- teristics. Figure 5.2: Comparison between random sampling (left) and geometry-aware sampling (right), where sampled points are marked in blue and red, respectively. The local neighborhood of a point,p, defines its local property. We collect k nearest neighbors ofp in a local region, find the covariance matrix of their 3D coordinates, and conduct eigen anal- ysis. This defines local PCA computation. The eigenvalues of local PCA can describe the local 103 characteristics of a point well. For examples, local features such as linearity, planarity, spheric- ity, entropy, can be expressed as functions of the three eigenvalues [27]. We are interested in discriminant points (e.g., those from objects like mopeds, cars, poles, trunks, etc.) rather than points from planar surfaces (e.g., buildings, roads, walls, etc.) To achieve this goal, we study dis- tributions of eigenvalues and set appropriate thresholds so as to discard non-discriminant points in the pre-processing step. In our implementation, we set thresholds on linearity, planarity, and eigen entropy. They are computed as Linearity= λ 1 − λ 2 λ 1 , Planarity= λ 2 − λ 3 λ 1 , Eigen entropy = 3 X i=1 λ i · log( 1 λ i ), (5.2) whereλ 1 , λ 2 andλ 3 are three eigenvalues of local PCA, andλ 1 ≥ λ 2 ≥ λ 3 . After this step, we are left around 4,000 to 5,000 points, which depends on the scene. Afterwards, we use random sampling to reduce the point number to 2,048 for further processing in the next step. Points selected by geometry-aware point sampling and random sampling are compared in Fig. 5.2. View− basedpartitioning. The sampled points obtained in the previous step are divided into disjoint sets based on their 3D coordinates. First, the 3D Cartesian coordinates are converted to spherical coordinates. Following the convention of the LiDAR coordinate system setup in the KITTI dataset, the positiveZ direction is along the direction of motion of the vehicle, while the positiveY direction is vertical. We are interested in the azimuthal angleϕ , which is given by ϕ =arctan(z/x), (5.3) 104 wherez andx are point coordinates along theZ andX axes, respectively. Theϕ coordinate of every point gives the position of the point with respect to the vehicle. Then, we can define four views based onϕ as follows. • Front view: the set of points in front of the vehicle with an azimuthal angle45 ◦ ≤ ϕ ≤ 135 ◦ . • Rear view: the set of points in the back of the vehicle with an azimuthal angle225 ◦ ≤ ϕ ≤ 315 ◦ . • Right view: the set of points in the right of the vehicle with an azimuthal angle− 45 ◦ < ϕ< 45 ◦ . • Left view: the set of points in the left of the vehicle with an azimuthal angle135 ◦ < ϕ < 225 ◦ . Since motion between two consecutive scans is incremental, we can focus on point matching in the same view. The proposed partitioning helps in scenarios when there are similar instances (e.g., persons or equally spaced poles along the road in different views) in two point cloud scans. We also tried to partition points into six disjoint views but observed no advantage. On the con- trary, the probability of correctly matched points coming from different views increases. Thus, we stick to the choice of four views. An example of view-based partitioning is shown in Fig. 5.3. Featureextractionandpointmatching. The Saab features of all points sampled in Step 1 are extracted using the PointHop++ architecture as described in [34, 106]. Briefly speaking, the k nearest neighbors of points are retrieved and the 3D coordinate space is partitioned into 8 octants. The mean of each octant is calculated and the eight means are concatenated to form a vector. A global PCA is conducted for dimension reduction in the first hop. Next, we perform channel-wise Saab transform in the second hop to increase the neighborhood size. Only two hops 105 Figure 5.3: View-based partitioning using the azimuthal angle, where the front, rear, left and right views are highlighted in blue, green, red, and yellow, respectively. are used and the spatial pooling operation between hops is deleted so that the total number of points remain the same. By following [27], we append the 3D coordinates to the derived channel- wise Saab features to yield the final feature vector of each point. For point matching in two point cloud scans, we consider the nearest neighbor search in the feature space and conduct it in each of the four view groups independently. As a result, we have four sets of point correspondences – one from each view. All corresponding pairs are then combined and used for translation and rotation parameters estimation. A subset of matched points between two consecutive scans are shown in Fig. 5.4. Motionestimationandupdate. The matched points found in Step 3 are used to estimate the motion between the two consecutive time instances. Suppose that(x i ,y i ),i=1,··· ,N, are 106 Figure 5.4: Sampled points at time instancest andt+1 are marked in blue and red, respectively, while point correspondences between two consecutive scans are shown in green. the pairs of corresponding points. The 6-DOF motion model can be determined as follows. First, the mean coordinates of the corresponding points can be written as ¯x= 1 N N X i=1 x i , ¯y = 1 N N X i=1 y i , (5.4) and the covariance matrix of the corresponding pairs of points can be computed as K(X,Y)= N X i=1 (x i − ¯x)(y i − ¯y) T . (5.5) Next, the3× 3 covariance matrix can be decomposed via SVD: K(X,Y)=USV T , (5.6) 107 whereU andV are orthogonal matrices of left and right singular vectors andS is the diagonal matrix of singular values, respectively. Then, the orientation and translation motion model are given by the rotation matrix,R, and the translation vector,t, in form of R =VU T , andt=− R¯x+ ¯y. (5.7) The vehicle trajectory is updated with the current predicted pose. The pose at time t, T t , with respect to the pose at timet− 1,T t− 1 , is given byT t =T t− 1 [Rt; ¯0 T 1]. The pose with respect to the the initial poseT 0 can be found accordingly. The process repeats by considering the next two point cloud scans. Furthermore, RANSAC [22] can be used to improve the robustness of point matching. 5.2.2 ExperimentalResults Figure 5.5: Evaluation results on sequences 4 (left) and 10 (right) of the KITTI dataset. Experiments are conducted on the KITTI Visual Odometry/SLAM benchmark [23] for perfor- mance evaluation. The dataset consists of 22 sequences in total, out of which the ground truth information is available for the first 11 sequences. Each sequence is a vehicle trajectory that consists of from 250 to 2000 time steps. The data at each time step contains the 3D point cloud 108 scan captured by the LiDAR scanner, stereo and monocular images. By following point cloud odometry benchmarking methods such as DeepPCO [88], we use the point cloud data only in the experiments, where 9 sequences whose ground truths are available for training and the remain- ing two (namely, sequences 4 and 10) for testing. It is observed that there is strong correlation between scans of the training data. For this reason, we uniformly sample 50 point clouds from the training 9 sequences to learn the channel-wise Saab transform, which is needed to compute the Saab features. Furthermore, the geometry-aware sampling process is used to select discriminant points. The thresholds for the eigen features during sampling are set to 0.7 for linearity and pla- narity and 0.8 for eigen entropy. Points with their linearity and planarity below the 0.7 threshold and with their eigen entropy higher than the 0.8 threshold are retained. The neighborhood size for finding the eigen features is 48 points. Two hops are used. Table 5.1: Performance comparison between GreenPCO and five supervised DL methods on two test sequences in the KITTI dataset. Sequence4 Sequence10 Method Avg. translation RMSE Avg. rotation RMSE Avg. translation RMSE Avg. rotation RMSE Two-stream [62] 0.0554 0.0830 0.0870 0.1592 DeepVO [87] 0.2157 0.0709 0.2153 0.3311 PointNet [66] 0.0946 0.0442 0.1381 0.1360 PointGrid [44] 0.0550 0.0690 0.0842 0.1523 DeepPCO [88] 0.0263 0.0305 0.0247 0.0659 GreenPCO (Ours) 0.0201 0.0212 0.0209 0.0628 The evaluation results for test sequences 4 and 10 are shown in Fig. 5.5. We see that GreenPCO is very effective and its predicted paths almost overlap with the ground truth ones. Based on the KITTI evaluation metric, the average sequence translation RMSE is 3.54% while the average 109 Table 5.2: Ablation study on KITTI dataset Sampling Method Number of sampled points View-based partitioning Eigen features Translation error (%) Rotation error (deg/m) Random 1024 ✓ ✓ 32.42 0.1120 2048 ✓ ✓ 32.20 0.1179 4096 ✓ ✓ 31.47 0.1004 Farthest point 1024 ✓ ✓ 28.79 0.1065 2048 ✓ ✓ 28.11 0.0971 4096 ✓ ✓ 28.13 0.0941 Geometry-aware 1024 ✓ ✓ 3.71 0.0345 2048 ✓ ✓ 3.54 0.0271 4096 ✓ ✓ 3.54 0.0308 2048 10.41 0.0415 2048 ✓ 4.89 0.0289 2048 ✓ 9.81 0.0401 sequence rotation error is 0.0271 deg/m. We compare GreenPCO with five supervised DL methods in Table 5.1. They are Two-stream [62], DeepVO [87], PointNet [66], PointGrid [44] and DeepPCO [88]. The evaluation metrics are the same as those in DeepPCO [88], where relative translation and rotation errors are considered. Although GreenPCO is an unsupervised learning method, it outperforms all supervised DL methods in both the average rotation RMSE and the average translation RMSE. We conduct the ablation study on GreenPCO to see the contributions of each component and summarize the results in Table 5.2. First, we compare geometry-aware, random and farthest point sampling methods. For each sampling, we set the number of sampled points to 1024, 2048, and 4096 points. Geometry-aware sampling consistently outperforms the other two. The errors of random sampling are worst while errors of farthest point sampling are also large. There is no advantage of using 4096 points over 2048 points for any sampling method. Thus, we set the input 110 point number to 2048. Next, we try to justify the inclusion of view-based partitioning and eigen features. Errors are consistently lower when view-based partitioning is adopted. This shows that view-based partitioning makes point matching more robust. Furthermore, it reduces the search space for time efficiency. Finally, when eigen features are omitted from the feature construction process, the performance drops sharply. Since most of the model-free methods such as LOAM [101] and V-LOAM [102] already offer state-of-the-art results, spending a large amount of time on model training (like deep learning) is not an optimal choice for this problem. GreenPCO is more favorable in this sense. Its training time is only 10 minutes on Intel Xeon CPU. The rapid training time of GreenPCO is attributed to two reasons. First, PointHop++ model training demands sampled points to learn filter parameters. Other steps in the pipeline such as view-based partitioning, point matching, and motion estimation are only required in testing. Second, we use an extremely small training set since there is high correlation between consecutive point cloud scans in different sequences due to incremental vehicular motion. Hence, skipping several scans and selecting only distant scenes should be sufficient to capture different scenes within the training data. It is observed that the use of 50 point cloud scans offers performance similar to that using the entire training dataset which comprises of 8500 scans. This corresponds to approximately 0.6% of the training data. In our implementation, the 50 point cloud scans are uniformly sampled from the entire train- ing dataset so as to represent diverse scenes. Table 5.3 summarizes the test performance for four different percents of training data. As shown in the table, the translation and rotation errors are unaffected and there is no advantage of using the entire training data. There is a steady decrease in the training time as the size of the training data is reduced. The model size is 75kB, which is 111 independent of the training data used. A small model size and less training time make GreenPCO a green solution for point cloud odometry. Table 5.3: The effect of different amounts of training data Training data used (%) Training time (hours) Model size Translation error (%) Rotation error (deg/m) 100 1.2 75kB 3.54 0.0268 50 0.8 75kB 3.53 0.0272 25 0.6 75kB 3.54 0.0271 10 0.5 75kB 3.54 0.0271 0.6 0.17 75kB 3.54 0.0271 5.2.3 Discussion GreenPCO takes consecutive point cloud scans captured by the LiDAR sensor on the vehicle and estimates the 6-DOF motion of vehicle incrementally. It first selects a small set of discriminant points using geometry-aware sampling. Then, the sampled points are divided into four disjoint sets based on the azimuthal angle. The point features are extracted using the PointHop++ archi- tecture and matched points are found by searching in the feature space. Finally, the corresponding points are used to estimate the rotation and translation parameters between two positions. The same process is repeated along time. GreenPCO gives accurate vehicle trajectories when eval- uated on the LiDAR scans from the KITTI dataset. It outperforms all supervised deep learning benchmarking methods with much less training data. 112 Figure 5.6: An overview of the PointFlowHop method, which consists of six modules: 1) ego- motion compensation, 2) scene classification, 3) object association, 4) object refinement, 5) object motion estimation, and 6) scene flow initialization and refinement. 5.3 PointFlowHopMethod 5.3.1 Methodology The system diagram of the proposed PointFlowHop method is shown in Fig. 5.6. It takes two consecutive point clouds X t ∈ R nt× 3 and X t ∈ R n t+1 × 3 as the input and calculates the point- wise flow ¯ f t ∈R n 1 × 3 for the points inX t . PointFlowHop decomposes the scene flow estimation problem into two subproblems: 1) de- termining vehicle’s ego-motion(T ego ) and 2) estimating the motion of each individual object (de- noted by(T obj i ) for objecti). It first proceeds by determining and compensating the ego-motion and classifying scene points as being moving or static in modules 1 and 2, respectively. Next, moving points are clustered and associated as moving objects in modules 3 and 4, and the motion of each object is estimated in module 5. Finally, the flow vectors of static and moving points are jointly refined. These steps are detailed below. 113 Module1: Ego− motionCompensation. Say the coordinates of thei th point inX t are given by(x i t ,y i t ,z i t ). Suppose this point is observed at(x i t+1 ,y i t+1 ,z i t+1 ) inX t+1 . These point coor- dinates are expressed in the respective LiDAR coordinate systems centered at the vehicle position at timet andt+1. Since the two coordinate systems may not overlap due to vehicle’s motion, the scene flow vector, ¯ f t i , of the i th point cannot be simply calculated using vector difference. Hence, we begin by aligning the two coordinates systems or, in other words, we compensate for the vehicle motion (or called ego-motion). The ego-motion compensation module in PointFlowHop is built upon the GreenPCO method. Ego-motion estimation in PointFlowHop involves a single iteration of GreenPCO whereby the vehicle’s motion from timet tot+1 is estimated. Then, the ego-motion can be represented by the 3D transformation,T ego , which consists of a 3D rotation and 3D translation. Afterward, we use T ego to warp X t to ˜ X t , making it in the same coordinate system as that of X t+1 . Then, the flow vector can be computed by ¯ f t i =(x i t+1 − ˜ x i t ,y i t+1 − ˜ y i t ,z i t+1 − ˜ z i t ), (5.8) where(˜ x i t ,˜ y i t ,˜ z i t ) is the warped coordinate of thei th point. Module2: SceneClassification . After compensating for ego-motion, the resulting ˜ X t andX t+1 are in the same coordinate system (i.e., that ofX t+1 ). Next, we coarsely classify scene points in ˜ X t andX t+1 into moving and static two classes. Generally speaking, the moving points may belong to objects such as cars, pedestrians, mopeds, etc., while the static points correspond to objects like buildings, poles, etc. The scene flow of moving points can be analyzed later while static points can be assigned a zero flow (or equal to the ego-motion depending on the convention 114 of the coordinate systems used). This means that the later stages of PointFlowHop would process fewer points. For the scene classifier, we define a set of shape and motion features that are useful in distin- guishing static and moving points. These features are explained below. • Shape features We reuse the eigen features [27] calculated in the ego-motion estimation step. They sum- marize the distribution of neighborhood points using covariance analysis. The analysis provides a 4-dimensional feature vector comprising of linearity, planarity, eigen sum and eigen entropy. • Motion feature We first voxelize ˜ X t andX t+1 with a voxel size of 2 meters. Then, the motion feature for each point in ˜ X t is the distance to the nearest voxel center inX t+1 , and vice versa, for each point inX t+1 . The 5-dimensional (shape and motion) feature vector is fed to a binary XGBoost classifier. For training, we use the point-wise class labels provided by the SemanticKITTI [5] dataset. We observe that the 5D shape/motion feature vector are sufficient for decent classification. The clas- sification accuracy on the SemanticKITTI dataset is 98.82%. Furthermore, some of misclassified moving points are reclassified in the subsequent object refinement step. Module3: ObjectAssociation. We simplify the problem of motion analysis on mov- ing points by grouping moving points into moving objects. To discover objects from moving points, we use the Density-based Spatial Clustering for Applications with Noise (DBSCAN) [21] 115 Figure 5.7: Objects clustered using the DBSCAN algorithm are shown in different colors. algorithm. Simply speaking, DBSCAN iteratively clusters points based on the minimum dis- tance (eps) and the minimum points (minPts) parameters. Parameter eps gives the minimum Euclidean distance between points considered as neighbors. ParameterminPts determines the minimum number of points to form a cluster. Some examples of the objects discovered using PointFlowHop are colored in Fig. 5.7. Points belonging to distinct objects may get clustered together. We put the points marked as “outliers” by DBSCAN in the set of static points. The DBSCAN algorithm is run on ˜ X t andX t+1 separately. Later, we use cluster centroids to associate objects between ˜ X t andX t+1 . That is, for each centroid in ˜ X t , we locate its nearest centroid inX t+1 . Module4: ObjectRefinement . Next, we perform an additional refinement step to re- cover some of the misclassified points during shape classification and potential inlier points dur- ing object association. This is done using the nearest neighbor rule within a defined radius neigh- borhood. For each point classified as a moving point, we re-classify static points lying within the neighborhood as moving points. The object refinement operation is conducted on ˜ X t andX t+1 . The refinement step is essential for two reasons. First, an imbalance class distribution between static and moving points usually leads to the XGBoost classifier to favor the dominant class (which is the static points). Then, the precision and recall for moving points are still low in spite of high 116 classification accuracy. Second, in the clustering step, it is difficult to select good values for eps andminPts that are robust in all scenarios for the sparse LiDAR point clouds. This may lead to some points being marked as outliers by DBSCAN. Overall, the performance gains of our method reported in Sec. 5.3.2 are a result of the combination of all steps and not due to a single step in particular. Module5: ObjectMotionEstimation. We determine the motion between each pair of associated objects in this step. For that, we follow a similar approach as taken by R-PointHop [34]. The objective of R-PointHop is to register the source point cloud with the target point cloud. For object motion estimation in PointFlowHop, the features of refined moving points from ˜ X t and X t+1 are extracted using the trained PointHop++ model. We reuse the same model from the ego-motion estimation step here. While four hops with intermediate downsampling is used in R-PointHop, the PointHop++ model in PointFlowHop only involves two hops without downsam- pling to suit the LiDAR data. Since ˜ X obj i t andX obj i t+1 are two sets of points belonging to objecti, we find corresponding points between the two point clouds using the nearest neighbor search in the feature space. The correspondence set is further refined by selecting top correspondences based on: 1) the minimum feature distance criterion and 2) the ratio test (the minimum ratio of the distance between the first and second best corresponding points). The refined correspondence set is then used to estimate the object motion as follows. First, the mean coordinates of the corresponding points in ˜ X obj i t andX obj i t+1 are found by: ¯x obj i t = 1 N obj i N obj i X j=1 ˜ x obj ij t , ¯x obj i t+1 = 1 N obj i N obj i X j=1 x obj ij t+1 . (5.9) 117 Then, the3× 3 covariance matrix is computed using the pairs of corresponding points as K( ˜ X obj i t ,X obj i t+1 )= N obj i X j=1 (˜ x obj ij t − ¯x obj i t )(x obj ij t+1 − ¯x obj i t+1 ) T . (5.10) The Singular Value Decomposition of K gives matrices U and U, which are formed by the left and right singular vectors, respectively. Mathematically, we have K( ˜ X obj i t ,X obj i t+1 )=USV T . (5.11) Following the orthogonal procrustes formulation [74], the optimal motion of ˜ X obj i t can be ex- pressed in form of a rotation matrixR obj i and a translational vectort obj i . They can be computed as R obj i =VU T , t obj i =− R obj i ¯x obj i t + ¯x obj i t+1 . (5.12) Since(R obj i ,t obj i ) form the object motion model for objecti, it is denoted asT obj i . Actually, once we find the corresponding point x obj ij t+1 of ˜ x obj ij t , the flow vector may be set to ˜ x obj ij t =x obj ij t+1 − ˜ x obj ij t . However, this point-wise flow vector can be too noisy, and it is desired to use a flow model for the object rather than each point. The object flow model found using SVD in the step after finding correspondences is optimal in the mean square sense over all corresponding points and, hence, is more robust. It makes a reasonable assumption of existence of a rigid transformation between the two objects. 118 Figure 5.8: Flow estimation results using PointFlowHop: input point clouds (left) and warped output using flow vectors (right). Module6: FlowInitializationandRefinement . In the last module, we apply the ob- ject motion modelT obj i to ˜ X obj i t and align it withX obj i t+1 . Since the static points do not have any motion, they are not further transformed. We denote the new transformed point cloud as ˜ X ′ t . At this point, we have obtained an initial estimate of the scene flow for each point in X t . For static points, the flow is given by the ego-motion transformation T ego . For the moving points, it is a composition of ego motion and corresponding object’s motionT ego · T obj i . In this module, we refine the flow for all points in ˜ X ′ t using the Iterative Closest Point (ICP) [6] algorithm in small non-overlapping regions. In each region, the points in ˜ X ′ t falling within it are aligned with corresponding points inX t+1 . The flow refinement step ensures a tighter alignment and is a common post processing operation in several related tasks. Finally, the flow vectors for 119 X t are calculated as difference between the transformed and initial coordinates. Exemplar pairs of input and scene flow compensated point clouds using PointFlowHop are shown in Fig. 5.8. 5.3.2 Experiments In this section, we report experimental results on real world LiDAR point cloud datasets. We choose the stereoKITTI [57, 58] and the Argoverse [8] two datasets since they represent chal- lenging scenes in autonomous driving environments. StereoKITTI has 142 pairs of point clouds. The ground truth flow of each pair is derived from the 2D disparity maps and the optical flow information. There are 212 test samples for Argoverse whose flow annotations were given in [63]. We use per-point labels from the SemanticKITTI dataset [5] to train our scene classifier. Following series of prior art, we measure the performance in the following metrics: • 3D end point error (EPE3D). It is the mean Euclidean distance between the estimated and the ground truth flow. • Strict accuracy (Acc3DS). It is the percentage of points for which EPE3D is less than 0.05m or the relative error is less than 0.05. • Relaxedaccuracy(Acc3DR). It gives the ratio of points for which EPE3D is less than 0.1m or the relative error is less than 0.1. • Percentage of Outliers. It is the ratio of points for which EPE3D is greater than 0.3m or the relative error is greater than 0.1. This is reported for the StereoKITTI dataset only. 120 • Mean angle error (MAE). It is the mean of the angle errors between the estimated and the ground truth flow of all points expressed in the unit of radians. This is reported for the Argoverse dataset only. PerformanceBenchmarking. The scene flow estimation results on stereoKITTI and Ar- goverse are reported in Table 5.4 and Table 5.5, respectively. For comparison, we show the per- formance of several representative methods proposed in the past few years. Overall, the EPE3D, Acc3DS and Acc3DR values are significantly better for stereoKITTI as compared to the Argoverse dataset. This is because Argoverse is a more challenging dataset. Furthermore, PointFlowHop outperforms all benchmarking methods in almost all evaluation metrics on both datasets. Table 5.4: Comparison of scene flow estimation results on the Stereo KITTI dataset, where the best performance number is shown in boldface. Method EPE3D (m)↓ Acc3DS↑ Acc3DR↑ Outliers↓ FlowNet3D [52] 0.177 0.374 0.668 0.527 HPLFlowNet [26] 0.117 0.478 0.778 0.410 PointPWC-Net [92] 0.069 0.728 0.888 0.265 FLOT [64] 0.056 0.755 0.908 0.242 HALFlow [86] 0.062 0.765 0.903 0.249 Rigid3DSceneFlow [24] 0.042 0.849 0.959 0.208 PointFlowHop (Ours) 0.037 0.938 0.974 0.189 AblationStudy In this section, we assess the role played by each individual module of PointFlowHop using the stereo KITTI dataset as an example. Ego-motion compensation. First, we may replace GreenPCO [33] with ICP [6] for ego- motion compensation. The results are presented in Table 5.6. We see a sharp decline in perfor- mance with ICP. The substitution makes the new method much worse than all benchmarking 121 Table 5.5: Comparison of scene flow estimation results on the Argoverse dataset, where the best performance number is shown in boldface. Method EPE3D (m)↓ Acc3DS↑ Acc3DR↑ MAE (rad)↓ FlowNet3D [52] 0.455 0.01 0.06 0.736 PointPWC-Net [92] 0.405 0.08 0.25 0.674 Just Go with the Flow [59] 0.542 0.08 0.20 0.715 NICP [1] 0.461 0.04 0.14 0.741 Graph Laplacian [63] 0.257 0.25 0.48 0.467 Neural Prior [50] 0.159 0.38 0.63 0.374 PointFlowHop (Ours) 0.134 0.39 0.71 0.398 methods. While the naive ICP could be replaced with other advanced model-free methods, it is preferred to use GreenPCO since the trained PointHop++ model is still needed later. Table 5.6: Ego-motion compensation – ICP vs. GreenPCO. Ego-motion Method EPE3D↓ Acc3DS↑ Acc3DR↑ Outliers↓ ICP [6] 0.574 0.415 0.481 0.684 GreenPCO [33] 0.037 0.938 0.974 0.189 Table 5.7: Performance gain due to object refinement. Object Refinement EPE3D↓ Acc3DS↑ Acc3DR↑ Outliers↓ ✗ 0.062 0.918 0.947 0.208 ✓ 0.037 0.938 0.974 0.189 Performance Gain Due to Object Refinement. Next, we compare PointFlowHop with and without the object refinement step. The results are shown in Table 5.7. We see consistent performance improvement in all evaluation metrics with the object refinement step. On the other hand, the performance of PointFlowHop is still better than that of benchmarking methods except for Rigid3DSceneFLow [24] (see Table 5.4) even without object refinement. 122 Table 5.8: Performance gain due to flow refinement. Flow Refinement EPE3D↓ Acc3DS↑ Acc3DR↑ Outliers↓ ✗ 0.054 0.862 0.936 0.230 ✓ 0.037 0.938 0.974 0.189 PerformanceGainDuetoFlowRefinement. Finally, we compare PointFlowHop with and without the flow refinement step in Table 5.8. It is not surprising that flow refinement is crucial in PointFlowHop. However, one may argue the refinement step may be included in any of the discussed methods as a post processing operation. While this argument is valid, we see that even without flow refinement, PointFlowHop still is better than almost all methods (see Table 5.4). Between object refinement and flow refinement, flow refinement seems slightly more important if we consider all four evaluation metrics jointly. 5.3.3 Discussion The complexity of a machine learning method is valuable besides performance measures such as prediction accuracy/error. We discuss more about the complexity of PointFlowHop next. The complexity can be examined from multiple angles, including training time, the number of model parameters (i.e., the model size) and the number of floating point operations (FLOPs) during infer- ence. Furthermore, since some model-free methods (e.g., LOAM [101]) and the recently proposed KISS-ICP [84] can offer state-of-the-art results for related tasks such as Odometry and Simul- taneous Localization and Mapping (SLAM), the complexity of learning-based methods deserves additional attention. 123 To this end, PointFlowHop offers impressive benefits as compared to representative DL-based solutions. Training in PointFlowHop only involves the ego-motion compensation and shape clas- sification steps. For object motion estimation, PointHop++ obtained from the ego-motion com- pensation step is reused while the rest of the operations in PointFlowHop are parameter-free and performed only in inference. Table 5.9 provides details about the number of parameters of PointFlowHop. It adopts the PointHop++ architecture with two hops. The first hop has 13 kernels of dimension 88 while the second hop has 104 kernels of dimension 8. For XGBoost, it has 100 decision tree estimators, each of which has a maximum depth of 3. We also report the training time of PointFlowHop in the same table, where the training is conducted on Intel(R) Xeon(R) CPU E5-2620 v3 at 2.40GHz. Table 5.9: The number of trainable parameters and training time of the proposed PointFlowHop. Number of Parameters Training time Hop 1 1144 20 minutes Hop 2 832 XGBoost 2200 12 minutes Total 4176 32minutes While we do not measure the training time of other methods ourselves, we use [63] as a reference to compare our training time with others. It took the authors of [63] about 18 hours to train and fine-tune the FlowNet3D [52] method for the KITTI dataset and about 3 days for the Argoverse dataset. We expect comparable time for other methods. Thus, PointFlowHop is extremely efficient in this context. While the Graph Laplacian method [63] offers a variant where the scene flow is entirely optimized at runtime (non-learning based), its performance is inferior to ours as shown in Table 5.5. 124 Table 5.10: Comparison of model sizes (in terms of the number of parameters) and computational complexity of inference (in terms of FLOPs) of four benchmarking methods. Method Number of Parameters FLOPs FlowNet3D [52] 1.23 M (308X) 11.67 G (61X) PointPWC Net [92] 7.72 M (1930X) 17.46 G (92X) FLOT [64] 110 K (28X) 54.65 G (288X) PointFlowHop (Ours) 4K(1X) 190M(1X) Finally, we compare the model sizes and computational complexity of four benchmarking methods in Table 5.10. It is apparent that PointFlowHop demands significantly less parameters than other methods. Furthermore, we compute the number of floating-point operations (FLOPs) of PointFlowHop analytically during inference and report it in Table 5.10. While calculating the FLOPs, we consider input point clouds containing 8,192 points. Thus, the normalized FLOPs per point is 23.19K. We conclude from the above discussion that PointFlowHop offers a green and high-performance solution to 3D scene flow estimation. 5.4 Conclusion In this chapter, we proposed two methods, namely GreenPCO and PointFlowHop for odometry and scence flow estimation from point clouds, respectively. Overall, PointFlowHop builds on top of GreenPCO work. It takes two consecutive LiDAR point cloud scans and determines the flow vectors for all points in the first scan. It decomposes the flow into vehicle’s ego-motion and the motion of an individual object in the scene. Here, the ego-motion is estimated based on the GreenPCO method. The superior performance of PointFlowHop over benchmarking DL-based methods was demonstrated on stereoKITTI and Argoverse datasets. Furthermore, PointFlowHop has advantages in fewer trainable parameters and fewer FLOPs during inference. 125 The novelty of these works lie in two aspects. First, they expand the scope of existing GL- based point cloud data processing techniques. GL-based point cloud processing has so far been developed for object-level understanding [32, 35, 36, 105, 106, 107] and indoor scene understand- ing [34, 104]. They address the more challenging problem of outdoor scene understanding at the point level. These works also expand the application scenario of R-PointHop, where all points are transformed using one single rigid transformation. For 3D scene flow estimation, each point has its own unique flow vector. Furthermore, we show that a single model can learn features for ego- motion estimation as well as object-motion estimation, which are two different but related tasks. This allows model sharing and opens doors to related tasks such as joint scene flow estimation and semantic segmentation. Second, these works highlight the over-paramertized nature of DL- based solutions which demand larger model sizes and higher computational complexity in both training and testing. The overall performance of GreenPCO and PointFlowHop suggest a new point cloud processing pipeline that is extremely lightweight and mathematically transparent. 126 Chapter6 ConclusionandFutureWork 6.1 SummaryoftheResearch In this dissertation, we focus on the problems of point cloud registration, pose estimation, rota- tion invariant classification, odometry and scene flow estimation. These are among the essential tasks for the realization of a 3D vision system. First, two methods for point cloud registration, namely Salient Points Analysis (SPA) and R-PointHop are developed for local and global reg- istration, respectively. Both these methods are unsupervised and based on the Green Learning (GL) framework. Accordingly, point features are learned in a hierarchical feedforward manner by capturing the the near to far neighborhood information successively. The learned features are further used to find point correspondences which then lead to the optimal 3D transformation. Experiments performed on synthetic models (ModelNet40 dataset), real world models (Stanford Bunny dataset), and indoor scene scans (3DMatch dataset) highlight the effectiveness of these methods in the 3D registration task. Specifically, R-PointHop offers several advantages due to it rotation invariant nature and is a potential tool for a wide range of applications involving point cloud data. 127 Next, we propose the Point Cloud Retrieval and Pose Estimation (PCRP) and SO(3) invariant PointHop (S3I-PointHop) methods. For pose estimation, we demonstrate how the R-PointHop point features can be aggregated into a global shape feature in order to retrieve a similar aligned object from a gallery set. Then the two similar point clouds (input and retrieved) can be registered to get the object pose. Furthermore, modification to the R-PointHop feature representation with a view to make it a more generalized feature descriptor, namely FR-PointHop, is discussed. PCRP assumes that the object class is known during retrieval. In the subsequent work, S3I-PointHop, we show how a 3D point cloud object with arbitrary rotation can be classified using the rotation invariant feature learning techniques developed in R-PointHop. Accordingly, the input object to PCRP can be classified to determine the correct category of object class which can then facili- tate object retrieval from candidate objects. S3I-PointHop outperforms PointHop-like methods in point cloud classification when the objects from ModelNet40 dataset posses different orien- tations. Moreover, in S3I-PointHop, we simplify the feature learning process using a single hop and complement it with multiple spatial aggregation techniques. The GreenPCO and PointFlowHop methods demonstrate green point cloud learning abilities for LiDAR point clouds from outdoor environments. In GreenPCO, we show how R-PointHop can be used to incrementally estimate the motion of an object by successively registering consec- utive point cloud scans captured by it. We also modify the point sampling method and restrict the search of correspondences based on prior knowledge to further boost the performance. In PointFlowHop, we decompose the scene flow into vehicle’s ego-motion and object motion com- ponents. Here, GreenPCO is used in the ego-motion compensation step. Overall, these methods outperform several recently proposed deep learning methods. 128 The advances in computer vision due to deep learning usually lead to a high carbon foot- print. As such, the training time and GPU usage is too large, but on the contrary, the perfor- mance gains are still noteworthy. The methods discussed in this dissertation are based on the GL framework and offer advantage in these regards over deep learning. First, the training process is much efficient than deep learning since GL does not involve iterative end-to-end optimization using backpropagation algorithm. Similarly, the number of trainable parameters (model size) is less as well. That being said, there is no compromise in terms of performance. Hence, it makes these works promising. Overall, the proposed methods combine the benefits of both, traditional and the learning based methods. In summary, following are the highlights and novelties of this research in comparison with the traditional and deep learning methods – • Data-driven as against handcrafted features, but also unsupervised as opposed to most deep learning solutions. • Task agnostic feature learning helps reuse of features for multiple but related tasks – reg- istration, retrieval and pose estimation; or odometry and scene flow estimation. • Unsupervised feature learning helps better generalize to unseen object categories and dif- ferent datasets. • A flexible framework to learn long-range neighborhood information based on 3D local de- scriptors. • A green solution to point cloud learning having a small model size and lower computation complexity. 129 6.2 FutureResearchTopics There are several ways in which this research can be extended further. One direction is to ad- dress some of the issues in existing works and make them better and robust. The other direction is to use the existing methods and investigate topics such as 3D object detection and semantic segmentation in outdoor scenes. As such, the green point cloud learning methods can be fused so as to have a one-stop solution to several problems in 3D perception. Let us first consider some of the possibilities to improve the existing works. In chapter 4, we briefly introduced the idea that S3I-PointHop can be used to first determine the class of the object and then its pose may be estimated using PCRP. Then, we no longer need to assume the object class for the purpose of retrieval. However, the results for such configuration weren’t reported in S3I-PointHop. It is worthwhile to consider how these two methods work together. As such we have seen the task agnostic nature of our features helps reuse them for multiple tasks. Here, hav- ing a single feature extraction method for classification, retrieval and pose estimation needs to be investigated for its feasibility. Furthermore, it will be interesting to see how the error in clas- sification propagates to the pose estimation step. Concurrently, the classification performance of S3I-PointHop itself may be improved to match state-of-the-art methods based on rotation invari- ant and equivariant networks. The PointFlowHop can be made entirely unsupervised bypassing the scene classifier. Usually, the static and moving points posses discriminant characteristics and a supervised classifier may be redundant. An alternate approach that uses prior knowledge to determine whether a point is static or moving may be developed. 130 Next, let use briefly review the potential future topics of 3D object detection and semantic segmentation in outdoor environments, and how the green learning methods may be adapted to solve them. 6.2.1 3DObjectDetection 3D object detection in an outdoor environment is one of the primary tasks undertaken by mod- ern autonomous vehicles. These applications demand real time and accurate processing. The generated 3D bounding boxes and object classes facilitate downstream tasks such as prediction and motion planning. Oftentimes, data from multiple sensors such as LiDAR and stereo cam- eras is fused together for robust detection. Considering the works proposed in this dissertation, the PointFlowHop method is well suited to be adapted to the 3D object detection task. In Point- FlowHop, it was shown in the object association step that clustering can be used to group moving points into moving objects. These grouped points belonging to moving objects can then be used to initialize object bounding boxes. A bounding box refinement step may be considered for a robust detection. This can be further combined with the S3I-PointHop method to determine the object class. Overall, the task agnostic nature of green learning methods can help facilitate series of downstream tasks using a single trained model. A similar approach using Hough voting was proposed in VoteNet [65] for 3D object detection in indoor scenes. VoteNet is summarized in Fig. 6.1. 131 Figure 6.1: Overview of VoteNet. The figure is from [65]. 6.2.2 SemanticSegmentation The goal of semantic segmentation is to assign a per-point label to the scene points. The Green Semantic Segmentation for indoor point clouds (GSIP) [104] was recently proposed which is based on the green learning paradigm. However, outdoor scenes are usually very noisy and the point clouds are sparse. They are very different from point clouds captured in indoor scenes using depth cameras. Hence, several modifications need to be made in GSIP to suit the outdoor semantic segmentation case. Moreover, GSIP does not consider temporal information. Since PointFlowHop is based on consecutive point clouds, it is worth revisiting the semantic segmentation problem and use the flow to guide the semantic segmentation. For instance, say the semantic labels for each point have been predicted by GSIP for point cloudX t at timet. Then, the scene flow between X t and X t+1 found by PointFlowHop can be used to obtain the per-point labels for X t+1 . The semantic segmentation may also be conditioned on several pieces from PointFlowHop such as knowledge of ego-motion, static and moving points, and object level associations. 132 Overall, the PointFlowHop work is promising and can be used as a initial point to develop a 3D scene understanding system that jointly works for odometry, scene flow estimation, object detection and semantic segmentation. Another topic that has never been touched upon in the green learning framework is multimodal sensor fusion. As such, related green learning works from 2D computer vision (such as object tracking [111, 112]) can be combined with the 3D point cloud methods for more robustness. 133 Bibliography [1] Brian Amberg, Sami Romdhani, and Thomas Vetter. “Optimal step nonrigid ICP algorithms for surface registration”. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE. 2007, pp. 1–8. [2] Yasuhiro Aoki, Hunter Goforth, Rangaprasad Arun Srivatsan, and Simon Lucey. “Pointnetlk: Robust & efficient point cloud registration using pointnet”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 7163–7172. [3] Tali Basha, Yael Moses, and Nahum Kiryati. “Multi-view scene flow estimation: A view centered variational approach”. In: International journal of computer vision 101 (2013), pp. 6–21. [4] Stefan Andreas Baur, David Josef Emmerichs, Frank Moosmann, Peter Pinggera, Björn Ommer, and Andreas Geiger. “SLIM: Self-supervised LiDAR scene flow and motion segmentation”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 13126–13136. [5] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall. “SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences”. In: Proc. of the IEEE/CVF International Conf. on Computer Vision (ICCV). 2019. [6] Paul J Besl and Neil D McKay. “Method for registration of 3-D shapes”. In: Sensor fusion IV: control paradigms and data structures. Vol. 1611. International Society for Optics and Photonics. 1992, pp. 586–606. [7] Dorit Borrmann, Jan Elseberg, Kai Lingemann, Andreas Nüchter, and Joachim Hertzberg. “Globally consistent 3D mapping with scan matching”. In: Robotics and Autonomous Systems 56.2 (2008), pp. 130–142. 134 [8] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. “Argoverse: 3D tracking and forecasting with rich maps”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 8748–8757. [9] Haiwei Chen, Shichen Liu, Weikai Chen, Hao Li, and Randall Hill. “Equivariant Point Network for 3D Point Cloud Analysis”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 14514–14523. [10] Hong-Shuo Chen, Mozhdeh Rouhsedaghat, Hamza Ghani, Shuowen Hu, Suya You, and C-C Jay Kuo. “Defakehop: A light-weight high-performance deepfake detector”. In: 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE. 2021, pp. 1–6. [11] Yang Chen and Gérard Medioni. “Object modelling by registration of multiple range images”. In: Image and vision computing 10.3 (1992), pp. 145–155. [12] Yueru Chen and C-C Jay Kuo. “Pixelhop: A successive subspace learning (ssl) method for object recognition”. In: Journal of Visual Communication and Image Representation 70 (2020), p. 102749. [13] Sungjoon Choi, Qian-Yi Zhou, and Vladlen Koltun. “Robust Reconstruction of Indoor Scenes”. In: Proceedings of the IEEE conference on computer vision and pattern recognition (2015), pp. 5556–5565. [14] Ronald Clark, Sen Wang, Hongkai Wen, Andrew Markham, and Niki Trigoni. “Vinet: Visual-inertial odometry as a sequence-to-sequence learning problem”. In:Proceedingsof the AAAI Conference on Artificial Intelligence . Vol. 31. 1. 2017. [15] Gabriele Costante and Thomas Alessandro Ciarfuglia. “LS-VO: Learning dense optical subspace for robust visual odometry estimation”. In: IEEE Robotics and Automation Letters 3.3 (2018), pp. 1735–1742. [16] Brian Curless and Marc Levoy. “A Volumetric Method for Building Complex Models from Range Images”. In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. 1996, pp. 303–312. [17] Igor Cvišić, Ivan Marković, and Ivan Petrović. “Recalibrating the KITTI dataset camera setup for improved odometry accuracy”. In: 2021 European Conference on Mobile Robots (ECMR). IEEE. 2021, pp. 1–6. [18] Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacchi, and Leonidas J Guibas. “Vector neurons: A general framework for SO (3)-equivariant networks”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 12200–12209. 135 [19] Haowen Deng, Tolga Birdal, and Slobodan Ilic. “Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 602–618. [20] Haowen Deng, Tolga Birdal, and Slobodan Ilic. “Ppfnet: Global context aware local features for robust 3d point matching”. In:ProceedingsoftheIEEEconferenceoncomputer vision and pattern recognition. 2018, pp. 195–205. [21] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. “A density-based algorithm for discovering clusters in large spatial databases with noise.” In: kdd. Vol. 96. 34. 1996, pp. 226–231. [22] Martin A Fischler and Robert C Bolles. “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography”. In: Communications of the ACM 24.6 (1981), pp. 381–395. [23] Andreas Geiger, Philip Lenz, and Raquel Urtasun. “Are we ready for autonomous driving? the kitti vision benchmark suite”. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE. 2012, pp. 3354–3361. [24] Zan Gojcic, Or Litany, Andreas Wieser, Leonidas J Guibas, and Tolga Birdal. “Weakly supervised learning of rigid 3D scene flow”. In: ProceedingsoftheIEEE/CVFconferenceon computer vision and pattern recognition. 2021, pp. 5692–5703. [25] Zan Gojcic, Caifa Zhou, Jan D Wegner, and Andreas Wieser. “The Perfect Match: 3D Point Cloud Matching with Smoothed Densities”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 5545–5554. [26] Xiuye Gu, Yijie Wang, Chongruo Wu, Yong Jae Lee, and Panqu Wang. “Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 3254–3263. [27] Timo Hackel, Jan D Wegner, and Konrad Schindler. “Fast semantic segmentation of 3D point clouds with strongly varying density”. In: ISPRS annals of the photogrammetry, remote sensing and spatial information sciences 3 (2016), pp. 177–184. [28] Frédéric Huguet and Frédéric Devernay. “A variational method for scene flow estimation from stereo sequences”. In: 2007 IEEE 11th International Conference on Computer Vision. IEEE. 2007, pp. 1–7. [29] Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. “Aggregating local descriptors into a compact image representation”. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE. 2010, pp. 3304–3311. [30] Andrew E Johnson. “Spin-images: a representation for 3-D surface matching”. In: (1997). 136 [31] Simon J Julier and Jeffrey K Uhlmann. “New extension of the Kalman filter to nonlinear systems”. In: Signal processing, sensor fusion, and target recognition VI. Vol. 3068. International Society for Optics and Photonics. 1997, pp. 182–193. [32] Pranav Kadam, Hardik Prajapati, Min Zhang, Jintang Xue, Shan Liu, and C.-C. Jay Kuo. “S3I-PointHop: SO (3)-Invariant PointHop for 3D Point Cloud Classification”. In: arXiv preprint arXiv:2302.11506 (2023). [33] Pranav Kadam, Min Zhang, Jiahao Gu, Shan Liu, and C-C Jay Kuo. “GreenPCO: An Unsupervised Lightweight Point Cloud Odometry Method”. In: 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP). IEEE. 2022, pp. 01–06. [34] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. “R-PointHop: A Green, Accurate, and Unsupervised Point Cloud Registration Method”. In: IEEE Transactions on Image Processing. IEEE, 2022. [35] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. “Unsupervised Point Cloud Registration via Salient Points Analysis (SPA)”. In: 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP). IEEE. 2020, pp. 5–8. [36] Pranav Kadam, Qingyang Zhou, Shan Liu, and C-C Jay Kuo. “Pcrp: Unsupervised point cloud object retrieval and pose estimation”. In: 2022 IEEE International Conference on Image Processing (ICIP). IEEE. 2022, pp. 1596–1600. [37] Rudolph Emil Kalman. “A new approach to linear filtering and prediction problems”. In: (1960). [38] Kishore Reddy Konda and Roland Memisevic. “Learning visual odometry with a convolutional network”. In: VISAPP (1). 2015, pp. 486–490. [39] Venkat Krishnamurthy and Marc Levoy. “Fitting Smooth Surfaces to Dense Polygon Meshes”. In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. 1996, pp. 313–324. [40] C-C Jay Kuo. “The CNN as a guided multilayer RECOS transform [lecture notes]”. In: IEEE signal processing magazine 34.3 (2017), pp. 81–89. [41] C-C Jay Kuo and Yueru Chen. “On data-driven saak transform”. In: Journal of Visual Communication and Image Representation 50 (2018), pp. 237–246. [42] C-C Jay Kuo and Azad M Madni. “Green learning: Introduction, examples and outlook”. In: Journal of Visual Communication and Image Representation. Elsevier, 2022, p. 103685. [43] C-C Jay Kuo, Min Zhang, Siyang Li, Jiali Duan, and Yueru Chen. “Interpretable convolutional neural networks via feedforward design”. In: Journal of Visual Communication and Image Representation 60 (2019), pp. 346–359. 137 [44] Truc Le and Ye Duan. “Pointgrid: A deep network for 3d shape understanding”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 9204–9214. [45] Xuejing Lei, Ganning Zhao, Kaitai Zhang, and C-C Jay Kuo. “TGHop: an explainable, efficient, and lightweight method for texture generation”. In: APSIPA Transactions on Signal and Information Processing 10 (2021). [46] Feiran Li, Kent Fujiwara, Fumio Okura, and Yasuyuki Matsushita. “A Closer Look at Rotation-Invariant Deep Point Cloud Analysis”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 16218–16227. [47] Ruibo Li, Chi Zhang, Guosheng Lin, Zhe Wang, and Chunhua Shen. “Rigidflow: Self-supervised scene flow learning on point clouds by local rigidity prior”. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition. 2022, pp. 16959–16968. [48] Xianzhi Li, Ruihui Li, Guangyong Chen, Chi-Wing Fu, Daniel Cohen-Or, and Pheng-Ann Heng. “A rotation-invariant framework for deep point cloud analysis”. In: IEEE Transactions on Visualization and Computer Graphics. IEEE, 2021. [49] Xiaolong Li, Yijia Weng, Li Yi, Leonidas J Guibas, A Abbott, Shuran Song, and He Wang. “Leveraging se (3) equivariance for self-supervised category-level object pose estimation from point clouds”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 15370–15381. [50] Xueqian Li, Jhony Kaesemodel Pontes, and Simon Lucey. “Neural scene flow prior”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 7838–7851. [51] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. “Pointcnn: Convolution on x-transformed points”. In: Advances in neural information processing systems 31 (2018). [52] Xingyu Liu, Charles R Qi, and Leonidas J Guibas. “Flownet3d: Learning scene flow in 3d point clouds”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 529–537. [53] David G Lowe. “Distinctive image features from scale-invariant keypoints”. In: International journal of computer vision 60.2 (2004), pp. 91–110. [54] Bruce D Lucas, Takeo Kanade, et al. “An iterative image registration technique with an application to stereo vision”. In: Vancouver, British Columbia. 1981. [55] Ellon Mendes, Pierrick Koch, and Simon Lacroix. “ICP-based pose-graph SLAM”. In: 2016 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR). IEEE. 2016, pp. 195–200. 138 [56] Moritz Menze and Andreas Geiger. “Object scene flow for autonomous vehicles”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 3061–3070. [57] Moritz Menze, Christian Heipke, and Andreas Geiger. “Joint 3D estimation of vehicles and scene flow”. In: ISPRS annals of the photogrammetry, remote sensing and spatial information sciences 2 (2015), p. 427. [58] Moritz Menze, Christian Heipke, and Andreas Geiger. “Object scene flow”. In: ISPRS Journal of Photogrammetry and Remote Sensing 140 (2018), pp. 60–76. [59] Himangi Mittal, Brian Okorn, and David Held. “Just go with the flow: Self-supervised scene flow estimation”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, pp. 11177–11185. [60] Peter Muller and Andreas Savakis. “Flowdometry: An optical flow and deep learning based approach to visual odometry”. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE. 2017, pp. 624–631. [61] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. “ORB-SLAM: a versatile and accurate monocular SLAM system”. In: IEEE transactions on robotics 31.5 (2015), pp. 1147–1163. [62] Austin Nicolai, Ryan Skeele, Christopher Eriksen, and Geoffrey A Hollinger. “Deep learning for laser based odometry estimation”. In: RSS workshop Limits and Potentials of Deep Learning in Robotics. Vol. 184. 2016, p. 1. [63] Jhony Kaesemodel Pontes, James Hays, and Simon Lucey. “Scene flow from point clouds with or without learning”. In: 2020 international conference on 3D vision (3DV). IEEE. 2020, pp. 261–270. [64] Gilles Puy, Alexandre Boulch, and Renaud Marlet. “Flot: Scene flow on point clouds guided by optimal transport”. In: European conference on computer vision. Springer. 2020, pp. 527–544. [65] Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. “Deep hough voting for 3d object detection in point clouds”. In:proceedingsoftheIEEE/CVFInternationalConference on Computer Vision. 2019, pp. 9277–9286. [66] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. “Pointnet: Deep learning on point sets for 3d classification and segmentation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 652–660. [67] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. “Pointnet++: Deep hierarchical feature learning on point sets in a metric space”. In: Advances in neural information processing systems 30 (2017). 139 [68] Julian Quiroga, Frédéric Devernay, and James Crowley. “Scene flow by tracking in intensity and depth data”. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE. 2012, pp. 50–57. [69] Mozhdeh Rouhsedaghat, Yifan Wang, Xiou Ge, Shuowen Hu, Suya You, and C-C Jay Kuo. “Facehop: A light-weight low-resolution face gender classification method”. In: International Conference on Pattern Recognition. Springer. 2021, pp. 169–183. [70] Dávid Rozenberszki and András L Majdik. “LOL: Lidar-only Odometry and Localization in 3D point cloud maps”. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2020, pp. 4379–4385. [71] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. “ORB: An efficient alternative to SIFT or SURF”. In: 2011 International conference on computer vision. Ieee. 2011, pp. 2564–2571. [72] Szymon Rusinkiewicz and Marc Levoy. “Efficient variants of the ICP algorithm”. In: Proceedings third international conference on 3-D digital imaging and modeling. IEEE. 2001, pp. 145–152. [73] Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. “Fast point feature histograms (FPFH) for 3D registration”. In: 2009 IEEE international conference on robotics and automation. IEEE. 2009, pp. 3212–3217. [74] Peter H Schönemann. “A generalized solution of the orthogonal procrustes problem”. In: Psychometrika 31.1 (1966), pp. 1–10. [75] Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. “Green AI”. In: Communications of the ACM 63.12 (2020), pp. 54–63.issn: 0001-0782.doi: 10.1145/3381831. [76] Aleksandr Segal, Dirk Haehnel, and Sebastian Thrun. “Generalized-icp.” In: Robotics: science and systems. Vol. 2. 4. Seattle, WA. 2009, p. 435. [77] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. “Scene coordinate regression forests for camera relocalization in RGB-D images”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013, pp. 2930–2937. [78] Federico Tombari, Samuele Salti, and Luigi Di Stefano. “Unique signatures of histograms for local surface description”. In: European conference on computer vision. Springer. 2010, pp. 356–369. [79] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. “Bundle adjustment—a modern synthesis”. In: International workshop on vision algorithms. Springer. 1999, pp. 298–372. 140 [80] Greg Turk and Marc Levoy. “Zippered Polygon Meshes from Range Images”. In: Proceedings of the 21st annual conference onComputer graphics and interactive techniques. 1994, pp. 311–318. [81] Mikaela Angelina Uy and Gim Hee Lee. “Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 4470–4479. [82] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural information processing systems 30 (2017). [83] Sundar Vedula, Simon Baker, Peter Rander, Robert Collins, and Takeo Kanade. “Three-dimensional scene flow”. In: Proceedings of the Seventh IEEE International Conference on Computer Vision. Vol. 2. IEEE. 1999, pp. 722–729. [84] Ignacio Vizzo, Tiziano Guadagnino, Benedikt Mersch, Louis Wiesmann, Jens Behley, and Cyrill Stachniss. “KISS-ICP: In Defense of Point-to-Point ICP Simple, Accurate, and Robust Registration If Done the Right Way”. In: IEEE Robotics and Automation Letters (2023). [85] Christoph Vogel, Konrad Schindler, and Stefan Roth. “Piecewise rigid scene flow”. In: Proceedings of the IEEE International Conference on Computer Vision. 2013, pp. 1377–1384. [86] Guangming Wang, Xinrui Wu, Zhe Liu, and Hesheng Wang. “Hierarchical attention learning of scene flow in 3D point clouds”. In: IEEE Transactions on Image Processing 30 (2021), pp. 5168–5181. [87] Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. “Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks”. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2017, pp. 2043–2050. [88] Wei Wang, Muhamad Risqi U Saputra, Peijun Zhao, Pedro Gusmao, Bo Yang, Changhao Chen, Andrew Markham, and Niki Trigoni. “Deeppco: End-to-end point cloud odometry through deep parallel neural network”. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. 2019, pp. 3248–3254. [89] Yue Wang and Justin M Solomon. “Deep closest point: Learning representations for point cloud registration”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 3523–3532. [90] Yue Wang and Justin M Solomon. “PRNet: Self-Supervised Learning for Partial-to-Partial Registration”. In: Advances in Neural Information Processing Systems. 2019, pp. 8814–8826. 141 [91] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. “Dynamic graph cnn for learning on point clouds”. In: Acm Transactions On Graphics (tog) 38.5 (2019), pp. 1–12. [92] Wenxuan Wu, Zhi Yuan Wang, Zhuwen Li, Wei Liu, and Li Fuxin. “Pointpwc-net: Cost volume on point clouds for (self-) supervised scene flow estimation”. In: European conference on computer vision. Springer. 2020, pp. 88–107. [93] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. “3d shapenets: A deep representation for volumetric shapes”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 1912–1920. [94] Jianxiong Xiao, Andrew Owens, and Antonio Torralba. “Sun3d: A database of big spaces reconstructed using sfm and object labels”. In: Proceedings of the IEEE international conference on computer vision. 2013, pp. 1625–1632. [95] Heng Yang, Jingnan Shi, and Luca Carlone. “Teaser: Fast and certifiable point cloud registration”. In: IEEE Transactions on Robotics 37.2 (2020), pp. 314–333. [96] Jiaolong Yang, Hongdong Li, Dylan Campbell, and Yunde Jia. “Go-ICP: A globally optimal solution to 3D ICP point-set registration”. In: IEEE transactions on pattern analysis and machine intelligence 38.11 (2015), pp. 2241–2254. [97] Yijing Yang, Wei Wang, Hongyu Fu, C-C Jay Kuo, et al. “On supervised feature selection from high dimensional feature spaces”. In: APSIPA Transactions on Signal and Information Processing. Vol. 11. 1. Now Publishers, Inc., 2022. [98] Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher, Jianxiong Xiao, and Thomas Funkhouser. “3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions”. In: CVPR. 2017. [99] Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher, Jianxiong Xiao, and Thomas Funkhouser. “3dmatch: Learning local geometric descriptors from rgb-d reconstructions”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1802–1811. [100] Mingliang Zhai, Xuezhi Xiang, Ning Lv, and Xiangdong Kong. “Optical flow and scene flow estimation: A survey”. In: Pattern Recognition 114 (2021), p. 107861. [101] Ji Zhang and Sanjiv Singh. “LOAM: Lidar Odometry and Mapping in Real-time”. In: Robotics: Science and Systems. Vol. 2. 9. 2014. [102] Ji Zhang and Sanjiv Singh. “Visual-lidar odometry and mapping: Low-drift, robust, and fast”. In: 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2015, pp. 2174–2181. 142 [103] Kaitai Zhang, Bin Wang, Wei Wang, Fahad Sohrab, Moncef Gabbouj, and C-C Jay Kuo. “Anomalyhop: an ssl-based image anomaly localization method”. In: 2021 International Conference on Visual Communications and Image Processing (VCIP). IEEE. 2021, pp. 1–5. [104] Min Zhang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. “Gsip: Green semantic segmentation of large-scale indoor point clouds”. In: Pattern Recognition Letters 164 (2022), pp. 9–15. [105] Min Zhang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. “Unsupervised feedforward feature (uff) learning for point cloud classification and segmentation”. In: 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP). IEEE. 2020, pp. 144–147. [106] Min Zhang, Yifan Wang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. “Pointhop++: A lightweight learning model on point sets for 3d classification”. In: 2020 IEEE International Conference on Image Processing (ICIP). IEEE. 2020, pp. 3319–3323. [107] Min Zhang, Haoxuan You, Pranav Kadam, Shan Liu, and C-C Jay Kuo. “PointHop: An explainable machine learning method for point cloud classification”. In: IEEE Transactions on Multimedia 22.7 (2020), pp. 1744–1755. [108] Tianyu Zhao, Qiaojun Feng, Sai Jadhav, and Nikolay Atanasov. “Corsair: Convolutional object retrieval and symmetry-aided registration”. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. 2021, pp. 47–54. [109] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. “Fast global registration”. In: European Conference on Computer Vision. Springer. 2016, pp. 766–782. [110] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. “Open3D: A Modern Library for 3D Data Processing”. In: arXiv preprint arXiv:1801.09847 (2018). [111] Zhiruo Zhou, Hongyu Fu, Suya You, and C-C Jay Kuo. “Gusot: Green and unsupervised single object tracking for long video sequences”. In: 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP). IEEE. 2022, pp. 1–6. [112] Zhiruo Zhou, Hongyu Fu, Suya You, C-C Jay Kuo, et al. “UHP-SOT++: An Unsupervised Lightweight Single Object Tracker”. In: APSIPA Transactions on Signal and Information Processing 11.1 (). [113] Yao Zhu, Xinyu Wang, Hong-Shuo Chen, Ronald Salloum, and C-C Jay Kuo. “A-pixelhop: A green, robust and explainable fake-image detector”. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, pp. 8947–8951. 143
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D object detection in industrial site point clouds
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Object detection and recognition from 3D point clouds
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
3D deep learning for perception and modeling
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Green image generation and label transfer techniques
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Unsupervised learning of holistic 3D scene understanding
PDF
Point-based representations for 3D perception and reconstruction
PDF
Hybrid methods for robust image matching and its application in augmented reality
PDF
Efficient graph learning: theory and performance evaluation
PDF
Accurate image registration through 3D reconstruction
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Data-driven 3D hair digitization
Asset Metadata
Creator
Kadam, Pranav
(author)
Core Title
Green learning for 3D point cloud data processing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-05
Publication Date
03/29/2023
Defense Date
03/22/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D point cloud,3D pose estimation,3D registration,green learning,OAI-PMH Harvest,scene flow estimation
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Nakano, Aiichiro (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
pranavka@usc.edu,pranavkadam2296@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112850241
Unique identifier
UC112850241
Identifier
etd-KadamPrana-11526.pdf (filename)
Legacy Identifier
etd-KadamPrana-11526
Document Type
Dissertation
Format
theses (aat)
Rights
Kadam, Pranav
Internet Media Type
application/pdf
Type
texts
Source
20230329-usctheses-batch-1012
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
3D point cloud
3D pose estimation
3D registration
green learning
scene flow estimation