Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Object detection and recognition from 3D point clouds
(USC Thesis Other)
Object detection and recognition from 3D point clouds
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
OBJECT DETECTION AND RECOGNITION FROM 3D POINT CLOUDS by Jing Huang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2016 Copyright 2016 Jing Huang Acknowledgements I would like to thank my advisor, Professor Suya You, for all his help and advice during my studies and research at University of Southern California. I had a great time at the Computer Graphics and Immersive Technologies lab, and I hope our cooperation will continue in the future. I would like to thank Prof. Ulrich Neumann, Prof. C.-C. Jay Kuo, Prof. Aiichiro Nakano and Prof. Hao Li for taking their precious time to serve on the committee of my qualifying exam and thesis defense. Over the past six years it has been my great honor to collaborate with the current and previous CGIT lab members: Guan Pang, Rongqi Qiu, Qiangui Huang, Weiyue Wang,JiapingZhao,WeiGuan,QianyiZhou,ZhenzhenGao,QuanWang,YiLi,Tanasai Sucontphunt, Sang Yun Lee, Luciano Nocera and Kelvin Chung. It was also a great pleasure to communicate with all other members in the graphics and vision labs at USC: Zhuoliang Kang, Yijing Li, Yili Zhao, Bohan Wang, Danyong Zhao, Hongyi Xu, Liwen Hu, Chongyang Ma, Lingyu Wei, Tianye Li, Chen Sun, Song Cao, Weijun Wang, Bor- Jeng Chen, Anh Tran, Matthias Hernandez and many others. I thank all my friends for making my life here a happy and precious experience. ii Iamgratefultothesupport, encouragementandpatiencefrommyparents, Chunqing Huang and Lijing Huang, and other family members in China. Finally, special thanks to my fianc´ ee Anna Baiyu Zhu, a beautiful and smart girl. I am very grateful for all her love and tenderness. iii Table of Contents Acknowledgements ii List Of Tables vi List Of Figures viii Abstract xv Chapter 1 Introduction 1 Chapter 2 Related Work 8 2.1 3D Feature Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 3D Object Detection and Recognition . . . . . . . . . . . . . . . . . . . . 12 2.3 3D Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 3 3D Feature and Description 16 3.1 Self-Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Extending Self-Similarity to 3D Domain: Range Image . . . . . . . . . . . 19 3.3 Extending Self-Similarity to 3D Domain: 3D Self-Similarity Descriptor . . 31 Chapter 4 Detecting Industrial Parts from 3D Point Cloud 67 4.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Candidate Extraction based on Point Classification . . . . . . . . . . . . . 70 4.3 Matching based on 3D Self-Similarity Descriptor . . . . . . . . . . . . . . 77 4.4 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5 Refinement of the Object Detection Methods . . . . . . . . . . . . . . . . 86 4.6 Experimental Results for the Refined System . . . . . . . . . . . . . . . . 94 Chapter 5 Detecting Object-Level Changes from 3D Point Cloud 101 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3 Point Cloud Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4 3D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.5 Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.6 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 iv Chapter 6 Detecting and Classifying Urban Objects from 3D Point Cloud 123 6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3 Candidate Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.4 Candidate Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.5 Pole Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.6 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Chapter 7 Object Detection using Deep Convolutional Neural Net- work 144 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.2 Large-Scale Connection Removal . . . . . . . . . . . . . . . . . . . . . . . 146 7.3 Knowledge-based Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 147 7.4 Classification using OV-CNN . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Chapter 8 Scene Labeling using 3D Convolutional Neural Network 156 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.3 Voxelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 8.4 3D Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . 161 8.5 Label Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 8.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Chapter 9 Conclusion and Future Work 170 9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 References 176 v List Of Tables 3.1 Different Self-Similarity correlation functions . . . . . . . . . . . . . . . . 22 3.2 RobustnessofNormalSelf-Similaritydescriptorbasedonfeaturesdetected byMoPC(averageL 2 distancebetweendescriptorsatcorrespondingpoints). Average number of points: 518 . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Robustness of Curvature Self-Similarity descriptor based on features de- tectedbyMoPC(averageL 2 distancebetweendescriptorsatcorresponding points). Average number of points: 518 . . . . . . . . . . . . . . . . . . . 43 3.4 Evaluation of matching results with different configurations (weights) of united self-similarity. Pure normal similarity performs the best overall, but curvature/photometry similarity can do better for specific data. . . . 45 3.5 Parametersusedintheexperiments. Thefirstseveralrowsarethenumber of bins in each dimension. r is the radius of the support region from which the descriptors are generated. . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6 Descriptor quality of different kinds of descriptors. Average number of features is 4447. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7 Average running time for computing one descriptor and doing one com- parison in matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1 Cross validation result for pipes of different sizes. The left column means the training data, and the top row means the testing data. Y means at least 80% of the testing points are classified as pipe, while N means the opposite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 The linearity of some of the representative targets we are interested in. There are totally 1223 clusters in this ranking. . . . . . . . . . . . . . . . 77 4.3 Statisticsofthetrainedprimitiveshapeclassifiers. Theratioofthesupport vectors means Positive/Negative. . . . . . . . . . . . . . . . . . . . . . . . 82 vi 4.4 Remaining points after removal of each point category. . . . . . . . . . . . 83 4.5 Statistics of detection. There are 8 big categories, 33 sub-categories and 127 instances (Ground-truth) of targets in the scene. Among them 62 are correctly identified (TP = True Positive), while 35 detections are wrong (FP = False Positive), and 65 instances are missed (FN = False Negative). 85 4.6 Statisticsofdetection. Thereare13categories,and129instances(Ground- truth) of targets in the scene. Our method correctly identifies 89 instances among them, while 16 detections are wrong, and 40 instances are missed. The missed instances are mainly due to large (> 50%) occlusion. These results, especiallytherecall, aresubstantiallybetterthantheresultsofthe system proposed in [38]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.1 Causes of object-level changes. . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 Statistics of the change detection results on synthetic dataset. The result isaggregatedfromthe10pairsofrandomlygeneratedscenesineachdataset.118 5.3 Statistics of the change detection results on real dataset. . . . . . . . . . 119 6.1 Qualitative relationship between the class of object and the class of point classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.2 Statistics of localization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.3 Evaluation of the pole-like object detection. . . . . . . . . . . . . . . . . 141 6.4 Evaluation of the pole-like object classification. . . . . . . . . . . . . . . 141 7.1 Comparison of segmentation result. . . . . . . . . . . . . . . . . . . . . . . 153 7.2 Evaluation result of different numbers of kernels in the two convolutional layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.3 Evaluationofvariousviewpatchsizes. n c1 mustbedivisibleby4tosatisfy the constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.4 Comparison of three CNN architectures and SVM. . . . . . . . . . . . . . 155 8.1 Comparison of different numbers of kernels in the two 3D convolutional layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 vii List Of Figures 1.1 (a) Part detection and recognition from industrial data. (b) Object detec- tion and recognition from urban data. . . . . . . . . . . . . . . . . . . . . 2 1.2 Model training based on gradient descent. . . . . . . . . . . . . . . . . . . 5 3.1 Two point clouds represent two LiDAR scans of the same area captured at different times and from different aspects. The proposed method can find precise matches for the points in the overlapping area of the two point clouds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Fromoriginallocalimageregion(left)tocorrelationsurface(middle), then quantized using log-polar coordinate (right). . . . . . . . . . . . . . . . . . 21 3.3 Comparison of different SSIM correlation functions on multimodal data. The horizontal axis corresponds to different pairs of LiDAR intensity and depthimage,whiletheverticalaxisshowstheratiobetweencorrectmatches and total matches. The threshold ratio is 0.8 in this experiment. . . . . . 23 3.4 Matchingresultsonsyntheticdatawithillumination, orientationandscale changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Matching result between aerial image and LiDAR depth image. . . . . . . 26 3.6 Matching result between image pairs sensed differently. . . . . . . . . . . . 27 3.7 Matching result between LiDAR intensity images of large area at different time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.8 MatchingresultbetweenLiDARdepthimageandaerialimageofthesame urban area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.9 Matching result between low-texture surfaces. . . . . . . . . . . . . . . . . 28 3.10 Matching result between LiDAR intensity images at different time. . . . . 28 viii 3.11 Matching result between LiDAR intensity image and LiDAR depth image. 29 3.12 Illustration for self-similarities. Column (a) are three point clouds of the same object and (b) are their normal distributions. There are many noises inthe2ndpointcloud,whichleadtoquitedifferentnormalsfromtheother two. However, the 2nd point cloud shares similar intensity distribution as the 1st point cloud, which ensures that their self-similarity surface (c), quantized bins (d) and thus descriptors (e) are similar to each other. On the other hand, while the intensity distribution of the 3rd point cloud is different from the other two, it shares similar normals as the 1st point cloud (3rd row vs. 1st row in column (b)), which again ensures that their self-similarity surface, quantized bins and descriptors are similar to each other. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.13 Illustration of the local reference frame and quantization. . . . . . . . . . 38 3.14 Detectedsalientfeatures(highlighted)withtheproposedmulti-scaledetec- tor. Different sizes/colors of balls indicate different scales at which the key points are detected. These features turn out to be distinctive, repeatable and compact for matching. . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.15 SHREC benchmark dataset. The transformations are (from left to right): isometry, holes, micro holes, scale, local scale, noise, shot noise, topology and sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.16 Matching result between human point clouds with rotation, scale, affine transformation, holes and rasterization. . . . . . . . . . . . . . . . . . . . 44 3.17 (a) is the matching result with dense 3D Self-Similarity descriptor of dif- ferent poses of a cat. (b) is the matching result with MoPC feature-based 3D Self-Similarity descriptor of different poses of a wolf. . . . . . . . . . . 44 3.18 (a)istheprecision-recallcurveforthe3DSelf-similaritydescriptorbetween two wolf models from TOSCA high-resolution Data. (b) is the precision- recall curve for the 3D Self-similarity descriptor on Vancouver Richard Corridor data.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.19 Matching result of aerial LiDAR out of two scans of the Richard Corridor area in Vancouver. (b) is the filtered result of (a). . . . . . . . . . . . . . 46 3.20 (a) is the visualization of N-SSIM surface and (b) is the visualization of C-SSIM surface. The purple point is the reference point x 0 . The self- similarity values are represented by colors, among which red > yellow > green > blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 ix 3.21 (a) and (b) are the visualization of self-similarity surfaces. (c) and (d) are the visualization of the corresponding quantization results. . . . . . . . . . 51 3.22 Aggregated results for the bin number test. . . . . . . . . . . . . . . . . . 55 3.23 The transformed models used in our experiments from the SHREC Bench- mark. The first column contains the original model, followed by transfor- mations with holes, sampling, noise, shot noise and rasterization, respec- tively. Note that only the human model contains the rasterization from the SHREC 2011 benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.24 The vehicle models used in our experiments from the Princeton 3D Shape Benchmark [83]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.25 The ROC curve for matching evaluation of descriptors on data with (a) noise, (b) rasterization, (c) holes, and (d) sampling.. . . . . . . . . . . . . 62 3.26 Some of the query scans. The azimuth angle for the scanner is random. . 63 3.27 The recognition rate of descriptors. . . . . . . . . . . . . . . . . . . . . . . 64 3.28 Matching results between the recognition results. Within each pair of matching, the top object (in blue) is the query, while the bottom object (in red) is the matched reference scan. The first row are matching results between correctly recognized target with the corresponding reference scan from the database. The second row shows the matching result of failed cases, which seem to be reasonable since their models have quite similar shapes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.29 Matching results between templates with the real data. Within each pair of matching, the object in red is the template car, while the object in blue is the real scan in the urban environment. . . . . . . . . . . . . . . . . . 65 4.1 (a) The CAD models of the part templates, which are converted into point cloudsbyavirtualscannerinpre-processing. (b)Thecoloredvisualization of the highly complicated scene point cloud we are dealing with. . . . . . 68 4.2 Flowchart of the proposed object detection system. . . . . . . . . . . . . . 68 4.3 Illustration of Point Feature Histogram calculation. . . . . . . . . . . . . . 71 4.4 Classificationresultofpipepoints(ingreen),planepoints(inyellow),edge points (in blue), thin-pipe points (in dark green) and the others (in red). . 74 4.5 Segmentation result of the remaining candidate points. . . . . . . . . . . . 76 x 4.6 Parts detection in an assembly. . . . . . . . . . . . . . . . . . . . . . . . . 82 4.7 Detected parts, including hand wheels, valves, junctions and flanges. . . . 84 4.8 Detected parts on top of the tanks. . . . . . . . . . . . . . . . . . . . . . . 84 4.9 Precision-recall curve of the industrial part detection. . . . . . . . . . . . 85 4.10 Detection and alignment result from the cluttered scene. The chef is de- tected twice from two different sub-parts. . . . . . . . . . . . . . . . . . . 86 4.11 Matching a single part to a big cluster. . . . . . . . . . . . . . . . . . . . . 88 4.12 The process of adaptive segmentation. . . . . . . . . . . . . . . . . . . . . 89 4.13 Illustration of overlapping score. (a) shows the alignment of two point clouds and (b) highlights the overlapping area within certain threshold. The overlapping score reflects the ratio between the size of overlapping area and one of the original point clouds. . . . . . . . . . . . . . . . . . . 92 4.14 Comparison of alignment result using (a) maximum number of matching criterion and (b) overlapping score criterion. The point clouds in (a) are misaligned. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.15 Comparison of segmentation result using adaptive segmentation (a), as well as fixed-tolerance segmentation with tolerance 0.03 (b) and 0.01 (c). The congested area highlighted by the bounding box (b) is successfully separated into several smaller part-like clusters by adaptive segmentation (a), while over-segmentation (c) is avoided. . . . . . . . . . . . . . . . . . 97 4.16 Comparison of segmentation / clustering methods. The vertical axis rep- resents the number of correct detections using different methods. The first column is adaptive segmentation, while the remaining columns are segmentation with a fixed tolerance t. . . . . . . . . . . . . . . . . . . . . 98 4.17 Detection result in an industrial scene. The detected parts are shown in purple and inside the bounding boxes. . . . . . . . . . . . . . . . . . . . . 98 xi 4.18 Detectionresultsbytheproposedmethod. Theleftcolumnaretheoriginal point clouds, while the right column are the classification and detection results. Theplanesareshowninyellow, pipesshowningreen, edgesshown in light purple, and the detected parts are shown in dark purple. Despite the presence of large number of pipes (1st/2nd row), planes (3rd row) and clutters (2nd/4th/5th row), our system is capable of detecting all kinds targets ranging from flanges, hand wheels (2nd/4th row), tripods (3rd row) to large ladders (5th row). . . . . . . . . . . . . . . . . . . . . . . . . 99 5.1 An illustration of possible causes of changes between old data (top) and new data (bottom). (1) New - The new object appears; (2) Missing - The old object is missing; (3)-(5) Pose - The pose of the object is changed due to translation and rotation; (6) Replacement - The old object is replaced by the new object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2 System pipeline for object change detection. The input contains the refer- ence data (left) and the target data (right). Results from major stages in- cludingglobalalignment,objectdetectionandchangedetectionareshown. Intheillustrationofobjectdetection,detectedobjectsarehighlightedwith green color and bounding boxes. The visualization of change detection re- sults is explained in Section 5.6. . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3 Decision flow of pairwise change estimation. . . . . . . . . . . . . . . . . . 112 5.4 (a)(b)Originalpointclouds. (c)Globalalignmentresult. (d)(e)Detection results. (f) Change detection results. The visualization of the change detection results is explained in Section 5.6. . . . . . . . . . . . . . . . . . 121 5.5 (a) Original mesh model. (b) Original point cloud. (c) Global alignment result. (d) (e) Detection results. (f) Change detection results. The visual- ization of the change detection results is explained in Section 5.6. . . . . . 122 6.1 Representativepoles. Fromlefttorightarestreetlights,flags,utilitypoles, signs, meters and traffic lights, respectively. . . . . . . . . . . . . . . . . . 127 6.2 The pipeline of the proposed pole detection and classification system. . . 128 6.3 The step-by-step results of candidate localization. (a) The sliced result. (b) The clustering result for each slice. (c) The segments satisfying the pole criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 xii 6.4 Groundanddisconnectedregiontrimmingprocess. (a)Theoriginalbucket- augmentedcandidate. (b)Theredpartistheabove-the-seedcluster,while thebluepartistheunder-the-seed, orground-connectedcluster. (c)When ground trimming is done, the bottom part of the pole is successfully ex- tended. (d) Finally, the disconnected components with respect to the seed trunks are removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.5 Distribution of height of poles. . . . . . . . . . . . . . . . . . . . . . . . . 133 6.6 Comparison of point classification result of a facade patch before and after smoothing. (a) shows the unsmoothed point classification result, in which most points are correctly classified as plane. However, the smoothed point classification result (b) wrongly turns the planar points into linear points. 135 6.7 Point classification results on different objects. The meanings of the colors are: blue - vertical linear, red - planar, black - volumetric, orange - wire, green - other linear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.8 Pole detection and classification result of the whole area. . . . . . . . . . . 142 6.9 Typical failure cases of our method. The left is a false alarm due to the sparsity of the tree, while the right is a false negative caused by a light with many bulbs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.10 Close-ups of pole detection and classification result by our method. We use yellow color to denote the lights, blue color to denote utility poles and red color to denote signs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.1 System overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.2 Resultsforclusterorientationestimation. (a)Orientationestimationresult forcandidateclustersalongtheroad. Theredlineistheprincipalaxisand the blue line is secondary axis. The estimation is pretty close to human perception for the cars. (b) For clusters containing multiple cars (typically in the parking lots), the orientation can be estimated as well. . . . . . . . 148 7.3 Results for gap segmentation. In (a), the large cluster is divided into 1×5 sub-regions. Also, the ground that failed to be removed is correctly ex- cluded in the result. In (b), the connected cars along the road is separated. 149 7.4 Orthogonal-View CNN with fusion at the fully-connected hidden layer, which performs the best among the three architectures. All networks have 6 layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 xiii 7.5 Vehicle detection results. The vehicles are highlighted in red color. (a) shows the example of vehicles parking along the street. (b) and (c) show the example of large parking lots. (d) shows that our method could handle both cars and trucks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.1 Objects of interest. The buildings are colored with orange, planes (includ- ing ground and roof) are colored with yellow, trees are colored with green, carsarecoloredwithred,polesarecoloredwithblue,andwiresarecolored with black. The points in the other categories are colored with light gray. 158 8.2 Thelabelingsystempipeline, includingtheofflinetrainingmoduleandthe online testing module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 8.3 Illustration for dense voxelization. The input point cloud (a) is parsed through the voxelization process, which generates a dense voxel represen- tation depicted in (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 8.4 3D Convolutional Neural Network. The number on the top denote the number of nodes in each layer. The input is the voxel grid of size 20 3 , fol- lowedbyaconvolutionlayerwith20featuremapsofsize5×5×5resulting in20×16 3 outputs,amaxpoolinglayerwith2×2×2-sizednon-overlapping divisions resulting in 20×8 3 outputs, a second convolutional layer with 20 feature maps of size 5×5×5 resulting in 20×4 3 outputs, a second max pooling layer with 2×2×2-sized non-overlapping divisions resulting in 20×2 3 outputs, a fully connected layer with 300 hidden nodes, and the final output is based on a softmax over 8 labels (including 7 categories and an empty label). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 8.5 Confusion matrix for different categories. The entry at the i-th row and the j-th column denotes the percentage of points of the j-th truth category that are classified as the i-th category. . . . . . . . . . . . . . . . . . . . . 167 8.6 Labeling result for a large urban area through the 3D-CNN. . . . . . . . 168 8.7 Close-ups of the labeling result. (a) The ground planes, buildings, trees and cars. (b) The street view with various poles including light poles, sign poles, utility poles and flag poles. (c) The street view with utility poles and wires. (d) The parking lot scene. . . . . . . . . . . . . . . . . . . . . . 169 xiv Abstract Object detection and recognition are two highly related fundamental problems in com- puter vision. While most existing works have been in the 2D image domain, 3D data are gaining popularity in recent years thanks to the development of 3D sensors. This work focuses on object detection and recognition from 3D point clouds, which involves various stages of point cloud processing including feature description, matching, segmentation, localization, classification, and labeling. We explore two ways of using feature descriptors in the detection and recognition tasks. The first way is to compute a set of descriptors on the local neighborhood of feature points and use them in the matching-based framework. The second way is to compute a descriptor for a whole candidate cluster, and then apply machine learning techniques to classify the clusters without knowing the exact poses. Matching of 3D data is challenging in that there are usually enormous number of 3D points and the coordinate system can vary in terms of translation, 3D-rotation and scaling. Point positions are generally not coincident; noises and occlusions are common due to incomplete scans; and objects are attached to each other and background objects such as ground and pipes. Furthermore, many data may not contain any photometric information such as intensity other than point positions. Given the problems above, xv matching methods that solely rely on photometric properties will fail and conventional techniques or simple extensions of 2D methods are no longer feasible. The unique nature of point clouds requires methods and strategies different from those for 2D images. We first present a novel representation specifically designed for matching of 3D point clouds. Particularly,ourapproachisbasedontheconceptofself-similarity,whichaimsat capturing the internal geometric layout of local patterns. We develop a generalized self- similarity estimation method in the 3D domain that can incorporate normal, curvature, intensity and other properties of the point cloud. Furthermore, we design the 3D self- similarity descriptor using the local reference frame based on the normal and principal curvature directions. The resulting 3D self-similarity descriptor is compact and view- independent. We also present a feature selection method through maxima of principal curvature. We apply the 3D self-similarity descriptor to build a complete feature-based match- ing subsystem, where we propose a new correspondence selection method by imposing rigid-body constraints. The subsystem is further incorporated in the object detection and recognition system. The whole system combines learning-based point classification, adaptive segmentation, feature-based matching and an iterative process for multi-object detection in large-scale point clouds. The framework not only takes global and local information into account, but also benefits from both learning and empirical methods. We further apply our object detection system in another interesting and challenging problem: giventwodataofthesamesitescanned/modeledatdifferenttimes, howcanwe tell the difference between the two? We formulate this problem as the 3D change detec- tion problem, and propose a novel method for detecting object-level changes. In general, xvi we notice that the changes can be viewed as the inconsistency between the global align- ment and the local alignment. Therefore, we devise a change detection framework that comprises global alignment, local object detection and a novel change detection method. Specifically, we design a series of change evaluation functions for pairwise change infer- ence, based on which we formulate the many-to-many object change correlation problem as the weighted bipartite matching problem that can be solved efficiently. The matching-based approach works well for point clouds with rich details and high intra-class correlation, e.g., the industrial parts or the different scans of the same object. However, for urban objects, such as vehicles, poles and trees, the intra-class variation is much higher, thus requiring a different strategy. Particularly, we find learning-based approaches to be very effective and have good generalizability. For pole-like objects, we find that they are more differed in the global appearance than the local appearance. We employ a three-stage system including localization, seg- mentation and classification. The slicing-based localization algorithm takes advantage of the unique characteristics of pole-like objects and avoids heavy computation on the feature of every point in traditional methods. Then, the bucket-shaped neighborhood of the segments is integrated and trimmed with region growing algorithms, reducing the noises within candidate’s neighborhood. Finally, we introduce a representation of six attributes based on the height and five PCA-based features and apply SVM to classify the candidate objects into four categories, including lights, utility poles and signs, as well as the non-pole category. For vehicles, the PCA features are not enough to tell them apart from other planar objects such as buildings. To this end, we apply the deep Convolutional Neural Networks xvii (CNN) on the orthogonal-view information from the candidates. The system consists of a segmentation stage and a classification stage. Prior knowledge for vehicles and urban environment is utilized to help the detection process. Specifically, we incorporate curb detection and removal in the segmentation stage. Our approach is able to estimate the orientationofthecandidatesanduseittohandlethedifficultcasessuchasvehiclesinthe parking lots. To distinguish the vehicles from other segments among the 3D point cloud candidates, we develop and compare three architectures of CNN based on the fusion of orthogonal view projections of the candidates. Finally,wefigurethat,projecting3Ddatato2Drepresentationssuchasdepthimages and then applying the 2D techniques can sometimes lead to loss of important structural information embedded in the 3D representation. Inspired by the success of deep learning onthe2Dproblems,wepresentthevoxel-basedfully-3DConvolutionalNeuralNetworkon thepointcloudlabelingproblem. Ourapproachminimizestheuseofpriorknowledgeand doesnotrequireasegmentationsteporhand-craftedfeaturesasmostpreviousapproaches did. We also present solutions for large data handling during the training and testing process. We demonstrate the proposed object detection and recognition systems through ex- periments on point clouds from industrial datasets and large-scale urban datasets. xviii Chapter 1 Introduction Objectdetectionandobjectrecognitionaretwoofthemostbasictasksincomputervision. Therearenumerousworksfocusingondetectionofvarioustypesofobjects. Untilrecently, however, most methods work on the 2D image data. As the sensor technology develops fast these years, 3D point clouds of the real scenes have been increasingly popular and precise, while there are much fewer methods trying to directly detect/recognize objects from the 3D data. The goal of my research is to detect objects of relatively small scale from large scene point cloud and recognize the specific categories. Figure 1.1 shows two typical datasets we’redealingwith. Ourapproachaimsatdetectingandrecognizingindustrialparts,such asvalves,flanges,handwheels,connectorsfromtheindustrialpointclouds(Figure1.1(a)), aswellasurbanobjects, suchasdifferentkindsofpole-likeobjectsincludingstreetlights, utility poles and street signs, and vehicles from the urban point clouds (Figure 1.1(b)). Comparingto2Dimagedata, therearesomeadvantagesfor3Dpointclouddata. For example, thespatialgeometricstructureofthepointsareclearperse. However, thereare also several challenges for 3D point cloud data detection. Firstly, the texture information 1 (a) (b) Figure 1.1: (a) Part detection and recognition from industrial data. (b) Object detection and recognition from urban data. is not as clear as 2D images; secondly, just as the 2D data, 3D data can be affected by noises and occlusions; thirdly, the objects have more degrees of freedom in the 3D space, increasing the difficulty in alignment; finally, the point cloud data are typically very large with millions of points in a single scene, thus the efficiency of the algorithm is vital for an algorithm. We explore two ways of using feature descriptors in detection and recognition tasks. Thefirstistocomputeasetofdescriptorsonthelocalneighborhoodoffeaturepointsand use them in the matching-based framework. The second way is to compute a descriptor for a whole cluster, and classify the cluster as a whole without knowing the exact pose using learning-based framework. In the first part of the work we emphasize more on the first approach, i.e., the matching-based strategy, which works well for point clouds with rich details and high intra-class correlation, e.g., the industrial parts or the different scans of the same object. Matching is a process of establishing precise correspondences between two or more datasets acquired, for example, at different times, from different aspects or even from 2 different sensors or platforms. We addresses the challenging problem of finding precise point-to-point correspondences between two 3D point cloud data. Specifically, inspired by the 2D self-similarity descriptor [82], we find that the self- similarity characteristics can represent the intrinsic structure of data, regardless of the modality of data. We extend this idea in two scopes, including converting 3D data to 2D rangeimagesandapplyingafeature-based2Dself-similaritydescriptor,andageneralized representationofself-similarityevaluationinthe3Ddomainthatcanincorporatenormal, curvature,intensityandotherpropertiesofthepointcloud. Therepresentationefficiently captures distinctive geometric signatures embedded in point clouds. Furthermore, we designthe3Dself-similaritydescriptorusingthelocalreferenceframebasedonthenormal and principal curvature directions. The resulting 3D self-similarity descriptor is compact and view-independent. Weapplythe3Dself-similaritydescriptortobuildacompletefeature-basedmatching subsystem, where we propose a new correspondence selection method by imposing rigid- body constraint. The subsystem is further incorporated in the object detection and recognition system. The system combines learning-based point classification, adaptive segmentation, feature-based matching and an iterative process for multi-object detection in continuous point clouds. The framework not only takes global and local information into account, but also benefits from both learning and empirical methods. We also show that, applying two different local descriptors (FPFH [74] and 3DSSIM [37]) in different phase of processing could be the best choice for the whole system. We further apply our object detection system in another interesting and challenging problem: given two data of the same site scanned/modeled at different times, how can 3 we tell the difference between the two? We formulate this problem as the 3D change detection problem, and propose a novel method for detecting object-level changes. In general, we notice that the changes can be viewed as the inconsistency between the global alignment and the local alignment. Therefore, we propose a change detection framework that comprises global alignment, local object detection and a novel change detection method. Specifically, we propose a series of change evaluation functions for pairwise change inference, based on which we formulate the many-to-many object change correlation problem as the weighted bipartite matching problem which could be solved efficiently. In the second part of the work, we emphasize more on the second strategy, i.e., the learning-based approach, which works better for point clouds with high intra-class varia- tion,e.g.,theurbanobjectssuchasvehicles,polesandtrees,andhasgoodgeneralizability. We first explore the global feature-based approach in another object detection and recognition task of pole-like objects from point clouds obtained in urban areas, since such objectsaremoredifferedintheglobalappearancethanthelocalappearance. Thesystem consists of three stages: localization, segmentation and classification. The localization al- gorithm based on slicing, clustering, pole seed generation and bucket augmentation takes advantageoftheuniquecharacteristicsofpole-likeobjectsandavoidsheavycomputation on the feature of every point in traditional methods. Then, the bucket-shaped neighbor- hoodofthesegmentsisintegratedandtrimmedwithregiongrowingalgorithms, reducing the noises within candidate’s neighborhood. Finally, we introduce a representation of six attributes based on the height and five point classes closely related to the pole categories 4 Figure 1.2: Model training based on gradient descent. andapplySVMtoclassifythecandidateobjectsintofourcategories, includingthreepole categories light, utility pole and sign, and the non-pole category. Then, we find Convolutional Neural Networks very effective in solving both the ob- ject classification problem as well as the point labeling problem. Generally, the Neural Network-based learning method we are using follow the training pattern depicted in Fig- ure 1.2. In the vehicles detection problem, we apply the deep Convolutional Neural Networks (CNN) on the orthogonal-view information from the candidates. The system consists of a segmentation stage and a classification stage. Prior knowledge for vehicles and urban environment is used to help the detection process. Specifically, we incorporate ground re- moval using normal-constraint region growing, building removal with height information, curb detection using horizontal linear point clustering during the segmentation stage. We employ the idea of normal voting to estimate the orientation of the candidates and use it to handle the difficult segmentation problems such as separating vehicles in the 5 parking lots. To distinguish the vehicles from other segments among the 3D point cloud candidates, we develop and compare three architectures of CNN based on the fusion of orthogonal view projections of the candidates. Finally, to avoid the loss of important structural information embedded in the 3D representation when projecting 3D data to 2D representations such as depth images and then applying the 2D techniques, and inspired by the success of deep learning on the 2D problems,wepresentthevoxel-basedfully-3DConvolutionalNeuralNetworkonthepoint cloud labeling problem. Our approach minimizes the use of prior knowledge and does not require a segmentation step or hand-crafted features as most previous approaches did. We also present solutions for large data handling, balancing of training data as well as data augmentation during the training and testing process. The remainder of the thesis is organized as follows. Chapter 2 summarizes the related work of the feature descriptor, object detection and recognition systems and the 3D Convolutional Neural Networks. Chapter 3-5 mainly talk about the first strategy, i.e., the matching-based approach. Chapter 3 illustrates the concept of self-similarity as well as the formation of the 3D Self-Similarity descriptor. Chapter 4 describes the whole object detection and recognition system that employs the 3D Self-Similarity feature- based matching module. Chapter 5 further builds a change detection system based on the object detection system in Chapter 4. Chapter 6-8 mainly talk about the second strategy,i.e.,thelearning-basedapproach. Chapter6presentsapole-likeobjectdetection andclassificationsystemthatappliesPCA-basedfeaturesandSupportVectorMachinein the classification task. Chapter 7 introduces the Orthogonal-View Convolutional Neural Networks (OV-CNN) and its application on the vehicle detection problem. Chapter 8 6 further extends the deep learning method to the fully 3D Convolutional Neural Network and utilizes it to solve the 3D point cloud labeling problem through voxelization process. Finally, the conclusions and future works are summarized in Chapter 9. 7 Chapter 2 Related Work 2.1 3D Feature Descriptor 3Ddatamatchinghasrecentlybeenwidelyaddressedinbothcomputervisionandgraphic communities. Avarietyofmethodshavebeenproposed,buttheapproachesbasedonlocal featuredescriptorsdemonstratesuperiorperformanceintermsofaccuracyandrobustness [41,90]. In local feature-based approach, the original data are transformed into a set of distinctive local features, each representing a quasi-independent salient region within the scene. The features are then characterized with robust descriptors containing local surface properties that are supposedly repeatable and distinctive for matching. Finally, registration methods could be employed to figure out the global arrangement. One of the most well-known descriptor, Spin image, is designed for 3D surface repre- sentation and matching [41]. The key element of spin image generation is the oriented point, or 3D surface point with an associated direction. Once the oriented point is de- fined, the surrounding cylindrical region is compressed to generate the spin image as the 2D histogram of number of points lying in different distance grids. By using local 8 object-oriented coordinate system, the spin image descriptor is view and scale indepen- dent. Several variations of spin image have been suggested. For example, [73] proposed a spherical spin image for 3D object recognition, which can capture the equivalence classes of spin images derived from linear correlation coefficients. Manyworksalsoattempttogeneralizeconceptsfrom2Dto3D,suchas3DSURF[44] extendedfromSURF[5]and3DShapeContext[46]extendedfrom2DShapeContext[7]. Taxonomy Tombari et al. [93] proposed to categorize the descriptors into two classes, i.e., signatures and histograms. The histogram-based descriptors include Spin Image (SI) [41], Local Surface Patches (LSP) [17], 3D Shape Context (3DSC) [28], Intrinsic Shape Signatures (ISS) [107], Tensor [60], Point Feature Histograms (PFH) [77], Fast Point Feature Histograms (FPFH) [74], Unique Shape Context (USC) [92], Fast Integral Normal3D(FINDDD)[72], NormalAlignedRadialFeature(NARF)[87]. Thesignature- based descriptors include Structural Indexing (StInd) [88], Point Signature (PS) [18], 3D Point’s Fingerprint (3DPF) [91], Exponential Mapping (EM) [62] and 3D Speeded Up Robust Features (3DSURF) [44]. SHOT [93] uses both histogram and signature representations. Using this taxonomy, our method (3D Self-Similarity, 3DSSIM) [37] lies in the signature-based category. Through our discussion in Section 3.3.5, we find that in general, the histogram-based methods are more robust to difference between non- existence and existence, while the non-histogram-based methods are more robust among different levels of noises. Local Reference Frame Besides the conceptual idea, the selection of Local Reference Frame (LRF) is also vital for a reliable rotation-invariant descriptor. Tombari et al. [93] 9 summarize the uniqueness and unambiguity of LRF for several descriptors, including EM [62] and ISS [107] which are unique but ambiguous, as well as StInd [88], PS [18], 3DPF [91], 3DSC [28] and Tensor [60] which are not unique but use multiple descriptors to remove the ambiguity. Our LRF is similar to that of ISS [107] and can have up to 4 orientations of LRF due to the directions of the eigenvectors. We’ll discuss this topic in Section 3.3.5 as well. Most of the descriptors mentioned above are based on the assumption of a local neighborhood and a rigid shape, which is also the main focus of our work. However, it’s worth mentioning that there are descriptors for non-rigid shapes and for global shapes: Descriptor for Non-Rigid Shape Heat Kernel Signature (HKS) proposed by Sun et al. is a type of shape descriptor targeting for matching objects under non-rigid trans- formation [90]. The idea of HKS is to make use of the heat diffusion process on the shape to generate intrinsic local geometry descriptor. It is shown that HKS can capture much of the information contained in the heat kernel and characterize the shapes up to isometry. Further, Bronstein and Kokkinos [14] improvedHKS to achieve scale-invariant, and developed a HKS local descriptor that can be used in the bag-of-features framework for shape retrieval in the presence of a variety of non-rigid transformations. Global Feature Descriptor Most descriptors are applied locally in the neighborhood ofakeypoint. Whilelocaldescriptorscouldbeusedastheglobaldescriptorbyextending the neighborhood, a down-sampling process is needed for efficiency. On the other hand, there are also descriptors designed to be computed over a whole cluster, e.g., Viewpoint Feature Histogram (VFH) [75]. From a generalized perspective, the proposed 6-D cluster 10 attributes used in pole classification in Chapter 6 could be viewed as a global feature descriptor. Benchmarks There have been some benchmarks on the 3D descriptors. An evalu- ation on 3D descriptors for object and category recognition could be found in [1]. A detailed performance evaluation and benchmark on 3D shape matching were reported in SHREC’10 [13] that simulates the feature detection, description and matching stages of feature-based matching and recognition algorithms. The benchmark tests the per- formance of shape feature detectors and descriptors under a wide variety of conditions. After that, SHREC’11 [11] tests the performance of descriptors under several different transformations. Note that the descriptors in [11] are mostly based on mesh data. In our work, we use the SHREC benchmarks to test and evaluate our approach. Self-Similarity Recently, the concept of self-similarity has drawn much attention and been successfully applied for image matching and object recognition. Shechtman and Irani [82] propose the first algorithm that explicitly employs self-similarity to form a de- scriptorforimagematching. Theyusedtheintensitycorrelationcomputedinlocalregion as resemblance to generate the local descriptor. Later on, several extensions and varieties were proposed. For example, Chatfield et al. [16] combine a local self-similarity de- scriptor with the bag-of-words framework for image retrieval of deformable shape classes; Maver [57] uses the local self-similarity measurement for interest point detection. 11 2.2 3D Object Detection and Recognition Previousworksinpointcloudprocessingcouldbedividedintoseveralcategoriesbasedon their focus: segmentation, classification, matching, modeling, registration and detection. Object detection in the scene point cloud is a systematic work that typically requires multiple techniques in different aspects. SceneandObject Therearenumerousworksfocusingondifferentsortsofpointcloud data. One big category is focusing on the urban area. For example, Patterson et al. [64], Frome et al. [28] and Matei et al. [54] try to detect vehicles; Yokoyama et al. [105] and Lehtom¨ aki et al. [50] concentrate on detecting poles. The detailed review of pole-like object detection and classification could be found in Chapter 6. Golovinskiy et al. [31] try to detect both vehicles and posts as well as other small urban objects. Another category is the indoor scenes. Steder et al. [86] use a model-based method to detect chairs and tables in the office. N¨ uchter et al. [63] aim at classifying objects such as office chair. Koppula et al. [45] use a graphical model to capture various features for indoor scenes. Rusu et al. [78] try to obtain 3D object maps from the scanned point cloud and mainly focuses on indoor household objects. However, very few of them work on industrial scene point cloud: Vosselman et al. [98] review some techniques used for recognition in industry as well as urban scenes. Schnabel et al. [80] present a RANSAC algorithm that could detect basic shapes, including planes, spheres, cylinders, cones and tori, in the point clouds. However, it does not deal directly with the complex shapes that might represent a part. 12 Learning-based Frameworks Learning-based method is widely used in detection and/or classification tasks. One of the most successful learning-based framework is the boosted cascade framework [97], which is based on AdaBoost [27] that selects the domi- nating simple features to form rapid classifiers. Other common techniques include SVM (e.g. [36]), ISM (e.g. [96]), Conditional Random Field (e.g. [102]) and Markov networks (e.g. [2,81]), among which SVM is famous for its simple construction yet robust perfor- mance. MRF-based methods focus on the segmentation of 3D scan data, which could be an alternative method for the segmentation and clustering process in our framework. 3D Object Detection and Labeling Pinheiro et al. applied Recursive Neural Net- work for scene parsing [65]. Habermann et al. present a 3D object recognition ap- proach based on Multiple Layer Perceptron (MLP) on 2D projected data [33]. However, their method requires data segmentation. Koppula et al. take a large-margin approach to perform the 3D labeling classification based on various features [45]. PCA analysis and the dimensionality feature based on it have been applied in point-level classification tasks [22,105]. Still, the parameter selection for the features are highly empirical, and we show in this work that 3D-CNN based on the simplest occupancy voxels could achieve comparable effects without any task-specific features. Segmentation In order to reduce the size of problem, segmentation techniques are oftenemployedin3Dpointcloudpre-processing. ManymethodsarebasedonGraph-cut. For example, Golovinskiy and Funkhouser [30] uses a min-cut algorithm for the outdoor urban scan. Douillard et al. [24] presents a set of segmentation methods for different types of point clouds. They also proposed an evaluation metric that could quantify the 13 performance of segmentation algorithms. In our work, we aim to decompose the point clouds into meaningful regions and separate the parts from each other while avoiding breaking one part into multiple pieces, with an adaptive segmentation scheme. MatchingandCorrespondenceSelection Theextensivelyappliedstrategy,feature- based matching, performs the feature extraction and then computes descriptors in each of the extracted key-points. For example, Patterson et al. [64] obtains the key-points by placing a 3D grid and computes the Spin Image descriptor [41]. There are several complex correspondence selection schemes, most of which are originated from RANSAC [25]. For example, Maximum Likelihood Estimation by Sample Consensus (MLESAC) scheme proposed in [95] is also applied in [94] for image matching. Chum and Matas [19] proposes the Progressive Sample Consensus (PROSAC) that reduces the computation with classification score. They further presents a randomized model verification strategy forRANSAC,basedonWald’stheoryofsequentialdecisionmaking[20]. Acomprehensive survey of the RANSAC techniques could be found in [71]. We do not apply a randomized scheme, but instead take advantage of prior knowledge to prune the outliers during the selection procedure. Forthewholesystem,Pattersonet al. [64]proposesanobject(mainlycars)detection system for urban area with bottom-up and top-down descriptors. Mian et al. [59] also propose a segmentation and matching scheme to deal with cluttered scenes, but it’s fully based on models instead of point clouds. Since models could be converted to point clouds through a virtual scanner conveniently but not vice versa, and the scanned data 14 are usually in the point cloud format, we provide a viable solution to process 3D data, especially those without edge information. 2.3 3D Convolutional Neural Networks We introduce the 3D Convolutional Neural Networks to solve the urban point cloud labeling problem. 3D CNN has been proposed first in the application of video data analysis, because videos can be seen as a temporal 3D extension of the 2D images as have well-defined grid values. Ji et al. proposed 3D-CNN for human action recognition in video data [40]. For 3D point cloud, Maturana and Scherer applied 3D-CNN for landing zone detection from LiDAR point clouds [55]. Prokhorov presented a 3D-CNN forcategorizationofsegmentedpointclouds[67]. 3DShapeNetsapplied3DCNNtolearn the representation of 3D shapes [100]. VoxNet integrated a volumetric Occupancy Grid representation with 3D CNN [56]. All of their approaches require pre-segmented objects before applying the 3D-CNN method. To incorporate the localization problem in the 3D CNNframework, SongandXiaoproposedthedeepslidingshapesfor3Dobjectdetection in depth images [84] and RGB-D images [85], with the 3D Region Proposal Network (RPN). However, they use the Manhattan world assumption to define the bounding box orientation of indoor objects, which is not feasible for outdoor objects in our case. 15 Chapter 3 3D Feature and Description Figure 3.1 shows an example of point cloud data that were acquired by airborne LiDAR sensor. The two point clouds represent two LiDAR scans of the same area (downtown Vancouver) acquired at different times and from different viewpoints. The goal is to find precise matches for the points in the overlapping areas of the two point clouds. Matching of point clouds is challenging in that there are usually enormous number of 3D points and the coordinate system can vary in terms of translation, 3D-rotation and scale. Point positions are generally not coincident; noises and occlusions are common due to incomplete scans; and objects are attached to each other and/or the ground. Furthermore, many data may not contain any photometric information such as intensity other than point positions. Given the problems above, matching methods that solely rely on photometric proper- tieswillfailandconventionaltechniquesorsimpleextensionsof2Dmethodsarenolonger feasible. Theuniquenatureofpointcloudsrequiresmethodsandstrategiesdifferentfrom those for 2D images. 16 Figure 3.1: Two point clouds represent two LiDAR scans of the same area captured at differenttimesandfromdifferentaspects. Theproposedmethodcanfindprecisematches for the points in the overlapping area of the two point clouds. In real applications, most point clouds are a set of geometric points representing externalsurfacesorshapesof3Dobjects. Wethereforetreatthegeometryastheessential information. We need a powerful descriptor, as a way to capture geometric arrangements of points, surfaces and objects. The descriptor should be invariant to object translation, scaling and rotation. In addition, the high dimensional structure of 3D points must be collapsed into something manageable. We present a novel technique specifically designed for matching of 3D point clouds. Particularly, our approach is based on the concept of self-similarity. Self-similarity is an attractiveimagepropertythathasrecentlyfounditswayinmatchingintheformoflocal self-similarity descriptors [82]. It captures the internal geometric layout of local patterns in a level of abstraction. Locations with self-similarity structure of a local pattern are distinguishable from locations in their neighbors, which can greatly facilitate matching process. Several works have demonstrated the value of self-similarity for image matching and related applications [16,57]. From a totally new perspective, we design a descriptor that can efficiently capture distinctive geometric signatures embedded in point clouds. 17 The resulting 3D self-similarity descriptor is compact and view/scale-independent, and hence can produce highly efficient feature representation. We apply the developed de- scriptor to build a complete feature-based matching system to provide high performance matching between point clouds. 3.1 Self-Similarity Originated from fractals and topological geometry, self-similarity is the property held by those parts of a data or object that resemble themselves in comparison to other parts of the data. The resemblance can be photometric properties, geometric properties or their combinations. Mostdata,regardlessofthedimensionality,havecertaindegreesofintrinsiccontinuity. The continuity itself means that there’s a large chance that similar patterns could appear in a relatively small area. Specifically, certain properties based on the data will have such similarity patterns. This is a trivial inheritance in case the data already have the patterns, however, even if the data look quite different at first glance, the underlying property might still be similar. Differentfromapproachesusinghand-craftedfunctionstogeneratethedescriptors,we findtheself-similaritytobeoneofthemostnaturalwaysofdescribingthecharacteristics of data, regardless of the modality of data. Given a property function f :X →P from the set of points X to the property space P, we define the similarity function s f :X×X →R between two points x,y∈X as: 18 s f (x,y)=d(f(x),f(y)). (3.1) where d:P×P →R is a metric or distance function on property space P. Note that s f is not a distance function on set X since it doesn’t satisfy the coincidence axiom. Now fixing a point x 0 , we want to know the repeated patterns of certain property with regard to the central point x 0 . Based on the similarity function, we define the self-similarity transformation ss f;x 0 :X →R as: ss f;x 0 (x)=s f (x,x 0 ). (3.2) Note that the self-similarity concept we introduce here is different from the scale- invariant patterns in fractal geometry. 3.2 Extending Self-Similarity to 3D Domain: Range Image The simple extension is to convert the 3D data into range image and apply the 2D self-similarity descriptor. However, there are several issues not addressed in the orig- inal framework [82]. It lacks the rotation invariance and the scale detection process, which means we could hardly apply the approach to match images with different orien- tation, scale and viewpoints. In addition, the approach is computation-intensive since it requires the dense computation of self-similarity description (based on correlation com- putation) for every a few pixels. To address these problems, we present a feature-based 19 self-similarity matching approach to ensure rotation and scale invariance, as well as the computational efficiency. We first use feature extraction methods to obtain a set of distinctive image features, togetherwiththescaleandorientationestimation. Next,foreachextractedfeaturepoint, we apply the correlation function combined with the scale and orientation to compute the local correlation surface. After that, the correlation surface is transformed into log- polar bins with the maximum/minimum value representing the value of each bin and then scaled such that the maximum value of all bins is equal to 1. Until now we have got the feature-based self-similarity descriptor. Finally, we can use the Nearest Neighbor Distance Ratio Matching process to obtain the correspondences between the two images. The remaining of this section will introduce each of the steps in detail. 3.2.1 Feature-based Self-Similarity Descriptor with Rotation and Scale Invariance 3.2.1.1 Feature Extraction There are several categories of feature extraction methods. The most common ones are Harris corner-based [35], Laplacian-based and Hessian-based. The representative for Laplacian-based methods is SIFT [51] with an approximation of Difference of Gaussian (DoG), while the representative for Hessian-based method is SURF [5]. For simplicity, we try the methods used in SIFT and SURF, respectively, to extract highly-distinctive localfeaturesinscale-space. Theoutputofthefeatureextractionphaseincludesnotonly the position, but also the dominant orientation and scale of the feature point. It turns out that the results from two different methods are analogous. 20 Figure 3.2: From original local image region (left) to correlation surface (middle), then quantized using log-polar coordinate (right). 3.2.1.2 Correlation Surface The core part of self-similarity is the correlation surface, which we expect to be invariant to different sensor modalities. Although in general, the correlation surface can stretch as far as the whole image, but for better estimation of locality we typically focus on the correlation surface defined within a local region centered at a feature point q: S q (x):=Similarity(P(q),P(x)),∀x∈Region(q), (3.3) where P(q) is the center patch around q, and P(x) is the travelling patch decided by the position of point x in this local region of q. (Figure 3.2) The definition of the correlation function in [82] is: S q (x,y)=exp(− SSD q (x,y) max(var noise ,var auto (q) ), (3.4) where var auto is the maximum SSD result of the nearest patches to the center patch, expected to account for variance of sharpness, and var noise is a constant threshold. By our notation, we can rewrite it as 21 Correlation Function Mathematical Representation EXP S q (x)=exp(− SSD(P(q),P(x)) max(var noise ,varauto(q) ) SSD S q (x)=SSD(P(q),P(x)) RMSE S q (x)=RMSE(P(q),P(x))= √ SSD(P(q),P(x)) AVG S q (x)=|Average(P(q))−Average(P(x))| NRMSE S q (x)=Normalized(RMSE(P(q),P(x))) Table 3.1: Different Self-Similarity correlation functions S q (x)=exp(− SSD(P(q),P(x)) max(var noise ,var auto (q) ). (3.5) However, in practice we find that this complicated exponential model does not always best describe the degree of similarity, in both optical images and multimodal images. We test a few correlation functions in our experiments, including the original exponential formula (EXP), Sum of Squared Differences (SSD), Root Mean Square Error (RMSE), differenceofAveragefunction(AVG),andNormalizedRootMeanSquareError(NRMSE) (Table 3.1). The NRMSE means that the RMSE surface is treated as a set of numbers and normalized such that the standard deviation equals to 1. The results are obtained with densely sampled feature points and without scale and orientationchanges,inorderthattheresultsarenotmixedupwiththefeatureextraction phase and thus can reflect the true performance of the correlation functions. Figure 3.3 shows the evaluation result of different SSIM correlation functions on multimodal data. It turns out that the simple formula (RMSE) works best in most cases. 22 Figure 3.3: Comparison of different SSIM correlation functions on multimodal data. The horizontal axis corresponds to different pairs of LiDAR intensity and depth image, while the vertical axis shows the ratio between correct matches and total matches. The threshold ratio is 0.8 in this experiment. 3.2.1.3 Combining Scale and Orientation with the Descriptor Todealwiththegeneralmatchingproblemwithscaleandorientationchanges,wecanuse thedetectedfeaturescalesandorientationsto”normalize”eachfeaturepatchtofeature’s dominant scale and orientation. Indetail,letθ betheorientationat(x,y)(relativetothecenterpoint),thenweassign the intensity at (x,y) of scale γ (define original scale =1) to be: I γ (x,y):=I(x ′ ,y ′ ), (3.6) where x ′ y ′ =M(−θ) x y = γcosθ γsinθ −γsinθ γcosθ x y (3.7) Note that we can do Gaussian or bilinear interpolation to get the value of fractional coordinates if necessary. 23 3.2.1.4 Quantization Following the method described in [82], the correlation surface is transformed into a binned log-polar representation to generate a local self-similarity descriptor (Figure 3.2 right). The value for each bin is represented by the maximum of the correlation function values in this bin. In our experiment there are 4 logarithmic radial bins and 12 angular bins, resulting in a descriptor of only 48 dimensions, which supports fast computation of the Euclidean distances in the matching phase without much loss of distinctiveness. 3.2.2 Experimental Results and Discussion In this section, we will present some results of image matching using the proposed self- similarity descriptor with Root-mean-square-error as the correlation function. 3.2.2.1 Matching We use the Nearest Neighbor Distance Ratio (NNDR) matching with Euclidean distance as the basic metric. The best match for each feature is found by identifying its near- est neighbors (with minimum Euclidean distance) in the matched image. If we denote the nearest neighbor as N 1 , and the second nearest neighbor as N 2 , then we say A is matched to N 1 if and only if dist(A,N 1 )/dist(A,N 2 ) is smaller than a threshold. In our experiments the threshold is typically set to be between 0.6 and 0.8, and unless stated otherwise, the threshold is 0.65. In our visualization, each correspondence is shown as a line connecting two points with the same random color in the two images, respectively. For each point there’s a smaller circle denoting the exact center of position and a larger circle indicating the scale 24 (a) (b) (c) Figure 3.4: Matching results on synthetic data with illumination, orientation and scale changes. onwhichthecorrelationsurfaceiscomputedandquantizedforthispoint. Thewhitelines (bright lines) show those matches whose NNDR is below 0.6, indicating a high confidence of matching, while the gray ones indicate those matches with NNDR over 0.6 but smaller than the defined threshold, meaning relatively lower confidence of match. 3.2.2.2 Synthesized Data For this part of testing, we manually adjust the intensity, or rotate/scale the original image to form the new image, and match the new image to the original one using our feature-based self-similarity descriptor. Figure 3.4 show the robustness of the descriptor to illumination, orientation and scale changes. 3.2.2.3 Real Data With fixed scale and orientation In this part, we have fixed scale and orientation i.e. we assume the unified scale and orientation in the computation of descriptors. The idea is that the descriptor itself is still powerful, but we have to admit that the bottlenecks in the feature-based matching on multimodal data, or even images that are not purely optical, lie not only on the descriptor, but also on the orientation, scale and even the feature extraction itself. For example, in the image pair of Figure 3.5, by SIFT 25 Figure 3.5: Matching result between aerial image and LiDAR depth image. feature extraction the optical image can have as many as 364 feature points, while the textureless depth image contains only 20 feature points. The similar situation appears in most current feature extraction methods. Therefore, it’s extremely difficult for the descriptor to proceed given so few available features with noises. Another example is given in Figure 3.6. We can see that when we manually provide the knowledge that the two images share the same orientation, the self-similarity descriptor can work perfect to match the corresponding regions, however, if we use the orientation given by SIFT, SURForsimplegradient-basedmethodforeachregion, therecanhardlybeanyplausible correspondence. Figure 3.7 shows the matching result between LiDAR intensity images of large area in Vancouver at different time, which demonstrate the capability of the self-similarity descriptor on large data sets. With arbitrary scale and orientation Although due to the absence of suitable fea- ture extraction method in the data mentioned above, we have to manually fix the ori- entation and treat features densely distributed over the image, in other practical cases 26 Figure 3.6: Matching result between image pairs sensed differently. Figure 3.7: Matching result between LiDAR intensity images of large area at different time. 27 Figure 3.8: Matching result between LiDAR depth image and aerial image of the same urban area. (a) (b) (c) Figure 3.9: Matching result between low-texture surfaces. ourfeature-basedself-similaritydescriptorachievefavorableresults. Figure3.8showsthe matching result between LiDAR depth image and aerial image of the same urban area. Figure3.9showsthematchingresultbetweenthelow-texturesurfaces. Figure3.10shows the matching result of the southernmost part of data of Figure 3.7. Figure 3.11 shows the matching result between the corresponding LiDAR intensity image and depth image. Figure 3.10: Matching result between LiDAR intensity images at different time. 28 Figure 3.11: Matching result between LiDAR intensity image and LiDAR depth image. 3.2.3 Conclusion We explored a new image description and matching process based on image internal self-similarity. Our contributions include: (1) generalized the idea and framework of the self-similarity descriptor, especially for the multimodal data; (2) defined a ”better” self- similarity function, increasing robustness of image matching under different imaging con- ditions; (3) extend to feature based approach, while enhancing the ability for orientation andscaleinvariance. Thecharacteristicoffeature-basedandthelowdimensionalitymake our matching procedure extremely fast and able to run for real-time applications. Our approachprovidesanalternativetotheapplications, suchasgeospatialfeaturematching, recognition, and fusions. Duringourexperimentswealsofindthatcommonfeatureextractionmethods, aswell astheorientation/scaleassignment,canfailinsomeofthechallengingimagingconditions. 29 New feature extraction and orientation assignment technique that is more invariant to modality change is needed in order to address this problem. We would like to mention that in general, the correlation function can be pixel- based, gradient-based, edge-based or even shape-based, etc. It doesn’t even require that the two self-similarity descriptors are calculated using the same function, given certain normalization procedure. As a primitive example, in Table 3.1, the monotonicity related to similarity or dissimilarity of the exponential form is different from the other four, but it doesn’t matter until when we try to match two images using different correlation functions. In more general cases, the result of correlation function can be a conditional value, or a multi-value vector, in order to give robust estimation on a large variety of domains. On the other hand, there can be many other forms of self-similarity. For example, the self-similarity based feature extraction [57] is also another kind of generalized usage of self-similarity. In the future, it’s possible for us to develop generalized self-similarity descriptors and other forms of usage, improve the feature extraction and orientation/scale measure on certain data types such as the range image, and test our approach on data from more different modalities. 30 3.3 ExtendingSelf-Similarityto3DDomain: 3DSelf-Similarity Descriptor Since converting 3D data to 2D images can lose some important geometric information, we seek to obtain a generalized representation of self-similarity property directly in the 3D domain. Photometric properties such as color, intensity or texture are useful and necessary to measure the similarity between imagery data, however, they are no longer as reliable on point cloud data. In many situations, the data may only contain point positions without any photometric information. Therefore, geometric properties such as surface normals and curvatures are treated as the essential information for point cloud data. Particularly, we found that surface normal is the most effective geometric property that enables human visual perception to distinguish local surfaces or shapes in point clouds. Normal similarity has shown sufficiently robust to a wide range of variations that occurwithindisparateobjectclasses. Furthermore,apointanditsnormalvectorcanform a simple local coordinate system that can be used to generate view/scale-independent descriptor. Curvature is another important geometric property that should be considered in sim- ilarity measurement. The curvature illustrates the changing rate of tangents. Curved surfacesalwayshavevaryingnormal,yetmanynaturalshapessuchassphereandcylinder preserve the curvature consistency. Therefore, we incorporate the curvature in similarity 31 measurement to characterize local geometry of surface. Since there are many possible di- rectionsofcurvaturein3D,weconsiderthedirectioninwhichthecurvatureismaximized, i.e. the principal curvature, to keep its uniqueness. We also consider the photometric information in our algorithm development to gen- eralize the problem. We assume the case that both the photometric and geometric infor- mation are available in the dataset. We propose to use both the properties as similarity measurements and combine them under a unified framework. 3.3.1 Constructing 3D Self-Similarity Descriptor Given an interest point and its local region, there are two major steps to construct the descriptor: (1) generating the self-similarity surface using the defined similarity measure- ments, and (2) quantizing the self-similarity surface in a rotation-invariant manner. In this work, we consider similarity measurements on surface normal, curvature, and pho- tometric properties. Once the similarity measurements are defined, the local region is converted to self-similarity surface centered at the interest point, with multiple/united property similarity at each point. We can then construct the 3D local self-similarity descriptors to generate signatures of surfaces embedded in the point cloud. 3.3.1.1 Generating self-similarity surface Assume that there are property functions f 1 , f 2 , ...f n defined on a point set X, which map any point x ∈ X to property vectors f 1 (x), f 2 (x), ...f n (x). For 2D images, the property can be intensities, colors, gradients or textures. In 3D situation that we are 32 (a) (b) (c) (d) (e) Figure 3.12: Illustration for self-similarities. Column (a) are three point clouds of the same object and (b) are their normal distributions. There are many noises in the 2nd point cloud, which lead to quite different normals from the other two. However, the 2nd pointcloudsharessimilarintensitydistributionasthe1stpointcloud, whichensuresthat their self-similarity surface (c), quantized bins (d) and thus descriptors (e) are similar to each other. On the other hand, while the intensity distribution of the 3rd point cloud is different from the other two, it shares similar normals as the 1st point cloud (3rd row vs. 1st row in column (b)), which again ensures that their self-similarity surface, quantized bins and descriptors are similar to each other. dealing with, the property set can further include normals and curvatures, besides inten- sities/colors. For each property function f that has definition on two points x and y, we can further induce a pointwise similarity function s(x,y,f). Then, the united similarity can bedefinedasthecombinationofthesimilarityfunctionsofalltheproperties. Figure3.12 gives the intuition of how combined self-similarity would work for different data. Normal similarity The normal gives the most direct description of the shape infor- mation, especially for the surface model. One of the most significant characteristics of the normal distribution is the continuity, which means the normal similarity is usually 33 positively correlated to the distance between the two points. However, any non-trivial shape could disturb the distribution of normals, which gives the descriptive power of the normal similarity. There are several traditional methods for normal estimation. We use the methods described in [78] to extract the normals. Figure 3.12(b) are examples of the distribution of surface normals. The property function of normal is a 3D function f normal (x) = ⃗ n(x). Assume that the normals are normalized i.e. ∥⃗ n(x)∥=1, we can define the normal similarity between two points x and y as the angle between the normals, as formula 3.8 suggests. s(x,y,f normal )=[π−cos −1 (f normal (x)·f normal (y))]/π =[π−cos −1 (⃗ n(x)·⃗ n(y))]/π. (3.8) It’s easy to see that when the angle is 0, the function returns 1; whereas the angle is π, i.e. the normals are opposite to each other, the function returns 0. We should be careful that we need a locally stable normal estimation method here to ensure that the directions of normals are consistent with each other, because flipping one normal could lead to the opposite result of the function. Curvaturesimilarity Thecurvatureillustratesthechangingrateoftangents. Curved surfaces always have varying normals, yet many natural shapes such as sphere and cylin- der preserve the curvature consistency. Therefore, it’s worthwhile to incorporate the curvature information in the measurement of similarity. Since there are infinite possible 34 directions of curvature in 3D, we only consider the direction in which the curvature is maximized, i.e. the principal curvature. Theprincipalcurvaturedirectioncanbeapproximatedastheeigenvectorcorrespond- ingtothelargesteigenvalueofthecovariantmatrixC ofnormalsprojectedonthetangent plane. The property function of curvature is defined as a single-value function f curv (x)=argmax(λ|det(C−λI)=0). (3.9) Assume that the values are scaled to range from 0 to 1, we can define the curvature similarity between two points x and y as the absolute difference between them: s(x,y,f curv )=1−|f curv (x)−f curv (y)|. (3.10) Again, the function returns 1 when the curvature values are similar, and returns 0 when they are different. Photometric similarity Photometric information is an important clue for our under- standing of the world. For example, when we look at a gray image, we can infer the 3D structure through the observation of changes in intensity. Some point clouds, besides point positions, also contain certain photometric information such as intensity or any reflective values generated by sensors. Such information is a combinational result of ge- ometric structure, material, lighting and even shadows. While they are not as generally reliable as geometric information for point clouds, they can be helpful in specific situa- tions and we also incorporated them in the similarity measurement. We try to use the 35 photometric similarity to model this complicated situation as it is invariant to lighting to some extent, given the similar material properties. The property function of photometry is a single-value function f photometry (x) = I(x) where I(x) is the intensity function. With the range normalized to [0,1], we can define the photometric similarity between two points x and y as their absolute difference: s(x,y,f photometry )=1−|f photometry (x)−f photometry (y)| =1−|I(x)−I(y)|. (3.11) United Similarity Given a set of properties, we need to combine them to measure the united similarity. s(x,y)= ∑ p∈PropertySet w p ·s(x,y,f p ). (3.12) The weights w p ∈ [0,1] can be experimentally determined according to availability and contribution of each considered property. For example, when dealing with point clouds converted from mesh models, we will let w photometry = 0 since there’s no intensity infor- mation in the data. Another example is when we have known that there are many noise points in the data, which makes the curvature estimation unstable, we can reduce its weight accordingly. Once the similarity measurements are defined, construction of self-similarity surface is straightforward. First, the point clouds are converted to 3D point positions with the defined properties. Then, the self-similarity surface is constructed by comparing each point’s united similarity to that of surrounding points within a local spherical volume. The radius of the sphere is 4 times the detected scale at which the principal curvature 36 reaches its local maxima. The choice of the size can determine whether the algorithm is performed at a local region or more like a global region. We found by experiments that the performance is the best when the ratio is around 4. 3.3.1.2 Forming the Descriptor Our approach is trying to make full use of all kinds of geometric information on the point cloud, mainly including the normal and curvature, which can be seen as the first- order and the second-order differential quantities. Since we are facing point clouds that do not imply straightforward calculation, we need to make certain approximations for calculation. Open-source libraries such as Point Cloud Library (PCL) provides such approximations. The descriptor is invariant to view and scale transformations. This is achieved by using local reference system (Fig. 3.13) of each given key point: the origin is placed at the key point; the z-axis is the direction of the normal; the x-axis is the direction of the principal curvature; and the y-axis is the cross product of z and x directions. In order to reduce the dimension as well as bearing small distortion of the data, we quantize the correlation space into bins. We have N r radial bins, N ϕ bins in longitude ϕ and N θ bins in latitude θ, and replace the values in each cell with the average similarity valueofallpointsinthecell, resultinginadescriptorof N r ×N ϕ ×N θ dimensions(Figure 3.13). 37 Figure 3.13: Illustration of the local reference frame and quantization. Theindexofeachdimensioncanberepresentedbyatriple(Index(r),Index(ϕ),Index(θ)), ranging from (0,0,0) to N r −1,N ϕ −1,N θ −1). Each index component can be calculated in the following way: Index(r)=⌊N r · r scale ⌋ Index(ϕ)=⌊N ϕ · ϕ 2π ⌋ Index(θ)=⌊N θ · θ π ⌋ (3.13) In the final step, the descriptor is normalized by scaling the dimensions with the maximum value to be 1. 3.3.2 Point Cloud Matching Weapplythedevelopeddescriptortobuildacompletefeature-basedpointcloudmatching system. Point clouds often contain hundreds of millions of points, yielding a large high dimensionalfeaturespacetosearch, indexandmatch. Soselectionofthemostdistinctive and repeatable features for matching is a necessity. 38 3.3.2.1 Multi-Scale Feature Detector Feature,orkeypointextractionisanecessarystepbeforethecalculationof3Ddescriptor because (1) 3D data always have too many points to calculate the descriptor on; (2) distinctive and repeatable features will largely enhance the accuracy of matching. There are many feature detection methods evaluated in [79]. Our approach detects salient features with a multi-scale detector, where 3D peaks are detected in both scale-space and spatial-space. Inspired by [58], we propose to extract key points based on the local Maxima of Principle Curvature (MoPC), which provide relatively stable interest regions compared to a range of other interest point detectors. The first step is to set up several layers of different scales. Assume the diameter of the input point cloud is d. We choose one tenth of d as the largest scale, and one sixtieth of d as the minimum scale. The intermediate scales are interpolated so that the ratios between them are constant. Next, for each center point p and scale s, we calculate the principal curvature using points that lie within s units from p. The calculation process is discussed in 3.3.1.1. Finally, if the principal curvature value of point p at scale s is larger than the value of the same point p at adjacent scales and the value of all points within one third of s units from p at scale s, meaning that the principal curvature reaches local maximum across both scale and local neighborhood of p, then p is added to the key point set with scale s. Note that the same point could appear in multiple scales. Figure 3.14 shows the feature points detected in the model. 39 Figure 3.14: Detected salient features (highlighted) with the proposed multi-scale detec- tor. Different sizes/colors of balls indicate different scales at which the key points are detected. Thesefeaturesturnouttobedistinctive,repeatableandcompactformatching. 3.3.2.2 Matching criteria Incasetherecanbemultipleregions(andthusdescriptors)thataresimilartoeachother, wefollowtheNearestNeighborDistanceRatio(NNDR)methodi.e. matchingakeypoint in point cloud X to a key point in point cloud Y if and only if dist(x,y 1 ) dist(x,y 2 ) < threshold, (3.14) where y 1 is the nearest neighbor of x in point cloud Y and y 2 is the 2nd nearest neighbor of x in point cloud Y (in the feature space). Balancing between the number of true positives and false positives, the threshold is typically set to be 0.75 in our experiments. 40 Figure 3.15: SHREC benchmark dataset. The transformations are (from left to right): isometry, holes, micro holes, scale, local scale, noise, shot noise, topology and sampling. 3.3.2.3 Outlier Removal After a set of local matches are selected, we can perform outlier removal using global constraints. If it’s known that there are only rigid body and scale transformations, a 3D RANSAC algorithm is applied to determine the transformation that allow maximum number of matches to fit in. Figure 3.19 (b) shows the filtered result for Fig. 3.19 (a). 3.3.3 Experimental Results and Evaluation Theproposedapproachhavebeenextensivelytestedandevaluatedusingvariousdatasets including both synthetic data from standard benchmarks and our own datasets, covering a wide variety of objects and conditions. We evaluated the effectiveness of our approach in the terms of distinctiveness, robustness and invariance. 3.3.3.1 SHREC data We use the SHREC feature descriptor benchmark [11,13] and convert the mesh model to point cloud by keeping only the vertices for our test. This benchmark includes shapes with a variety of transformations such as holes, micro holes, scale, local scale, noise, shot noise, topology, sampling and rasterization (Fig. 3.15). 41 Transform. Strength 1 ≤ 2 ≤ 3 ≤ 4 ≤ 5 Holes 0.11 0.22 0.33 0.45 0.52 Local scale 0.40 0.58 0.67 0.77 0.83 Sampling 0.31 0.46 0.57 0.67 0.81 Noise 0.58 0.65 0.70 0.74 0.77 Shot noise 0.13 0.23 0.28 0.32 0.35 Average 0.34 0.45 0.53 0.60 0.67 Table 3.2: Robustness of Normal Self-Similarity descriptor based on features detected by MoPC (average L 2 distance between descriptors at corresponding points). Average number of points: 518 Table 3.2 and 3.3 show the average normalized L 2 error of SS-Normal descriptors and SS-Curvature descriptors at corresponding points detected by MoPC. Note that only the transformations with clear ground truth are shown here. The value is the calculated at corresponding points x k and y j : d kj = ||f k −g j || 2 1 |F(X)| 2 −|F(X)| ∑ k,j̸=k ||f k −g j || 2 , (3.15) and then sum up using: Q= 1 |F(X)| |F(X)| ∑ k=1 d kj =(|F(X)|−1)· ∑ |F(X)| k=1 ||f k −g j || 2 ∑ k,j̸=k ||f k −g j || 2 . (3.16) The results are competitive to the state-of-the-art methods compared in [13] [11]. Figure 3.16 (a) and (b) show the robustness of the proposed feature detector and descriptor under rotational and scale transformations. Figure 3.16 (c), (d) and (e) show some matching results between human point clouds from SHREC’10 [13] dataset under 42 Transform. Strength 1 ≤ 2 ≤ 3 ≤ 4 ≤ 5 Holes 0.10 0.21 0.31 0.42 0.49 Local scale 0.38 0.56 0.65 0.75 0.81 Sampling 0.44 0.55 0.63 0.71 0.87 Noise 0.55 0.62 0.67 0.72 0.75 Shot noise 0.13 0.22 0.27 0.31 0.33 Average 0.32 0.43 0.51 0.58 0.65 Table 3.3: Robustness of Curvature Self-Similarity descriptor based on features detected by MoPC (average L 2 distance between descriptors at corresponding points). Average number of points: 518 affine transformation, with holes and after rasterization. The feature extraction and descriptor calculation take about 1-2 minutes on typical data with around 50,000 points. Another set of test data is from TOSCA High-resolution [12]. Figure 3.17 (a) is a typical matching example using dense self-similarity descriptor and 3.17 (b) is the matching example using MoPC feature-based self-similarity descriptor. Figure 3.18 (a) shows the precision-recall curve for the wolf data. 3.3.3.2 LiDAR point clouds Table 3.4 shows the comparison of different configurations of similarity on 50 pairs of randomly sampled data in randomly selected 300m * 300m * 50m area in the Richard Corridor of Vancouver. The precision is the ratio between the number of all correctly identified matches (TP) and the number of all matches (TP + FP). The running time is about 15s per pair. We also perform experiments on different point clouds of the same region. Figure 3.18 (b) shows the precision-recall curve for 3D Self-similarity descriptor working on data 43 (a) (b) (c) (d) (e) Figure 3.16: Matching result between human point clouds with rotation, scale, affine transformation, holes and rasterization. (a) (b) Figure3.17: (a)isthematchingresultwithdense3DSelf-Similaritydescriptorofdifferent poses of a cat. (b) is the matching result with MoPC feature-based 3D Self-Similarity descriptor of different poses of a wolf. 44 Property Precision Normal 55% Curvature 49% Photometry 49% Normal+Curvature+Photometry 51% Table 3.4: Evaluation of matching results with different configurations (weights) of united self-similarity. Pure normal similarity performs the best overall, but curva- ture/photometry similarity can do better for specific data. 0 0.05 0.1 0.15 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision (a) 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision (b) Figure 3.18: (a) is the precision-recall curve for the 3D Self-similarity descriptor between two wolf models from TOSCA high-resolution Data. (b) is the precision-recall curve for the 3D Self-similarity descriptor on Vancouver Richard Corridor data. choppedoutfromtheLiDARdata(600m×600m×60m,100k points)ofRichardCorridor in Vancouver. In real applications there are tremendous data that might spread across large scales. Our framework can also deal with large scale data by divide-and-conquer since we only require local information to calculate the descriptors. Figure 3.19 shows the matching result of aerial LiDAR out of two scans of the Richard Corridor area in Vancouver. 45 (a) (b) Figure 3.19: Matching result of aerial LiDAR out of two scans of the Richard Corridor area in Vancouver. (b) is the filtered result of (a). 3.3.4 Renement of the 3D Self-Similarity Descriptor Although the proposed 3D Self-Similarity descriptor has achieved a fairly good perfor- mance, there are several issues need to be addressed. For example, the original definition does not account for the case where a region is empty, which is quite common in 3D data. Also, the normalization by the maximum value is redundant and causing over-invariance. Last but not the least, we find that the way of quantization could be improved to reduce thevulnerabilitytonoise. Inthissection,wemakeseveralimprovementsovertheoriginal 3D Self-Similarity descriptor. 3.3.4.1 Joint Self-Similarity We define the following joint self-similarity function, in case there are multiple property functions f 1 ,f 2 ,...,f n : ss f 1 ,f 2 ,...,fn;x 0 (x)=( n ∑ k=1 (w k ·ss f k ;x 0 (x)) 2 ) 1 2 . (3.17) 46 That is to say, instead of summing up linearly, we treat the value of each property as one dimension of a large n−dimensional property vector. {w k } n k=1 are the weights assigned to each property, such that each property gets corresponding influence on the joint self-similarity. 3.3.4.2 Redefinition of Self-Similarity Correlation Functions Given the normal function ⃗ n(·), the normal self-similarity (N-SSIM) is defined as the angle between the normal at every point in the neighborhood of the reference point and the normal at the reference point itself (3.8). This definition is similar to the deviation angle defined in [26], however, the concept of self-similarity incorporates this definition and provides a meaningful interpretation. Figure 3.20(a) illustrates a typical distribution of N-SSIM. ss ⃗ n;x 0 (x)=cos −1 (⃗ n(x)·⃗ n(x 0 ))/π. (3.18) Given the curvature function c(·), the curvature self-similarity (C-SSIM) is defined as the difference between the curvature value at every point in the neighborhood of the reference point and the curvature at the reference point itself (3.10). Figure 3.20(b) is a visualization of C-SSIM. Note that we do not intentionally normalize the values to be from 0 to 1. ss c;x 0 (x)=|c(x)−c(x 0 )|. (3.19) 47 (a) (b) Figure 3.20: (a) is the visualization of N-SSIM surface and (b) is the visualization of C-SSIM surface. The purple point is the reference point x 0 . The self-similarity values are represented by colors, among which red > yellow > green > blue. Given the intensity function p(·), the photometric self-similarity (P-SSIM) is defined as the difference between the intensity value at every point in the neighborhood of the reference point and the intensity at the reference point itself (3.20). ss p;x 0 (x)=|p(x)−p(x 0 )|. (3.20) Ifthere’snoRGBorintensitygiven,allintensityvaluewillbetreatedasthesameand P-SSIMwillbecomethedegeneratedformofpositionself-similarity,simplyindicatingthe positions of the points in the neighborhood. In such circumstances, P-SSIM would be optional in the joint self-similarity estimation. Note that we do not intentionally normalize the self-similarity transformations. In- stead, we just ensure that when the properties at two points are the same, the similarity value would be zero, otherwise the value is proportional to the absolute difference. 48 3.3.4.3 Quantization Recall that we have defined the local reference frame as (3.21). ⃗ z = ⃗ n |⃗ n| ⃗ x= ⃗ c p |⃗ c p | ⃗ y =⃗ z×⃗ x (3.21) where ⃗ n is the normal and ⃗ c p is the vector in the direction of principal curvature, which is by definition perpendicular to ⃗ n. Using feature selection methods such as Maximum of Principal Curvature (MoPC) [37], we can achieve a scale for each keypoint. This scale determines the size of the neighborhood used to compute the descriptor. For a given keypoint p, we denote the scale as s. Then, we have two ways of quantization: (1) Divide the bins in radial (r), elevation (θ) and azimuth (ϕ) dimensions. The number of bins are N r , N θ and N ϕ , respectively. (2) Instead of θ, we have the height H dimension, or the z value according to the reference frame. The number of bins is N H . Analysis in Section 3.3.5.4 shows that the second definition is more robust than the first one, so we write the bin boundaries r i , H j , ϕ k as: r i = i·s N r ,i =0,1,2,...,N r H j = j·π N H ,j =0,1,2,··· ,N H ϕ k = k·2π N ϕ ,k =0,1,2,··· ,N ϕ (3.22) 49 These boundaries further define the bins: B ijk ={q;r i−1 ≤q r ≤r i ∧H j−1 ≤q H ≤H j ∧ϕ k−1 ≤q ϕ ≤ϕ k } (3.23) Then, the similarity value of the bin with index {i,j,k} is evaluated by: S ijk = 1 |B ijk | ∑ q∈B ijk ss f;p (q), B ijk ̸=∅ 0, B ijk =∅. (3.24) where q r , q H and q ϕ are the converted local cylindrical coordinates of point q. Note that the case when B ijk is an empty set is non-trivial. In the original definition of 3D self-similarity, S ijk shouldbeassignedalargevalue1insteadof0. However, sincemissing data/occlusion is very common in 3D data, treating empty regions as a region with great difference instead of a region of little importance would have more negative effects. This modification actually allow a boost in the descriptor performance. Moreover, in the original form, normalization is performed by dividing the maximum value. However, this is superfluous because the this will eliminate the absolute difference between the similarity difference in different regions, causing over-invariance. In the extreme case, the N-SSIM on any local patch of a small ball and the N-SSIM on any local patch of a large ball will be similar if we do the normalization. Therefore, we discard the normalization process and directly obtain the final descriptor representation: d(i+j·N r +k·N H ·N r )=S ijk . (3.25) 50 (a) (b) (c) (d) Figure 3.21: (a) and (b) are the visualization of self-similarity surfaces. (c) and (d) are the visualization of the corresponding quantization results. Figure 3.21 visualizes the self-similarity surfaces and their quantization results. The two patches are from similar local regions of two different cars. 3.3.5 Discussion of Descriptor Design In this section we discuss and compare several characteristics and techniques during the descriptor design process. 3.3.5.1 Histogram-based vs. Non-histogram-based An important feature that Spin Image, Shape Context and FPFH descriptors have in common is the use of histogram that counts the number of points in a bin. For the 51 self-similarity descriptor, if we use the number of points, sum of differences or density information as the similarity evaluation function, histogram is also implicitly used. In general, histogram-based method would be largely affected by number of points lying in a bin. With a normalization step, the influence factor would become the relative number of points lying in each bin. On the other hand, if the self-similarity descriptor uses the min/max/average of dif- ferences of properties such as normals and curvatures, then as long as there’s no change from zero (empty region) to one or more points or from one or more points to zero in a bin, the values of the result will only change a little. However, if there are such changes, the values of the result will be quite different. Therefore, in general, the histogram-based methods are more robust to difference between non-existence and existence, while the non-histogram-based methods are more robust among different levels of noises. 3.3.5.2 Achieving Rotation Invariance As for two unaligned regions of 3D data, the most important invariance is the rotation invariance. Almost all 3D descriptors take the normal as the reference direction, and either explicitly or implicitly, consider the direction of the normal relatively stable and build the descriptor upon it. However, for the second direction of the local reference frame, different methods would have quite different consideration, and traditionally, the second direction is considered unstable and avoided in the formation of the descriptor. For example, Spin Image [41] proposed the idea of oriented point, which actually collapse the 3D space along all the potential second directions and get a 2D space. 3D Shape 52 Context [28], on the other hand, enumerate all the potential second directions and thus form a group of descriptors on one key point, leaving the dimension reducing task to either the matching stage through a rotated comparison, or the post processing stage with spherical harmonic transformation [43]. The3DSelf-Similaritydescriptor,however,considersthattheseconddirection,though not as stable as the first direction defined by the normal, is still useful. Since the second directionshouldbeperpendiculartothefirstdirection,allpossibleseconddirectionsform the so-called tangent plane. By differential geometry, there are two principal curvatures and the corresponding principal directions. Therefore, the principal direction with the larger eigen value is selected as the second direction. 3.3.5.3 Uni-centric vs. Multi-centric Spin Image and SSIM are uni-centric i.e. the key-point (together with its normal) would have strong influence on all the calculation of the descriptor (although the effect might be weakened through a series of normalization steps), while Shape Context and PFH are multi-centric i.e. all pairs of points within the neighborhood of the keypoint are treated equally in computation, and the key point would only take effect through the definition of the neighborhood boundary. The uni-centric approach has advantages in the intuitive representation of the local neighborhood and the fast computation, but it may also suffer from the bin division issue discussed in Section 3.3.5.4. 53 3.3.5.4 Quantization and Bin Numbers Thenumberofbinsinquantizationisanimportantfactorindescriptorperformance, and even vital sometimes. Generally speaking, increasing the number of bins will increase the details in the descriptor representation, but may also decrease the invariance ability in that dimension. We perform an exhaustive experiment on the bin numbers of the three dimensions. Specifically, we run 10× 10× 10 = 1000 combinations of the bin divisions in radius (R), height (H), and azimuth (ϕ), each from 1 to 10. The precision is computed as the number of correct matches divided by the number of total ground-truth matches (Sec. sec:PointwiseMatching). Figure 3.22 shows the aggregated results of the mean precision for each dimension. For example, when evaluating the mean precision with regard to certainH, we compute the average among all precision values with R andϕ ranging from 1 to 10. In this way we get three precision-dimension curves showing the influences of the number of bins. Generally, the mean precision increases as the bin number increases. However, we can clearly see the bump in the height curve. The performance increases faster when bin of height is an odd number. This actually could be explained by the characteristic of the uni-centric descriptor. Let’s consider the small neighborhood of the keypoint. If the bin number of height is even, this neighborhood, which is in fact a surface patch perpendicular to the normal, would just lie on the boundary between one or more pairs of two nearby bins in H. In such case, any turbulence or noise in the area would result 54 1 2 3 4 5 6 7 8 9 10 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 Number of bins Mean precision R (Radius) H (Height) φ (Azimuth) Figure 3.22: Aggregated results for the bin number test. in quite different assignments of points to the bins. However, if we use odd bins in H, this small neighborhood would have much less chance to get on the boundary of H. Now, applying the similar analysis, we could see why we use height (H) instead of elevation (θ) in quantization. The elevation division would make the region around the key point divided by many small bins, which is vulnerable to noise. One could ask if azimuth (ϕ) has the similar issue, however, since the boundary planes of ϕ is usually perpendicular to the local surface and the intersection is only a line instead of a plane, the division in ϕ would not have such problem. Wealsoperformedsimilartestwithheight(H)replacedbyelevation(θ),andthemean precision is generally decreased by roughly 0.8 percent, which confirmed the superiority of height with odd bins. 55 3.3.5.5 Time Complexity If we define t n as the average time per point for computing one normal, t c as the average time per point for computing one curvature, t d as the average time per (feature) point for computing one descriptor, t f as the average time per point during feature extraction, n p as the number of points in the point cloud, n f as the number of feature points, then the total running time could be approximately written as: t=t n ·n p +t c ·n p +t f ·n p +t d ·n f . (3.26) Note that, t f and t d are similar for different size of data, while t n and t c differ a lot. This is due to the use of kd-tree (which will contribute O(log(n p ))) and some other non-linear factors in the computation of normals and curvatures. 3.3.6 Evaluation for Rened 3D Self-Similarity Descriptor We extensively evaluate the refined 3D Self-Similarity descriptor with various datasets and conditions, from three aspects: quality of descriptor, performance for point-wise matching, performance for feature-based recognition, in comparison of state-of-the-art descriptors including Spin Image, 3D Shape Context and Fast Point Feature Histograms. The joint normal-curvature self-similarity descriptor is denoted as CN-SSIM, with the weight ratio w c :w n =4:1. The parameters we are using are listed in Table 3.5. The division of SSIM is deter- mined according to the discussion in Sec. 3.3.5.4. 56 Table 3.5: Parameters used in the experiments. The first several rows are the number of bins in each dimension. r is the radius of the support region from which the descriptors are generated. Parameter Descriptor SSIM SI FPFH 3D SC HKS Radius R 5 9 - 15 - Azimuth ϕ 6 - - 12 - Height H 5 17 - - - Elevation θ - - - 11 - Time T - - - - 101 Angle - - 11 - - Angle Type - - 3 - - Dimensions 150 153 33 1980 101 r 9.6 9.6 9.6 9.6 - 3.3.6.1 Testing Datasets Two datasets are used in our experiments. The first is SHREC (SHape REtrieval Con- test) benchmark dataset, which has been widely used in descriptor and feature detector evaluations [13] [11]. Figure 3.23 shows the data used in our experiment. Note that all descriptors except HKS are depending merely on the vertex information of the dataset. ThesecondoneisthePrinceton3Dshapedataset,inwhich61vehiclemodelsareused in our experiment(Figure 3.24). The models include the following categories: jeep/SUV, pickup/truck, sedan, race car, sports car and antique car. In urban robot navigation systems such as [47], the module of 3D obstacle detection could be incorporated, thus we trytodemonstratethepossibleapplicationoftheproposeddescriptorinthecardetection in the urban area. 57 Figure 3.23: The transformed models used in our experiments from the SHREC Bench- mark. The first column contains the original model, followed by transformations with holes, sampling, noise, shot noise and rasterization, respectively. Note that only the human model contains the rasterization from the SHREC 2011 benchmark. 58 Figure 3.24: The vehicle models used in our experiments from the Princeton 3D Shape Benchmark [83]. 59 3.3.6.2 Quality of Descriptor In this section, we evaluate the robustness of descriptors on a set of transformations, including noise, rasterization, etc., with the descriptor quality defined in [11], that is, given the set of detected features, the quality of descriptor is calculated at corresponding points x k and y j : d kj = ||f k −g j || 2 1 |F(X)| 2 −|F(X)| ∑ k,j̸=k ||f k −g j || 2 , (3.27) and then the overall quality for the descriptor is computed as the average ratio: Q= 1 |F(X)| |F(X)| ∑ k=1 d kj =(|F(X)|−1)· ∑ |F(X)| k=1 ||f k −g j || 2 ∑ k,j̸=k ||f k −g j || 2 . (3.28) Table 3.6 shows the statistics for descriptor quality under different transformations, based on the MoPC feature detector. In principle, the smaller the quality value is, the betterthedescriptorisagainstthecorrespondingtransformation. However, we’llseethat the matching result in the next section telling a different story. 3.3.6.3 Pointwise Matching Assume that the two sets of keypoints are X = {x i |i = 1,2,...,N X } and Y = {y j |j = 1,2,...,N Y }. A point pair (x i ,y j ),x i ∈X,y j ∈Y is considered as a ground truth match if the L2-distance is smaller than a predefined threshold Θ: 60 Table3.6: Descriptorqualityofdifferentkindsofdescriptors. Averagenumberoffeatures is 4447. Transform. Descriptor C-SSIM N-SSIM CN-SSIM SpinImage FPFH 3DSC HKS 1 Holes 0.323 0.267 0.288 0.281 0.177 0.373 0.135 Sampling 0.650 0.560 0.589 0.543 0.438 0.895 0.547 Noise 0.407 0.395 0.399 0.312 0.163 0.457 0.258 Shot noise 0.171 0.147 0.158 0.030 0.017 0.240 0.528 Rasterize 0.568 0.442 0.489 0.521 0.496 0.812 - S G ={(x i ,y j );∥x i −y j ∥ 2 < Θ}. (3.29) Thispointpairisconsideredasadetected match iftheirdescriptorssatisfytheNearest Neighbor Distance Ratio (NNDR) condition: S D ={(x i ,y j );∥des(x i )−des(y j )∥ 2 <λ∥des(x i )−des(y k )∥ 2 ,∀k∈{1,2,...,N Y },k̸=j}. (3.30) Here λ is the NNDR value. Thus the true positive match set S TP is the intersection of the detected match set S D and the ground truth match set S G , and the precision could be computed by (3.31) and recall could be computed by (3.32). Precision = |S TP | |S D | (3.31) Recall = |S TP | |S G | = |S D ∩S G | |S G | (3.32) 61 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision C−SSIM N−SSIM CN−SSIM SI FPFH SC (a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision C−SSIM N−SSIM CN−SSIM SI FPFH SC (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision C−SSIM N−SSIM CN−SSIM SI FPFH SC (c) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision C−SSIM N−SSIM CN−SSIM SI FPFH SC (d) Figure 3.25: The ROC curve for matching evaluation of descriptors on data with (a) noise, (b) rasterization, (c) holes, and (d) sampling. The precision-recall performance of different descriptors are summarized as well. The tolerance threshold is Θ = 3 units in SHREC data. Figure 3.25 shows the ROC curves of descriptors on data with noise, rasterization, holes and sampling. We do not show the result for shot noise here since all descriptors perform nearly perfectly on it. 3.3.6.4 Feature-based Object Recognition In this section, we apply the descriptors in the recognition of vehicles from the Princeton Shape Benchmark (PSB) [83]. We pick out 61 vehicle models from PSB for experiment. Similar to [28], we generate a set of reference scans and a set of query scans from the 1 Due to the limit of memory, the data of the human could not be handled by the HKS program, so the evaluation for HKS is based on the two smaller data (dog and horse). 62 Figure 3.26: Some of the query scans. The azimuth angle for the scanner is random. models. Four scans are generated for each model with the scanner at 45, 135, 225, 315 degrees in azimuth and 45 degrees in elevation. The query scans are generated with the scanner at 30 degrees in elevation, and a random degree in azimuth. Moreover, Gaussian noiseswithstandarddeviationσ areaddedtothequeryscansalongthelinefromscanner to each point (Figure 3.26). Werandomlyselect300featuresoneachofthereferenceandqueryscansandcompute descriptors on all the detected keypoints. Then, we do the matching between each of the query scan and the reference scan. The matching score is computed as the maximum overlapping score estimated from 20 most possible transformations defined by the top 20 matchings as seeds. The reference scan with the best matching score to the query scan is consider as the corresponding object and the query scan is classified as the corresponding category. Depending on how many top reference scans are considered as correct classification, we can make a plot of the recognition rate versus the number of top reference scans (Figure 3.27). 63 0 1 2 3 4 5 6 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Number of Top Reference Scans Recognition Rate C−SSIM N−SSIM SI FPFH SC Figure 3.27: The recognition rate of descriptors. From the figure we can infer that, most descriptors achieve 100 percents correctness if the top 3 reference scans are counted as match. The maximum variation occurs at the top 1 result, where N-SSIM is beaten by 3DSC with a tiny difference. However, since 3DSC needs substantially longer time to be compared, SSIM is still competitive overall. Looking carefully on the wrongly recognized ones, we can see that they are quite reasonablyrecognizedasclosemodels. Thevisualizationofmatchingresults(Figure3.28) further confirm the suspicion. Then, we perform experiments for template matching on the real urban LiDAR data captured in Los Angeles and show the matching results in Figure 3.29. 3.3.6.5 Processing Time We record the average time needed to compute one descriptor as well as the average time needed to compare a pair of descriptors in matching. The results are displayed in Table 3.7. We can see that the uni-centric descriptors, SSIM and Spin Image, are the 64 Figure3.28: Matchingresultsbetweentherecognitionresults. Withineachpairofmatch- ing, the top object (in blue) is the query, while the bottom object (in red) is the matched reference scan. The first row are matching results between correctly recognized target with the corresponding reference scan from the database. The second row shows the matching result of failed cases, which seem to be reasonable since their models have quite similar shapes. (a) (b) (c) Figure 3.29: Matching results between templates with the real data. Within each pair of matching, the object in red is the template car, while the object in blue is the real scan in the urban environment. 65 Running Time (ms) Descriptor SSIM SI FPFH 3DSC HKS Descriptor 2.93 2.66 12.5 6.49 4.33 Comparison 0.242 0.221 0.047 105 0.135 Table 3.7: Average running time for computing one descriptor and doing one comparison in matching. fastest descriptors with less than 3 ms per descriptor. The 3D Shape context descriptor is extremely slow in matching due to not only its large dimensions (1980), but also the need to compare 12 rotations of descriptors [28]. 66 Chapter 4 Detecting Industrial Parts from 3D Point Cloud In this chapter, we propose an object detection and recognition system based on the feature-based matching module proposed in Chapter 3. Figure 4.1 (a) shows some exam- ples of target objects, and a scene point cloud from LiDAR data is illustrated in Figure 4.1 (b). Our main contribution includes: (1) Combine almost all aspects of point cloud pro- cessingtoformaworkablehierarchicalsceneunderstandingsystemwhichcoulddealwith complicatedpointclouds;(2)ApplySVMontheFPFHdescriptor[74]inpipe/plane/edge point classification; (3) Propose the variant of RANSAC that could efficiently deal with the rigid body constraint. 4.1 System Overview We depict the brief pipeline in Figure 4.2. The target library could contain both mesh models and point clouds. If a mesh, or CAD model is given, a virtual-scanning process is used to create a corresponding point cloud in the target library. The virtual scanner simulates the way a real scanner works, 67 (a) (b) Figure 4.1: (a) The CAD models of the part templates, which are converted into point clouds by a virtual scanner in pre-processing. (b) The colored visualization of the highly complicated scene point cloud we are dealing with. Figure 4.2: Flowchart of the proposed object detection system. 68 using a Z-buffer scan conversion [89] and back-projection to eliminate points on hidden or internal surfaces. The point clouds in the target library are pre-processed to detect features and store their representations for efficient matching. Another part of offline processing is the training of the SVM for the so-called regular points, including plane, pipe and edge points. We calculate the Fast Point Feature His- togram [74] for each point, then feed several trunk of a positive/negative clusters in the SVM trainer. During online processing, we first calculate the FPFH on a neighborhood of each point and apply SVM-test on them. Points are thus classified as one of the 4 regular categories (e.g. plane) or others. Large connected components of the same category are then extracted and fitted, while the remaining points are clustered based on Euclidean distance. Next, eachclusterispassedthroughthesub-moduleofclusterfilter. Theclusterfilter is designated to consist of one or several filters, based on application, that can rule out or set aside clusters with or without certain significant characteristic. The filters should be extremely fast while able to filter out quite a number of impossible candidates. Our implementation currently uses one filter i.e. linearity filter. Theclustersthatpassthetestofthefilterwillbegintobematchedwiththetemplates in the library. The descriptors for the targets generated in the offline processing are compared against the descriptors for the candidate clusters generated during the online processing and the transformation is estimated if possible. Note that the features and the descriptors will not be computed twice for efficiency. 69 Thekeymatchingstepisthefeaturecomparison,theprocessofcomparingthefeature representationswithpointdescriptorsbetweenthecandidateclustersandpartlibrarytar- gets. Initially all nearest-neighbor correspondences, or pairs of features, with the Nearest Neighbor Distance Ratio (NNDR) value, are computed and then, a greedy filtering strat- egy is used to look for the top four correspondences that fit the distance constraint. A transformation is estimated based on all correspondences that fit the hypothesis. The transformation is refined through Gram-Schmidt Orthogonalization. The percentage of aligned pointswill be used as the matchingscore. If the matchingscore betweena cluster and a target is higher than some threshold, the cluster is considered to be a detected instance of the target. In case that there are multiple targets in a single cluster, we iteratively remove the aligned part and examine the remaining part of the cluster until it’s too small to be matched. 4.2 Candidate Extraction based on Point Classication Weinitiallyobservedthat,inourdatasetoflargeoutdoorindustrialscene,alargeportion of the points belong to basic geometrical shapes, mainly planes (e.g. ground, ladders and boxes) and pipe-shapes (cylindrical pipes, bent connection, posts). Therefore, removing large clusters of such points will largely ease and accelerate our processing and help us focus on interested objects that we would like to detect. 70 Figure 4.3: Illustration of Point Feature Histogram calculation. 4.2.1 Fast Point Feature Histogram Since we aim at classifying points with simple geometry, it’s natural to think of using simple descriptors with low dimension. We select the 33-dimensional Fast Point Feature Histogram (FPFH) as our descriptor. FPFH is a approximative and accelerated version of the Point Feature Histogram (PFH) [77] formulation. PFH uses a histogram to encode the geometrical properties of a point’s neighborhood by generalizing the mean curvature around the point. The descriptorisdesignedsothatitisinvarianttothe6Dposeoftheunderlyingsurface. The histogram representation is quantized based on all the relationships between the points and their estimated surface normals within the neighborhood (Figure 4.3). The local frame for computing the relative difference between two points p s and p t is defined in Equation 4.1. ⃗ u=⃗ n s ⃗ v =⃗ u× (p t −p s ) ∥p t −p s ∥ 2 ⃗ w =⃗ u×⃗ v (4.1) 71 With this frame, the difference between the point-normal pair can be represented by the following angles (Eq. 4.2): α =⃗ v·⃗ n t ϕ=⃗ u· (p t −p s ) ∥p t −p s ∥ 2 θ =arctan(⃗ w·⃗ n t ,⃗ u·⃗ n t ) (4.2) These angles are then quantized to form the histogram. FPFH [74] reduces the computational complexity of PFH from O(nk 2 ) to O(nk), where k is the number of neighbors for each point p in point cloud P, without losing much of the discriminative power in PFH: FPFH(p q )=PFH(p q )+ 1 k k ∑ i=1 1 w k ·PFH(p k ). (4.3) In our experiment, we use the version implemented in the open-source Point Cloud Library (PCL) [76]. 4.2.2 Classication by SVM In the offline stage, We manually select and label about 75 representative small trunks of point cloud, adding up to around 200k labeled points. For support vector machine, we use the LIBSVM package [15] and the kernel function is the radial basis function (RBF). The parameters in our experiment are C = 8 and γ = 0.5. Details for SVM could be found in [15] and here we focus on the selection of training data that determines the property of the classifier. 72 Training/Testing 5 cm 10 cm 15 cm 20 cm 5 cm Y N N N 10 cm N Y Y Y 15 cm N Y Y Y 20 cm N Y Y Y Table 4.1: Cross validation result for pipes of different sizes. The left column means the training data, and the top row means the testing data. Y means at least 80% of the testing points are classified as pipe, while N means the opposite. 4.2.3 More Regular Categories: Edge and Thin Pipe During experiments, however, we found that near places where two or more planes inter- sect, some points that are actually on one of the planes would not be classified as plane point due to the interference of another plane in their neighborhood. On the other hand, thesepointsobviouslydonotbelongtopartswhentheygrouptogetherasalargecluster. Therefore, we assign them to another category, namely the Edge category. Besides edges, we also found some thin pipes missing in pipe detection. Experiments show that simply adding them in the training dataset might have negative effects on pipes with larger sizes, which suggests that they may need to be regarded as a separate category from pipes (partially due to the neighborhood size of the FPFH descriptor). To judge the distinctiveness of pipes of different sizes, we perform a series of cross validation and summarize them in Table 4.1. We can see that the 10/15/20-cm pipe classifiers could also classify the 10/15/20-cm pipes interchangeably, while 5-cm pipe classifier will distinguish the 5-cm pipe from the others. This evidence also provides support to separate the category Thin-Pipe from the category Pipe. If we are going to distinguish between 10/15/20-cm pipes, however, we need to add the other sizes as negative examples to get more precise boundaries between them. 73 Figure 4.4: Classification result of pipe points (in green), plane points (in yellow), edge points (in blue), thin-pipe points (in dark green) and the others (in red). We perform SVM 4 times, once for each category. Points are thus labeled as plane, pipe, edge, thin-pipe or others. Figure 4.4 shows the classified pipe points and plane points. Note that, some pipe-shaped objects e.g. tanks are so huge that locally they are like planes. In our experiments, we found it better for segmentation if we label the large tanks with small curvature as planes rather than cylinders. 4.2.4 Segmentation and Clustering The clustering algorithm is based on the Euclidean distance. We iteratively select a random seed point that has not been in any cluster and try to expand it to the neighbor points that have also not been in any cluster yet using the classical Flood-Fill algorithm. The neighbor threshold is set according to the granularity of the input point cloud. It’s 74 easy to see that after finite steps the residual point cloud will be divided into a number of disjoint clusters. We apply clustering routine for five times. First we do clustering on points labeled as oneofthefourcategoriesandgetalistofplane/pipe/edge/thin-pipeclusters,respectively. Then we subtract the big clusters from the original point cloud. This is important since wedon’twanttoremovesmallareaofregularshapesthatmightlieonabigpart. Finally, clustering is performed on the remaining points that we believe to be part points. Using S(C) to denote the set of points in category C, we can approximately write the algorithm as filtering out non-candidates: S(Candidate):=S(All)−S(Plane)−S(Pipe)−S(Edge)−S(ThinPipe). (4.4) Finally, we can formally make an extensible definition of the Target Candidate by Equation 4.5: S(Candidate):=S(Irregular) =S(All)−S(Regular) =S(All)− ∪ i S(Regular i ), (4.5) where Regular i can be any connected component with repetitive patterns and large size. This definition also provides us the possibility of discovering candidates of new targets that are actually not in the database. Figure 4.5 shows the segmentation and clustering result from the previous steps. 75 Figure 4.5: Segmentation result of the remaining candidate points. 4.2.5 Cluster Filter We observed that, not all clusters worth doing the detailed matching. In fact, most clusters in a scene will not be what we are interested in even at first glance. Therefore, the input clusters will be first passed through a filter. The goal of the filter is to quickly rule out or set aside clusters with or without certain significant characteristic. The filters should be extremely fast while able to filter out quite a number of impossible candidates. Currently we have implemented a linearity filter. The linearity filter is independent of the query target (from the part library). The linearity is evaluated by the absolute value of the correlation coefficient r in the Least Squares Fitting on the 2D points of the three projections on the x−y, y−z and z−x planes. If |r| > 0.9 in one of the projections, the cluster would be considered as a linear cluster. Note that planes and thin-pipes may fall in the linear category, but since both of them should have been removed in the classification step, any remaining linear clusters are considered lines, missed pipes, missed planes or noise. Table 4.2 shows the linearity of some of the representative objects we are interested in. Note that none of them are in 76 Linearity Rank Correlation Coefficient 563 0.638 656 0.545 669 0.527 821 0.362 1057 0.153 1064 0.145 1071 0.137 Table 4.2: The linearity of some of the representative targets we are interested in. There are totally 1223 clusters in this ranking. the linear range. In one of our experiments, only 1491 candidates are remaining, out of 1905 initial clusters from previous steps, after ruling out the most linear clusters. 4.3 Matching based on 3D Self-Similarity Descriptor 4.3.1 Feature Extraction and Descriptor Generation We follow the method in Chapter 3 [37] to match between the template point cloud and thecandidatecluster. However, weonlyuseasimplifiedversioni.e. thenormalsimilarity since there’s typically no intensity information in the template point clouds. The feature extraction process is performed such that 3D peaks of local maxima of principle curvature are detected in spatial-space. Given an interest point and its local region, there are two major steps to construct the descriptor. Firstly, the self-similarity surface, which is the 3D extension of the 2D self-similarity surface described in [82], is generated using the similarity measurements across the local region, where the similarity measurements can be the normal similarity. The property 77 function of normal is a 3D function f n (x)=⃗ n(x). Assume that the normals are normal- ized i.e. ||⃗ n(x)|| = 1, the normal similarity between two points x and y is defined by the angle between the normals, as Equation 3.8 suggests. Then, the self-similarity surface is quantized along log-spherical coordinates to form the3Dself-similaritydescriptor inarotation-invariantmanner. Thisisachievedbyusing local reference system of each given key point: the origin is placed at the key point; the z-axis is the direction of the normal; the x-axis is the direction of the principal curvature; and the y-axis is the cross product of z and x directions. The correlation space is quantized into bins in order to reduce the dimension as well as bearing small distortion of the data. In this application we set N r = 5 radial bins, N ϕ =5 bins in longitude ϕ and N θ =5 bins in latitude θ, and replace the values in each cell with the average similarity value of all points in the cell, resulting in a descriptor of 5×5×5 = 125 dimensions. The dimension is greatly reduced without deduction of performance, which is another important difference from [37]. In the final step, the descriptor is normalized by scaling the dimensions with the maximum value to be 1. The output of this phase contains a group of feature points, each of which is assigned with a point descriptor i.e. a 125-dimensional vector. 4.3.2 Correspondence Selection During detailed matching, the descriptors for the targets generated in the offline process- ing are compared against the descriptors for the candidate clusters generated during the 78 online processing and the transformation is estimated if possible. Note that the features and the descriptors will not be computed twice for efficiency. It’s natural to first calculate the nearest-neighbor correspondences with any Nearest Neighbor Distance Ratio (NNDR) (introduced in [61]) between the candidate cluster and the target from the library. Then, unlike the normal RANSAC procedure, we propose a greedy algorithm based on the observation that (1) the top-ranked correspondences are morelikelytobethecorrectones;(2)theobjectswearedetectingarerigidindustrialparts thathavefixedstandardsizeandshapes. Weknowthatingeneralcasethetransformation can be represented as Equation 4.6: p ′ =sRp+T. (4.6) where s is the scaling factor, R is the rotation matrix and T is the translation vector. For rigid-body transformation in our case, s=1, so we need to solve for the 12 unknowns in 3×3 matrix R and 3×1 vector T, which requires at most 4 correspondences of (p,p ′ ) here. (Note, however, that there are only 7 independent variables.) The insight is: rigid-body transformation preserves the distance between the points. We thus propose a greedy 4-round strategy to find the 4 correspondences: Initially we have all nearest-neighbor correspondences with any NNDR value in the candidate set; In the beginning of round i, the correspondence with the minimum NNDR value, c i =(a i ,b i ), is added to the final correspondence set and removed from the candi- date set. Then, for each of the correspondence c ′ =(a ′ ,b ′ ) in the candidate set, calculate 79 the 3D Euclidean distance dist(a ′ ,a i ) and dist(b ′ ,b i ), and if the ratio of the (squared) distance is larger than some threshold (1.1), c ′ will be removed from the candidate set. If the candidate set becomes empty within 4 rounds, we will regard the match as a failed rigid-body match; otherwise after 4 rounds, at least 4 distance-preserving corre- spondences will be selected and then the transformation could be estimated. The size of the candidate set after 4 rounds will be also used as the matching score. To avoid the shortcoming that the matching will be sensitive to the first correspon- dence, we will try multiple initial seeds and only pick out the transformation with the highest overlapping score. Finally, thetransformationbetweenthecandidateclusterandthetargetwillberecal- culatedbasedonthecombinedcorrespondences. Specifically,a3×3affinetransformation matrix and a 3D translation vector is solved from the equations formed by the correspon- dences. Then the rigid-body constraint is used again to refine the result, through the Gram-Schmidt Orthogonalization of the base vectors (R =(⃗ u 1 ,⃗ u 2 ,⃗ u 3 )): ⃗ u ′ 1 =⃗ u 1 ⃗ u ′ 2 =⃗ u 2 −proj ⃗ u ′ 1 (⃗ u 2 ) ⃗ u ′ 3 =⃗ u 3 −proj ⃗ u ′ 1 (⃗ u 3 )−proj ⃗ u ′ 2 (⃗ u 3 ) (4.7) which are then normalized using: ⃗ e ′ i = ⃗ u ′ i ∥⃗ u ′ i ∥ (i=1,2,3). (4.8) 80 A point p in cluster A is said to be aligned with cluster B if the nearest neighbor in cluster B to p under the transformation is within some threshold. The alignment score is thus defined as the percentage of aligned points. If the alignment score between a cluster and a target is higher than some threshold, the cluster is considered to be a detected instance of the target. 4.3.3 Iterative Detection Incasethattherearemultipletargetsinasinglecluster,weiterativelyremovethealigned partthroughthecloudsubtractionroutine,andexaminetheremainingpartofthecluster until it’s too small to be matched. Here is an evaluation of our algorithm against the case of multi-target detection in a single cluster: we cut an assembly into several pieces of parts and use them as the query. We successfully detect all of them one-by-one through our alignment and iterative detection process (Fig. 4.6). Note that no segmentation process is involved, and the descriptors are not trivially the same between the same location of the part and the assembly. 4.4 Experimental Results 4.4.1 SVM Classier Training Generally speaking we have five categories for the training clusters: Plane, Pipe, Edge, Thin-Pipe and Part. Since we are using two-class classification, when we are training one kind of classifier, all clusters labeled as this class will be regarded as the positive 81 Figure 4.6: Parts detection in an assembly. Classifier Training Clusters Training Points Support Vectors Plane 14/23 94602 2069/2063 Pipe 14/9 91613 1496/1503 Edge 9/24 94838 1079/1121 Thin-Pipe 8/22 83742 1020/1035 Table 4.3: Statistics of the trained primitive shape classifiers. The ratio of the support vectors means Positive/Negative. example, while the negative samples will be selected from the remaining categories. We summarize the statistics of training data in Table 4.3. The number of support vectors shows the complexity of the category, in order to distinguish it from the others. Table 4.4 shows the number of points in the residual point cloud after removing each of the four categories. Nearly half of the points are classified as plane, while after removing pipes there are one third of points, and finally we have only one fifth of the original points need to be considered in detection, which shows the effectiveness of the point classification step. 82 Original Plane Pipe Edge Thin-Pipe Points 25,135k 14,137k 8,767k 6,015k 5,534k (%) 100.0% 56.2% 34.9% 23.9% 22.0% Table 4.4: Remaining points after removal of each point category. 4.4.2 Ground LiDAR In this section, we show the testing results of our method on the industrial scene point cloud. Aresultforthesub-sceneisshowninFigure4.7, wheredetectedpartsarehighlighted with colors and bounding boxes. Figure 4.8 shows another result with respect to ground truth: the red color means false negative i.e. the object is miss detected or the point on the candidate cluster is misaligned; the blue color means false positive i.e. there is no target at the position but the algorithm detected one, or the point on the target is misaligned. Yellow/cyanbothmeantruepositivei.e. thealignedpointsarecloseenough. Table 4.5 and Figure 4.9 show the statistical results of the industrial scene. From the table we can see that balls, flanges, tanktops and valves are more successfully detected than the other objects. 4.4.3 Publicly Available Data The experiment results show that our method works with the virtual point clouds almost aswellaswiththerealpointclouds. Sincethevirtualpointcloudscouldbeautomatically generated from mesh models with the virtual scanner, we can expect the fully-automatic matching between the point clouds and the mesh models. 83 Figure 4.7: Detected parts, including hand wheels, valves, junctions and flanges. Figure 4.8: Detected parts on top of the tanks. 84 Category Sub-categories Truth TP FP FN Ball 1 22 11 4 11 Connection 2 3 0 0 3 Flange 4 32 20 3 12 Handle 6 10 3 1 7 Spotlight 1 6 1 0 5 Tanktop 2 4 3 3 1 T-Junction 5 25 7 0 18 Valve 12 25 17 24 8 All 33 127 62 35 65 Table 4.5: Statistics of detection. There are 8 big categories, 33 sub-categories and 127 instances (Ground-truth) of targets in the scene. Among them 62 are correctly identified (TP = True Positive), while 35 detections are wrong (FP = False Positive), and 65 instances are missed (FN = False Negative). Figure 4.9: Precision-recall curve of the industrial part detection. 85 (a) (b) Figure4.10: Detectionandalignmentresultfromtheclutteredscene. Thechefisdetected twice from two different sub-parts. To compare our method with the other methods, we have also tested our algorithm onsomeofthefollowingpubliclyavailabledata. Figure4.10showsonedetectionresultof the cluttered scene [59]. Note that only one face is present in the scene point cloud, and occlusions lead to discontinuity of some parts, which make the data quite challenging. Moreover, our point classification phase does not contribute to the result in this case. 4.5 Renement of the Object Detection Methods In the previous sections we have shown the potential of a matching-based system, while in this section we try to achieve substantially better results by carefully examining the bottlenecks of [38] in terms of the recall rate. Since matching between a small part with a large region will waste much time in unnecessary comparisons and also cause confusion, a segmentation step is necessary. The simplest idea is to segment the cloud based on Euclidean distance. However, in complex industrialscenes, ifthetoleranceofclusteringislarge, severalpartsmightstayconnected 86 in one cluster. On the other hand, if the tolerance is small, certain part can be broken in several pieces. To solve this problem, we propose an adaptive segmentation method that generates clusters with relatively small size for matching while the continuity of a single part is retained. Another key part for detection-by-matching approaches is matching. Instead of the feature extraction and descriptor computation, we focus on the correspondence selec- tion/outlier removal algorithm, which could otherwise be a bottleneck in the whole de- tectionsystem. Infact,thenewoutlierremovalalgorithmlargelyreducesthedependency on the quality of descriptors. RANSAC [25] is a widely used strategy in outlier removal. However, traditional RANSAC is too general to retain high efficiency. Instead, we found that many outliers could be pruned in the early stage of matching, under various assumptions/constraints. For example, we could impose rigid-body constraint, where a rigid-body coefficient is specified for different applications. Finally, we explore the nature of a successful matching by comparing two different evaluation criteria. Figure 4.11 shows one example of the matching result between a single part to a big region using the proposed correspondence selection algorithm with the overlapping score criterion, which was impossible for the method in [38] within the same time. 4.5.1 Adaptive Segmentation As is introduced in previous sections, given the point class labels, we are able to filter out the backbone points through the following subtraction process: we perform an Euclidean 87 (a) (b) Figure 4.11: Matching a single part to a big cluster. clustering for points within each of the four backbone categories, and remove them from the original cloud only if they belong to a sufficiently large cluster (≥ 1000 points). In this way, we obtain the residual point cloud. In our original method [38], the residual point cloud is segmented by a one-time Euclidean clustering. However, there are some problems in the segmentation results: 1) Big clusters can still exist after segmentation. 2) There are many parts in such big clusters, however, it’s difficult for the matching stage to handle clusters that are too big. As the solution, we propose the following adaptive segmentation procedure (Figure 4.12). Instead of fixing the tolerance in Euclidean clustering, we iteratively decrease the tolerance for large clusters so that each cluster contains no more than a fixed number of points. In other words, only large clusters will be re-segmented. This process generates a tree of clusters. The tolerance is smaller for deeper layers, and the leaf nodes are the final result of segmentation. Although parts that are close to each other could be connected due to noise or in- complete removal of backbone points, which can lead to wrong clustering by Euclidean 88 Figure 4.12: The process of adaptive segmentation. distance, we observe that in most cases these connections would be cut off earlier than the connections among the same objects, which ensures the feasibility of the algorithm. 4.5.2 Matching and Pose Extraction After the features with descriptors are extracted and computed, we are able to estimate the pose transformation between any two point clouds. The general idea is, instead of extracting only the largest group of matches among different hypotheses that obey the rigid-body constraint, we first get all possible rigid transformations among different hypotheses of matches and then, evaluate the quality of transformations (Equation 4.14) and extract the best one (Equation 4.15). 4.5.2.1 Correspondence Selection The goal of correspondence selection is to select feasible sets of correspondences from which possible transformation hypotheses could be computed. Formally, we define a correspondence between keypoint x in point cloud X and key- point y in point cloud Y as a triple c =(x,y,f), where f is the confidence value. 89 Given a keypoint x, we are only interested in the keypoint y 1 that minimizes the difference of descriptors ||d(x)− d(y)||. Also, we need to ensure that it’s outstanding enough to beat other candidates, thus the negative of Nearest Neighbor Distance Ratio (NNDR) is used as the confidence value, such that the higher the confidence value is, the more outstanding the top correspondence is (Equation 4.9). f(c)=− ||d(x)−d(y 1 )|| ||d(x)−d(y 2 )|| (4.9) wherey 1 andy 2 arethetoptwocorrespondingpointsinY thatminimize||d(x)−d(y)||. Now, we can assign the set of initial correspondences generated from the keypoints in X with sorted confidence values, i.e. C 0 ={c i ;f(c 1 )≥f(c 2 )≥...≥f(c n )}. For any correspondence set C, we define the seed correspondence of it as α(C), the correspondence with the highest confidence value (Equation 4.10). α(C)=t∈C(∀c∈C(f(t)≥f(c))) (4.10) In the rare case that there are multiple correspondences with the same confidence value as the highest, any one of them could be the seed. In the k-th hypothesis, we assume that the k-th correspondence is correct, while the first to (k − 1)-th correspondences are wrong. Formally, we only consider the subset C (0) k = {c i ∈ C 0 ;i ≥ k}, and the remaining correspondence with the highest confidence c k =α(C (0) k ) is the seed correspondence. 90 Starting from c k , we can gradually add feasible correspondences to the final corre- spondence set, denoted as S k . Initially we have S (0) k =∅. Then, each round we add one more seed to the final correspondence set: S (j+1) k =S (j) k ∪{α(C (j) k )}= j ∪ l=0 {α(C (l) k )}. (4.11) Meanwhile, weonlypreservecorrespondencesthatobeytherigid-bodyconstraint, i.e. the rigid-body transformation is distance-preserving (Equation 4.12). C (j+1) k ={c∈C (j) k \{α(C (j) k )};∀s∈S (j) k ( 1 γ < ||x(c)−x(s)|| ||y(c)−y(s)|| <γ)} (4.12) where γ is the rigid-ratio threshold, and for ideal rigid-body γ = 1. Using the fact that the elements in C (j) k must satisfy the constraint with elements in S (j−1) k , we can further simplify the computation from O(n 2 ) to O(n): C (j+1) k ={c∈C (j) k \{α(C (j) k )}; 1 γ < ||x(c)−x(α(C (j) k ))|| ||y(c)−y(α(C (j) k ))|| <γ} (4.13) This process could be carried on until C (j) k =∅. Based on each self-consistent correspondence set S k , we can estimate a transforma- tion matrix T k that minimize the squared difference between the correspondences and normalize T k with the Gram-Schmidt process. 91 (a) (b) Figure4.13: Illustrationofoverlappingscore. (a)showsthealignmentoftwopointclouds and (b) highlights the overlapping area within certain threshold. The overlapping score reflectstheratiobetweenthesizeofoverlappingareaandoneoftheoriginalpointclouds. 4.5.2.2 Evaluation Criteria We define the overlapping score between the two point clouds A and B, with overlapping threshold θ, as the proportion of the points in A having at least one point in B in its θ-neighborhood (Equation 4.14) (Figure 4.13). Ω(A,B)= |{x∈A;∃y∈B(||x−y||<θ}| |A| (4.14) Note that in most cases Ω(A,B)̸=Ω(B,A). In implementation, a kd-tree is built for both point clouds A and B, such that the running time for each find-nearest-neighbor operationisO(log(|A|))orO(log(|B|)),thusthetimecomplexityfortheroutineis O(|A|· log(|B|)+|B|·log(|A|)). Based on the overlapping score, we define the following criteria (Equation 4.15) for the best transformation extraction: 92 T best =argmax T k (Ω(T k ×P t ,P c )), (4.15) where P t is the template point cloud and P c is the candidate cluster. That is, we transform the template point cloud with the estimated transformations, compare them to the candidate cluster, and extract the transformation that generates the largest over- lapping score as the best transformation between the pair P t and P c . The range of the Ω function is [0,1], and its value would reach 1 even if P t is only a subset of P c , meaning that the partial matching is enabled. In such cases, we would remove the overlapping part from P c and iteratively perform the matching on the remaining point cloud until it’s small enough (e.g. less than 500 points). Here are some insights why our evaluation criterion is better than the traditional maximum-number-of-matching criterion: the criterion ensures the nearly optimum so- lution under the final evaluation criterion of to what extent the two point clouds are overlapping with each other, while not consuming much more computation time. Also, it is a kind of the quality control of matching at the final stage, meaning that once a good transformation has been generated, it would not be contaminated by other steps. Moreover, this framework of 1) getting all possible transformations, 2) evaluating overlapping scores and 3) extracting best transformations based on overlapping scores could be generalized to non-rigid pose estimation as well. 93 4.6 Experimental Results for the Rened System 4.6.1 Rigid Ratio Threshold The rigid ratio threshold γ, or the maximum distance ratio between two selected corre- spondences, is an important threshold that imposes the distance consistency during the correspondence selection. Generallyspeaking,thesizeofthefinalcorrespondencesetwilldecreaseasthethresh- old decreases, while the accuracy of the correspondences will increase. If the threshold is too tight, there will be too few correspondences to estimate the transformation, while if the threshold is too loose, the accuracy will decrease. We empirically optimize this threshold and set γ =1.08 for all the tests. 4.6.2 Number of Matching Attempts Though best correspondences typically have the best confidence values, we find that makingmoreattemptsactuallydoesimprovetheresults,sinceourevaluationmechanism, in most cases, ensures the monotonicity of performance against the number of attempts, while only adding a negligible increase in the computational complexity of the method. In the following experiments, we try 50 different hypotheses during best pose extraction. 4.6.3 Comparison of Evaluation Criteria In this section, we perform the experiment based on the parameters suggested in Section 4.6.1 and 4.6.2, i.e. rigid ratio = 1.08, number of attempts = 50. The only variable is 94 (a) (b) Figure 4.14: Comparison of alignment result using (a) maximum number of matching criterion and (b) overlapping score criterion. The point clouds in (a) are misaligned. the scoring method, i.e. matching score (number of inlying matches after correspondence selection) proposed in [38] vs. overlapping score proposed in this paper. Here is an example to illustrate the difference (Figure 4.14). If we use the matching number as score, the result on the left will be selected since it has more matches (143) than the result on the right (108). However, if we apply the transformation computed from the two sets of matches, we’ll find that the alignment on the left is worse than the one on the right. In a word, the overlapping score uses a more direct and final evaluation of the quality of alignment and thus gives a better result. 4.6.4 Evaluation of Segmentation We use a difficult scenario (Figure 4.15) to compare the adaptive segmentation with fixed-tolerance segmentation. For adaptive segmentation we use the following parameters: the initial tolerance τ 0 = 0.03, decay coefficient d = 0.9, and upper bound of cluster size N = 50000. Except the segmentation method, the matching techniques applied are exactly the same in this experiment. 95 Figure 4.15 shows the comparison results of a cluttered scene between adaptive seg- mentation and fixed-tolerance segmentation. The clusters obtained from adaptive seg- mentation are more evenly distributed and properly segmented without disconnecting single parts. Figure 4.16 shows the number of correct detections by different methods. When we simply reduce the fixed tolerance to break down the clusters, the detection rate doesn’t increase as wished, but instead has a dramatically drop when the tolerance is comparable totheprecisionofpointcloud,resultinginover-segmentationofparts. However,adaptive segmentationavoidssuchsituationbykeepingpartsfrombeingbrokendownwhenthey’re already small enough. 4.6.5 Comparison of Systems Inthissection, werunourdetectionsystemonalargeindustrialscenedatasetcontaining over15millionpoints. Figure4.17presentsthedetectionresult,inwhichthepointclouds of the detected parts are replaced with the corresponding templates. The pipe models are generated with the method described in [69]. This result also demonstrates the potential application of our detection system in the point cloud modeling system. Table 4.6 summarizes the statistics for the full industrial scene using our method versus the method in [38]. Our method is significantly better in terms of the recall rate. More results are shown in Figure 4.18, containing several pairs of original point cloud and the classification and detection results. 96 (a) (b) (c) Figure 4.15: Comparison of segmentation result using adaptive segmentation (a), as well as fixed-tolerance segmentation with tolerance 0.03 (b) and 0.01 (c). The congested area highlightedbytheboundingbox(b)issuccessfullyseparatedintoseveralsmallerpart-like clusters by adaptive segmentation (a), while over-segmentation (c) is avoided. 97 Adaptive t = 0.03 t = 0.02 t = 0.015 t = 0.01 0 5 10 15 20 25 #Correct Detection Figure 4.16: Comparison of segmentation / clustering methods. The vertical axis rep- resents the number of correct detections using different methods. The first column is adaptive segmentation, while the remaining columns are segmentation with a fixed toler- ance t. Figure 4.17: Detection result in an industrial scene. The detected parts are shown in purple and inside the bounding boxes. 98 Figure 4.18: Detection results by the proposed method. The left column are the original point clouds, while the right column are the classification and detection results. The planes are shown in yellow, pipes shown in green, edges shown in light purple, and the detected parts are shown in dark purple. Despite the presence of large number of pipes (1st/2nd row), planes (3rd row) and clutters (2nd/4th/5th row), our system is capable of detecting all kinds targets ranging from flanges, hand wheels (2nd/4th row), tripods (3rd row) to large ladders (5th row). 99 Category Truth Refined method Original method TP FP Recall Precision TP FP Recall Precision Cap 4 2 0 0.50 1.00 1 0 0.25 1.00 Cross Junction-1 5 2 1 0.40 0.67 0 0 0.00 - Cross Junction-2 2 2 0 1.00 1.00 1 0 0.50 1.00 Flange 66 45 12 0.68 0.79 18 4 0.27 0.82 Flange-Cross 4 4 0 1.00 1.00 1 0 0.25 1.00 Flange-L 19 13 2 0.68 0.87 8 1 0.42 0.89 Ladder 7 3 0 0.43 1.00 0 1 0.00 0.00 Handwheel-L-4 13 11 0 0.85 1.00 8 0 0.62 1.00 Handwheel-M-4 1 0 0 0.00 - 0 0 0.00 - Junction-Pneumatic 1 1 0 1.00 1.00 1 0 1.00 1.00 Tripod 5 4 1 0.80 0.80 3 0 0.60 1.00 Valve-Handwheel-3 1 1 0 1.00 1.00 0 2 0.00 0.00 Valve-L-4 1 1 0 1.00 1.00 1 0 1.00 1.00 Summary 129 89 16 0.69 0.85 42 8 0.33 0.84 Table 4.6: Statistics of detection. There are 13 categories, and 129 instances (Ground- truth) of targets in the scene. Our method correctly identifies 89 instances among them, while 16 detections are wrong, and 40 instances are missed. The missed instances are mainly due to large (> 50%) occlusion. These results, especially the recall, are substan- tially better than the results of the system proposed in [38]. 100 Chapter 5 Detecting Object-Level Changes from 3D Point Cloud Change detection plays vital roles in a variety of applications such as surveillance, track- ing, data updating and smart environments. Most previous change detection works have been done on 2D data, i.e., images taken at different times or different frames in a video. However,since2Ddataareprojecteddatathatdonotexplicitlyreflectthe3Dstructures, recovering the 3D pose from 2D data is often needed, which faces many challenges such as illumination and occlusion problems. On the other hand, 3D data have the advantage that they are intrinsically correct in terms of the 3D pose, thus can avoid many problems caused by 2D distortions. Although 3D data become easier to achieve nowadays, most changes detection in 3D data were manually done by professionals, which is rather time consuming, as [29] pointed out. Therefore, it’s necessary to develop a change detection approach specialized in 3D data. Intuitively, the change detection needs to be based on thealignmentofthetwodata. Oncethedataarealigned, therecouldbemultiplewaysto detect the changes. A naive way is using the octree to compare the spatial difference [29]. For specific applications, different methods could be applied, e.g. 3D change detection for buildings [42]. In this work, however, we focus on the 3D change detection of objects 101 Figure 5.1: An illustration of possible causes of changes between old data (top) and new data (bottom). (1) New - The new object appears; (2) Missing - The old object is missing; (3)-(5) Pose - The pose of the object is changed due to translation and rotation; (6) Replacement - The old object is replaced by the new object. in industrial sites. We investigate how change detection could be enhanced beyond the traditional align-and-compare approaches. Specifically, we are interested in changes of certain types of objects, since in many ap- plicationsonlyobjects ofinterestare considered important. Also, in manybuilt facilities, changes on certain types of objects are observed more often than the other changes. For example, in an industrial scene, the pose of a handwheel is changed frequently and the differenceisvitalsinceitcoulddeterminewhetheravalveisopenorclosed. Therefore, in this work we focus on the object-level changes, whereas the changes of other large-scale substance such as pipes and planes are not highlighted in our system. Figure5.1illustratesthetypicalcausesofthechanges,includingappearance/disappearance, pose change (translation and/or rotation) and replacement. Each type of change is ren- dered with a different color. In this work, these causes are naturally incorporated in the design of our change detection algorithm. Note that 3D changes not only occur between point clouds scanned at different times, but exist between point clouds and mesh models as well. The mesh model could either 102 Case Explanation Existence The object appears or disappears. Pose The object is translated and/or rotated. Replacement The object is replaced by another object. Table 5.1: Causes of object-level changes. be the original design of the site, or manually modeled data from real scans. Since we can apply a virtual scanner to convert the mesh model to the point cloud format, our point cloud based approach could be applied in both scenarios. Given that most industrial parts are rigid, we make an assumption that each object is a rigid body. If part of some object could be transformed (e.g. the handwheel on a valve), it would be considered as a separate part. Table 5.1 summarizes the possible causes of change of rigid objects. For any single object, there can be two types of changes: existence change and pose change. There are two possible existence changes, appearing, i.e., the object does not present in the first databutappearsintheseconddata, and disappearing, i.e., theobjectpresentsinthefirst data but disappears in the second data. The pose change can be caused by translation, rotation or a combination of translation and rotation. For example, a handwheel could be rotated to control the flow in a pipe. Finally, for the inter-object scenario, i.e., the two objects are not the same object, there could be a replacement change. Our goal is thus to detect changes caused by these reasons. We summarize our major contribution as follows: (1) We propose an object-level change detection algorithm based on the estimation of the degree of change in the pairwise scenario and the non-trivial change selection based on weighted bipartite matching. 103 (2) We propose and integrate the whole change detection system consisting of three essential modules: alignment, object detection and change detection. (3) To the best of our knowledge, this work is the first one to address the change detection problem on 3D data for complex industrial scenes at object level. 5.1 Related Work Change detection has been one of the basic topics in computer vision. A systematic survey of 2D image change detection algorithms can be found in [70], which focuses on changes between general raw images instead of application-specific change detection. [53] performed tests with six change detection procedures on land cover images. They conclude that classification-based methods are less sensitive and more robust. We make an analogy to this idea in our method: given the categories of objects that stand for a small region of point clouds, we can obtain results more robustly together with more semantic meanings. Another object-based approach [99] classifies groups of pixels corresponding to existing objects in the GIS database. The term object here is used to denote geo-objects instead of the industrial objects as in this work. There are some common issues across both 2D and 3D change detection, such as alignment, classification, detection and occlusion handling. However, the differences are also obvious. For example, one of the key challenges in 2D change detection is the illumination factor, while 3D change detection does not have such issue. There have been benchmarks such as [32] for detecting changes among frames in videos. We can extend some ideas from the 2D scenario. For example, background subtraction is the key idea 104 of many image/video change detection approaches (e.g. [4]). In this work, we treat the objects with relatively small scales in the scene as foreground, and the backbone objects such as planes and pipes as the background. Through this analogy, we apply the object detection method [39] that contains a backbone object subtraction step. On the other hand, there have not been many existing works dealing with the change detection in 3D data. One of the earliest work in 3D data was proposed by Girardeau- Montaut et al. [29]. They compare several octree-based comparing strategies including average distance, best fitting plane orientation and Hausdorff distance. The overlapping score used in our work has the similar effects to Hausdorff distance. Recently, Xiao et al. [101] presents a change detection system for point clouds of urban scenes. Their method is based on the consistency of the occupancies of scan rays. Our method, however, does not need knowledge of the scan rays and can thus be applied to any point clouds. 5.2 System Overview The whole framework consists of three stages: global data alignment, object detection and change detection. Figure 5.2 shows the pipeline with the step-by-step results for the proposed system. The input of the system contains reference data and target data, both of which can be either 3D point clouds or mesh models. If a mesh model is given, it could be automatically converted to a point cloud by a virtual scanner. In the first stage, the SparseIterativeClosestPoint(SICP)methodproposedin[10]isusedtocomputeaglobal alignment of the two point clouds. The target data can thus be transformed to the same 105 Figure 5.2: System pipeline for object change detection. The input contains the refer- ence data (left) and the target data (right). Results from major stages including global alignment, object detection and change detection are shown. In the illustration of object detection, detected objects are highlighted with green color and bounding boxes. The visualization of change detection results is explained in Section 5.6. pose as the reference data. Then, in the object detection stage, we apply the method proposed in [39] to detect the objects of interest and as a result, the detected objects are locally aligned with the library objects. In other words, our change detection system is based on the inconsistency between the global alignment and the local alignment. Finally, pairwisechangesareestimatedbasedonseveralchangeestimationfunctions, and we explore two approaches to solve the change selection problem, including the Greedy Nearest Neighbor (GNN) and the Weighted Bipartite Matching (WBM). 5.3 Point Cloud Alignment Since we are dealing with industrial data, where the global structures have inherent rigid properties, it’s suitable to apply the rigid registration techniques. The major difficulty 106 fortraditionalICP[9]isthatit’sverysensitivetooutliers. However, thecasesformissing data are often observed in 3D scans due to various reasons. Sparse Iterative Closest Point (SICP) [10] is a recently proposed extension of the traditional ICP. Specifically, they formulate the general sparse ℓ p optimization problem andempiricallyselectp=0.4asthebalancedresultbetweenperformanceandrobustness. There are several advantages of SICP. First, it’s heuristic-free, compared to previous variations of ICP. Second, it has only one free parameter, which is easy to tune for specific applications. Therefore, we apply SICP for global alignment of data in this work. 5.4 3D Object Detection Given the aligned point clouds, we follow the idea of the 3D object detection method proposed by [39] to detect objects from each of them, respectively. Detecting 3D objects from point cloud itself is a challenging fundamental problem. For industrial datasets, a majority of points belong to planes and pipes. Therefore, it’s necessary to first filter such points that have little probability of being in an object. For each point we compute a normalized Fast Point Feature Histograms (FPFH) descriptor [74]. Five categories are defined, i.e. four backbone shapes pipe, plane, edge and thin-pipe, as well as the remaining points denoted as the part category since they are most likely to be part of an industrial part. We train four SVM-classifiers using LIBSVM [15] over these categories, including pipe vs non-pipe, plane vs non-plane, edge vs non-edge and thin-pipe vs non-thin-pipe. Finally, each point is assigned with four labels corresponding to the four classification results. 107 Thepointsineachcategoriesareclusteredusingtheregiongrowingalgorithmbasedon Euclideandistancebetweenthepoints,respectively. Then,largeconnectedcomponentsof the primitive shapes are removed from the original point cloud while the smaller clusters are kept. After that, all the remaining points are iteratively clustered using adaptive segmen- tation [39] so that we obtain a list of candidate clusters. For each of the point clouds from the templates in the database and the candidate clusters, we perform feature extraction using Maxima of Principal Curvature (MoPC) method proposed in [37] to obtain a subset of points as the key-points. For each key- point, we compute the 3D Self-Similarity descriptor [37] with curvature similarity. Todeterminewhichcandidateclustersbelongtowhichobjectcategoryinthedatabase, we apply the greedy matching scheme based on rigid-body constraints and overlapping scores described in [39]. 5.5 Change Detection From the previous stages we can obtain two groups of detected objects along with the transformation with regard to the library objects. For a naive change detection system, all the detected object locations are reported as auxiliary information and the user is required to manually decide the actual changes. However, we propose a method that could automatically infer the changes among the objects. In order to figure out the changes between two groups of objects, we first consider the simplified case where each group contains exactly one object. 108 5.5.1 Pairwise Change Estimation GivenanytwodetectedobjectsX andY indifferenttimes, wecouldmakeseveralchange evaluations between them. The most fundamental evaluation is the category to which each of them belongs. It’s natural to define that, if they belongs to the same category, the category change is 0, otherwise the change is 1 (Eq. 5.1). C c (X,Y)= 0,Cat(X)=Cat(Y), 1,Cat(X)̸=Cat(Y) (5.1) Another significant change between two objects one would observe immediately is the difference in locations. We can thus define the location change as the distance between the centers of the objects (Equation 5.2). C l (X,Y)=||X−Y||. (5.2) Even if the locations of the two objects are close, there could be rotational change between them, e.g., a handwheel could be rotated even if it stays in the same location. Let T be the template of X and Y in the database, from the object detection module we know: X =R 1 T +b 1 Y =R 2 T +b 2 (5.3) where R 1 and R 2 are rotational matrices and b 1 and b 2 are translation vectors. By eliminating T we have: 109 Y =R 2 T +b 2 =R 2 [R −1 1 (X−b 1 )]=R 2 R −1 1 X +b 3 . (5.4) Therefore, we can directly compute the rotational matrix between X and Y as R :=R 2 R −1 1 . (5.5) The remaining problem is how we can measure the degree of rotation. We use the Euler angles (yaw, pitch, roll) derived from the rotational matrix (Equation 5.6): α =atan2(R 21 ,R 11 ) β =atan2(−R 31 ,(R 2 32 +R 2 33 ) 1 2 ) γ =atan2(R 32 ,R 33 ) (5.6) Then, the rotation change can be represented by the norm of (α,β,γ) (Eq. 5.7). C r (X,Y)= √ α 2 +β 2 +γ 2 . (5.7) Finally, if the objects are close enough, we can measure their degree of overlapping. Specifically,wecomputetheoverlappingscorebetweenthetwopointcloudsasisproposed by[39]. Theoverlappingscorewiththresholdθ =0.005isdefinedastheproportionofthe points in point cloud A where there exists a point in point cloud B in its θ-neighborhood (Equation 5.8). Ω(A,B)= |{x∈A;∃y∈B(||x−y||<θ}| |A| (5.8) 110 The overlap change is then defined by Equation 5.9. C Ω (X,Y)= √ (1−Ω(X,Y)) 2 +(1−Ω(Y,X)) 2 . (5.9) Finally, if we presume that the two objects have a relationship, we can decide the reasons of changes between them using the similarity scores proposed above. Figure 5.3 illustrates the decision flow of pairwise change estimation. If the categories of the two objectsaredifferent, thenthere’sareplacementchange; otherwiseiftheoverlapchangeis small than θ Ω (In our experiments θ Ω ≈0.7 corresponding to the case where overlapping rate is 50%, i.e., Ω(X,Y)=Ω(Y,X)=0.5), we claim that there’s no change between the objects. Otherwisethelocationchangeandtherotationchangearecheckedbycomparing them to thresholds θ l =0.1 and θ r = √ 0.2 2 +0.2 2 +0.2 2 ≈ 0.346. 5.5.2 Change Selection Given the pairwise change estimation result, we now consider the problem of selecting the most convincing changes. A degenerated case is the one-to-many problem: given an object X and several candidate objects Y j , we need to figure out which object among Y j is the most probable one corresponding to X. A straightforward method is using the greedy algorithm to select the best candidate for each object based on the distance. If the distance of the nearest neighbor is too large (e.g. > 1m), then the candidate is considered to have no matched object. We refer to this method as the Greedy Nearest Neighbor (GNN) method. 111 Figure 5.3: Decision flow of pairwise change estimation. 112 However, the minima might not be achieved simultaneously, which requires that we make a balance among the measurements. Also, instead of only the location proximity, qualitatively we expect that the object Y j with the same category, the minimum location change,theminimumrotationchangeaswellastheminimumoverlapchangewithrespect to X be the most probable candidate, i.e., the one with the minimum changes. In order to answer this question, we need a comparable measurement of similarity between X and each of Y j . Also, the best candidate might not exist, as there could be missing/disappearing or new/appearing objects. For example, if the location change is too large, then the objects are more likely to be different ones, even if their category and pose are similar. In the GNN method a threshold is put on the distance, however, we propose a smoother solution by using the Gaussian function (Equation 5.10). g(x;µ,σ)= 1 σ √ 2π e − 1 2 ( x− ) 2 . (5.10) Formally, we define the change estimation between the two objects X andY as Equa- tion5.11,whereC c ,C l ,C r andC Ω aretheestimationofcategorychange,locationchange, rotation change and overlap change, respectively. C =(g(C l ;0,σ l0 )·(1+g(C r ;0,σ r ))+g(C Ω ;0,σ Ω )) ·(1−C c )+g(C l ;0,σ l1 )·C c . (5.11) Here’stheexplanationofEq. 5.11. Thestandarddeviationsσ l0 ,σ l1 ,σ r andσ Ω canbe viewed as the soft threshold for deciding correlation and non-correlation. The category change estimation C c acts as the indicative function. If the objects are not in the same 113 category, then the rotation change and the overlap change are not well defined, thus only the location change takes effect. If the objects are in the same category, the rotation changewouldtakeeffect only when the location changeis small, while the overlapchange would act independently since it inherently requires a close distance. Moreover, the distance threshold for the same category σ l0 is larger than that for different categories σ l1 , which gives a larger range for search of objects of the same category. Generally speaking, objects of the same category are preferred to be correlated. To deal with the case where there’s no suitable candidate, we set a cutoff threshold C 0 for the change estimation. The value is set to be the change estimation value with respect to two dummy objects Y D1 ,Y D2 at a location that is far enough. We empirically define Y D1 to be an object with the different category at a distance of 0.5, and Y D2 to be an object with the same category at a distance of 1. C 0 is thus defined as Eq. 5.12: C 0 =max(g(0.5;0,σ l1 ),g(1;0,σ l0 )). (5.12) Instead of the single value for the pairwise change evaluation, we now have a matrix of the change evaluation values representing all the pairwise relationship between the two sets of the objects. Next, we need to figure out which relations should be kept while the others shouldn’t. We formulate the problem as the following classical Weighted Bipartite Matching (WBM). 114 5.5.2.1 Weighted Bipartite Matching (WBM) Given a graph G = (V,E), G is bipartite if there exists partition V = V x ∪ V y s.t. V x ∩V y =∅ and E ⊆V x ×V y . M ⊆E is a matching on G if ∀v∈V there’s at most one edge in M that connects v. We can define a weight function w : E → R on the edges. The weight of M is given by w(M)= ∑ e∈M w(e). (5.13) The goal of maximum weight bipartite matching is thus to find argmax M (w(M)). (5.14) ThemaximumweightbipartitematchingcanbesolvedefficientlyinO(V 2 logV+VE) with Hungarian algorithm using Fibonacci heap. 5.5.2.2 Conversion to WBM Problem We can now convert our problem to the maximum weight bipartite matching problem. EachdetectedobjectX i (1≤i≤n)inthefirstdataandY j (1≤j ≤m)intheseconddata can be represented by a vertex, and each weighted edge represents the change evaluation value between the objects: w(i,j)=C(X i ,Y j ),1≤i≤n∧1≤j ≤m. (5.15) 115 Note that when the change is more significant, by the Gaussian-based definition of functionC (Equation5.11)theweightissmaller, sotomaximizetheweightsisequivalent to minimize the changes. Since n and m might not be equal, we need to add dummy vertices as well as dummy edges to balance the graph. The traditional method is to assign zero weight to the dummy edges, however, in our case not all objects need to be matched to another object, meaningthatcertainobjectscouldbepreferablymatchedtoadummyobjectevenifthere are unmatched real objects. In the extreme case, all the objects are matched to dummy objects. Therefore,weaddexactlymdummyverticestotheX sideandndummyvertices to the Y side, and assign the weights of the dummy edges to be the cutoff threshold: w(i,j)=C 0 ,n<i≤n+m∨m<j ≤n+m. (5.16) Finally, for each matched pair (X,Y), if X is a dummy vertex while Y is not, then Y is considered as the new object; if Y is a dummy vertex while X is not, then X is considered as the missing object. If both X and Y are not dummy, we can apply the algorithm described in Fig. 5.3 to decide the change between them. 5.6 Experimental Results Wedemonstratethefeasibilityofourapproachonbothsyntheticdatasetandrealdataset. 116 5.6.1 Synthetic Data Wecreateadatageneratorthatcanautomaticallygeneratepairsofchangedscenes. Each sceneisacombinationofrandomlyplacedobjectswithoutbackboneobjectssuchaspipes and planes. We automatically generate two synthetic datasets, Sparse and Dense, each containing 10pairsofrandomlygeneratedscenes. Foreachpairofdata,5−15objectsareunchanged, 10−20 objects have pose changes (including translation, rotation as well as both), 0−5 objects are replaced, 0− 5 objects are missing and 0− 5 objects are new. For those with pose changes, we limit the position difference to be no more than 0.6 (unit: m). All objects are restricted in the l×l×l area to enable contacts among the objects. For the Sparse dataset l = 4m and for the Dense dataset l = 3m. Therefore, the objects in the Dense dataset have more chance of contacts and thus Dense is more challenging than the Sparse dataset. The process of data generation ensures that the global alignment and object detection are nearly perfect. Therefore, the actual performance of the change detection algorithm is evaluated. Each change is represented as a tuple (5.17): τ =(T,C r ,C t ,P r ,P t ), (5.17) where T is the type of change, C r and C t are the categories of the correlated objects in the reference data and the target data, and P r and P t represent the locations of the objects. The judgment of whether a detected change is true or not is depending on the proximity of the tuples (5.18): 117 WBM GNN Data Truth TP FP FN Precision Recall TP FP FN Precision Recall Sparse 212 201 4 11 98.0 94.8 199 18 13 91.7 93.9 Dense 212 183 23 29 88.8 86.3 182 46 30 79.8 85.8 Table 5.2: Statistics of the change detection results on synthetic dataset. The result is aggregated from the 10 pairs of randomly generated scenes in each dataset. τ 1 =τ 2 ⇔(T 1 =T 2 )∧(C r1 =C r2 )∧(C t1 =C t2 ) ∧(||P r1 −P r2 ||<ϵ)∧(||P t1 −P t2 ||<ϵ), (5.18) where ϵ=0.1 is the tolerance of localization error. Table 5.2 shows the statistics of the two change detection methods on the synthetic datasets. WBM outperforms GNN on both datasets. Specifically, while WBM and GNN have similar recall, WBM has significantly higher precision. 5.6.2 Real Data We test the methods on both cloud-to-cloud and mesh-to-cloud data of real industrial sites, containing totally 37 changes in ground truth. The real dataset is more challenging than the synthetic dataset in that the global alignment and object detection are not perfect. For example, if an object failed to be detected in the object detection stage, the change on it would not be detected in the change detection stage as well. Therefore, we compare 4 combinations of methods, including the two change detection algorithms WBMandGNNcombinedwitheitherautomatic/manualobjectdetection. Thestatistics are summarized in Table 5.3, where we list the precision and recall on the real dataset 118 Method Precision Recall Perfect Object Detection + GNN 68.4 70.3 Perfect Object Detection + WBM 100.0 100.0 Automatic Object Detection + GNN 42.6 54.1 Automatic Object Detection + WBM 61.4 73.0 Table 5.3: Statistics of the change detection results on real dataset. for either perfect (manual) object detection or automatic object detection combined with WBM or GNN. Obviously perfect object detection would given better results than the automatic object detection which could miss some objects. Still, it’s surprising that given the perfect object detection results WBM would achieve a perfect precision and recall in our test dataset. On the other hand, even with the perfect object detection results, GNN could only achieve around 70% in both precision and recall. Figure 5.4 illustrates the example of cloud-to-cloud change detection and Figure 5.5 shows the example of mesh-to-cloud change detection. In final results, the reference data is shown as the baseline in light blue, the objects in the reference data that are found to have a change are highlighted with deep blue, while the objects in the new data that are found to have a change are highlighted with red. Each detected object has a bounding box, and the ones that are not highlighted with deep blue or red colors are the ones found to be unchanged. Four labels are assigned to highlight the changes, including New, Miss, Pose and Replace. For pose change and replacement change, there’s a green line to represent the correspondence between the objects. Fromthefiguresandthetablewecanseethat, Thefalsealarmsaremainlyduetothe imperfectnessofthefirsttwosteps. Forexample,thefalsealarmsoftheflangesinthefirst 119 data case (Figure 5.4) are due to false alarms and mis-detection in the object detection module. The false alarms of the pose change in the second data case (Figure 5.5) are due to the small displacement of the global alignment (Figure 5.5 (c)). It’s also important to note that the unchanged objects are not counted in the statistics as the ground truth, while our method successfully identifies many of the such cases as in Figure 5.4. 120 (a) (b) (c) (d) (e) (f) Figure 5.4: (a) (b) Original point clouds. (c) Global alignment result. (d) (e) Detection results. (f) Change detection results. The visualization of the change detection results is explained in Section 5.6. 121 (a) (b) (c) (d) (e) (f) Figure 5.5: (a) Original mesh model. (b) Original point cloud. (c) Global alignment result. (d) (e) Detection results. (f) Change detection results. The visualization of the change detection results is explained in Section 5.6. 122 Chapter 6 Detecting and Classifying Urban Objects from 3D Point Cloud There are several categories of objects in a typical urban scene, including large scale objects such as buildings, moving objects such as cars, natural objects such as trees and man-made objects such as poles. In this chapter, we focus on detection of pole- like objects, including utility poles, street lights, traffic lights, road signs, flag poles and parking meters. With the correctly identified instances, there can be many potential applications such as navigation for robots, autonomous driving and urban modeling. Early works use images or videos to detect pole-like objects [23]. As 3D data become popular nowadays, the advantage that 3D data can avoid problems such as illumination and background confusion in 2D has been realized. On the other hand, the fact that 3D sensed datasets contain a large number of points calls for the efficiency of algorithms. While we can classify and extract all linear clusters from arbitrary directions (see Sec. 6.5.1), most pole-like objects are in the upright direction, even if the terrain is steep, due to safety and usability requirements. In case the data come without a regular well- aligned coordinate system, we can either first roughly fit the ground and get the upright 123 direction, or manually select the z-direction. After that, we are able to apply a fast vertical bounding-box-based method to extract all possible locations of pole-like objects. On the other hand, simple clustering methods based on spatial proximity are not feasiblesincethegroundalwaysconnecteverythingtogether. Also,inlarge-scaledatasets such as urban scenes, the majority of data belong to large scale planes including ground and building facade, which is not the focus of pole detection. Therefore, most previous workstendtouseaplane-fittingtechniquetoremovetheground. Ifthegroundisremoved perfectly, then all the objects over it seem to be properly segmented. Unfortunately, the ground is not always planar, meaning that a brute-force fitting may fail. Another approachistoclassifythelocalshapesthatareplanar,whichyieldsaccurateclassification of the points. However, this method costs too much time on computing features for every point. In contrast, our method uses the simple horizontal slicing to separate the ground and then applies clustering based on Euclidean distance on each layer, avoiding plane fitting and computation of features on every point. The problem left is to reassemble the broken parts while avoiding the interference of ground and nearby objects. To this end, we propose a bucket augmentation method, followed by a segmentation stage consisting of ground trimming and disconnected component trimming. Finally, in order to filter other objects that contain pole-like structures, we introduce the validation process based on the statistical pole descriptor and SVM-based classification. 124 6.1 Related Work Slicing-based Method. Several works use the slicing-based method to deal with point cloud data for finding vertical objects, in order to reduce the influence of structures attached to the vertical trunk. Luo and Wang [52] use slicing to detect pillars from a point cloud. Pu et al. [68] extend the approach of Luo and Wang [52] and propose a percentile-based method to detect the pole-like object. However, their validation method is mainly based on the deviation of the neighboring subparts, which is capable of dealing with pole-like objects with attachments in the bottom or on the top of the trunks, but not enough for pole-like objects with rich attachments in the middle. Cylindrical Shape-based Method. One of the earliest work aiming at detecting poles from ranged laser data uses Hough voting to detect circles [66]. Similar approaches have been extensively applied in tree trunk detection [3]. However, the circular charac- teristic for poles is obvious only for indoor environments or close-range scans such as [52]. Segmentation. Due to the presence of ground and facades, which connects every objects in a huge cluster, a filtering and segmentation step is usually needed. Yokoyama et al. [105] assumed that the ground in the input data has been removed. Golovinskiy et al. [31] uses iterative plane fitting to remove the ground, and a graph-cut-based method to separate the foreground with the background. We do not, however, rely on ground fitting in the pre-processing step. Instead, we do the ground trimming after localization, which involves only a small section of the ground. Point Classification. The point classification is a standard technique to analyze the composition of the point cloud. For example, Lalonde et al. [48] uses a Gaussian 125 Mixture Model (GMM) with Expectation Maximization algorithm to learn a model of the three saliency features derived from the eigenvalues of the local covariance matrix, i.e., linear, planar and volumetric. Demantk´ e et al. [22] proposes the dimensionality features based on the same saliency features, which are exhaustively computed at each point and scale. Behley et al. [6] applies spectrally hashed logistic regression to fast classifythepoints. HadjiliadisandStamos[34]proposedasequentialonlinealgorithmfor classification between vegetation and non-vegetation and between vertical and horizontal surfaces. Our method is closest to that of Yokoyama et al. [105], but we make a more delicate classification by distinguishing between wire points and the general linear points. Pole Classification. Surprisingly, despite the essential trunk, the shapes of the poles are quite diverse. Figure 6.1 illustrates some representative poles. Most existing works classify the poles according to their usage, shapes and even height. Pu et al. [68] make a detailed classification on the signs according to the planar shapes. However, they do not distinguish between non-planar poles, e.g., lights and utility poles. Golovinskiy et al. [31] make a mixed categorization of usage and height, resulting in seven categories including short post, lamp post, sign, light standard, traffic light, tall post and parking meters, while the utility poles are not reported. Yokohama et al. [105] classify the poles intothreecategoriesincludingstreetlights,utilitypolesandsigns. Inthiswork,wefollow thecategorizationofYokohamaetal.[105]sincethesethreecategoriesrepresentthemost common usage of poles, while further classification depending on height could be easily achieved. 126 Figure 6.1: Representative poles. From left to right are street lights, flags, utility poles, signs, meters and traffic lights, respectively. 6.2 System Overview Thesystempipelinefordetectionand classification isillustratedinFigure6.2. Theinput is a large-scale point cloud of an urban area, and we output all possible candidates of the pole-like objects and classify them into 3 basic categories, i.e., street lights, utility poles and signs. Specifically, there are three major stages of processing. The first stage is localization, where all possible locations of pole-like objects are extracted. In this stage, we make use of the unique characteristic of pole-like objects, i.e., the local parts of the pole-like objects are also pole-like when they are broken down. The second stage is segmentation, in which the ground and other disconnected components are trimmed at the candidate locations. Finally, we compute the statistical attributes for each candidate based on the extended distribution features and classify the candidates with a support vector machine. The classification step also help filters out other objects that contain local pole-like structures such as trees, pedestrians and part of buildings. In the following sections, we will discuss each of these stages in detail. 127 Figure 6.2: The pipeline of the proposed pole detection and classification system. 6.3 Candidate Localization We denote the set of pole-like object point clouds as P. Given a scene point cloud Y, any object in it could be seen as a subset of Y, our goal is to find the set of pole-like objects P Y ={P i |P i ⊂Y ∧P i ∈P,i =1,2,...,n}. Each pole-like object P i ∈ P could be divided into two parts: a trunk T i and the remainingpointsR i . ThetrunkT i iswhatmakesanobjectpole-like, whiletheremaining points R i can be used to tell what class of pole the object belongs to. We define the location of a pole-like object as the location of its stem. We focus on the pole-like objects with vertical or near-vertical trunks. If the original input is not well oriented, then a rough ground fitting of the point cloud could be used to generate the upright direction. In fact, most trunks of pole-like objects should not be tilted too much regardless of the terrain. In the first step, we would like to find all possible locations of pole-like objects. We employ a slicing strategy to obtain a fast raw localization. 128 6.3.1 Point Cloud Slicing and Clustering One of the important properties of the trunk in the real scene point cloud is that, if it’s horizontally sliced, each of its slices would still be a trunk-like segment. We first slice the input cloud along the z (upright) direction. The height of each layer is H =1 (unit: meter). The slicing result is illustrated in Figure 6.3 (a). For each slice, we perform the clustering algorithm based on Euclidean distance to get a list of clusters. The clustering result within each slice is illustrated in Figure 6.3 (b). 6.3.2 Pole Seed Generation Given a horizontal slice of a pole-like object, there are two obvious characteristics: first, the cross section is relatively small; second, the length is long enough. Moreover, the bounding box of a piece of trunk is very close to the original shape. Therefore, we have the following two criteria regarding the cross section area and the segment length based on the bounding box property of each candidate cluster (Equation 6.1): L x ×L y <A L z ≥λ·H (6.1) InEquation 6.1, weassume that the size of thebounding boxis L x ×L y ×L z , A is the maximumareavalueofacrosssectionthatwouldbeconsideredasatrunk,Histheslicing height, and λ is the ratio coefficient of the length. We empirically set A = 0.7 2 = 0.49, H =1 and λ=0.5. Figure 6.3 (c) highlights the segments satisfying the trunk criteria. 129 (a) (b) (c) Figure 6.3: The step-by-step results of candidate localization. (a) The sliced result. (b) The clustering result for each slice. (c) The segments satisfying the pole criteria. After all possible trunk segments are generated, we go through them from lower layers to the upper layers to check if there is a trunk overlap. Two trunks are said to be overlapping if and only if the bounding boxes of the two trunks projected to the x−y plane have an overlap. Once a trunk overlap is detected, the trunk segment in the upper layer would be added to the trunk list of the lower segment. The lowest trunk segments are considered as the seeds of the poles. 6.3.3 Pole Bucket Augmentation In order to extend the pole candidate from the trunks to their attached structures, we perform the bucket augmentation on the pole seeds. Specifically, the region within a constanthorizontaldistancer b =3fromthecenteroftheseedtrunksegmentisconsidered as the pole bucket, and all points within this range would be appended to the candidate pole cluster. 130 6.4 Candidate Segmentation While the bucket enhancing limit the range of the pole-like object, the enclosed area does not necessarily belong to the pole. These outliers could possibly be the ground, or points from other objects. 6.4.1 Ground Trimming In general, the lowest part of the pole is connected with the cluster of the ground, so it’s necessary to trim the ground from the cluster. The idea is that, we can extend the bottom part of the pole as long as the number points in the inner circle is larger than the outer circle. Specifically, we only consider the points below the seed trunk, i.e., the ground- connected cluster G i = {p ∈ C i |z p < z m }, where z m = min q∈T i z q is the lower bound of the z value of the seed trunk. Then, we attempt to extend the bottom of the trunk at a step of δ =0.2. In the k-th step, we check if inequality (6.2) holds: |{p∈G ik |r p <r inner }| |{p∈G ik |r inner <r p <r outer }| <λ G (6.2) where G ik is the sliced ground-connected cluster (Equation 6.3), the inner radius r inner = r b 4 = 0.75, the outer radius r outer = 2r inner = 1.5, and λ G = 0.5 is the trunk- ground ratio. If the inequality holds, the points within the inner radius will be added to the candidate cluster. The process stops when inequality (6.2) is not satisfied. G ik ={p∈G i |z m −kδ≤z p <z m −(k−1)δ}. (6.3) 131 (a) (b) (c) (d) Figure 6.4: Ground and disconnected region trimming process. (a) The original bucket- augmented candidate. (b) The red part is the above-the-seed cluster, while the blue part is the under-the-seed, or ground-connected cluster. (c) When ground trimming is done, the bottom part of the pole is successfully extended. (d) Finally, the disconnected components with respect to the seed trunks are removed. 6.4.2 Disconnected Region Trimming To remove the other objects that lie within the range of the bucket, we need to trim the disconnected components. Specifically, we perform the region growing algorithm from each of the seed trunks in the current candidate point cloud, and filter out points that are unreachable. Figure 6.4 shows the ground and disconnected region trimming process. 6.5 Pole Classication Since the enforced constraints are relatively weak in the localization step so as to keep as many potential poles as possible. As a side effect, many candidates are actually part of buildings containing pole structures, trees with trunks, and even pedestrians. Therefore, it’s necessary to validate them again after segmentation. Meanwhile, most poles belong to lights, utility poles and signs, we try to identify which of these three categories the 132 Figure 6.5: Distribution of height of poles. poles belong to at the same time. To this end, we classify the pole candidates into four categories: lights, utility poles, signs and others. The ones classified as others would be removed from the pole detection result. Beforeapplyingtheclassifier, weneedtoextractsomeattributesfromthecandidates. The most straightforward attribute about a pole candidate is the height h=z max −z min . The height is meaningful because the pole-like objects we would like to detect are man- made objects, thus have fixed height for certain sub-categories. Figure 6.5 illustrates the distribution of heights of poles from the manually labeled ground truth. There are multiple peaks in different intervals, which suggests a large number of poles of certain types are present in the region. 6.5.1 Point Classication The usage of the pole is highly dependent on the local shapes or attachments, which could be simplified as linear, planar and volumetric components. All pole-like candidates contain a vertical linear trunk. Besides that, lights could contain some linear branches or 133 volumetric bulbs, utility poles contain linear wires, and signs contain planar components. Thesecomponentsarefurthercomposedoflocalpoint-levelpatchesofthesameproperty, so we just need make a classification on the neighborhood of each point. The traditional Principal Component Analysis (PCA) is applied here. Suppose that λ 1 ≥λ 2 ≥λ 3 are the eigenvalues of the variance-covariance matrix M p (6.4): M p = 1 |U p | ∑ q∈Up (q− ¯ q)(q− ¯ q) T , (6.4) where U p is the set of points lying in the sphere of r U = 0.5 centered at the point p, and ¯ q is the barycenter of U p . λ 1 ≫ λ 2 ≃ λ 3 would indicate there’s one principal direction and the distribution is linear; λ 1 ≃ λ 2 ≫ λ 3 would indicate there are two principal directions and the distribution is planar; λ 1 ≃ λ 2 ≃ λ 3 would suggest that there’s no obvious principal direction and the distribution is thus volumetric. There are multiple forms of determining which criterion is satisfied [22,48,105]. We apply the form of distribution features as in [105] (6.5): S 1 =λ 1 −αλ 2 S 2 =λ 2 −λ 3 S 3 =βλ 3 (6.5) Then, the dimensionality feature d = argmax i∈{1,2,3} S i is introduced to indicate which feature is the most significant. Note that, although Definition (6.5) does not have a normalization coefficient as in [22], this form is actually more general because only the 134 (a) (b) Figure 6.6: Comparison of point classification result of a facade patch before and after smoothing. (a) shows the unsmoothed point classification result, in which most points are correctly classified as plane. However, the smoothed point classification result (b) wrongly turns the planar points into linear points. relative relationship among S i matters, and the form in [22] corresponds to the case in which α =1 and β =1. In [105], the cluster has been smoothed using the endpoint preserving Laplacian Smoothing in order to recognize linear shapes of different radii. Therefore, very strict parameters are applied for the smoothed data, i.e., α = 10 and β = 100. However, we find that smoothing could cause problems in case of sparse regions (Figure 6.6), and is time-consuming. Moreover, without smoothing, a relaxed condition with α = 4 and β = 2 is enough for recognizing most linear shapes and different α could help distinguish the radii of the them. This is particularly useful given the observation that, different from the case in [105], street lightsand utilitypole cannot be distinguished simply using the proportion of linear points and volumetric points, because both categories can have similar number of linear points. On the other hand, what makes a pole to be a utility pole is that it 135 carries wires. Therefore, the key is to distinguish the wire points from the other linear points. While the wire points are classified as linear points in the distribution feature, two distinctpropertiesofwiresareenforced: (1)thewiresarethinnerthanotherlinearparts; (2) the directions of the wires are typically horizontal. To judge the first property, we evaluate a strict condition with α ′ = 10, meaning that S ′ 1 could be larger than S 2 and S 3 only if the linearity is significant enough. The second property requires that principal direction be nearly horizontal, in other words, the eigenvector ⃗ v 1 corresponding to λ 1 is roughly perpendicular to the upright direction (Equation 6.6, θ w =0.2). S ′ 1 =λ 1 −α ′ λ 2 >S 2 ,S 3 | ⃗ v 1 ||⃗ v 1 || ·(0,0,1)|<θ w (6.6) Similar to [105], the principal direction could be used to distinguish the linear points lying on the vertical trunk by applying the constraint (6.7) (θ t =0.8): | ⃗ v 1 ||⃗ v 1 || ·(0,0,1)|>θ t . (6.7) Figure 6.7 shows the classification result of all five categories of points on different types of objects. 6.5.2 Pole Component Analysis From Figure 6.7 we can qualitatively summarize the relationship between the object classes and the point classification in the following table (6.1): 136 (a) (b) (c) (d) (e) (f) Figure 6.7: Point classification results on different objects. The meanings of the colors are: blue - vertical linear, red - planar, black - volumetric, orange - wire, green - other linear. Class Linear Planar Volumetric Vertical Wire Others Pole - Light + - -/+ - -/+ Pole - Utility + + -/+ - -/+ Pole - Sign + - -/+ + - Others - Tree -/+ - - - ++ Others - Fa¸ cade -/+ - - ++ - Others - Others - -/+ -/+ -/+ -/+ Table 6.1: Qualitative relationship between the class of object and the class of point classification. 137 In the table, ’-’ means the object category contains few points of the corresponding class in the column, ’+’ means the object category contains many points of the corre- sponding class, ’-/+’ means the object category could contain few or many points due to different subcategories (e.g. Figure 6.7(b) and 6.7(c)), while ’++’ means the object category contains large number of points. We can see that, without the wire class, it’s hard to distinguish between the lights and the utility poles. Fromtable(6.1)wecaninfersomeheuristicconditionsbasedonthenumberofpoints belonging to different classes to do the classification. However, since the variance within a category could be huge, it’s better to apply a learning-based classifier. In general, we have six attributes for any candidate cluster, including the height h and the number of points in each of the five categories. However, since some joint points on the trunk could be classified as non-linear points, it’s better not to count them in the statistics. Similar to the attached part recognition in [105], we apply RANSAC to fit the verticallinear points as trunk. The pointslying within distance σ =0.2 are considered to be on the fitted line regardless of their class. The process is continued until the number of unfitted vertical linear points is smaller than 50. Another observation is that the non- trunk features on the very bottom of the candidates are typically unrelated (Figure 6.7). Therefore, we exclude the non-trunk points on the lowest 0.1× h part of the cluster. Finally, we obtain a refined result for the number of points in each class, i.e., the number of vertical linear points on the fitted trunk n 1 , the number of wire points n 2 , the number of other linear points n 3 , the number of planar points n 4 and the number of volumetric points n 5 . Note that n 2 ,n 3 ,n 4 and n 5 exclude the points on the fitted trunk and the bottom part of the cluster. 138 Finally, we normalize them by (6.8). d i = n i N ,i =1,2,3,4,5. (6.8) 6.5.3 Classication by SVM To classify and validate the poles, we train a 4-class Support Vector Machine (SVM) [15] based on the six attributes. The 4 classes are lights, utility poles, signs and others (non- poles). We use 10-fold validation and grid search for the best C and γ. The training data contain 6 lights, 5 utility poles, 3 signs and 8 instances of non-poles including 4 trees, 3 fa¸ cade segments and 1 pedestrian. We find that even with such a small group of training dataset, the classification result outperforms the heuristic method. The results are presented in Section 6.6. 6.6 Experimental Results The scanned data used in our experiment cover a 700m× 700m area of Ottawa from the Wright State 100 dataset [31]. The provided data are merged from one airborne scanner and four car-mounted TITAN scanners facing to the left, right, ground and sky, respectively. The quality of the airborne and TITAN data fusion is 0.05 meters. We manually labeled over 450 poles in the test region containing 45 100m×100m blocks of data with over 70 million points. Since many poles are ambiguous in the point cloud, we verify the ground truth with Google Street View. Figure 6.1 shows a collection of representative poles. The most common category is the light, followed by the sign and utility pole. 139 Number of blocks 45 Number of slices 1287 Number of clusters 114102 Clusters with H >0.3 79225 Clusters with A< 0.49 53080 Clusters fitting both criteria 30336 Localized candidates 2448 Table 6.2: Statistics of localization. Table 6.2 shows the statistics for localization. The minimum z value is 40.53 and the maximumzvalueis125.76. Afterslicingineachblocks,therearetotally1287slices. After clustering within each slice, there are totally 114102 clusters. If only height criterion is applied, therewillbe79225candidateclusters; whileifonlytheareacriterionisenforced, there will be 53080 clusters left. 30336 clusters pass both criteria, and after merging the segments, there are 2448 candidate locations. The 2448 candidates locations are verified through the validation based on the at- tributes and the SVM classifier, and finally 470 instances are identified as one of the three pole categories (lights, utility poles and signs). Table (6.3) shows the evaluation of detection of the pole-like objects. Among the 451 pole-like objects, 340 are successfully detected, with a recall rate of 75%. There are 144 false alarms, most of which are trees with smallcrowns, pedestrians, and sparse building fa¸ cadeswithpole-likestructures. Ta- ble (6.4) shows the comparison of pole classification results between our method and that of Yokoyama et al. [105]. Our method turns out to be much better. There are various reasons that the method of [105] fails in this dataset. First, their segmentation step does not take the buildings into account; second, the variation within one category, e.g., the lights are not considered; third, the wires connecting almost all the utility poles make 140 Method Predicted Correct Precision Recall Yokoyama et al. [105] 483 202 42% 45% Proposed method 470 340 72% 75% Table 6.3: Evaluation of the pole-like object detection. Method Category Predicted Correct Precision Recall Yokoyama et al. [105] Light 221 73 33% 33% Utility 95 2 2% 5% Sign 167 27 16% 25% Summary 483 102 21% 27% Proposed method Light 213 119 56% 63% Utility 144 72 50% 68% Sign 113 54 48% 68% Summary 470 245 52% 65% Table 6.4: Evaluation of the pole-like object classification. theirsegmentationunsuccessful; fourth,thesmoothingstepbringsmorenoisesthangains in this dataset; and finally, there are many parameters that need to be tuned in their method, therefore some of the parameters might not fit in this dataset. Our method, on the other hand, solve or partially solve the above problems and thus get a superior performance in this challenging dataset. Figure 6.8 shows the pole detection and classification result of the whole area and Figure 6.10 shows some close-ups. The overall running time is around 3 hours, which is fast and demonstrates the feasibility of our method for large-scale urban data. Figure 6.9 shows the typical failure cases of our method: trees with small crown could be easily confused with the lights with many bulbs. This could possibly be solved by taking more contextual information into account, which is part of our future work. 141 Figure 6.8: Pole detection and classification result of the whole area. (a) (b) Figure 6.9: Typical failure cases of our method. The left is a false alarm due to the sparsity of the tree, while the right is a false negative caused by a light with many bulbs. 142 (a) (b) (c) (d) Figure 6.10: Close-ups of pole detection and classification result by our method. We use yellow color to denote the lights, blue color to denote utility poles and red color to denote signs. 143 Chapter 7 Object Detection using Deep Convolutional Neural Network 7.1 Introduction Urban object recognition could be seen as a basic task in a variety of applications, in- cluding urban modeling at a fine level [21,31], robot navigation and autonomous driving. Specifically, we focus on the vehicle detection problem in this chapter. Previous works on vehicle detection were mostly conducted on the images or videos. For point clouds, the standard approach to vehicle detection is object-based, which involves two steps: can- didate generation and verification. For candidate generation, Patterson et al. [64] used spin-imagetoclassifythepointsandthenperformclustering,However,theirmethodonly deals with cars parking along the roads. Yao et al. [103] apply the vehicle-top detection to identify the candidates. Velizhev et al. [96] use domain knowledge to generate the hypotheses, while our method does not rely on the street axes as they do. To verify the candidates, one way is to perform matching process based on local/global descrip- tors with the templates [54,64]; another way is to compute a few statistics from the 144 candidates [104,106]; feature voting could also be applied [96]. Unlike their methods, we base the classification on the straightforward visual appearance without hand-crafted descriptors. Golovinskiy et al. [31] applied a generic shape-based 3D object detection framework, but the emphasis was on the pole-like objects, and the results for cars were weaker since they did not utilize the properties of the vehicles. Yao [104] compared two methodsforvehicleextractionandshowedthat3Dmethodhasmoreaccurateresultsthan the grid-cell-based method. They used SVM with only five attributes in the classification step, which is capable of dealing with simpler cases, but might fail in the complicated environments. On the other hand, the deep learning technique has achieved a great success in image classification recently. To this end, we apply the deep CNN on the orthogonal-view informationfromtheobjects. OurdetectionandclassificationsystemisdepictedinFigure 7.1. The method begins with the removal of large-scale background objects including the terrain, the upper part of the buildings and the curbs. We then exploit the knowledge- based segmentation to generate the candidates. The orientation of the vehicles is a useful yet robust cue, thus plays vital roles in fine segmentation and classification. We further introduce a gap segmentation method for the difficult case of parking lots, which uses the vehicle orientation to limit the number of gap examinations. Finally, we compute the three views of the objects with the orientation and classify them in a trained CNN with orthogonal view projections as input. Note that most 3D extensions of deep neural networks have been toward the temporal domain [40]. There are also a few attempts to incorporate the multi-view information in the deep model [108]. However, they focus on 145 Figure 7.1: System overview. face recognition and reconstruction of multiple views. Therefore, the major contributions of our work are: (1) We extend the deep learning approach from the 2D image to the 3D domain, by proposing several orthogonal-view-based architectures of CNN for classification of 3D pointclouddata. Whileourfocusisonthevehicles,theclassificationframeworkisgeneric and could be applied to any other 3D detection task. (2) We construct a novel system for segmenting and detecting 3D objects, especially the vehicles, in the urban area, including knowledge-based techniques such as curb re- moval and gap segmentation for vehicles on the parking lots. 7.2 Large-Scale Connection Removal 7.2.1 Ground and Building Removal Most points in the data are in fact the ground points, which often connect everything together. We remove the ground through the following normal-based algorithm. Given the input point cloud P, we first compute the normal for each point based on the points lying in its neighborhood. We extract all points with the z component of their normals larger than a threshold θ G . Then, we perform the region growing algorithm within these pointswithuprightnormals. Thegroundpointsetisdefinedastheunionofallconnected 146 components with more than 5000 points, so as to avoid removing the top surface of the vehicles. The upper parts of the buildings that are above the local ground level by 10 meters are also removed. 7.2.2 Curb Removal The vehicles are typically seen either on the roads or in the parking lots. Many vehicles on the road are parking along the road side, namely the curbs. Our approach is able to remove the curbs so that the clusters along the curbs are disconnected from each other. In fact, the problem is similar to the linear segment detection problem. We apply the principal component analysis on the points close to the ground level, and extract those points with linear properties, i.e., one of the eigenvalue is much larger than the other two eigenvalues. 7.3 Knowledge-based Segmentation 7.3.1 Adaptive Segmentation As large-scale background is removed, it’s natural to perform the clustering. However, theclusterswouldbetoobigwithalargemargin,whilethesmallermargincouldresultin the over-segmentation of the vehicles where the scan is incomplete or sparse. Therefore, we apply the adaptive segmentation method [39], which starts with a large threshold, clusters once, then multiplies the threshold with a decay factor of 0.9 and performs clustering again on the large clusters. This process is repeated until the number of points in each cluster is smaller than a predefined value. 147 (a) (b) Figure 7.2: Results for cluster orientation estimation. (a) Orientation estimation result for candidate clusters along the road. The red line is the principal axis and the blue line is secondary axis. The estimation is pretty close to human perception for the cars. (b) For clusters containing multiple cars (typically in the parking lots), the orientation can be estimated as well. 7.3.2 Orientation Estimation With the previous steps, while few cars are wrongly removed, the vehicles in the parking lots are usually connected by noises. We notice that the normal distribution would be maximizedatthreeorthogonaldirections, oneofwhichisupright, foreitherasinglevehi- cle or multiple adjacent vehicles in parking lot or on the street. Therefore, we extract the maximum of the horizontal normal distribution histogram to solve the problem. Specif- ically, we compute the normals for all points in the cluster. Then, we project the ones that are pointing to the near-horizontal directions (n z < 0.2), to the horizontal plane. After that, we quantize the directions and make a histogram of the number of normals lying in each bin (the ones with opposite directions are added up). Finally, the bin with the maximum number of normals is treated as the principal bin, and the direction per- pendicular to it as the secondary bin. The accurate principal direction is computed as the average of the normalized projected normals in the principal bin and the adjacent bins. Figure 7.2 shows the results of orientation estimation. 148 (a) (b) Figure 7.3: Results for gap segmentation. In (a), the large cluster is divided into 1×5 sub-regions. Also,thegroundthatfailedtoberemovediscorrectlyexcludedintheresult. In (b), the connected cars along the road is separated. 7.3.3 Gap Segmentation Using the orientation information, we can now fit a more accurate bounding box and improve the segmentation result, since the cars typically have rectangular shapes. Fur- thermore, when viewing from one of the estimated axes, gaps could be observed between the nearby cars. Therefore, we develop the following gap detection method for each clus- ter: (1) Compute the principal and secondary axes; (2) Along each axis, compute the maximum local height within a quantization unit (interval); (3) Record if the heights of the intervals are larger than a threshold; (4) If several consecutive intervals are higher than the threshold (bounded by intervals lower than the threshold), we consider it as a possible block. Then, we combine the feasible consecutive intervals. Specifically, if we have M and N feasible consecutive intervals in the principal and the secondary directions, respec- tively, we can generate M ×N sub-regions from the region. Figure 7.3 shows some gap segmentation results. 149 Finally,theseparatedareaissegmentedagainusingamoreaccuratelocalorientation, since the orientation of a larger cluster could be slightly different from that of its sub- clusters. We’dalsoliketoremoveareasoutsidetheboxesorboxesthatarealmostempty. This is done through the iterative version of the algorithm, which starts with the process oforientationestimationandgapsegmentation, thenfiltersthegeneratedcandidatesand iterates the process on the remaining sub-clusters until no significant change is made. 7.4 Classication using OV-CNN Oncethevehiclecandidatesaresegmentedinthepreviousstages,weemploythelearning- based classifier to distinguish between the vehicle clusters and the non-vehicle clusters, mainly including facades, bushes, trees, poles and roofs. 7.4.1 Orthogonal View Projection For each instance, we generate 3 images (n×n pixels) from 3 orthogonal views on XY- plane, XH-plane and YH-plane. Note that we restrict each dimension to be no more than l meters. Theintensityisproportionaltothenumberofpointslyinginthecorresponding bins. The center of the three views is located at (¯ x,¯ y,z min +l), where ¯ x and ¯ y are the average of the coordinates. One of the major difficulties in the 3D projection is that the direction is unknown. However, since we have estimated principal and secondary orientationsoftheclustersinSection7.3.2,wecandirectlyemploythemastheprojection directions. A brief test showed that the classifier using the oriented projection performs better than the projection using world coordinates. This is reasonable since we focus on vehicles, which have well-defined orientations. 150 Figure 7.4: Orthogonal-View CNN with fusion at the fully-connected hidden layer, which performs the best among the three architectures. All networks have 6 layers. 7.4.2 Network Architecture In terms of the layout of a single-view network, our work is based on the success of LeNet [49]. Several structures of CNN that fuse information of orthogonal views are explored. CNN Combined with Voting. This is the simplest architecture, which counts on votes cast by the classification result of the three identical CNNs on separate views. This acts as a baseline method for evaluating the effect of combination of information from different views. The architecture could be represented as 3×(C(n c1 ,d c1 ,f c1 )−P(n p1 )− C(n c2 ,d c2 ,f c2 )−P(n p2 )−FC(n f1 )−LR(n f2 ))−Σ ≥2 (3). Here C(n,d,f) represents a convolutional layer with input size n×n and d filters with size f×f; P(n) represents a max-pooling layer with input size n×n; FC(n) represents a fully-connected layer with input size n; LR(n) represents a logistic regression layer with input size n; and Σ(n) represents a simple summing operation of the outputs from the logistic regression layers. 151 In other words, the final output is based on the sum of the outputs (0 or 1) from three parallel single-view CNNs, and gets activated only when at least two single-view CNNs output 1. The CNNs are trained using all three views without knowing exactly which direction the view is projected, thus sharing identical weights. Fused CNN at LR Layer. In this architecture, the outputs of the fully-connected layersdonotexplicitlydecidetheclassofeachview, butratherbeconcatenatedandused as the input for the one single final logistic regression layer. In this way, different views have the chance to affect each other in the very last step. Using the notations above, the architecture could be written as 3×(C(n c1 ,d c1 ,f c1 )−P(n p1 )−C(n c2 ,d c2 ,f c2 )−P(n p2 )− FC(n f1 ))−LR(n f2 ×3). Fused CNN at Fully-Connected Hidden Layer. In this architecture, the out- puts of the second max-pooling layers are not used in the decision of each view, but be concatenated and used as the input for one unified fully-connected hidden layer. The ar- chitecture could be written as 3×(C(n c1 ,d c1 ,f c1 )−P(n p1 )−C(n c2 ,d c2 ,f c2 )−P(n p2 ))− FC(n f1 ×3)−LR(n f2 ) (Figure 7.4). 7.5 Experiments 7.5.1 Experimental Protocol WeevaluateourdetectionsystemonalargeLidarpointclouddatasetoftheurbanareaof Ottawa. The data was a fusion of one airborne scanner and four car-mounted scanners, which provides higher density along the streets but lower density far from the streets (e.g., the parking lots). We implement the networks using the Theano library [8] and 152 Method TP FP Precision Recall Baseline 58 39 54.2% 58.0% Baseline+Curb 70 43 56.9% 70.0% Baseline+Curb+Adaptive 88 89 47.1% 88.0% Baseline+Curb+Adaptive+Gap 93 89 48.4% 93.0% Table 7.1: Comparison of segmentation result. train them with the Stochastic Gradient Descent method. The size of mini-batch is 10 examples (with 30 views), and the learning rate is 0.1 with a decay rate of 0.95. The training and validation dataset containing 436 examples is annotated from segmented clusters outside the 50 blocks of testing dataset. We implement an automatic program to evaluate the methods. We manually annotated 50 trunks of data of 100m×100m area, containing 728 instances of cars and over 30 million points. If a detected instance has an overlapping percentage of 50% with a ground truth instance, we count it as a correct detection. 7.5.2 Evaluation of Segmentation Table 7.1 demonstrates the progress of segmentation methods, including the baseline method using ground removal and clustering and three incremental methods (curb re- moval, adaptive segmentation and gap detection). 7.5.3 Parameter Selection We experimented with a few parameter settings. For clarity, we only list the parameter evaluation for the OV-CNN fused at the FC layer, which is superior in the comparison. The parameter selection process for the other two networks is similar. 153 d c1 /d c2 10 20 30 40 10 80.0% 80.0% 80.0% 77.7% 20 80.8% 83.8% 82.3% 80.8% 30 78.5% 79.2% 78.5% 77.7% Table 7.2: Evaluation result of different numbers of kernels in the two convolutional layers. (d c1 /d c2 )/n c1 24 28 32 36 (20/20) 80.0% 83.8% 80.0% 80.0% (20/30) 78.5% 82.3% 80.8% 70.8% Table 7.3: Evaluation of various view patch sizes. n c1 must be divisible by 4 to satisfy the constraints. Kernels. We fix the kernel size to be 5×5 and evaluate how different numbers of kernels affect the performance. We can see from Table 7.2 that the best performance is achieved when d c1 =d c2 =20. Patch size. To determine the visual patch size, we fix the two pairs of kernels that performthebestintheevaluationabove. Table7.3showstheeffectofhowdifferentpatch size affect the performance. We can see that when the input view patch size n c1 = 28, the best performance is achieved. 7.5.4 Evaluation of Classication Architectures To evaluate the performance of the three architectures, we apply the best parameters selected from the previous subsection and compare the results. We also apply SVM on theconcatenatedrepresentationofthethreeviewsasthebaselinemethod. FromTable7.4 wecanseethat,comparedtoSVM,Voting-basedCNNdoesnotshowbetterperformance, while the two fused architectures give better results. The best OV-CNN is the one that fuses at the fully-connected hidden layer. This is reasonable since the connections among 154 Classifier SVM Voting LR-Fused FC-Fused Correct% 77.5% 75.9% 80.0% 83.8% Table 7.4: Comparison of three CNN architectures and SVM. (a) (b) (c) (d) Figure 7.5: Vehicle detection results. The vehicles are highlighted in red color. (a) shows the example of vehicles parking along the street. (b) and (c) show the example of large parking lots. (d) shows that our method could handle both cars and trucks. different orthogonal views are more likely to be modeled through the early-fused layers. The best architecture turn out to be 3×(C(28,20,5)−P(24)−C(12,20,5)−P(8))− FC(16×3)−LR(300). Figure 7.5 shows the final detection results. 1 The whole system is highly efficient. For one block of data (100m× 100m), the segmentation step takes less than 5 minutes, and the classification step takes less than 1 minute. 1 We have included a supplementary MP4 le which contains intermediate and nal results of our method. This will be available at http://ieeexplore.ieee.org. 155 Chapter 8 Scene Labeling using 3D Convolutional Neural Network 8.1 Introduction Pointcloudlabelingisanimportanttaskincomputervisionandobjectrecognition. Asa result,eachpointisassignedwitharedefinedlabel,whichfurtherservesasacueforscene analysis and understanding. The classes of interest include most common objects in the urban scenario (Figure 8.1): large-scale planes (including ground and roof), buildings, trees, cars, poles as well as wires. Traditionally, hand-crafted features are widely used in existing methods [45,84,105]. However, the recent progress of deep learning techniques show that, the simplest feature like pixels can be directly combined with the neural networks and trained together. On the other hand, simply projecting 3D data to 2D representations such as depth images and then applying the 2D techniques can easily lead to loss of important structural information embedded in the 3D representation. Inspired by the success of deep learning 156 on the 2D image problems, we present the voxel-based fully-3D Convolutional Neural Network on the point cloud labeling problem. In most existing approaches, segmentation is a necessary step before performing tasks such as detection and classification [33]. Our method does not require prior knowledge, such as the segmentation of the ground and/or buildings, the precomputed normals, etc. Everything is based on the voxelized data, which is a straightforward representation. From another point of view, our approach works as an end-to-end segmentation method. Our work shows the power of neural networks to capture the essential features needed for distinguishing different categories by themselves. Despite the conceptually straightforward idea of representing point clouds as voxels to solve the problem, there are many underlying challenges. First, the dense voxel repre- sentation will quickly exceed the memory limit of any computer. Secondly, it will require too much time without proper optimization of the algorithm. Also, the classifier could be easily biased towards some dominating categories (e.g., buildings) without deliberately balancing the training data. Our contributions mainly include: (1) We introduce a framework of 3D Convolutional Neural Network (3D-CNN) and design effective algorithms for labeling complex 3D point could data. (2) We present solutions for efficiently handling large data during the voxelization, training and testing of the 3D network. 157 Figure 8.1: Objects of interest. The buildings are colored with orange, planes (including ground and roof) are colored with yellow, trees are colored with green, cars are colored with red, poles are colored with blue, and wires are colored with black. The points in the other categories are colored with light gray. 158 8.2 System Overview Our labeling system is depicted in Figure 8.2. The system is composed of an offline training module and an online testing module. The offline training takes the annotated training data as input. The training data are parsed through a voxelization process that generates occupancy voxel grids centered at a set of keypoints. The keypoints are generated randomly, and the number of them are balanced across different categories. The labels of the voxel grids are decided by the dominating category in the cell around the keypoint. Then, the occupancy voxels and the labels are fed to a 3D Convolutional Neural Network, which is composed of two 3D convolutional layers, two 3D max-pooling layers, a fully connected layer and a logistic regression layer (Section 8.4). The best parameters during training are saved. The online testing takes a raw point cloud without labels as input. The point cloud is parsed through a dense voxel grid and results in a set of occupancy voxels centered at every grid centers, respectively. The voxels are then used as the input to the trained 3D convolutional network, and every voxel grid would produce exactly one label. The inferred labels are then mapped back to the original point cloud to produce a pointwise labeling result (Section 8.5). Note that, due to different requirements of the training and testing modules, the voxelization process are quite different except the parameters such as grid size and voxel number. We will discuss the details of voxelization in Section 8.3. 159 Figure 8.2: The labeling system pipeline, including the offline training module and the online testing module. 8.3 Voxelization We turn the point cloud into 3D voxels through the following process. We first compute theboundingboxforthewholepointcloud. Then, wedescribehowthelocalvoxelization isobtainedifacenterpointfromthepointcloudischosen. Thechoiceofthecenterwould bedifferentdependingonwhetherwe’reinthetrainingprocessorthetestingprocessand will be discussed in the experiment section. Given the center point (x,y,z), we set up a cubic bounding box of radius R around it, i.e., [x−R,x+R]×[y−R,y+R]×[z−R,z+R]. Then, we subdivide the cube into a N×N×N grid of cells. In our experiment, R =6 and N =20, resulting in a cell of size 0.3×0.3×0.3 and 8000 cells. We then go through the points that lie within the cubic box and project them with the integer indices. The result of a local voxelization is thus a 8000-dimensional vector. 160 There are several ways of computing the value of each cell. The simplest one is to compute the occupancy value, i.e., if there’s a point inside it, the value becomes 1, otherwise becomes 0. A slightly complicated version is to compute a density value, which could be realized by counting how many points lie within each cell. In our experiment, we find that occupancy value is enough to generate a good result. By moving the center point around, we can generate a dictionary of different local voxelization results. Figure 8.3 shows the original input point cloud and the generated voxels. The process above is enough for the testing process. However, for training purpose, we need to provide a unique label for each generated voxel grid. We define the label for the entire voxel grid as the label of the cell around its center, i.e., [x−r,x+r]×[y− r,y +r]× [z −r,z +r]. In our experiment r = 0.3/2 = 0.15, so that the size of the cell is identical to the cells constructing the voxel grid. In most cases, there are points of multiple categories lying within a single cell. We apply a voting approach to decide the label of the cell, that is, the category with the most points in the cell will be treated as the representative category of the cell. In the rare case where two or more categories have equal number of points, we just pick a random one. 8.4 3D Convolutional Neural Network After generating the voxels, we feed them to our 3D convolutional neural network. Here are some essential blocks forming the 3D CNN. 161 (a) (b) Figure8.3: Illustrationfordensevoxelization. Theinputpointcloud(a)isparsedthrough the voxelization process, which generates a dense voxel representation depicted in (b). 8.4.1 3D Convolutional Layer A 3D convolutional layer could be represented as C(n,d,f), meaning a convolutional layer with input size n×n×n and d feature maps with size f ×f ×f. Formally, the output at position (x,y,z) on the m-th feature map of 3D convolutional layer l is v xyz lm =b lm + ∑ q f−1 ∑ i=0 f−1 ∑ j=0 f−1 ∑ k=0 w ijk lmq v (x+i)(y+j)(z+k) (l−1)q , (8.1) where b lm is the bias for the feature map, q goes through the feature maps in the (l−1)-th layer, w ijk lmq is the weight at position (i,j,k) of the kernel of the q-th feature map. The weights and the bias will be obtained through the training process. 8.4.2 3D Pooling Layer A 3D pooling layer can be represented as P(n,g), meaning a pooling layer with input size n×n×n and a pooling kernel of g×g×g. In this approach, we use max pooling. Formally, the output at position (x,y,z) on the m-th feature map of 3D max pooling layer l is 162 Figure8.4: 3DConvolutionalNeuralNetwork. Thenumberonthetopdenotethenumber of nodes in each layer. The input is the voxel grid of size 20 3 , followed by a convolution layer with 20 feature maps of size 5×5×5 resulting in 20×16 3 outputs, a max pooling layer with 2×2×2-sized non-overlapping divisions resulting in 20×8 3 outputs, a second convolutional layer with 20 feature maps of size 5×5×5 resulting in 20×4 3 outputs, a second max pooling layer with 2×2×2-sized non-overlapping divisions resulting in 20×2 3 outputs, a fully connected layer with 300 hidden nodes, and the final output is based on a softmax over 8 labels (including 7 categories and an empty label). v xyz lm = max i,j,k∈{0,1,...,g−1} v (gx+i)(gy+j)(gz+k) (l−1)m . (8.2) To increase nonlinearity, we use the hyperbolic tangent (tanh(·)) activation function after each pooling layer. 8.4.3 Network Layout In terms of the layout of the 2D network, our work is based on the success of LeNet [49], which is composed of 2 convolutional layers, 2 pooling layers and 1 fully-connected layer. We replace the 2D convolutional layers and 2D pooling layers with the 3D convolutional layers and 3D pooling layers, respectively, and obtain our architecture (Figure 8.4). ThearchitecturecouldberepresentedasC(n c1 ,d c1 ,f c1 )−P(n p1 ,g p1 )−C(n c2 ,d c2 ,f c2 )− P(n p2 ,g p2 )−FC(n f1 )−LR(n f2 ); FC(n) represents a fully-connected layer with input size n; LR(n) represents a logistic regression layer with input size n. The final output denoting the labeling result is an integer l∈{0,1,2,...,L} produced by softmax. In this 163 case L=7, representing the 7 categories specified in Figure 8.1, while the label 0 denotes the degenerated case where the central region of the voxel grid contains no point. 8.5 Label Inference Given the trained network, we can perform voxel-level classification on the cells. We densely sample the center point of the voxel grids with a distance of 0.3. The coincidence of the center distance and the cell size is intended to generate a unoverlapped compact division of the whole space. Given the labeling result of a local voxel box, we label all points in the cell near the center of the box as the corresponding category. The compact division would label every point in the data exactly once. However, this would restrict the precision of the labeling to the degree of the cell size. In our exper- iments, we find the granularity is sufficient to produce quite a good result (see Section 8.6) with negligible boundary artifacts. In practice, when there’re enough computational resources, we can shift the centers around and redo the classification process, and finally, we can employ a voting scheme to decide which label is assigned to each point. 8.6 Experiments 8.6.1 Dataset and Training Setup The labeling system is evaluated on a large Lidar point cloud dataset of the urban area of Ottawa. The data come from a fusion of one airborne scanner and four car-mounted scanners. We implement the 3D convolutional network using the Theano library [8] and 164 the network is trained with the Stochastic Gradient Descent method. The size of mini- batch is 30 examples, and the learning rate is 0.1 with a decay rate of 0.95. We manually labeled a few trunks from the entire data, which could generate up to 500k features and labels. However, we only selected 50k of them as training data and 20k of them as validation data under a random and balanced scheme. 8.6.2 Balance of Training Samples During training, we find that a dense sampling of the training data leads to undesired behavior of bias among common categories such as buildings, versus the less common categories such as the wires. Therefore, we implement a balanced random sampling of different categories of the training data. Specifically, we extract the same number of key points from each category, as the centers of the voxelized regions. Experiments show that despite the vastly uneven number of points in category in the real data, the balance of the training data among each categories contributes to a key boost of the performance of the method. 8.6.3 Parameter Selection We perform experiments with a few parameter settings. We fix the kernel size to be 5×5 and evaluate how the numbers of kernels affect the performance. From Table 8.1 we can see that the best performance is achieved when d c1 = d c2 = 20, and generally speaking, the parameter here does not have huge impact on the labeling result. 165 d c1 /d c2 10 20 40 20 91.9% 93.0% 92.7% Table 8.1: Comparison of different numbers of kernels in the two 3D convolutional layers. 8.6.4 Qualitative Result Figure 8.6 shows the labeling result of a large area in the city of Ottawa by our approach. Figure 8.7 shows the closeups. We can see from the result that although we don’t have a segmentation step for this task, the generated result shows clearly distinguished labeling result for different objects. 8.6.5 Quantitative Result We evaluate our approach by comparing the results and the ground truth labels. We im- plement an automatic program to evaluate the methods. The points are sorted according to their coordinates in O(nlogn) time so that the labels can be compared in O(n) time, where n is the number of points. Figure 8.5 shows the confusion matrix across the categories. The entry at the i-th row and the j-th column denotes the percentage of points of the j-th truth category that are classified as the i-th category, and the background of the cells is color-coded so that 1 is mapped to black, 0 is mapped to white and anything in between is mapped to the corresponding gray values. From the table we can see that the cars and planes are well-classified with an accuracy higher than 95%, while buildings, poles and wires have an accuracy between 80% and 90%. The accuracy for trees is a little below 80%, mainly due to the confusion with the others category containing many scattered clusters such as humans and bushes. In some area, the parallel wires have a high density and lead to 166 Figure 8.5: Confusion matrix for different categories. The entry at the i-th row and the j-th column denotes the percentage of points of the j-th truth category that are classified as the i-th category. some confusion with the horizontal planes. The overall precision for point labeling of all categories is 93.0%. Despitethehighdimensionalityofthevoxelrepresentationand3DCNN,oursystemis highlyefficientintermsofbothspaceandtime. Thetrainingprocesstakesaround2hours on a PC with NVIDIA GeForce GTX 980M GPU. For one trunk of data (100m×100m) and with voxel size of 0.3m× 0.3m× 0.3m, the voxelization process takes less than 5 minutes, and the classification step takes less than 3 minutes. 167 Figure 8.6: Labeling result for a large urban area through the 3D-CNN. 168 (a) (b) (c) (d) Figure 8.7: Close-ups of the labeling result. (a) The ground planes, buildings, trees and cars. (b) The street view with various poles including light poles, sign poles, utility poles andflagpoles. (c)Thestreetviewwithutilitypolesandwires. (d)Theparkinglotscene. 169 Chapter 9 Conclusion and Future Work 9.1 Conclusion This research investigates the problem of object detection and recognition from 3D point cloud. We explored two different strategies, i.e., the matching-based approach and learning-based approach, to solve the problem on both the industrial dataset and the urban dataset. In Chapter 3, We extended the 2D self-similarity descriptor to the 3D spatial space. The new feature-based 3D descriptor is invariant to scale and orientation change. We carefully examine the design principles of the descriptor, and propose the new 3D self- similarity descriptor with improvement on robustness to empty region and noises. The discussionandcomparisonwouldhopefullybenefitthefuturegeneral3Ddescriptordesign process. We also extensively compare the performance of different 3D shape descriptors intermsofdimensionality,runningtime,descriptorquality,precisionandrecallforpoint- wise matching and recognition rate. The results show that, 3D self-similarity descriptor, especiallytheonebasedonnormal-similarity, couldachievebetterorequallygoodresults 170 compared to state-of-the-art descriptors under various transformations and for different applications with low time and memory consumption. Since meshes or surfaces can be sampled and transformed into the point cloud or voxel representations, the method can be easily adapted to the matching models to point clouds, or models to models. InChapter4,wepresentanobjectdetectionframeworkfor3Dscenepointcloud,using a combinational approach containing SVM-based point classification, segmentation, clus- tering, filtering, 3Dself-similaritydescriptorandrigid-bodyRANSAC.TheSVM+FPFH method in pipe/plane/edge point classification gives a nice illustration of how descriptor could be combined with training methods. The adaptive segmentation algorithm greatly reduces the burdens for matching by limiting the size of clusters. Applying two local descriptors (FPFH and 3D Self-Similarity) in different phase of processing shows that different descriptors could be made the most of under different circumstances. The pro- posed variant of RANSAC considering the rigid body constraint also shows how prior knowledge could be incorporated in the system. Also, by applying the corresponding se- lection algorithm with the overlapping score criterion, the point cloud matching module has become robust enough that most pairs of well-segmented clouds of the same category of objects can be matched properly. The experiment results show the effectiveness of our method, especially for large cluttered industrial scenes. Based on the object detection system, we further define and address the problem of object-level change detection in point clouds in Chapter 5. We present a new approach, in which the global alignment, local object detection and object change inference are combined to achieve robust detection of changes. Specially, we propose the change evalu- ation functions for pairwise change estimation, and use the weighted bipartite matching 171 tosolvethemany-to-manyobjectcorrelation problem. Asfarasweknow, thisisthe first work addressing the 3D change detection problem in complex industrial scenes at object level. In the future, the scheme could be extend to detect changes of large-scale non-part objects such as pipes and planes. The matching-based approach works well for point clouds with rich details and high intra-class correlation, e.g., the industrial parts or the different scans of the same object. However, for urban objects, such as vehicles, poles and trees, the intra-class variation is much higher, thus requiring a different strategy. Particularly, we find learning-based approaches to be very effective and have good generalizability. We propose an efficient and easy-to-implement algorithm for pole detection from the 3D scanned point cloud of the urban area in Chapter 6. We introduce a series of slicing, combination and filtering strategies, and propose a five-class point classification method tohelpvalidateaswellasclassifythepoles. Therearesomepossibleimprovementsinthe future work. For example, a plane removal step could be incorporated in pre-processing. Moreover, as indicated in the failure case analysis, the contextual information could be beneficial for figuring out the outliers. Also, we are working on a pole classification method that can further classify the poles into more detailed subcategories. We introduce deep learning techniques into our research first by developing a CNN that could deal with 3D point cloud classification using orthogonal views in Chapter 7. Combining it with knowledge-based segmentation techniques, we can efficiently perform vehicle detection from the urban point clouds. We evaluate and demonstrate the per- formance of our method through detailed experiments. Although this work focuses on 172 vehicles, the OV-CNN architecture could easily be applied to other 3D object detection and classification tasks. Finally, we propose a 3D point cloud labeling system based on fully 3D Convolu- tional Neural Network in Chapter 8. Our approach does not need prior knowledge for segmentation. In the future, there are a few directions in which we can explore. First, the current labels can be used as input to the network again, which resembles the idea of recursive neural networks. Also, the multi-resolution version of the network might work even better for objects of large scale variance. 9.2 Future Work Possible future works lie in the following directions: Combination of Matching and Learning. We have mainly discussed about two different kinds of strategies: matching-based strategy and learning-based strategy. In the complicated scenario, it would be better if we can combine the two strategies. In fact, we have made some attempts in our work already, by combining the learning-based point classification and matching-based detection. However, there are many other possibilities. For example, we can first use learning-based approach to detect the possible locations of the objects, and then apply matching-based approach to refine the detection results as well as getting the pose information. Another possibility is that we can even learn how to perform the matching through training examples from the existing matching results. This requires a good abstraction and representation of the matching process. 173 Cooperative Alignment and Detection in Change Detection In the current framework of change detection, the global alignment and the local detection are two independent stages. In the future, the scheme could be extended so that the two stages benefit from each other: (1) Enhancement of global alignment. The accuracy of global alignment could largely affect the accuracy of the whole system, therefore, it’s vital to ensure a good global alignment. Effectively the object detection module does not rely on the global alignment, moreover, it can provide necessary information that can help the initial setup of the point clouds, to which the global alignment is sensitive. The first idea is to align the backbone objects separately. Note that ground planes, which contain many points, typically provide only one orientation cue, while the pipes, acting as the skeleton of an industrial site, might be easier to be aligned. This is more significant considering the case where ground is not present in exactly one of the data. The second idea is to use the detected as hyper-key-points, while the categories could be viewed as a one-dimensional descriptor. The global alignment now turns into a hyper-matching problem among the objects. (2) Enhancement of object detection. On the other way around, object detection could benefit from the global alignment as well, especially when we are not able to perform a perfect detection on any data, due to noises and occlusions. For each detected object in each of the data, we can try to search more thoroughly in the nearby area in the other data, after the two data have been globally aligned. Enhancing the 3D CNN by using Recursive Networks. Currently the 3D CNN would generate the point labels in one single pass. However, we can actually consider these label output as the input of another 3D CNN, and refine the labels again. This 174 makessensebecausethesmoothnessinformationhasproventobehelpfulinmanylabeling problems. There have been some works using RNN in the 2D problems, so the 3D RNN and/or similar structures is a promising direction. Applying 3D CNN to Broader 3D Vision Tasks. We have employed the 3D CNN in the point cloud labeling problem, which classify all the points into one of the seven common categories. This approach has a lot of potential. For example, we can use another set of training data that output the bounding box of the object as label, and use this framework for detection. Such extensions would again prove the flexibility and generalizability of our 3D CNN framework. 175 References [1] Luıs A Alexandre. 3d descriptors for object and category recognition: a compar- ative evaluation. In Workshop on Color-Depth Camera Fusion in Robotics at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura, Portugal. Citeseer, 2012. [2] DragomirAnguelov,BTaskarf,VassilChatalbashev,DaphneKoller,DinkarGupta, GeremyHeitz,andAndrewNg. Discriminativelearningofmarkovrandomfieldsfor segmentation of 3d scan data. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 169–176. IEEE, 2005. [3] T Aschoff and H Spiecker. Algorithms for the automatic detection of trees in laser scanner data. International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, 36(Part 8):W2, 2004. [4] Olivier Barnich and Marc Van Droogenbroeck. Vibe: A universal background sub- traction algorithm for video sequences. Image Processing, IEEE Transactions on, 20(6):1709–1724, 2011. [5] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust fea- tures. In Computer Vision{ECCV 2006, pages 404–417. Springer, 2006. [6] Jens Behley, Kristian Kersting, Dirk Schulz, Volker Steinhage, and Armin B Cre- mers. Learning to hash logistic regression for fast 3d scan point classification. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 5960–5965. IEEE, 2010. [7] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape matching and object recognitionusingshapecontexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(4):509–522, 2002. [8] James Bergstra, Fr´ ed´ eric Bastien, Olivier Breuleux, Pascal Lamblin, Razvan Pas- canu, Olivier Delalleau, Guillaume Desjardins, David Warde-Farley, Ian Goodfel- low, ArnaudBergeron, etal. Theano: Deeplearningongpuswithpython. In NIPS 2011, BigLearning Workshop, Granada, Spain, 2011. [9] Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Robotics- DL tentative, pages 586–606. International Society for Optics and Photonics, 1992. 176 [10] Sofien Bouaziz, Andrea Tagliasacchi, and Mark Pauly. Sparse iterative closest point. In Computer Graphics Forum, volume 32, pages 113–123. Wiley Online Library, 2013. [11] Edmond Boyer, Alexander M Bronstein, Michael M Bronstein, Benjamin Bustos, TalDarom,RaduHoraud,IngridHotz,YosiKeller,JohannesKeustermans,Artiom Kovnatsky, et al. Shrec 2011: robust feature detection and description benchmark. In Proceedings of the 4th Eurographics conference on 3D Object Retrieval, pages 71–78. Eurographics Association, 2011. [12] Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Numerical geom- etry of non-rigid shapes. Springer, 2008. [13] AM Bronstein, MM Bronstein, B Bustos, U Castellani, M Crisani, B Falcidieno, LJ Guibas, I Kokkinos, V Murino, M Ovsjanikov, et al. Shrec 2010: robust feature detection and description benchmark. Proc. 3DOR, 2(5):6, 2010. [14] Michael M Bronstein and Iasonas Kokkinos. Scale-invariant heat kernel signa- tures for non-rigid shape recognition. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1704–1711. IEEE, 2010. [15] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector ma- chines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. [16] Ken Chatfield, James Philbin, and Andrew Zisserman. Efficient retrieval of de- formable shape classes using local self-similarities. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages 264–271. IEEE, 2009. [17] Hui Chen and Bir Bhanu. 3d free-form object recognition in range images using local surface patches. Pattern Recognition Letters, 28(10):1252–1262, 2007. [18] Chin Seng Chua and Ray Jarvis. Point signatures: A new representation for 3d object recognition. International Journal of Computer Vision, 25(1):63–85, 1997. [19] Ondrej Chum and Jiri Matas. Matching with prosac-progressive sample consensus. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 220–226. IEEE, 2005. [20] Ondrej Chum and Jir´ ı Matas. Optimal randomized ransac. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(8):1472–1482, 2008. [21] Nico Cornelis, Bastian Leibe, Kurt Cornelis, and Luc Van Gool. 3d urban scene modelingintegratingrecognitionandreconstruction. InternationalJournalofCom- puter Vision, 78(2-3):121–141, 2008. [22] J´ erˆ ome Demantk´ e, Cl´ ement Mallet, Nicolas David, and Bruno Vallet. Dimension- ality based scale selection in 3d lidar point clouds. International Archives of Pho- togrammetry, Remote Sensing and Spatial Information Sciences, Laser Scanning, 2011. 177 [23] Petr Doubek, Michal Perdoch, Jirı Matas, and J Sochman. Mobile mapping of vertical traffic infrastructure. In Proceedings of the 13th Computer Vision Winter Workshop, pages 115–122, 2008. [24] Bertrand Douillard, James Underwood, Noah Kuntz, Vsevolod Vlaskine, Alastair Quadros, Peter Morton, and Alon Frenkel. On the segmentation of 3d lidar point clouds. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 2798–2805. IEEE, 2011. [25] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. [26] A Flint, A Dick, and A Van den Hengel. Local 3d structure recognition in range images. IET Computer Vision, 2(4):208–217, 2008. [27] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37. Springer, 1995. [28] Andrea Frome, Daniel Huber, Ravi Kolluri, Thomas B¨ ulow, and Jitendra Malik. Recognizing objects in range data using regional point descriptors. In Computer Vision-ECCV 2004, pages 224–237. Springer, 2004. [29] D Girardeau-Montaut, M Roux, R Marc, and G Thibault. Change detection on points cloud data acquired with a ground laser scanner. International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, 36(part 3):W19, 2005. [30] AlekseyGolovinskiyandThomasFunkhouser. Min-cutbasedsegmentationofpoint clouds. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th In- ternational Conference on, pages 39–46. IEEE, 2009. [31] Aleksey Golovinskiy, Vladimir G Kim, and Thomas Funkhouser. Shape-based recognition of 3d point clouds in urban environments. In Computer Vision, 2009 IEEE 12th International Conference on, pages 2154–2161. IEEE, 2009. [32] NilGoyette,Pierre-MarcJodoin,FatihPorikli,JanuszKonrad,andPrakashIshwar. Changedetection. net: A new change detection benchmark dataset. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer So- ciety Conference on, pages 1–8. IEEE, 2012. [33] Danilo Habermann, Alberto Hata, Denis Wolf, and Fernando Santos Osorio. Ar- tificial neural nets object recognition for 3d point clouds. In Intelligent Systems (BRACIS), 2013 Brazilian Conference on, pages 101–106. IEEE, 2013. [34] Olympia Hadjiliadis and Ioannis Stamos. Sequential classification in point clouds of urban scenes. In Proc. 3DPVT, 2010. 178 [35] Chris Harris and Mike Stephens. A combined corner and edge detector. In Alvey vision conference, volume 15, page 50. Manchester, UK, 1988. [36] Michael Himmelsbach, Andre M¨ uller, Thorsten L¨ uttel, and Hans-Joachim W¨ unsche. Lidar-based 3d object perception. In Proceedings of 1st international workshop on cognition for technical systems, volume 1, 2008. [37] Jing Huang and Suya You. Point cloud matching based on 3d self-similarity. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Com- puter Society Conference on, pages 41–48. IEEE, 2012. [38] JingHuangandSuyaYou. Detectingobjectsinscenepointcloud: Acombinational approach. In 3DTV-Conference, 2013 International Conference on, pages 175–182. IEEE, 2013. [39] Jing Huang and Suya You. Segmentation and matching: Towards a robust ob- ject detection system. In Winter Conference on Applications of Computer Vision (WACV), 2014. [40] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(1):221–231, 2013. [41] Andrew Edie Johnson and Martial Hebert. Recognizing objects by matching ori- ented points. In Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on, pages 684–689. IEEE, 1997. [42] ZhizhongKangandZhaoLu. Thechangedetectionofbuildingmodelsusingepochs of terrestrial point clouds. In Multi-Platform/Multi-Sensor Remote Sensing and Mapping (M2RSM), 2011 International Workshop on, pages 1–6. IEEE, 2011. [43] Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Rotation in- variant spherical harmonic representation of 3d shape descriptors. In Proceedings of the 2003 Eurographics/ACM SIGGRAPH symposium on Geometry processing, pages 156–164. Eurographics Association, 2003. [44] Jan Knopp, Mukta Prasad, Geert Willems, Radu Timofte, and Luc Van Gool. Hough transform and 3d surf for robust three dimensional classification. In Com- puter Vision{ECCV 2010, pages 589–602. Springer, 2010. [45] Hema S Koppula, Abhishek Anand, Thorsten Joachims, and Ashutosh Saxena. Semantic labeling of 3d point clouds for indoor scenes. In Advances in Neural Information Processing Systems, pages 244–252, 2011. [46] Marcel K¨ ortgen, Gil-Joo Park, Marcin Novotni, and Reinhard Klein. 3d shape matchingwith3dshapecontexts. InThe7thcentralEuropeanseminaroncomputer graphics, volume 3, pages 5–17, 2003. 179 [47] Rainer Kummerle, Michael Ruhnke, Bastian Steder, Cyrill Stachniss, and Wolfram Burgard. A navigation system for robots operating in crowded urban environ- ments. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 3225–3232. IEEE, 2013. [48] Jean-Fran¸ cois Lalonde, Nicolas Vandapel, Daniel F Huber, and Martial Hebert. Natural terrain classification using three-dimensional ladar data for ground robot mobility. Journal of eld robotics, 23(10):839–861, 2006. [49] Yann LeCun, L´ eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278– 2324, 1998. [50] Matti Lehtom¨ aki, Anttoni Jaakkola, Juha Hyypp¨ a, Antero Kukko, and Harri Kaartinen. Detection of vertical pole-like objects in a road environment using vehicle-based laser scanning data. Remote Sensing, 2(3):641–664, 2010. [51] David G Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999. [52] De-an Luo and Yan-min Wang. Rapid extracting pillars by slicing point clouds. In Proc. XXI ISPRS Congress, IAPRS, volume 37, pages 215–218. Citeseer, 2008. [53] J-F Mas. Monitoring land-cover changes: a comparison of change detection tech- niques. International journal of remote sensing, 20(1):139–152, 1999. [54] Bogdan Matei, Ying Shan, Harpreet S Sawhney, Yi Tan, Rakesh Kumar, Daniel Huber, and Martial Hebert. Rapid object indexing using locality sensitive hashing andjoint3d-signaturespaceestimation. PatternAnalysisandMachineIntelligence, IEEE Transactions on, 28(7):1111–1126, 2006. [55] DanielMaturanaandSebastianScherer. 3dconvolutionalneuralnetworksforland- ing zone detection from lidar. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 3471–3478. IEEE, 2015. [56] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural net- work for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015. [57] Jasna Maver. Self-similarity and points of interest. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(7):1211–1226, 2010. [58] Ajmal Mian, Mohammed Bennamoun, and R Owens. On the repeatability and qualityofkeypointsforlocalfeature-based3dobjectretrievalfromclutteredscenes. International Journal of Computer Vision, 89(2-3):348–361, 2010. [59] Ajmal S Mian, Mohammed Bennamoun, and Robyn Owens. Three-dimensional model-basedobjectrecognitionandsegmentationinclutteredscenes. PatternAnal- ysis and Machine Intelligence, IEEE Transactions on, 28(10):1584–1601, 2006. 180 [60] Ajmal S Mian, Mohammed Bennamoun, and Robyn A Owens. A novel represen- tation and feature matching algorithm for automatic pairwise registration of range images. International Journal of Computer Vision, 66(1):19–40, 2006. [61] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(10):1615–1630, 2005. [62] JohnNovatnackandKoNishino. Scale-dependent/invariantlocal3dshapedescrip- tors for fully automatic registration of multiple sets of range images. In Computer Vision{ECCV 2008, pages 440–453. Springer, 2008. [63] Andreas N¨ uchter, Hartmut Surmann, and Joachim Hertzberg. Automatic classi- fication of objects in 3d laser range scans. In In Proc. 8th Conf. on Intelligent Autonomous Systems, page 963, 2004. [64] Alexander Patterson IV, Philippos Mordohai, and Kostas Daniilidis. Object de- tection from large-scale 3d datasets using bottom-up and top-down descriptors. In Computer Vision{ECCV 2008, pages 553–566. Springer, 2008. [65] Pedro HO Pinheiro and Ronan Collobert. Recurrent convolutional neural networks for scene parsing. arXiv preprint arXiv:1306.2795, 2013. [66] P Press and David Austin. Approaches to pole detection using ranged laser data. In Proceedings of Australasian Conference on Robotics and Automation. Citeseer, 2004. [67] Danil Prokhorov. A convolutional learning system for object classification in 3-d lidar data. Neural Networks, IEEE Transactions on, 21(5):858–863, 2010. [68] Shi Pu, Martin Rutzinger, George Vosselman, and Sander Oude Elberink. Recog- nizing basic structures from mobile laser scanning data for road inventory studies. ISPRS Journal of Photogrammetry and Remote Sensing, 66(6):S28–S39, 2011. [69] Rongqi Qiu, Qian-Yi Zhou, and Ulrich Neumann. Pipe-run extraction and re- construction from point clouds. In Computer Vision{ECCV 2014, pages 17–30. Springer, 2014. [70] Richard J Radke, Srinivas Andra, Omar Al-Kofahi, and Badrinath Roysam. Im- age change detection algorithms: a systematic survey. Image Processing, IEEE Transactions on, 14(3):294–307, 2005. [71] Rahul Raguram, Jan-Michael Frahm, and Marc Pollefeys. A comparative analysis of ransac techniques leading to adaptive real-time random sample consensus. In Computer Vision{ECCV 2008, pages 500–513. Springer, 2008. [72] Arnau Ramisa, Guillem Alenya, Francesc Moreno-Noguer, and Carme Torras. Finddd: A fast 3d descriptor to characterize textiles for robot manipulation. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pages 824–830. IEEE, 2013. 181 [73] Salvador Ruiz-Correa, Linda G Shapiro, and Marina Melia. A new signature-based methodforefficient3-dobjectrecognition. In Computer Vision and Pattern Recog- nition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Con- ference on, volume 1, pages I–769. IEEE, 2001. [74] RaduBogdanRusu,NicoBlodow,andMichaelBeetz. Fastpointfeaturehistograms (fpfh) for 3d registration. In Robotics and Automation, 2009. ICRA'09. IEEE International Conference on, pages 3212–3217. IEEE, 2009. [75] Radu Bogdan Rusu, Gary Bradski, Romain Thibaux, and John Hsu. Fast 3d recognition and pose using the viewpoint feature histogram. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 2155– 2162. IEEE, 2010. [76] Radu Bogdan Rusu and Steve Cousins. 3d is here: Point cloud library (pcl). In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 1–4. IEEE, 2011. [77] Radu Bogdan Rusu, Zoltan Csaba Marton, Nico Blodow, and Michael Beetz. Per- sistent point feature histograms for 3d point clouds. In Proc 10th Int Conf Intel Autonomous Syst (IAS-10), Baden-Baden, Germany, pages 119–128, 2008. [78] RaduBogdanRusu,ZoltanCsabaMarton,NicoBlodow,MihaiDolha,andMichael Beetz. Towards 3d point cloud based object maps for household environments. Robotics and Autonomous Systems, 56(11):927–941, 2008. [79] Samuele Salti, Federico Tombari, and Luigi Di Stefano. A performance evalua- tion of 3d keypoint detectors. In 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), 2011 International Conference on, pages 236–243. IEEE, 2011. [80] Ruwen Schnabel, Roland Wahl, and Reinhard Klein. Efficient ransac for point- cloud shape detection. In Computer graphics forum, volume 26, pages 214–226. Wiley Online Library, 2007. [81] Roman Shapovalov and Alexander Velizhev. Cutting-plane training of non- associative markov network for 3d point cloud segmentation. In 3D Imaging, Mod- eling, Processing, Visualization and Transmission (3DIMPVT), 2011 International Conference on, pages 1–8. IEEE, 2011. [82] Eli Shechtman and Michal Irani. Matching local self-similarities across images and videos. In Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on, pages 1–8. IEEE, 2007. [83] Philip Shilane, Patrick Min, Michael Kazhdan, and Thomas Funkhouser. The princeton shape benchmark. In Shape Modeling Applications, 2004. Proceedings, pages 167–178. IEEE, 2004. [84] Shuran Song and Jianxiong Xiao. Sliding shapes for 3d object detection in depth images. In Computer Vision{ECCV 2014, pages 634–651. Springer, 2014. 182 [85] Shuran Song and Jianxiong Xiao. Deep sliding shapes for amodal 3d object detec- tion in rgb-d images. arXiv preprint arXiv:1511.02300, 2015. [86] Bastian Steder, Giorgio Grisetti, Mark Van Loock, and Wolfram Burgard. Robust on-line model-based object detection from range images. In Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, pages 4739– 4744. IEEE, 2009. [87] Bastian Steder, Radu Bogdan Rusu, Kurt Konolige, and Wolfram Burgard. Point feature extraction on 3d range scans taking into account object boundaries. In Robotics and automation (icra), 2011 ieee international conference on, pages 2601– 2608. IEEE, 2011. [88] Fridtjof Stein and G´ erard Medioni. Structural indexing: Efficient 3-d object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):125–145, 1992. [89] W StraBer. Schnelle Kurven-und Flachendarstellung auf graphischen Sichtgeraten. PhD thesis, Ph. D.-Thesis, 1974. [90] JianSun, MaksOvsjanikov, andLeonidasGuibas. Aconciseandprovablyinforma- tive multi-scale signature based on heat diffusion. In Computer Graphics Forum, volume 28, pages 1383–1392. Wiley Online Library, 2009. [91] Yiyong Sun and Mongi A Abidi. Surface matching by 3d point’s fingerprint. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Con- ference on, volume 2, pages 263–269. IEEE, 2001. [92] Federico Tombari, Samuele Salti, and Luigi Di Stefano. Unique shape context for 3d data description. In Proceedings of the ACM workshop on 3D object retrieval, pages 57–62. ACM, 2010. [93] Federico Tombari, Samuele Salti, and Luigi Di Stefano. Unique signatures of his- tograms for local surface description. In Computer Vision{ECCV 2010, pages 356– 369. Springer, 2010. [94] Ben J Tordoff and David W Murray. Guided-mlesac: Faster image transform esti- mationbyusingmatchingpriors. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(10):1523–1535, 2005. [95] PhilipHSTorrandAndrewZisserman. Mlesac: Anewrobustestimatorwithappli- cation to estimating image geometry. Computer Vision and Image Understanding, 78(1):138–156, 2000. [96] Alexander Velizhev, Roman Shapovalov, and Konrad Schindler. Implicit shape models for object detection in 3d point clouds. Proc. ISPRS Congr, pages 1–6, 2012. 183 [97] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–511. IEEE, 2001. [98] GeorgeVosselman,BenGHGorte,GeorgeSithole,andTahirRabbani. Recognising structure in laser scanner point clouds. International archives of photogrammetry, remote sensing and spatial information sciences, 46(8):33–38, 2004. [99] Volker Walter. Object-based classification of remote sensing data for change detec- tion. ISPRS Journal of photogrammetry and remote sensing, 58(3):225–238, 2004. [100] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015. [101] Wen Xiao, Bruno Vallet, and Nicolas Paparoditis. Change detection in 3d point clouds acquired by a mobile mapping system. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, 1(2):331–336, 2013. [102] Xuehan Xiong and Daniel Huber. Using context to create semantic 3d models of indoor environments. In BMVC, pages 1–11, 2010. [103] Wei Yao, Stefan Hinz, and Uwe Stilla. Automatic vehicle extraction from airborne lidar data of urban areas aided by geodesic morphology. Pattern Recognition Let- ters, 31(10):1100–1108, 2010. [104] WeiYaoandUweStilla. Comparisonoftwomethodsforvehicleextractionfromair- borne lidar data toward motion analysis. Geoscience and Remote Sensing Letters, IEEE, 8(4):607–611, 2011. [105] Hiroki Yokoyama, Hiroaka Date, Satoshi Kanai, and Hiroshi Takeda. Detection and classification of pole-like objects from mobile laser scanning data of urban environments. International Journal of CAD/CAM, 13(2), 2013. [106] Jixian Zhang, Minyan Duan, Qin Yan, and Xiangguo Lin. Automatic vehicle ex- tractionfromairbornelidardatausinganobject-basedpointcloudanalysismethod. Remote Sensing, 6(9):8405–8423, 2014. [107] Yu Zhong. Intrinsic shape signatures: A shape descriptor for 3d object recognition. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages 689–696. IEEE, 2009. [108] ZhenyaoZhu,PingLuo,XiaogangWang,andXiaoouTang. Multi-viewperceptron: a deep model for learning face identity and view representations. In Advances in Neural Information Processing Systems, pages 217–225, 2014. 184
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D object detection in industrial site point clouds
PDF
Green learning for 3D point cloud data processing
PDF
Hybrid methods for robust image matching and its application in augmented reality
PDF
Point-based representations for 3D perception and reconstruction
PDF
Face recognition and 3D face modeling from images in the wild
PDF
3D urban modeling from city-scale aerial LiDAR data
PDF
3D deep learning for perception and modeling
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
Line segment matching and its applications in 3D urban modeling
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
City-scale aerial LiDAR point cloud visualization
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
From active to interactive 3D object recognition
PDF
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
Combining object recognition and tracking for augmented reality
PDF
Complete human digitization for sparse inputs
PDF
Automatic image matching for mobile multimedia applications
PDF
Data-driven 3D hair digitization
Asset Metadata
Creator
Huang, Jing
(author)
Core Title
Object detection and recognition from 3D point clouds
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/03/2016
Defense Date
08/30/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D,3D CNN,matching,OAI-PMH Harvest,object detection,object recognition,point cloud,self-similarity
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
You, Suya (
committee chair
), Kuo, C.-C. Jay (
committee member
), Nakano, Aiichiro (
committee member
), Neumann, Ulrich (
committee member
)
Creator Email
hjfzszthss@gmail.com,huang10@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-313673
Unique identifier
UC11213609
Identifier
etd-HuangJing-4863.pdf (filename),usctheses-c40-313673 (legacy record id)
Legacy Identifier
etd-HuangJing-4863.pdf
Dmrecord
313673
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Huang, Jing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D
3D CNN
matching
object detection
object recognition
point cloud
self-similarity