Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
3D object detection in industrial site point clouds
(USC Thesis Other)
3D object detection in industrial site point clouds
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
3D Object Detection in Industrial Site Point Clouds by Guan Pang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2016 Acknowledgement I would like to express my sincere thanks to my advisor Prof. Ulrich Neumann. His insightful advices and knowledge direct me through my six years of study and research towards my defense and dissertation, and will continue as my guide for the future of my career. I would like to thank Prof. Suya You for his invaluable suggestions to many of my research works. I would like to thank Prof. Aiichiro Nakano, Prof. C.-C. Jay Kuo and Prof. Hao Li for taking their precious time to serve on the committee of my qualifying exam and / or thesis defense. I would like to thank my labmates Jing Huang and Rongqi Qiu for their inspiring discussions, and all current and previous CGIT lab members that I have worked with or shared knowledge with. I am grateful to Prof. Xinggang Lin and Prof. Guijin Wang from Tsinghua University for their guidance and advices on my early research works to prepare me for PhD study. My PhD study and research is supported by Chevron Corp. under the joint project Center for Interactive Smart Oilfield Technologies (CiSoft), at the University of Southern California. I would like to thank Amir Anvar, Christopher Fisher, Michael Brandon Casey, Lanre Olabinjo from Chevron and Sorin Marghitoiu from CiSoft for their constructive input and help. A very special thanks to my parents for their constant care and encouragement, and my fiancee Chuanchuan Tang for her patience and love in accompanying me through this journey. I am very grateful for all these support and help I received, for which I will continue on my future career with dedication and determination. i Contents 1 Introduction 1 2 Related Work 8 2.1 3D Object Recognition in Point Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 3D Point Cloud Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 CNN for Object Classification and Detection . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Image Matching based on Keypoint Descriptors . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Multi-Modal Image Matching with the Gixel Array Descriptor 12 3.1 Introduction to Multi-Modal Image Matching . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Introduction to Gixel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 Additive Edge Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.2 Circular Gixel Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.3 Descriptor Localization and Matching . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Analysis of Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.2 Multi-Modal Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.1 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.2 Multi-Modal Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.3 Rotation and Scale Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4.4 Public Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.5 Processing Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 Training-based Object Detection in Cluttered 3D Point Clouds 26 4.1 Algorithm Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.1 Detector Training and Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.2 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 ii 4.2.1 False Alarm Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.2 Speed-Up Processing Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.3 Rotation and Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.1 Industrial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.2 Street Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.3 Public Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.4 Run-Time Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 Application in Automatic 3D Industrial Point Cloud Modeling 38 5.1 Individual Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 Pipe-Run Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.2 SVM-based Plane Point Classification . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 Integration Issues and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.1 Object Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2.2 Pipe Generation in Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2.3 Flange Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2.4 Other System Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Display Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.2 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3.3 Complete Reconstructed Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6 3D Object Detection by 2D Multi-View Projection 51 6.1 Algorithm Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.1.1 Multi-View Recognition Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.1.2 3D to 2D Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.1.3 2D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.1.4 2D to 3D results Re-Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 iii 6.2.1 Speed-Up with Binary Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2.2 Robustness with Smoothed Processing . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.3.1 Rotation Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.3.2 Multi-View Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3.3 Depth Section in Cluttered Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.4.1 Industrial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.4.2 Street Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.4.3 Comparative Experiments Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.4.4 Precision-Recall Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.4.5 Time Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7 3D Object Detection with Multi-View Convolutional Neural Network 70 7.1 3D Object Detection with CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.1.1 Multi-View 3D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.1.2 Detect 2D Projections with CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.1.3 Training Sample Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 Concatenated Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.2.1 Concatenated CNN for Fast Negative Rejection . . . . . . . . . . . . . . . . . . . . 74 7.2.2 Speed Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.3.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.3.2 Precision-Recall Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.3.3 Time Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 8 Conclusion 81 References 82 iv List of Figures 1 Challenges of detecting object in 3D point clouds. . . . . . . . . . . . . . . . . . . . . . . . 1 2 Examples of 3D object detection in point clouds. . . . . . . . . . . . . . . . . . . . . . . . 2 3 The recognition method should be robust and stable under various situations, including (a) occlusion; (b) rotation; (c) noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 Automatic 3D point cloud modeling system with a 3D point cloud object recognition module to identify and model complex shapes and structures. . . . . . . . . . . . . . . . . . . . . . 4 5 Road map of research sub-topics progress in this thesis. Notice that each sub-topic focuses on enhancing several different desirable properties and requirements stated above. . . . . . . 6 6 Matching “Lena” to its “pencilized” version, which is produced by Photoshop through a “pencil drawing effect”. Two images exhibit vastly different pixel-intensities, but the GAD finds a large number of correct correspondences. . . . . . . . . . . . . . . . . . . . . . . . . 13 7 (a) Small gap and zigzag artifacts appeared in many edge images. (b) Small gap appeared in edge detection. (c) Additive edge scoring makes short dashed line segments achieve a similar score as a long connected line. (d) Score of a line segment given its lengthl and distanced to the Gixel. (e) Illustration of Manhattan distance. . . . . . . . . . . . . . . . . . . . . . . . . 14 8 (a) Circular Gixel array - each yellow point is a Gixel. (b) A Gixel array can be adjusted for rotation and scale invariance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 9 Descriptor similarity score distribution when moving keypoint in the neighborhood. (a) GAD; (b) SURF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 10 Performance comparison between GAD and SIFT, SURF, BRIEF, ORB when matching aerial images to building rooftop outlines generated from 3D models for urban models texturing. (a) Matching with SIFT. (b) Matching with SURF. (c) Matching with BRIEF. (d) Matching with ORB. (e) Matching with GAD (no RANSAC). (f)recall vs. 1−precision curves. (g) Matching with GAD (after RANSAC). (h) Registered aerial image and 3D wireframes using GAD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 11 Performance comparison when matching a photo and a drawing. (a) Matching with SIFT. (b) Matching with SURF. (c) Matching with BRIEF. (d) Matching with ORB. (e) Matching with GAD. (f)recall vs. 1−precision curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 v 12 Performance comparison when matching an intensity image and a depth image. (a) Matching with SIFT. (b) Matching with SURF. (c) Matching with BRIEF. (d) Matching with ORB. (e) Matching with GAD. (f)recall vs. 1−precision curves. . . . . . . . . . . . . . . . . . . 21 13 Potential multi-modal matching applications of the GAD. (a) Matching maps to aerial images. (b) Medical image registration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 14 Matching performance of GAD with rotation and scale changes under multi-modal situations. (Curves comparison is not included since other descriptors barely work in multi-modality, as Sec.3.4.2 already showed.) (a) Matching between image 1 and 2 from the “Graffiti” series [1], while one of the images are rendered with a “glowing edges effect”. (b) Matching between image 1 and 4 from the “Boat” series [1], while one of the images are rendered with a “pencil drawing effect”. (The pointer on each keypoint indicating its orientation shows that almost all keypoints are aligned and matched correctly.) . . . . . . . . . . . . . . . . . . . . . . . 23 15 More results for matching with GAD (left) andrecall vs. 1−precision curve comparison (right). (a) Matching with JPEG compression (image 1 and 2 from the “Ubc” series [1]) (b) Matching with illumination change (image 1 and 4 from the “Leuven” series [1]) . . . . . . 24 16 Flow diagram of the training-based 3d object recognition algorithm . . . . . . . . . . . . . 27 17 (a) Illustration of 3D summed area table [50]. The eight regions are A, B, C, D, E, F, G, H, respectively. Assume V ijk is the value of the 3D summed area table at A ijk (i,j,k = 0,1). ThenV 111 =A+B+C+D+E+F+G+H,V 110 =A+B+C+D,V 101 =A+B+E+F , V 011 = A+C +E +G,V 100 = A+B,V 010 = A+C, V 001 = A+E,V 000 = A. Thus H =V 111 −V 110 −V 101 −V 011 +V 100 +V 010 +V 001 −V 000 . (b) Illustration of 2D [46] and 3D Haar-like features. Feature value is the normalized difference between the sum of pixels (voxels) in the bright area and the sum of pixels (voxels) in the shaded area. . . . . . . . . . 29 18 The first 5 features selected by the Adaboost training procedure. (a) For the T-junction object; (b) For the oil tank object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 19 Illustration of the binary occupancy based feature. . . . . . . . . . . . . . . . . . . . . . . . 30 20 Valve detection result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 21 Principle direction detection by PCA to align objects with arbitrary rotation changes. . . . . 32 22 Object detection results in industrial scenes. . . . . . . . . . . . . . . . . . . . . . . . . . . 34 23 (Upper) False alarm examples in industrial scene. (Lower) False alarm examples in street scene. 35 24 Object detection results in street scenes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 vi 25 Left: Object recognition in cluttered scene with high occlusion. Right: Recognition rate vs. Occlusion compared to spin image [39], tensor matching [34] and keypoint matching [49]. . 36 26 (a) Original industrial point clouds. (b) 3D models automatically reconstructed by the inte- grated 3D modeling system. (c) Models by commercial software [51], completely devoid of any object. (d) Hand-made models by professional modelers. . . . . . . . . . . . . . . . . . 38 27 The pipe-run modeling module [59] takes 3D raw point clouds as input and extracts pipes and reconstruct joints in the scene into a complete pipe network. . . . . . . . . . . . . . . . 40 28 Plane point classification [60] result. The yellow points are classified as plane, while the red ones are classified as non-plane. Note that the small clusters (e.g. those on the ladder) will no longer be considered as plane clusters after segmentation. . . . . . . . . . . . . . . . . . 41 29 System diagram - relationship of each modules and system integration elements. . . . . . . . 41 30 (a) Original industrial point clouds. (b) Unaligned results after merging pipe models and detected 3D objects (notice the small displacement between objects). (c) Objects are aligned to pipes. (d) Pipe segments are predicted and modeled in small gaps between objects on the same axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 31 (a) Flange detection with pipe information. (b) “Triple flange” detection. . . . . . . . . . . . 45 32 (a) Residual point clouds after removing pipe and plane points, used to reduce search space. (b) Original point clouds used to compute and evaluate object detector. (c) The “object hier- archy” where the bigger object is composed of two smaller ones, helpful in choosing overlaps. 46 33 (a) Results by the integrated system, where detected objects are rendered in point clouds. (b) Results by the system, where detected objects are rendered in mesh models. (c) Results by the system, where detected objects are replaced by data at corresponding location from the original point clouds.(d) Original point clouds. (e) Manual CAD models by professional modelers with one extra (mistaken) pipe line and other small errors. (f) Models by commer- cial software [51] with lots of errors but none of the objects. . . . . . . . . . . . . . . . . . 47 34 (a) Complete industrial scene reconstruction by the integrated system. (b) Original point clouds. (c) Models by commercial software [51]. (d) CAD models hand-made by profession- als, cleaner but lose too much details. (e) Results by the system on more industrial datasets. . 49 35 Flow of the proposed algorithm: First project the 3D point clouds into 2D images from multiple views, then detect object in each view separately, and finally re-project all 2D results back into 3D for a fused 3D object location estimate. . . . . . . . . . . . . . . . . . . . . . 52 vii 36 3D input point clouds of the data scene and object are projected into 2D depth images at multiple views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 37 The scene are segmented into four sections according to depth along one axis and projected into 2D binary images (converted to binary for efficiency - please refer to sec. 6.2.1). Notice the occluded tank object in fig. 36 can now be seen in a separate view. . . . . . . . . . . . . 55 38 Dominant gradients for all 8x8 grid is computed for every projected image for both object and scene (most scene grid lines are hided). Image and object patches are matched grid-by-grid for dominant gradients during detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 39 (a) 2D object detection confidence score distribution for the 4 images in fig. 37. (b) Filtered confidence map with only positives. (c) Objects are detected at local maximum in filtered confidence map. The green number marks the confidence. (d) Precision-Recall curves com- paring the three different 2D object detection methods. . . . . . . . . . . . . . . . . . . . . 57 40 2D detection results are re-projected into 3D space and combined to obtain the 3D object locations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 41 Demonstration of rotation invariance, where the rotated object (right) are successfully recog- nized in the scene at different orientations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 42 The algorithm search for both viewpoint rotation and in-plane rotation to achieve rotation invariance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 43 (a) Detection performance improvement as number of projection views increases from 2 to 8. Notice there are more true positives and less false alarms. (b) Precision-Recall curves with 2, 4, 6 and 8 projection views for the 3D scene. . . . . . . . . . . . . . . . . . . . . . . . . . . 62 44 (a) There are 6 pumps on the 3 top pipe lines in this heavily cluttered scene. With only 2 depth sections, 4 pumps on the outside can be detected, with some false alarms. With 4 depth sections, all 6 pumps are detected correctly. (b) Precision-Recall curves with 2, 3 and 4 depth sections in a cluttered scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 45 (a) Object detection results in industrial scenes. (b) Precision-Recall curves on industrial data compared to the training-based method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 46 (a) Object detection results in street scenes. (b) Precision-Recall curves on street data com- pared to the training-based method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 47 Examples of various test cases and recognition results. . . . . . . . . . . . . . . . . . . . . 66 viii 48 Precision-Recall Curves on various test cases, compared to Spin Images [39], FPFH [44], SHOT [68] and 3D window-scanning [62]. (a) Small segments with two or three object instances and few background points. (b) Large scenes with more than five object instances and many background points. (c) Industrial sites scan. (d) Street level LiDAR. (e) Occluded scene with partially scanned objects. (f) Cluttered scene in which several objects are close to each other, so they may interfere with their detections. . . . . . . . . . . . . . . . . . . . . 67 49 Comparison of pipelines for the original multi-view 3D point cloud object detection algorith- m [65] and the proposed algorithm with concatenated CNNs. . . . . . . . . . . . . . . . . . 72 50 Two types of CNN network structures used. Note the different number of classes in output. . 73 51 Some examples of CNN training samples: (a) Positive; (b) ”Easy” negative; (c) ”Hard” negative. 74 52 Concatenate two levels of early rejection network for fast negative window filtering, before the final level of multi-class detection network. . . . . . . . . . . . . . . . . . . . . . . . . 75 53 (a) Negative windows rejected in the first level of early rejection network are mostly simpler backgrounds. (b) Negative windows rejected in the second level of early rejection network are more complicated non-objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 54 Precision-Recall Curves on various test cases, compared to the original multi-view method [65] Spin Images [39], FPFH [44], SHOT [68] and 3D window-scanning [62]. (a) Small segments with two or three object instances and few background points. (b) Large scenes with more than five object instances and many background points. (c) Industrial sites scan. (d) Street level LiDAR. (e) Occluded scene with partially scanned objects. (f) Noisy scene with random noisy points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 ix List of Tables 1 Performance comparison of the 3D Haar-like feature and the binary occupancy based feature. 30 2 The change in performance as more negative samples are added. . . . . . . . . . . . . . . . 31 3 Performance of the detector on each type of objects in the industrial data. . . . . . . . . . . 34 4 Performance of the detector on each type of objects in the street data. . . . . . . . . . . . . . 36 5 Speed-up before and after binary conversion . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6 Statistical 3D recognition performance on 12 industrial object types. . . . . . . . . . . . . . 64 7 Statistical 3D recognition performance on street object types. . . . . . . . . . . . . . . . . . 64 8 Detection time comparison between the new multi-view algorithm and the 3D descriptors . . 68 9 Speed Analysis for 3D object detection methods . . . . . . . . . . . . . . . . . . . . . . . . 76 10 Time comparison for detecting 6 object classes . . . . . . . . . . . . . . . . . . . . . . . . 79 x Abstract Detection of three dimensional (3D) objects in point clouds is a challenging problem. Existing methods either focus on a specific type of object or scene, or require prior segmentation, both of which are usually inapplicable on real-world industrial applications. This thesis describe three methods to tackle the problem, with gradually improving performance and efficiency. The first is a general purpose 3D object detection method that combines Adaboost with 3D local features, without requirement for prior object segmentation. Experiments demonstrated competitive accuracy and robustness to occlusion, but this method suffers from limited rotation invariance. As an improvement, another method is presented with a multi-view detection approach that projects the 3D point clouds into several 2D depth images from multiple viewpoints, transforming the 3D problem into a series of 2D problems, which reduces complexity, stabilizes performance, and achieves rotation invariance. The problem is the huge amount of projected views and rotations that need to be individually detected, limiting the complexity and performance of 2D algorithm choice. Thus the third method is proposed to solve this with the introduction of convolutional neural network, because it can handle all viewpoints and rotations for the same class of object together, as well as predicting multiple classes of objects with the same network, without the need for individual detector for each object class. The detection efficiency is further improved by concatenating two extra levels of early rejection networks with binary outputs before the multi-class detection network. 3D object detection in point clouds is crucial for 3D industrial point cloud modeling. Prior efforts focus on primitive geometry, street structures or indoor objects, but industrial data has rarely been pursued. We integrate several algorithm components into an automatic 3D modeling system for industrial site point clouds, including modules for pipe modeling, plane classification and object detection, and solves the technology gaps revealed during the integration. The integrated system is able to produce classified models of large and complex industrial scenes with a quality that outperforms leading commercial software and comparable to professional hand-made models. This thesis also describes an earlier work in multi-modal image matching which inspires later research in 3D object detection by 2D projections. Most existing 2D descriptors only work well on images of a single modality with similar texture. This proposal presents a novel basic descriptor unit called a Gixel, which uses an additive scoring method to sample surrounding edge information. Several Gixels in a circular array create the Gixel Array Descriptor, excelling in multi-modal image matching with dominant line features. xi 1 Introduction Detection of three dimensional (3D) objects in point clouds is a crucial precondition for a significant number of applications, including virtual training and tourism, city planning, multimedia entertainment, robot local- ization and navigation, and scene understanding. It is a challenging problem for several reasons, as illustrated in fig. 1: • 3D scanners only sample objects at discrete points so small scale details are aliased or completely lost. • Scans may contain significant noise or gaps, degrading detection performance. • Many object types are similar in both shape and size or contain repeating patterns. • In cluttered scenes, nearby objects may occlude surfaces or interfere with the detection process. Figure 1: Challenges of detecting object in 3D point clouds. 3D object detection in point clouds, as shown in fig. 2, has attracted extensive research attentions for many different kinds of applications. However, they either require prior segmentation, or base on point classification, neither of which is suitable for industrial site scans much neglected in the research field. While primitive shapes like pipe cylinders and planes are prevalent in industrial sites, they are often connected to valves, pumps, and a wide variety of instrumentations with complex shapes and structures. Recognition 1 Figure 2: Examples of 3D object detection in point clouds. and modeling of these objects in 3D point cloud scans are important for many industrial applications such as maintenance and simulations, but there is yet to be seen a practical algorithm verified on large-scale real- world industrial data. A 3D object recognition method for industrial point clouds is desired with the following properties: • Performance: The objects should be recognized with a high recall rate (percentage of object instances successfully retrieved from the input scene) while maintaining a low false alarm rate (percentage of wrong recognition), in order to minimize manual post-processing. • Efficiency: The recognition method must execute within a reasonable time to allow large-scale and repeatable industrial applications. • Robustness or Invariance: The recognition method should be robust and stable under various situa- tions and obstacles, as illustrated in fig. 3: – Occlusion: Industrial sites are usually cluttered with dense objects occluding each other, resulting in partially scanned point clouds. Limited view points during 3D scanning may also produce partial scans, equivalent to occlusions. – Rotation: An industrial object may rotate freely along the base pipe, or even in three dimensional freedom in some situations. The recognition method should be insensitive to rotation changes. – Noise: Noisy points are common in 3D scans. The recognition method should be able to maintain a decent performance even under heavy noise. • Practicability: The recognition method should be verified on different real-world datasets. It’s also desirable that the method is versatile and applicable in more circumstances. 2 Figure 3: The recognition method should be robust and stable under various situations, including (a) occlu- sion; (b) rotation; (c) noise. With these requirements in mind, this thesis first presents a 3D point cloud object recognition method using a machine learning Adaboost training procedure [45]. Unlike many traditional methods, prior seg- mentation is not required and local features are used instead of global features to describe the objects which ensure robustness to occlusion. The local feature is a 3D Haar-like feature [46], working together with a 3D summed area table [50] for efficient computation. The 3D point clouds are resampled into a 3D image for 3D feature and summed area table calculation. This method is verified on both dense industrial data as well as sparse street data containing several different types of objects, showing competitive performance, efficiency and robustness to occlusion. The practicability of this solution to 3D object recognition in point clouds is further demonstrated through application in an automatic 3D point cloud modeling system. Existing automatic 3D modeling system are usually primitive-based, and severely limit the variety of structures they can model, requiring heavy human interactions. This thesis describes a system considering industrial point clouds to be a congregation of pipes, planes and objects. The system integrates three main modules to handle each part of the data. Two of them are primitive extraction processes [59, 60] that detect cylinders and planar geometry in the scene and estimate models and parameters to fit the evidence. The third module utilizes the above-mentioned 3D object recognition method to identify and model complex shapes and structures, by matching clusters of 3D points to 3D models of objects stored in a pre-built object library. The best-matched library object is used to represent the point cluster. Combination of primitive extraction and object detection processes complete a 3D model for a complex scene consisting of a plurality of planes, cylinders and general objects, as demonstrated in fig. 4. Creating such models in industry usually requires extensive time by skilled modelers, even using the 3 Figure 4: Automatic 3D point cloud modeling system with a 3D point cloud object recognition module to identify and model complex shapes and structures. best software tools available today. With the separately modeled pipes, planes and objects, the combined results display are freely switchable between mesh models for efficiency or point clouds for accuracy, which won’t be achievable in traditional data-driven or simple primitive-fitting methods. The capability of 3D object recognition in point clouds also opens up the possibilities for labeling objects with metadatas and separate manipulation for augmented reality. However, the first solution for 3D object recognition in point clouds still has some noticeable disadvan- tages, including only partial rotation invariance and inconvenient user accessibility, since it requires prior object training or even repeated training, and manual tuning of various parameters. An improved solution through 3D to 2D projections is suggested, inspired by some earlier research work in 2D image matching. Even though 3D object recognition in point clouds is still a new and active topic, arising from the availability of 3D scanners, object detection in 2D images has already seen extensive research development. This thesis suggests to transform the object recognition problem from 3D space into a series of 2D detection problems. To achieve this, a multi-view recognition framework is described, which first projects the 3D point clouds into 2D depth images of multiple views, then detects object in each views separately using a 2D object de- tection algorithm, and finally re-projects all the 2D detection results back into 3D to estimate combined 3D detection locations. This new algorithm effectively reduces the problem complexity from 3D to 2D, stabi- lizes the performance and achieves rotation invariance with multiple views, without any requirement for a priori object segmentation or descriptor training. The new method is tested on a combination of industrial data and street data [29, 74] containing various types of objects and scene conditions. In comparisons with state-of-the-art 3D recognition methods, our method has competitive overall performance with one-order of magnitude speed-up. 4 The second method still has its own issues: In order to transform the 3D problem into 2D space, 3D point cloud will be projected based on depth information at multiple viewpoints and rotations. This will generate a large amount (at least 10,000) of 2D detection task between projections of scenes and objects, which put a requirement on the speed of single 2D detection algorithm that it must be very fast in processing all 2D images to finish the overall 3D detection task in a reasonable time. As a result, this limits the complexity of 2D detection algorithm, and thus the overall performance of 3D detection. To solve this speed-complexity trade-off, we propose to use convolutional neural network (CNN) for 2D detection, which has already been proved [66, 67, 75, 76] to be the most powerful method for 2D detection. We use CNN to handle multiple viewpoints and rotations for the same class of object together with a single pass through the network, thus reducing the total amount of 2D detection tasks dramatically. Moreover, while the existing strategies usually require an individual detector for each class of object, CNN can be trained with a multi-class output, further saving tremendous processing time when there are multiple objects to detect. To enable multi-class CNN to detect object classes with varied sizes, we unify the training sample sizes with padded boundary so the detector will search for all object classes in a uniform-sized window. On top of these, we further improve the detection efficiency by concatenating two extra levels of early rejection networks with binary outputs, simplified architecture and smaller image sizes, before the final multi-class detection network. Experiments show that our method has competitive overall performance with at least one-order of magnitude speed-up in comparisons with state-of-the-art 3D point cloud object detection methods. Figure 5 shows a road map illustrating the research sub-topics progress in this thesis. Some earlier works also cover research in the area of 2D image matching, for which a basic descriptor unit called a Gixel is presented, using an additive scoring method to sample surrounding edge information. Several Gixels in a circular array create a powerful descriptor called the Gixel Array Descriptor (GAD), excelling in multi- modal image matching by exploiting line features. These serve as good background knowledge preparation for later works in related 3D topics, as well as inspiration in developing the improved 3D object recognition method through projection into 2D space. To Better evaluate the proposed 2D object detection algorithms, some additional work is done to construct a test dataset from many different sources. It contains three types of data, including single objects, street data and industrial data, incorporating some existing public datasets, such as the 3D Keypoint Detection Benchmark [34], UWA 3D Object Dataset [49] for object retrieval, and CMU Oakland 3-D Point Cloud Dataset [29], Washington Urban Scenes 3D Point Cloud Dataset [74] for street data. For industrial data, there is no public data available, so some private data is used. Note that object retrieval data [34, 49] come 5 Figure 5: Road map of research sub-topics progress in this thesis. Notice that each sub-topic focuses on enhancing several different desirable properties and requirements stated above. 6 as mesh models, so a virtual scanner is used to produce point clouds. Original point clouds are used for all street and industrial data. The dataset contains varied data size and object density to reflect different scale and complexity, including small segments or clusters with two or three object instances without too many background data points, and large scenes with more than five object instances and many background data points such as pipes, walls and planes (only for street and industrial data). Some scan conditions are also tested, including occlusions that create partially scanned objects, and clutter in which several objects are close to each other, so they may interfere with their detections. 7 2 Related Work 2.1 3D Object Recognition in Point Clouds Various methods have been introduced to tackle with the 3D object recognition problem, though many of them only focus on a specific type of object, e.g. cars, trees, rooftops, etc. For those methods that handle multiple types of 3D objects, many require segmented input data. For example, 3D object recognition on street LiDAR data [19] requires the scene be pre-processed based on ground estimation, so that candidate objects are segmented before applying recognition algorithms. Since many methods use global description of the target object, segmentation is necessary. Most of the existing methods focus on street data [19,29,30] and indoor data [31], neglecting other appli- cations such as industrial parts recognition [39,48,60,62], where objects are often more densely arranged and segmentation is much harder or impossible. Some existing methods for street and indoor data are performed in a fashion of point-based classification [29–31], which are able to label general categories of points such as ground, facades or vegetation. If applied on industrial data, these method might be able to differentiate between primitive shapes like pipes or planes, but not complex industrial object types. For the few existing methods on industrial data, they either focus on mesh models [48], or deal with only simple synthetic test scenario [39]. There’s yet to be seen a method capable of object recognition in 3D point cloud scans of real-world large-scale industrial sites. A few methods utilize machine learning techniques to select the best description for a specific type of 3D object, so they can be recognized efficiently and repeatedly in a large scene. As a common object class that has variations within a generally similar size and shape, cars are a typical target for training-based methods. Often a set of car objects is defined first to train either a set of local features [16, 17] or a bag of words [18]. Golovinskiy et al. [19] extend the targets to over twenty types of street objects, using classifiers trained with global features. Wessel et al. [20] present a supervised learning approach that uses arbitrary local features for 3D shape retrieval. Several works [21–24] make use of SVM, in which they first use some features to describe the target object with a vector, then train a classifier using SVM to classify it. Novotni et al. [22] use a SVM and Kernel-based methods to realize relevance feedback for 3D shape retrieval using global Zernike descriptors. Akgul et al. [23] present a linear score fusion approach, using a SVM in the training step in order to learn how to optimally combine similarity values between two objects arising from different global shape descriptors. Hou et al. [24] also make use of SVM, learning conditional object class probabilities given a set of global shape descriptors. The most similar to the training-based method presented in this thesis is by 8 Laga et al. [25], who introduce the Adaboost training framework to select a combination of weighted weak classifiers based on Light Field Descriptors [26]. Training-based methods are also seen in other applications like segmentation [32] or scene analysis [27, 28]. Song et al. [77] use depth maps for object detection with a 3D detector scanning in 3D space, which is similar to the depth-based projections method presented in this thesis, but their method focuses on RGB-D data rather than point clouds, and it is very time-consuming due to the extensive costs for detector training. The strategy of reducing 3D problem into 2D space was similarly employed for 3D object retrieval. Chen et al. [26] use 2D shapes and silhouettes to retrieve 3D object mesh models. Ohbuchi et al. [87] extract 2D multi- scale local features from range images to aid in 3D object model retrieval. Shang and Greenspan [88] use view sphere sampling to extract features from the minima of the error surfaces for 3D recoginition. Aubry et al. [78] also apply the idea of 3D-to-2D to align 3D CAD chair models to 2D images with trained mid-level visual elements. However, these methods focus on matching individual clean object mesh models, much simpler than the unsegmented noisy large-scale 3D point clouds in our case with tremendously increased complexity. Another types of approaches focus on 3D shape descriptors. Many local and global 3D descriptors have been proposed. Most popular ones are spin images (SI) [39] and extended Gaussian images [40], followed by various others like spherical spin images [41], regional point descriptors [42], angular spin images [43] and FPFH [44]. However, existing 3D Descriptors are usually designed for mesh models, hard to be adapted to raw point clouds with noisy points and cluttered objects. For methods described in this thesis, point clouds are transformed into 3D volumetric data during pre- processing to either turn it into a 3D image or prepare for 2D projection. 3D volumetric data has already been seen in some existing works for feature extraction and description, though they are commonly applied in only 3D shape models or medical applications. Gelfand et al. [33] compute an integral volume descriptor for each surface point and pick those with unique descriptors. Mian et al. [34] describe partial surfaces and 3D models by defining three dimensional grids over a surface and computing surface area that passes through each grid. Knopp et al. [35] extend the 2D SURF descriptor to 3D shape classification by converting meshes into volumetric data. Features based on volumetric data are also used in medical applications such as the 3D SIFT [36] for CT scan volumes and MSV (3D MSER) [37] for MRIs. Yu et al. [38] provide an evaluation for several interest points on 3D scalar volumetric data. 9 2.2 3D Point Cloud Modeling 3D point cloud modeling is a very time-consuming process, requiring manual labeling and creation of surfaces and their connections. Primitive-based computer-aided editing systems increase productivity, but extensive human interaction is typically required. Automatic systems exist, but severely limit the variety of structures they can model. In the case of aerial LiDAR (Light Detection And Ranging) data, existing systems focus on modeling buildings and ground terrain. Ground-based LiDAR 3D point cloud scans are often used to model planar surfaces [54, 55, 57], or classify specific categories of street objects [16–18] and indoor objects [28, 58]. In contrast, industrial site scans are a much neglected scenario, including a wide variety of shapes and structures. While pipes and cylinders are prevalent, their junctions are complex, and often connect to valves, pumps, and instrumentations. Existing systems for industrial point cloud modeling [39,48] could not detect and model primitive shapes such as cylinders, pipes, and planar surfaces together with general-shape objects such as valves, pumps, instrumentations, as well as the interconnections between them. Specialized commercial software [51] only attempts to model pipes, and completely ignores other scene objects. 2.3 CNN for Object Classification and Detection Since the work of Krizhevsky et al. [67] on ImageNet, convolutional neural network (CNN) [66] has became the most successful method for image classification problem. For object detection, R-CNN [75] is the state- of-the-art on 2D RGB images, while Depth-RCNN [76] expanded the R-CNN algorithm to adapt to RGB-D images. In the 3D domain, 3D ShapeNet [80] represented geometric 3D shape on 3D volumetric grids and applied CNN for classification. For 3D CAD model classification, Su et al. [79] took a view-based deep learning approach by rendering 3D shapes as 2D images. This method shares some similarities with ours, but 3D mesh model classification and retrieval is a much different problem than 3D object detection in large point clouds. For 3D point cloud, Prokhorov [81] and Habermann et al. [82] explored 3D point cloud object classification with CNN, but only focused on pre-segmented street objects. V oxNet [83] converted 3D point cloud into volumetric data and trained a 3D-CNN to classify them, but the method still focused only on point cloud segment from mostly street data, instead of more complex large-scale industrial point cloud that we’re working on. 10 2.4 Image Matching based on Keypoint Descriptors The existing keypoint descriptors can be roughly classified into two categories: The ones based on intensity or gradient distribution, and the ones based on pixel comparison or ordering. The distribution-based descriptors are more traditional, which can be dated back to the works of Zabih and Woodfill [3], Johnson and Hebert [4] and Belongie et al. [5]. An important representative descriptor in this category is SIFT (Scale-Invariant Feature Transform) [6], which coded orientation and location information of gradients in the descriptor region into a histogram, weighted by magnitude. Ke and Sukthankar [7] reduced the complexity of SIFT with PCA, leading to PCA-SIFT which is faster but also less distinctive than SIFT. Mikolajczyk and Schmid [1] proved the great performance of SIFT, and proposed a new descriptor based on SIFT by changing its location grid and reducing redundancy through PCA. The resulting GLOH is more distinctive than SIFT, but at the same time more expensive to compute. Bay et al. [2] proposed SURF (Speeded-Up Robust Features), using integral images to reduce the computation time, while still retaining the same gradient distribution histograms of SIFT. SURF [2] has proved to be one of the best descriptors in various circumstances. Recently, comparison-based descriptors are attracting attention. These descriptors use relative compari- son or ordering result of pixel intensities rather than the original intensities, resulting in much more efficient descriptor computation, while still maintaining performance competitive to distribution-based descriptors such as SIFT or SURF. Tang et al. [10] proposed a 2D histogram in the intensity ordering and spatial sub- division spaces, resulting in a descriptor called OSID, invariant to complex brightness change as long as it’s monotonically increasing. Calonder et al. [8] proposed BRIEF, using a binary string consisting the results of pair-wise pixel intensity comparison at pre-determined location, which is very fast in descriptor extraction and matching, but also very susceptible to rotation and scale changes. Rublee et al. [9] improved BRIEF into ORB (Oriented BRIEF), so that it’s rotation invariant and also resistant to noise. There are several other descriptors within this category [11] [12] [13]. However, this type of descriptor relies heavily on extensive texture, thus they are not suitable for multi-modal matching. The Gixel-based descriptor presented in the following chapter belongs to the distribution-based category. 11 3 Multi-Modal Image Matching with the Gixel Array Descriptor This chapter describes some earlier research works focusing on 2D Multi-Modal Image Matching. The knowledge of 2D descriptors for multi-modal images and idea of 2D image matching has inspired the follow- ing research in 3D object recognition in point clouds. The capability of the described Gixel Array descriptor might even lead to a more powerful 3D object recognition method if combined with existing framework in future works. 3.1 Introduction to Multi-Modal Image Matching Finding precise correspondences between image pairs is a fundamental task in many computer vision appli- cations. A common solution is extracting dominant features to describe a complex scene, and comparing similarity between feature descriptors to determine the best correspondence. However, most descriptors only perform well with images of the same modality. Successful matching depends on similar distributions of texture, intensity or gradient. Multi-modal images usually exhibit different patterns of gradient and intensity information, making it hard for existing descriptors to find correspondences. Examples include matching images from SAR, visible light, infrared devices, Lidar sensors, etc. A special case in multi-modal matching is to match a traditional optical image to an edge-dominant image with little texture, such as matching optical images to 3D wire- frames for model texturing, matching normal images to pictorial images for image retrieval, matching aerial images to maps for urban localization, etc. These tasks are even harder for existing descriptors since they often don’t have any texture distribution information. To match multi-modal images, especially edge-dominant images, This chapter presents a novel gradient- based descriptor unit called a “Gixel”, abbreviated from “Gradient Pixel”. A Gixel is a sample point for the gradient information in a small local region. It serves as a basic unit of the complete descriptor - the Gixel Array Descriptor (GAD), which consists of several Gixels in a circular array. Each Gixel is used to sample and score gradients in its neighborhood. The scores are normalized to represent relative gradient distribution between the Gixel regions, forming a complete descriptor vector. Owing to the circular array, the descriptor is easily adjusted to any orientation and size, allowing invariance in both rotation and scale. Line features are the most important and reliable features in multi-modal matching applications. A key property of a Gixel is that it samples the information of a long connected line with much the same result as it samples a series of short dashed line segments. This is achieved by an additive scoring method. The 12 Figure 6: Matching “Lena” to its “pencilized” version, which is produced by Photoshop through a “pencil drawing effect”. Two images exhibit vastly different pixel-intensities, but the GAD finds a large number of correct correspondences. score obtained with small line segments of similar orientation and distance to the Gixel is similar to the score obtained from a long connected line of the same orientation and distance. In multi-modal data, due to different sensor characteristics, line features may appear with gaps or zig-zag segments. Gixel’s additive scoring method samples broken or solid lines with similar scores. This is essential for multi-modal matching. Figure 6 shows an example matching result of the Gixel-based descriptor, between an ordinary image and a “pencil drawing” stylized image. The main contributions in this chapter include: • Observe that line features are the most important and reliable features in many multi-modal matching applications. • Present the Gixel Array Descriptor (GAD) to extract line features, and thus perform robustly for match- ing multi-model images, especially on edge-dominant images with little texture. • Demonstrate that the GAD can achieve good matching performance in various multi-modal applica- tions, while maintaining a performance comparable to several state-of-the-art descriptors on single modality matching. 3.2 Introduction to Gixel A Gixel, or “Gradient Pixel”, is a sample point for the gradient information in a small local region on an im- age. A traditional Pixel would capture the pictorial information of its neighborhood, and produce an intensity value or a 3D vector (R, G, B). Likewise, a Gixel would extract and summarize the gradient information of its neighborhood. All line segments or pixelized gradients within the neighborhood of a Gixel are sampled (or scored) to produce a 2D vector of both x and y gradient data. 13 Figure 7: (a) Small gap and zigzag artifacts appeared in many edge images. (b) Small gap appeared in edge detection. (c) Additive edge scoring makes short dashed line segments achieve a similar score as a long connected line. (d) Score of a line segment given its lengthl and distanced to the Gixel. (e) Illustration of Manhattan distance. 3.2.1 Additive Edge Scoring For each line segment or pixelized gradient (a pixel gradient is considered as a line segment of length 1), a Gixel creates a score, which encodes three elements of information: orientation, length, and distance to the Gixel. The longer a line segment is, the higher it is scored. Similarly, the nearer a line segment is to the Gixel, the higher it is scored. The score is then divided into x and y components according to the orientation of the line segment or gradient. In addition, sensor noise or inaccurate edge detection usually results in small gaps, zigzags, or other artifacts along the line, as shown in Fig.7(a, b). The design of the scoring function makes it additive, so that a sequence of short line segments achieves a similar score as a long connected line, as shown in Fig.7(c). The functionf(x) = 1/(1+x): has the properties that f(0) = 1, f(+∞) = 0, andf(x) decreases as the input valuex increases. The scoring function is constructed fromf(x). Given a line segment’s lengthl, and distance to the Gixel d, the score is defined as: Score(l,d) =f(d)−f(d+l) = 1 1+d − 1 1+d+l (1) Figure 7(d) illustrates the scoring function. Since f(x) is monotonically decreasing, Score(l,d) should increase asl increases. Moreover,f ′ (x) is always negative and monotonically increasing, makingScore(l,d) decreases asd increases. The score is then broken into x and y components, reflecting the orientation of the line segment. Finally, to achieve the property that short dashed line segments are scored similarly to a long connected line, Manhattan distance is used ford instead of Euclidian distance, which is defined as the distance from the 14 Gixel to the intersection point between the Gixel and the line, plus the distance from the intersection point to the nearest point on the line segment (0 if the intersection point is on the line segment). It might seem that using Manhattan distance causes line segments pointing towards the Gixel to be scored higher than those perpendicular ones. This is actually intended, so when a Gixel lies along the extension of a line segment, that line should have more importance to the Gixel score. Thus Gixels will also differentiate lines by their orientations. For example, as shown in Fig.7(e),AB is a line segment,C andC ′ are two points onAB near each other, G is the Gixel, andF is the intersection point fromG toAB. To show thatAB andAC +C ′ B contribute similar scores to the GixelG: ScoreAB = Score(AB,GF +FA)= f(GF +FA)−f(GF +FB) (2) ScoreAC +Score C ′ B = Score(AC,GF +FA)+Score(C ′ B,GF +FC ′ ) = f(GF +FA)−f(GF +FC)+f(GF +FC ′ )−f(GF +FB)≈ f(GF +FA)−f(GF +FB)= ScoreAB (3) The two short line segmentsAC andC ′ B will be scored similarly to the whole line segmentAB, overcoming the impact of the gapCC ′ . This way, the impacts of multiple end-to-end short segments are aggregated for improved matching of multi-modal images. Lastly, note that the scoring system is smooth and does not introduce thresholds or discontinuities as a function of segment parameters. 3.2.2 Circular Gixel Array All line segments or pixelized gradients within the neighborhood of a Gixel will be scored, summed up, and split into a 2-D vector with x and y components to encode orientation information. However, a single 2D vector will not provide enough discriminative power for a complex image. Therefore, for a complete descriptor, several Gixels are put in a region to compute their gradient scores individually and connect their scores into a larger vector. The vector is then normalized to encode relative gradient strength distribution among the Gixels, as well as reduce the effect of illumination change. While many Gixel array distributions are possible, the Gixels are put in a circular array to form the complete descriptor, which is named Gixel Array Descriptor (GAD), as shown in Fig.8(a). The circular array is very helpful in achieving rotation and scale invariance (Sec.3.4.3). The Gixels are placed on concentric 15 Figure 8: (a) Circular Gixel array - each yellow point is a Gixel. (b) A Gixel array can be adjusted for rotation and scale invariance. circles with different radii extending from the center Gixel. There are three parameters involved: the number of circles, the number of Gixels on each circle, and the distance between each circle (Figure 8 shows an example of 2 circles and 8 Gixels on each circle). The circular layout is similar to the DAISY descriptor [14], but the computation of the two descriptors are completely different since they aim at different applications. The array parameters are determined empirically, but are not critical in experiments. A dense Gixel array may lead to redundancy and unnecessary complexity, as Gixels close to each other will sample similar gradient information. On the other hand, too few or too sparse a Gixel array will reduce the discriminative power of the descriptor, and result in low feature dimension or low correlation between Gixels. Experiments use fixed parameters of 3 circles, 8 Gixels on each circle, and the distance between each circle is 50 pixels (which means the radius is 50, 100, 150 from the center Gixel). Thus there are3∗8+1 = 25 Gixels in total, resulting in a feature dimension of 50. A circular Gixel array also has several other benefits. It’s easy to generate and reproduce. Each Gixel in the array samples the region evenly. Most importantly, it can be adapted to achieve rotation and scale invariance by rotating Gixels on the circle and adjusting circle radius (refer to Sec.3.4.3), as shown in Fig.8(b). 3.2.3 Descriptor Localization and Matching Unlike descriptors such as SIFT or SURF, which require a keypoint detection step, GAD can be computed with any detector or at any arbitrary location, while still maintaining good performance even under multi- modal situations, due to its smoothness property (Sec.6.2.2). In practice a simple keypoint detection step like 16 the Harris corner detector can be used to identify areas with gradients to speed up the matching process. The distance between two descriptors is computed as the square root of squared difference of the two vectors. Two descriptors on two images are considered matched if the nearest neighbor distance ratio [1] is larger than a threshold. The nearest neighbor distance ratio is defined as the ratio of distance between the first and the second nearest neighbors. 3.3 Analysis of Advantages 3.3.1 Smoothness The design of GAD takes special care to achieve its smoothness property, both in how gradients are scored, and in how the neighborhood size of each Gixel is determined. The scoring function is defined as a smooth function monotonically decreasing with length and distance. The neighborhood of each Gixel is cut off smoothly, so that the score is negligible at its boundary, resulting in a circular neighborhood of 50-pixel radius. A small change in a line segment’s length, orientation, distance or the location of a Gixel sample will only result in a small change in the descriptor. Moreover, the additive scoring property of a Gixel also contributes to the smoothness by alleviating the impact of noise and line gaps. When matching images of a single modality, it is desirable to use a score function with a sharp peak. Smoothness may not be an important issue in this case. However, for multi-modal matching, even corre- sponding regions may not have exactly the same intensity or gradient distribution. The smoothness property of a Gixel makes it more robust to noise and thus better suited for multi-modal applications. In addition, a Gixel-based descriptor relies less on accurate selection of a keypoint position. This means that keypoint selection is not critical. Even when keypoints have small offsets between two images, which is common in multi-modal matching, Gixels can still match them correctly. On the other hand, traditional descriptors doesn’t exhibit smoothness, as they usually divides pixels and gradients into small bins and makes numerous binary decision or selections. Comparison-based descriptors are even worse, because noise or keypoint variations might change the pixel position and ordering completely. Figure 9 shows the distribution of matching similarity score between two descriptors in corresponding regions (of Fig.10), when one of the descriptor is moving. The Gixel descriptor (Fig.9(a)) demonstrates better smoothness and distinctiveness (single peak) than SURF (Fig.9(b)). 17 Figure 9: Descriptor similarity score distribution when moving keypoint in the neighborhood. (a) GAD; (b) SURF. 3.3.2 Multi-Modal Matching Line features are the most important and sometimes the only available feature in multi-modal matching prob- lems. Each Gixel in the descriptor array samples and scores the line features in its neighborhood from a different distance and orientation. In other words, the Gixels sample overlapping regions of edge informa- tion, but from different locations. This spatial distribution of samples imparts the final descriptor with its discriminating power. On the other hand, traditional distribution-based descriptors tend to break a long line feature into indi- vidual grid or bin contributions. Noisy line rendering and detection might impact the distribution statistics heavily, making the matching process unreliable. Descriptors based on pixel-wise comparison may also suffer in multi-modal matching, as the lack of texture will not provide enough distinctive pixel pairs to work robustly. 3.4 Experiments 3.4.1 Evaluation Method The presented descriptor is evaluated on both public test data and multi-modal data. For public test data, the dataset provided by Mikolajczyk and Schmid [1] is chosen, which is a standard dataset for descriptor evalua- tion used by many researchers. It contains several image series with various transformation, including zoom 18 and rotation (Boat), viewpoint change (Graffiti), illumination change (Leuven), JPEG compression (Ubc). On the other hand, multi-modal matching is still a relatively new research area, so there is no established public test dataset yet. SIFT [6], SURF [2], BRIEF [8] and ORB [9] are used for comparison during evaluation. They are good representatives of existing descriptors, since they cover both distribution-based and comparison-based categories of descriptors, and are usually compared to other descriptors, producing similar performance. To make fair comparisons, SURF’s feature detector is also used to determine keypoints locations for GAD. After feature extraction, all descriptors are put through the same matching process to determine matched pairs (Note that GAD, SIFT, SURF all use L2 norm distance, but BRIEF and ORB have to use hamming distance). The latest build of OpenSURF 1 is used for implementation of SURF. For SIFT, BRIEF and ORB, the latest version of OpenCV 2 is used. Edges are extracted by a Canny detector. To evaluate the performance objectively, recall and 1−precision are computed based on the number of correct matches, false matches, and total correspondence [1]. Two keypoints are correctly matched if the distance between their location under the ground-truth transformation is below a certain threshold. Cor- respondence is defined as the total number of possible correct matches. recall and 1−precision can be computed by [1]: recall= #correct matches #correspondence (4) 1−precision= #correct matches #correct matches+#false matches (5) By changing the matching threshold, different values of recall and 1− precision can be obtained. A performance curve is then plotted asrecall versus1−precision. 3.4.2 Multi-Modal Matching Figure 10 shows an important matching problem occurred in an actual industrial application. Aerial images need to be matched to 3D wireframes for urban models texturing. One of the images contains only the building rooftop outlines generated from 3D models, which is a binary image without any intensity or texture information. Figure 10(a,b,c,d) are the matching results of SIFT, SURF, BRIEF, ORB, respectively. None of these establish enough good matches to even apply RANSAC. Figure 10(e) shows the matching result obtained by GAD. Most matches are correct, thus the incorrect ones can be filtered out via RANSAC , as shown in Fig.10(g). Using the refined matches, the true transformation homography can be estimated 1 http://www.chrisevansdev.com/computer-vision-opensurf.html 2 http://opencv.willowgarage.com/wiki/ 19 Figure 10: Performance comparison between GAD and SIFT, SURF, BRIEF, ORB when matching aerial images to building rooftop outlines generated from 3D models for urban models texturing. (a) Matching with SIFT. (b) Matching with SURF. (c) Matching with BRIEF. (d) Matching with ORB. (e) Matching with GAD (no RANSAC). (f)recall vs. 1−precision curves. (g) Matching with GAD (after RANSAC). (h) Registered aerial image and 3D wireframes using GAD. between the two images, and applied to reproject the aerial images onto the 3D wireframes for texturing, as shown in Fig.10(h). Figure 10(f) provides therecall vs. 1−precision curve comparison, which shows that traditional descriptors barely works with different texture distribution patterns, as can be seen from their low recall value in the curves, while GAD exhibit robust performance. Figure 11 demonstrates another possible application of multi-modal matching, to match an ordinary image with an artistic image for image retrieval. One image is a photo of the Statue of Liberty, and the other one is a drawing of the Statue. Figure 11(a,b,c,d) are the matching results of SIFT, SURF, BRIEF, ORB, respectively. None of them exhibits many good matches, while GAD can find a good number of correct matches, in spite of completely different image modality and the slight viewpoint change between the two images, as shown in Fig.11(e). Figure 11(f) provides therecall vs. 1−precision curve comparison. Figure 12 shows two images taken on the same area, but one is an intensity image, and the other is a depth image, with visibly different visual patterns. Figure 12(a,b,c,d) are the matching results of SIFT, SURF, BRIEF, ORB, respectively. None of them manages to find enough correct matches, if any. On the 20 Figure 11: Performance comparison when matching a photo and a drawing. (a) Matching with SIFT. (b) Matching with SURF. (c) Matching with BRIEF. (d) Matching with ORB. (e) Matching with GAD. (f)recall vs. 1−precision curves. Figure 12: Performance comparison when matching an intensity image and a depth image. (a) Matching with SIFT. (b) Matching with SURF. (c) Matching with BRIEF. (d) Matching with ORB. (e) Matching with GAD. (f)recall vs. 1−precision curves. 21 Figure 13: Potential multi-modal matching applications of the GAD. (a) Matching maps to aerial images. (b) Medical image registration. other hand, GAD is able to find many correct matches, as shown in Fig.12(e). Figure 12(f) provides the recall vs. 1−precision curve comparison. The GAD has many other potential multi-modal matching applications, such as matching maps to aerial images, or medical image registration [15], as shown in Fig.13(a,b). 3.4.3 Rotation and Scale Invariance The GAD is easily adjusted to achieve both rotation and scale invariance. The smoothness property of Gixels introduced in Sec.6.2.2 makes it robust to small scale or rotation changes, as shown in Fig.14(a). This allows a search for scale and rotation changes. The Gixel sample point positions are simply adjusted for each search step in scale or rotation. As introduced in Sec.3.2.2, Gixels are organized in a circular array, with several Gixels placed evenly on each circle. Rotating the array w.r.t. the center Gixel may align the descriptor into other orientations, so that all Gixels are still sampling corresponding regions when the center position is matched. In addition, for each Gixel, the score in both x and y components(x old ,y old ) is converted through the triangle function 22 transformation to account for the rotation change into(x new ,y new ): x new =|x old ×cosα+y old ×sinα| y new =|y old ×cosα−x old ×sinα| (6) Scale invariance can be achieved similarly. First adjust the radius of each circle in the circular formation according to the scale. Then adjust size of the neighborhood for each Gixel so it covers the same region under different image size. Finally, distance and length value of all gradients within the neighborhood are also scaled linearly. It should be noted that the algorithm are not just compensating for a known rotation or scale change. Instead, it searches for possible changes, and compensates for each searched change using the methods described above. Figure 14(b) uses two images from the “Boat” series [1] with both rotation and scale changes, though the 2nd one is further rendered with a “pencil drawing effect” via Photoshop. GAD exhibits rotation and scale invariance by finding a good number of correct matches. Figure 14: Matching performance of GAD with rotation and scale changes under multi-modal situations. (Curves comparison is not included since other descriptors barely work in multi-modality, as Sec.3.4.2 already showed.) (a) Matching between image 1 and 2 from the “Graffiti” series [1], while one of the images are rendered with a “glowing edges effect”. (b) Matching between image 1 and 4 from the “Boat” series [1], while one of the images are rendered with a “pencil drawing effect”. (The pointer on each keypoint indicating its orientation shows that almost all keypoints are aligned and matched correctly.) 23 3.4.4 Public Dataset Figure 14(a,b) both come from the public dataset [1], where GAD can achieve good performance even under multi-modal situations. Two more examples are presented in Fig.15(a,b), including the “Ubc” series with JPEG compression and the “Leuven” series with illumination change, again both from the public dataset [1]. Due to space limitation, the matching results of other descriptors are not listed here (except their curves). They can be found in previous literatures. Figure 15: More results for matching with GAD (left) and recall vs. 1− precision curve comparison (right). (a) Matching with JPEG compression (image 1 and 2 from the “Ubc” series [1]) (b) Matching with illumination change (image 1 and 4 from the “Leuven” series [1]) Figure 15(a) shows that the GAD outperforms other descriptors on the “Ubc” data with JPEG compres- sion. A possible explanation is that matching images under JPEG compression to normal images is similar to a multi-modal problem, as JPEG compression will produce more “blob-like” areas with quantization edges and deteriorated texture quality. The smoothness property (Sec.6.2.2) of Gixels also likely helps in reducing the impact of JPEG compression. In Fig.15(b) with illumination change, the GAD has a recall rate slightly inferior to other descriptors, but still finds a large number of correct matches with almost no errors. 3.4.5 Processing Time GAD’s computation process is time-consuming compared to state-of-the-art descriptors, but no efforts at optimization have been made yet. For examples, Fig.6 (size 512x512) takes GAD 8.9 seconds, while SURF needs 0.7s; Fig.15(b) (size 900x600) takes GAD 19.5s, while SURF needs 1.3s. Speed is not a primary concern during this research, so there is still potential for optimizations. 24 3.5 Summary This chapter introduces a novel descriptor unit called a Gixel, which uses an additive scoring method to extract surrounding edge information. It is shown that a circular array of Gixels will sample edge information in overlapping regions to make the descriptor more discriminative and it can be invariant to rotation and scale. Experiments demonstrate the superiority of the Gixel array descriptor (GAD) for multi-modal matching, while maintaining a performance comparable to state-of-the-art descriptors on traditional single modality matching. 25 4 Training-based Object Detection in Cluttered 3D Point Clouds This chapter describes the first method for 3D object detection in point clouds, using a machine learning Adaboost training procedure [45]. Segmentation is not required and local features are used instead of global features to describe the objects. The local feature is a 3D Haar-like feature [46], working together with 3D summed area table [50] for efficient computation. The 3D point clouds are resampled into a 3D image, for 3D features and summed area table calculation. The method is tested on both dense industrial data and sparse street data containing several different types of objects, demonstrating competitive performance, efficiency and robustness to occlusion. The main contributions in this chapter include: • Combine Adaboost training procedures and local features for 3D object detection. • Use Scanning search without a priori segmentation to deal with cluttered object scenes. • Convert point clouds into volume data (3D image) as basis for feature computation. 4.1 Algorithm Introduction 4.1.1 Detector Training and Detection The algorithm to detect objects in 3D point clouds is divided into two modules, a training module and a de- tection module. A detector is trained for each class of object, using the Adaboost training procedure [45] with training samples generated from a pre-labeled object library. The detector exhaustively scans and evaluates the point clouds, returning the maximum positive responses as detected object locations. Figure 16 shows an overall flow diagram of the training-based algorithm. The object detector consists of N (default N = 30) weak classifiersc i (sec.4.1.3), each with a weightα i . Each weak classifier evaluates a subset of the candidate region, and returns a binary decision. The object detector, or strong classifier, is a combination of all weighted weak classifiers Σ i α i c i , which is compared to a predetermined thresholdt (= 0.5Σ i α i by default) to determine whether the candidate region is a positive match. The differenceΣ i α i c i −t is also used to estimate a detection confidence. The object detector is trained using an Adaboost training procedure [45]. Positive training samples are obtained from pre-labeled library objects by random sampling with optional additional noise and occlusions (synthesized by selecting a random region where all points are removed). Negative input samples are pro- duced from negative point clouds regions (region without the target object) by randomly sampling a subset 26 Figure 16: Flow diagram of the training-based 3d object recognition algorithm with the size of the target object. The input of the detection module is a region of 3D point cloud, which is pre-processed (sec.4.1.2) into a 3D summed area table for efficient computation. A 3D detection window is moved to search across the 3D image, evaluating the match between each point clouds cluster within the detection window and the target object with the detector trained during the training phase. After the scanning window exhaustively evaluates the input point cloud, all detected positive match instances are further processed by non-maximum suppression to identify the target object with the best match and confidence above a threshold. 4.1.2 Pre-Processing The raw point clouds data often contains too much unnecessary detail information for object detection. A points distribution is usually enough for the task. Therefore, a voxelization process is performed to convert the point clouds into volumetric data, or a 3D image. Each voxel in the 3D image corresponds to a grid subset of the original point clouds. However, only the number of points within each grid is stored at each 27 voxel. All the coordinate information of points are discarded. To smooth the sampling effect during grid conversion, each point contributes to more than one voxel through linear interpolation. The size of a grid region is empirically set to about 1/100 of the average object size. Feature computation based on the 3D image is accelerated with a 3D version of summed area table [47, 50], which efficiently computes rectangular features in constant time (e.g. the 3D haar-like feature in sec.4.1.3). The 3D summed area table element at x,y,z has the sum of all elements with coordinates no more thanx,y,z: ii(x,y,z)= X x ′ ≤x,y ′ ≤y,z ′ ≤z i(x,y,z) (7) whereii(x,y,z) is the 3D summed area table andi(x,y,z) is the original 3D image. Using the recursive equations: s(x,y,z)=s(x,y,z−1)+i(x,y,z) (8) ss(x,y,z) =ss(x,y−1,z)+s(x,y,z) (9) ii(x,y,z)=ii(x−1,y,z)+ss(x,y,z) (10) ii(x,y,z) can be computed from i(x,y,z) in one pass in linear time. (s(x,y,z) and ss(x,y,z) are cumulative sums,s(x,y,−1)= 0,ss(x,−1,z)= 0,ii(−1,y,z)= 0.) Any 3D rectangular sum is obtained with eight array references (see fig.17(a)). 3D Haar-like features (sec.4.1.3) require twelve array references for the two neighboring rectangular regions. 4.1.3 Features Each weak classifier is based on a feature computed from the 3D image. The 3D Haar feature closely resem- bles its 2D version proposed in [46]. The feature value is the sum of voxels in half the region minus the sum in the other half, with possible orientations aligned to either the x, y or z-axis. Essentially, this feature extracts the boundary of the object, since that’s where the two blocks are most distinctive. Figure 17(b) illustrates the computation of 2D Haar-like feature [46] and its 3D counterpart. The feature may vary in both location and size, randomly generated in the weak classifier pool and optimally selected for the best error rate by the 28 Figure 17: (a) Illustration of 3D summed area table [50]. The eight regions are A, B, C, D, E, F, G, H, respectively. Assume V ijk is the value of the 3D summed area table at A ijk (i,j,k = 0,1). Then V 111 = A+B+C+D+E+F+G+H,V 110 =A+B+C+D,V 101 =A+B+E+F ,V 011 =A+C+E+G, V 100 =A+B,V 010 =A+C,V 001 =A+E,V 000 =A. ThusH =V 111 −V 110 −V 101 −V 011 +V 100 + V 010 +V 001 −V 000 . (b) Illustration of 2D [46] and 3D Haar-like features. Feature value is the normalized difference between the sum of pixels (voxels) in the bright area and the sum of pixels (voxels) in the shaded area. Adaboost training procedure (sec.4.1.1). Figure 18 shows the first 5 features selected for the T-junction and oil tank object. The features are located on the object surface, thus capturing the object boundary, and the Adaboost training procedure can select the most distinctive regions of an object as its features. Figure 18: The first 5 features selected by the Adaboost training procedure. (a) For the T-junction object; (b) For the oil tank object. One key point of the algorithm is a framework combining the Adaboost training procedure with local features. In this sense, any local feature can be used as long as it can support a weak classifier. As illustration, another local feature based on binary occupancy is implemented, in addition to the 3D Haar-like feature 29 Figure 19: Illustration of the binary occupancy based feature. already introduced. The occupancy feature is computed by converting the 3D image (sec.4.1.2) into a binary 3D image through simple thresholding, then use the target object as a template to test the percentage of overlapping grids w.r.t. the matching candidates. Binary AND operation is used to speed-up the computation, while a dilate operation is used to make the matching more resistant to noise. Figure 19 illustrate a simple version of the binary feature. Table 1 provides a performance comparison between the two features. The 3D Haar-like feature performs better, so it’s used for the remainder of experiments. Nevertheless, the binary occupancy feature illustrates that other local feature may be integrated into the framework to pursue improved classification performance. Table 1: Performance comparison of the 3D Haar-like feature and the binary occupancy based feature. Haar-like Feat. Binary Occupancy Feat. Conf. of 2 True Positives 90%, 60% 90%, 30% Valves in Industrial Scene # False Alarms 0 0 Processing Time 2 sec 2 min Conf. of 2 True Positives 100%, 70% 80%, 60% Cars in Street Scene # False Alarms 8 22 Processing Time 5 sec 4 min 4.2 Analysis 4.2.1 False Alarm Reduction The negative training samples described in sec.4.1.1 are randomly cropped, which means they are usually very different from the target object, so the features selected during training are not very discriminative. As a result, the detector often produces a large number of false positive detections. To reduce the number of false detections, or to better train the detector for more discriminative power, the 30 Table 2: The change in performance as more negative samples are added. # Negative Samples # Detections # False Alarms Conf. of 2 True Positives Highest False Detection Conf. 30 25 23 90%, 40% 50% 34 21 19 80%, 50% 30% 44 8 6 90%, 20% 20% 66 1 0 70%,< 0 < 0 idea is to use false detections (relatively “hard” samples) as additional negative training samples to re-train the detector. The false detections used for re-training are detected purely from other negative scenes that are guaranteed to have no target object. The re-training process can be repeated several times to further reduce false detections. Figure 20 shows a scene in which a valve is to be detected (as shown in the top-left corner). Table 2 shows the change in performance as more negative samples are added. As the number of false detections is gradually reduced, and their max confidence is also reduced. One of the detected valves is the same as the reference object, which is always positively identified and keeps a high confidence. The other “positive” valve actually is a bit different from the reference object, since it has a different shaped base and handle, with a ball attached to it. It gets lower and lower confidence as more negative samples are added, until finally identified as negative. This shows that the detector is more precise and restrictive after re-training. Clearly, as hard negatives are filtered, the assumed “positive” is also filtered, which is understandable. Figure 20: Valve detection result. 31 4.2.2 Speed-Up Processing Time The exhaustive scanning search method has brought significant amount of computation in the object detection phase. Without any optimization, the original processing speed is about 400 detection windows per second. Two ideas are employed to speed-up the process. Firstly, when a detection window has too few number of points, the algorithm skip its evaluation since it won’t contain the target object. Equivalently, this means deciding whether the sum of 3D image voxels within a detection window is smaller than a threshold, thus the 3D summed area table (sec.4.1.2) can be used to obtain the sum efficiently. This helps raise the processing speed to about 100000 detection windows per second, because in a real scene most area in the point clouds is actually blank. Moreover, detection window evaluation occupies many processing time, which involves computing fea- tures and applying weak classifiers. As introduced in sec.4.1.2, each 3D Haar-like feature requires only twelve array reference with the help of 3D summed area table. As a result, the final optimized processing time is about 3000000 windows per second, nearly 10000 times or 4 orders of magnitude speed-up from the original. 4.2.3 Rotation and Scale Scale changes are accommodated by resizing the object detector and searching at different scales. Evaluating varying scales uses the same 3D summed area table to speed-up the computation, so the extra time introduced by multiple scale searches is linear. Figure 21: Principle direction detection by PCA to align objects with arbitrary rotation changes. The experiments constrain rotation changes to perpendicular rotations (90 or 180 degrees along x, y, z- axes), so the same 3D summed area table can be used. The assumption (that rotations are usually 90 or 180 degrees) actually holds true for many industrial and street applications. To cope with arbitrary rotations, a principle direction detector using PCA can be applied at each window position before matching evaluation, as shown in fig.21. The detected direction is used to align the candidate object to the same orientation as 32 the library object. However, this will render the 3D summed area table ineffective, thus these cases increase processing time. 4.3 Experiments Data from several intrinsically different sources are chosen to demonstrate the range and performance of the detector, including industrial data, street data, and a public 3D object recognition dataset [49] containing 3D models of cluttered objects with high occlusion. 3D object recognition in point cloud is still a relatively new research topic, arising from the availability of 3D scanners. Many existing public datasets are only for mesh models. There are a few public datasets available in point clouds [29,31] for street and indoor data, but they are labeled based on point classification, with results provided in percentage of correctly labeled points, so they are not really comparable with the algorithm describe in this proposal. Industrial data exists [48], but still only for mesh models. This proposal aims at developing a method for 3D object recognition in industrial site point clouds, for which there’s yet to be seen a real competitor. Experiments presented in this section serve to demonstrate the performance and efficiency of the 3D object recognition algorithm, with a comparison on public data [49] only to show the robustness to occlusion. 4.3.1 Industrial Data Industrial applications often focus on objects with simple shapes such as T-junctions, valves, etc. The chal- lenge in detecting such objects is that they usually appear connected to or nearby other objects of similar shape, making it challenging for segmentation algorithms to isolate them and extract global features. Figure 22 provides some object detection results in industrial scenes for six types of objects. The chosen objects are all simple structures and positioned in areas cluttered with other objects. The detector manages to identify all of them successfully. Table 3 lists statistical results for nine categories of objects from the industrial dataset, including the number of instances for each type of objects, the number of total detections, and number of correct detections (true positives) and wrong detections (false alarms), and recall / precision rate based on the numbers. The detector finds most of the instances, achieving a combined recall rate of 83.8%. Most false alarms are from the “T-Junction” and “Warning Board” categories that have a very generic shape, which is easily confused with similar structures, as shown in fig.23. False alarms usually have a detection confidence much lower than the true positives, making it possible to filter them out by ranking or adjusting the threshold. Moreover, 33 Figure 22: Object detection results in industrial scenes. since the detectors for different objects are executed in parallel, low-confidence false alarm of one object overlapping with true positive of another object will automatically be ignored. Table 3: Performance of the detector on each type of objects in the industrial data. Object Category # appearance # detected # correct Recall # false alarms Precision Big-based Valve 3 4 3 100% 1 75% Small-based Valve 5 6 4 80% 2 66.7% T-Junction 9 12 7 77.8% 5 58.3% Long Handle Tank-top Tube 2 2 2 100% 0 100% Short Handle Tank-top Tube 2 1 1 50% 0 100% Warning Board 9 15 8 88.9% 7 53.3% Lights 3 4 2 66.7% 2 50% Tube 2 2 2 100% 0 100% Ladder 2 5 2 100% 3 40% Total 37 51 31 83.8% 20 60.8% 4.3.2 Street Data Street data applications often focus on a specific class of objects, such as trees or cars. Such objects are relatively easily segmented from the ground, so many detection methods recognize the whole cluster using global features. Experiments show that different types of street objects are recognized without the prerequisite of segmentation. 34 Figure 23: (Upper) False alarm examples in industrial scene. (Lower) False alarm examples in street scene. Figure 24 provides some object detection results in street scenes. Table 4 lists statistical results for five kinds of street objects. All detections are achieved by a scanning search without a segmentation process. Almost all the street objects are located, with a combined recall rate of 94.1%, even though the point density is very low and the samples are highly biased to one side. However, as in the previous experiments, generically shaped objects and sparsely sampled data result in several false alarms. Some examples are shown in fig.23. As in the industrial data, most of the false alarms have a confidence much lower than the true positives, and thus can possibly be filtered out. Figure 24: Object detection results in street scenes. There is no public street data available in point clouds with labeled categories of objects. Existing methods on street data require either prior segmentation [19], or point classification [29,31], thus not comparable with the algorithm presented in this proposal. In fact, latest street object recognition method [19] reported a recognition precision of 58% and recall rate of 65%, while the training-based method produces a precision of 43.2% and recall rate of 94.1%. Note that the training-based method is tuned to emphasize more on recall 35 Table 4: Performance of the detector on each type of objects in the street data. Object Category # appearance # detected # correct Recall # false alarms Precision Car 4 7 4 100% 3 57.1% Lamp Post 3 5 3 100% 2 60% Tree 4 8 3 75% 5 37.5% Roadblock 4 11 4 100% 7 36.4% Stop Sign 2 6 2 100% 4 33.3% Total 17 37 16 94.1% 21 43.2% Figure 25: Left: Object recognition in cluttered scene with high occlusion. Right: Recognition rate vs. Occlusion compared to spin image [39], tensor matching [34] and keypoint matching [49]. rate because false alarms can usually be removed with easy human interactions. Even though the results are not produced on the same data or procedure, they still show the training-based object recognition method is at least competitive in performance. 4.3.3 Public Data The last experiment uses a public dataset [34] for 3D object recognition in cluttered scenes. The dataset has 50 scenes, each containing 4-5 3D models of randomly placed objects. Since the objects are placed together, and the scene is captured only by a single-sided scan, high occlusion and partial data are common. The data are presents in 3D mesh models, which are transformed into point clouds with a virtual scanner, then apply the detector. The results are compared to several state-of-the-art algorithms, including spin image [39], tensor match- ing [34] and keypoint matching [49]. Figure 25 provides the recognition results comparison. A recognition rate vs. occlusion graph is plotted as in [49]. Even though the detector is global, the ensemble of locally trained weak classifiers achieves comparable performance to local descriptor matching methods for these highly occluded scenes with very limited partial views, demonstrating the robustness to occlusion. 36 4.3.4 Run-Time Statistics The optimized object detector can efficiently process large-size point clouds data. For example, in the first scene of fig.22, to detect a T-junction (with 39917 points, size of 36× 33× 38 after voxelization) in an industrial scene (with 557528 points, size of 500× 482× 445 after voxelization), 107245000 detection windows are scanned, with 98543382 of them skipped due to too few number of points. This leaves 8701618 windows to be evaluated, leading to 261048540 features and weak classifiers to be computed and evaluated, which cost 17.544 seconds. The pre-processing of 3D summed area table itself takes only 1.852 seconds. The total detection time is 36.315 seconds. The complete industrial dataset contains a middle-size industry scene with about 14 millions of points, which can take less than half an hour to process. In another example, the first scene of fig.24, 5.406 seconds are spent to detect a car (2590 points) in a street scene (213920 points), with 3.968 seconds on feature computation and 0.143 seconds on pre-processing. 4.4 Summary This chapter describe a general purpose 3D object detection method that detects 3D objects in various 3D point cloud scenes, including industrial scenes and street scenes. An Adaboost training procedure is intro- duced to train detectors with 3D Haar-like features, whose computation receives significant speed-up from 3D summed area table. The framework can support other feature types, as illustrated by the implementa- tion of a 3D Haar-like feature and a binary occupancy feature. Experiments on various object types from industrial data, street data, and mesh model public data demonstrate competitive performance, efficiency and robustness to occlusion. 37 5 Application in Automatic 3D Industrial Point Cloud Modeling With the first solution to 3D point cloud object detection as described in the previous chapter, this chapter demonstrates describes the practicability of the algorithm in an integrated system for automatic generation of a 3D classified model of an industrial scene from 3D point cloud data. The system consider industrial point clouds to be a congregation of pipes, planes and objects, and inte- grates three main modules to handle each part of the data. Two of them are primitive extraction processes that detect cylinder and planar geometry in the scene and estimate models and parameters to fit the evidence. The third module utilizes the 3d point cloud object detection techniques to match clusters of 3D points to 3D models of objects stored in a prebuilt object library. The best-matched library object is used to represent the point cluster. Combination of primitive extraction and object detection processes complete a 3D model for a complex scene consisting of a plurality of planes, cylinders and general objects. Creating such models currently requires extensive time by skilled modelers, even using the best software tools available today. Figure 26: (a) Original industrial point clouds. (b) 3D models automatically reconstructed by the integrated 3D modeling system. (c) Models by commercial software [51], completely devoid of any object. (d) Hand- made models by professional modelers. Fig. 26(a) shows an example industrial point cloud scene with a few pipes and industrial objects. Special- ized commercial software [51] only attempts to model pipes, and completely ignores other scene objects, as shown in fig. 26(c). Of course hand-made models can achieve arbitrary levels of detail and fidelity, however, as shown in the pumps in fig. 26(d), human mistakes are also common in such models due to their complexity as well as fatigue and other variations in human performance. Fig. 26(b) presents results by the described 38 system, where most pipes and objects are reconstructed. With the separately modeled pipes, planes and objects, the combined results display are freely switchable between mesh model for efficiency or point clouds for accuracy, which won’t be achievable in traditional data- driven or simple primitive-fitting methods. Several industrial point cloud datasets are used in experiments to demonstrate the performance of the complete system. The completeness and quality of the produced models is compared to a leading commercial automatic modeling product [51] as well as hand-made models created by professional service providers. The main contributions in this chapter include: • Describe an integrated system with pipe modeling, plane classification, and object detection that auto- matically classifies and models 3D industrial point clouds. • Describe issues that arise in the integration and their solutions. • Present results for several challenging data sets and a multi-mode display capability, uniquely achiev- able with the integrated 3D modeling system. 5.1 Individual Modules The 3D point cloud modeling system treats an industrial point cloud as a congregation of pipes, planes and objects, and integrates three main modules for pipe modeling, plane classification and 3D object detection. Primitive shapes such as pipes and planes are extracted first, and afterwards more complicated objects will be detected by a 3D object detection module, using the algorithm described in the previous chapter. The primitive shape extraction modules are developed by Qiu [59] and Jing [60], and described more in detail in the cited publications. For the completeness of this proposal, this section briefly summarizes these two modules. 5.1.1 Pipe-Run Modeling Since pipe-runs are critical and prevailing shapes in typical industrial sites, they explicitly detected in the first place. The pipe-run modeling module is based on the work by Qiu et al. [59]. This module takes 3D raw point clouds as input and extracts pipes and reconstruct joints in the scene into a complete pipe network, as shown in fig. 27. This module has three parts: global similarity acquisition, cylinder fitting and joint detection. Instead of following the fit-primitive-then-refine strategy in traditional primitive-based methods, this module detect the global similarities of primitives based on statistical analysis of point normals. These global similarities 39 Figure 27: The pipe-run modeling module [59] takes 3D raw point clouds as input and extracts pipes and reconstruct joints in the scene into a complete pipe network. are then used as constraints in fitting primitives to the point clouds. Then, joints are reconstructed to connect cylinders into a fully-connected pipe network. 5.1.2 SVM-based Plane Point Classification Many points in an industrial scene are part of planes (e.g. ground, platforms and boxes). Therefore, identify- ing clusters of such points will greatly ease and accelerate the overall modeling process and help the algorithm focus on objects with fine details. Different from usual indoor/outdoor data, the industrial scene contains not only planes, but pipes as well, with large trunks of continuous yet changing normals. To overcome this challenge, the planar surface points classification scheme presented in [60] is applied, which combines the SVM-based classification with the Fast Point Feature Histogram (FPFH) [61] as descriptor to characterize the neighborhood of each point. After the local point-level classification, points belonging to relatively big planar surfaces are segmented. Given the classified plane points, a random unvisited seed point is iteratively selected and expanded into unvisited neighboring points using the classical Flood-Fill algorithm, resulting in a set of disjoint clusters within finite steps. Only clusters with enough points (roughly more than 1k) will be considered as a plane surface candidate to preserve small planar area that might be part of an object. Fig. 28 shows an example of plane classification. In experiment, plane point removal reduces the size of input from 25,135k to 14,137k, meaning only half of the original points need to be considered in the following steps, which shows the effectiveness of the plane classification module. 40 Figure 28: Plane point classification [60] result. The yellow points are classified as plane, while the red ones are classified as non-plane. Note that the small clusters (e.g. those on the ladder) will no longer be considered as plane clusters after segmentation. 5.2 Integration Issues and Solutions The complete model can be produced by merging the results from the three modules. However, some issues arise in the integration of individual modules into a complete system, which are described and solved in this chapter. The relationship of each modules and system integration elements are illustrated in the diagram in fig. 29. Figure 29: System diagram - relationship of each modules and system integration elements. 41 Figure 30: (a) Original industrial point clouds. (b) Unaligned results after merging pipe models and detected 3D objects (notice the small displacement between objects). (c) Objects are aligned to pipes. (d) Pipe segments are predicted and modeled in small gaps between objects on the same axis. 5.2.1 Object Alignment In object detection, each type of objects is detected independently by exhaustive window searching with a discrete step size, and each object instance is detected independently as a local confidence maximum. As a result, the detected locations of different object types, or different instances of the same object may appear with certain displacement between each other like in fig. 30(b). After merged with the pipe models, the visual artifacts caused by these small displacements are worsen, because most objects are supposed to be aligned with pipes. This issue is solved with the information gained from pipe models, by trying to align the detected objects with one of the pipe segments. Each modeled pipe segment has an axis vector, one point on the axis, min and max value along the axis to define end points of the segment, and a radius. When building the object library, each type of object supposed to be connected to a pipe will also be defined with an axis and two endpoints, similar as a pipe segment. After results are merged, each detected object will be attempted to align with a nearby pipe segment in similar orientation, following Algorithm 1. As shown in fig. 30(c), all objects in the scene are aligned with a pipe segment. Notice that the objects may appear slightly overlapped because library 42 models from source data is not accurate - bigger than actual point clouds. Algorithm 1 Align detected objects Require: n pipe segmentsP i ,i = 1,...,n, axisl pi m detected objectsO j ,j = 1,...,m, axisl oj axis angle thresholdα th distance thresholdD th for allO j do for allP i do ifangle(l pi ,l oj ) <α th then ifdistance(O j ,l pi )<D th then D ij ←distance(O j ,P i ) end if end if end for k← minimize k D kj ,k = 1,...,n X kj ,Y kj ← closest point pairs onO j ,l pk T kj ←transformation(X kj →Y kj ) A j ←T kj (O j ) end for return m aligned objectsA j ,j = 1,...,m 5.2.2 Pipe Generation in Gaps In fig. 30(c), there is a single object on the left separated from others, but still aligned. It seems a pipe segment is supposed to connect the object to others, but the pipe modeling module failed to do it because it’s not long enough to establish a confident pipe segment there. Nonetheless, it’s still long enough to become a visual artifact. It’s not easy to solve this by the pipe modeling module alone, but luckily, the integrated system provide extra information in detected and aligned objects. Since two objects on the same axis has a small gap in between, the algorithm can predict there’s a good chance that a pipe segment should connect them. The predicted pipe segment will then be verified with the original point clouds. Algorithm 2 describe the steps to fill these gaps with pipe. Fig. 30(d) shows the gap is filled successfully. 5.2.3 Flange Detection Flange is a common structure in industrial scenes. It’s a special type of object that is too complex to be captured by pipe modeling, but yet too simple to be recognized by object detection with both good precision and false alarm rate. However, under an integrated system, the extra pipe information can help the object detection module locate the flanges successfully. The key ideas for this include: 43 Algorithm 2 Connect gaps with pipe segments Require: set{P} ofn pipe segmentsP i , axisl pi set{A} ofm aligned objectsA j , axisl aj axis angle thresholdα th distance thresholdD min andD max while{A}6=φ do get next objectsA k from{A} for allA j do ifangle(l ak ,l aj ) <α th then {A 0 }←A j X jk1 ,X jk2 ← projected end points ofA j onl ak {E}←X jk1 ,X jk2 end if end for for allP i do ifangle(l ak ,l pi ) <α th then {P 0 }←P i Y ik1 ,Y ik2 ← projected end points ofP i onl ak {E}←Y ik1 ,Y ik2 end if end for sort{E} while{E} is not traversed do get next neighboring point pairse a ,e b from{E} if the space betweene a ,e b is empty then D←distance(e a ,e b ) ifD min <D <D max then generate pipe segmentsS temp betweene a ,e b ifS temp fits with the original point cloud then {A}← S temp end if end if end if end while {A}←{A}−{A 0 } {P}←{P}−{P 0 } {E}←φ end while return set{S} of generated pipe segments 44 Figure 31: (a) Flange detection with pipe information. (b) “Triple flange” detection. • Treat the flanges more than the “bump” itself, but include a section of pipe beside it when creating the flange library object to include more features. • Since the flange is small, reducing the voxel size when computing 3D image preserves more details for better feature description. • Flanges should always align with a pipe, thus searching only along the established pipe axis (but not limited to modeled pipe segments) can ensure good precision while effectively reducing false alarms. An example flange detection result is shown in fig. 31(a). Another noticeable structure in the scene is the “triple flange”, which won’t be recognized as three separate flanges because there is no small pipe section between each individual flange, and is thus defined as a different type of object. Using similar idea as flange detection, triple flanges can be successfully detected as well, as shown in fig. 31(b). 5.2.4 Other System Features In 3D object detection module, exhaustive searching in 3D space is very time-consuming. Luckily, the integrated system provides primitive shape points from two other modules, and pipes or planes are usually not considered as objects. Therefore, to reduce search space, all the points belonging to pipes or planar surfaces will be removed from the original point cloud, resulting in a residual point clouds. The object detector can then limit its search space according to the residual points. However, many industrial objects usually contain part of pipe or plane, such as T-junctions or pumps, which might be incomplete after points removal. The solution is to use both point clouds: the residual for locating search windows, and the original 45 Figure 32: (a) Residual point clouds after removing pipe and plane points, used to reduce search space. (b) Original point clouds used to compute and evaluate object detector. (c) The “object hierarchy” where the bigger object is composed of two smaller ones, helpful in choosing overlaps. for computing and evaluating the detector. Fig. 32(a) shows the residual point clouds of an object, where points from its base pipe are mostly removed. Fig.32(b) shows the original point clouds of the same object, used for detector evaluation. When detecting multiple object types, each object detector is applied separately. All detection results are then combined to produce the final list of detected objects. If two overlapping object instances are detected, usually the one with higher confidence is kept. This may happen because one object actually belong to the other, for which the concept of ”object hierarchy” is introduced. A bigger object might be composed of several smaller ones, as shown in fig. 32(b, c). Object hierarchy can be used to reassess the confidence in detected objects, and decide which to kept when there are overlaps. 5.3 Results 5.3.1 Display Modes Fig. 33(a, b, c) shows the classified models created by the integrated 3D modeling system for a large industrial scene. Results from the three modules, namely pipes, planes and objects, are rendered in different colors, proving they are successfully detected and extracted. The system design make sure that pipes, planes and each object types are individually detected and labeled, opening the possibility for extra metadata and interaction in virtual environment. The result display can also be freely switched between mesh model for efficiency or point clouds for accuracy. All of these benefits are not achievable in traditional data-driven or simple primitive-fitting methods. In fig. 33(a, b, c), three display modes are demonstrated, with the first (fig. 33(a)) displaying the detected objects in point clouds, another (fig. 33(b)) displaying the objects in 3D mesh model, 46 Figure 33: (a) Results by the integrated system, where detected objects are rendered in point clouds. (b) Results by the system, where detected objects are rendered in mesh models. (c) Results by the system, where detected objects are replaced by data at corresponding location from the original point clouds.(d) Original point clouds. (e) Manual CAD models by professional modelers with one extra (mistaken) pipe line and other small errors. (f) Models by commercial software [51] with lots of errors but none of the objects. 47 and the final one (fig. 33(c)) displaying the detected objects in data at corresponding location from the original point clouds. The reason for different display modes is that highly precise mesh model is hard to come by due to time and effort required to create them. While mesh models are more efficient, it’s usually less accurate than displaying in point clouds. The integrated system provides separated pipes, planes and objects, thus the system user can choose to display the more accurately modeled primitives in mesh model, while leaving the more complex objects in point clouds. Rendering part of the results in point clouds mode is not just a simple clone of the input, but on the other hand more precisely reconstruct the original point cloud, while still preserving the meta-information of the detected objects and classified primitives. 5.3.2 Comparison As introduced above, fig. 33(a, b, c) present the modeling results by the integrated 3D modeling system in different display modes. Fig. 33(d) provides the original point clouds for reference. Fig. 33(e) shows the CAD models bundled with the source data, manually created by professional service providers. The hand-made model has 4 pipe lines on the top platform, which is wrong compared to the original point clouds who only has 3. On the other hand, the integrated system correctly modeled the 3 pipe lines, showing that automatic system sometimes even outperform human eyes in complex scenes. In addition, some objects in the CAD model are very inaccurate, such as the small valves. The problem will not present if objects are rendered in point clouds as in fig. 33(a, c). The hand-made model is also worse on rails and platforms. Fig. 33(f) shows the result by a leading commercial automatic modeling product, ClearEdge 3D [51], which is only able to correctly model some pipes but none of the objects, together with lots errors. 5.3.3 Complete Reconstructed Scene Fig. 34(a) shows a full industrial scene reconstruction result by the integrated 3D modeling system, demon- strating its capability in handling large-scale complex data. Fig. 34(c) shows the result by ClearEdge 3D [51], almost unrecognizable. Fig. 34(d) shows the hand-made model by professionals, which does look cleaner but loses too much details compared to the system result, and contains some inaccuracies. Fig. 34(b) is the original point clouds. Fig. 34(e) presents more results of the system on other industrial datasets. 5.4 Summary This chapter presents a 3D classification and modeling system to automatically create models from 3D indus- trial point clouds, integrating three modules for pipe modeling, plane classification and object recognition, 48 Figure 34: (a) Complete industrial scene reconstruction by the integrated system. (b) Original point clouds. (c) Models by commercial software [51]. (d) CAD models hand-made by professionals, cleaner but lose too much details. (e) Results by the system on more industrial datasets. 49 demonstrating the practicability of the training-based 3D object recognition algorithm. Several issues when integrating the modules into the complete system are addressed. The system is designed so that pipes, planes and different types of objects are recognized separately, allowing freely switchable display in mesh models and/or point clouds for balanced efficiency and accuracy. Experiments show that the integrated 3D model- ing system can successfully model large complex industrial point clouds, outperforming leading commercial automatic modeling software, and comparable to professional hand-made models. 50 6 3D Object Detection by 2D Multi-View Projection This chapter describes another solution for 3D object detection in point clouds, improved upon the previous one using Adaboost and 3D features, which suffers from only partial rotation and unsatisfying speed. Two recent trends motivate this work. Growing availability and use of 3D scanners has spurred interest in 3D object recognition. Also, 2D object detection in images has improved dramatically. These observations motivate a transformation of the 3D object recognition problem into a series of 2D detection problems. This 3D-to-2D strategy is similar to those used for 3D object model retrieval [26,78,87], but the target of this work is unsegmented noisy large-scale 3D point cloud which is much more complex. The key idea of the improved solution is to transform the object recognition problem from 3D space into a series of 2D detection problems. To achieve this, a multi-view recognition framework is described, which first project the 3D point clouds into 2D depth images of multiple views, then detect object in each views separately using a 2D algorithm, and finally re-project all the 2D detection results back into 3D to estimate combined 3D detection locations. This algorithm effectively reduces the problem complexity from 3D to 2D, stabilizes the performance and achieves rotation invariance with multiple views, without any requirement for a priori object segmentation or descriptor training. The multi-view projection process also stabilizes performance in cluttered and occluded scenes and provides rotation invariance. This algorithm is demonstrated on both dense industrial data and sparse street data containing several different types of objects, in comparison to the previous algorithm. It is also tested on a combination of in- dustrial data and street data [29,74] containing various types of objects and scene conditions. In comparisons with state-of-the-art 3D recognition methods and the previous method, this improved method has competitive overall performance with one-order of magnitude speed-up. The main contributions in this chapter include: • Transforming the 3D point cloud object recognition problem into a series of 2D detection problems to reduce search complexity. • Employing multi-view projection to provide rotation invariance and stabilize performance in cluttered and occluded scenes. • Achieving one-order of magnitude speed-up compared to state-of-the-art 3D recognition algorithms. • Removing all requirements for prior object segmentation or detector training needed in other algo- rithms. 51 Figure 35: Flow of the proposed algorithm: First project the 3D point clouds into 2D images from multiple views, then detect object in each view separately, and finally re-project all 2D results back into 3D for a fused 3D object location estimate. 6.1 Algorithm Introduction 6.1.1 Multi-View Recognition Framework The core idea of the improved 3D detection algorithm is to transform a 3D detection problem into a series of 2D detection problem, thereby transforming the complexity of an exhaustive 3D search into a fixed number of 2D searches. This is achieved by projection of 3D point clouds at multiple viewpoints to decompose it into a series of 2D images, which works like the reverse process of multi-view stereo reconstruction [89] where 2D images from multiple viewpoints are fused to reconstruct 3D information. To ensure that the original 3D information is not lost, the 3D to 2D projection is done at multiple viewing angles (evenly chosen on a sphere). Depth information is utilized when projecting 2D images for each view, and kept for later re- projection back into 3D for fusion of 2D results. As shown in the algorithm flow in fig. 35, after the input 3D point cloud is projected into 2D images from multiple views, each view is used to locate the target object. Lastly, all 2D detection results are re-projected back into 3D space for a fused 3D object location estimate. 52 The multi-view recognition process fits human intuition as well. When human eyes are looking at a scene, two eyes actually capture two views which are usually enough to identify many objects in a scene. More looks in different views will make the identification task even easier. The multi-view recognition algorithm is trying to simulate the process with the multi-view recognition framework. The benefits of this multi-view projection are three-fold. Firstly, each view can compensate for other- s’ missing information, equivalent to a pseudo 3D recognition process with reduced complexity. Secondly, target objects are also projected from multiple views and detected in all projected scene views, making the recognition process invariant to rotation changes. Thirdly, multiple independent 2D detection processes sta- bilize the final fused 3D object locations, filtering discrete location offsets common in 2D detection. 6.1.2 3D to 2D Projection The projection of 3D object is performed at multiple viewpoints evenly distributed on a sphere to help ensure invariance to rotation (sec. 6.3.1). The scene point cloud is also projected at several viewpoints to stabilize performance, especially in cluttered and occluded scenes (sec. 6.3.2). For each view direction, the 3D input point cloud is projected into a 2D image, with each image pixel’s intensity recording depth information, as shown in fig. 36. Both the data scene and the search target object are projected for 2D detection. The 3D space is discretized into cells. The cell size is fixed for each dataset according to point density and objects size. Cells with at least one point are considered occupied. The cell size is set so that occupied cells have 10-20 points on average. Parallel projection rays from each view then sample the scene by extending rays from pixel array. The occupied cell closest to the viewpoint sets the depth value of that pixel, similar to the idea of z-buffering. However, a point cloud scene is usually too crowded to be rendered in one depth image without objects occluding each other. The scene in fig. 36 contains two tanks, but in one view the projected depth image only shows one of them because the other one is occluded. In larger scene this would happen a lot, which severely damages the performance and stability of the algorithm. The solution is to decompose the depth image in each view into different sections according to depth. This help split the data into multiple layers, reducing occlusions during 3D-2D projection, while still keeping some relative depth information between sections. For each section, the depth change is very limited, so the image will be converted into binary without depth information, which enables efficient binary image operations to greatly reduce the computation time (refer to sec. 6.2.1). Preferably, the splitting points should be decided automatically with minimal cutting. However, the point 53 Figure 36: 3D input point clouds of the data scene and object are projected into 2D depth images at multiple views. cloud scenes are usually too cluttered and noisy to efficiently locate optimal splitting points. Instead, the split is performed at a fixed depth span, which is much simpler and faster. Although sometimes an object might be cut in half, its silhouette should remains the same for 2D projection in direction perpendicular to the cutting. In addition, the multi-view detection mechanism can ensure the object be detected in most other views and still successfully retrieved in the final merged 3D detection result. Fig. 37 shows the scene segmented into 4 sections according to depth along one axis and projected into 2D binary images. For the scene point cloudC Scene , projection at viewing angleθ i and depth sectionSec j results in a 2D imageI Scene (i,j), as in eq. 11. Object point cloud is similarly projected into 2D imageI Obj (k), though not segmented by depth because objects are much smaller. I Scene (i,j) =Proj(C Scene ,θ i ,Sec j ) I Obj (k) =Proj(C Obj ,θ k ) (11) Though it might seems the dimension reduction from 3D to 2D is compromised by the multitude of projection views and depth sections that even out the search complexity, it should be noted that all feature extraction and evaluations are also reduced to 2D at every search location, thus the total computation cost is still reduced significantly. 54 Figure 37: The scene are segmented into four sections according to depth along one axis and projected into 2D binary images (converted to binary for efficiency - please refer to sec. 6.2.1). Notice the occluded tank object in fig. 36 can now be seen in a separate view. 6.1.3 2D Object Detection A 2D detection algorithm is needed for detecting projected objectI Obj (k) of every view in projected scene I Scene (i,j) of every view and depth section, as in eq. 12. (x,y) ijk =Detect(I Scene (i,j),I Obj (k)) (12) (x,y) ijk is the 2D coordinates of detected objects I Obj (k) in the projected scene I Scene (i,j). The al- gorithm must be able to exploit the fact that all 2D projected images are binary images, and fast because the algorithm will be executed many times in the multi-view recognition framework. In the mean time, the algorithm should also maintain a good detection rate. Three different algorithms are implemented and tested for these objectives. The first algorithm is based on template-matching. The projected object is used as a template to be searched exhaustively across the whole image of projected scene. At each searched location, the template and the image patch within the search window are compared pixel-wise to produce a confidence score. The binary images only have two pixel values, 1 for point and 0 for no point. Different value matches are assigned with different weights heuristically. For example, an object point matched to an empty pixel in data has more negative impact than otherwise, because sometimes there are other objects around in the data. The second algorithm also follows the exhaustive window searching procedure as in the first one. How- ever, the method for evaluating each search window is adapted from the BRIEF descriptor [8]. 128 point pairs are randomly generated from each object template following a Gaussian distribution with the center of 55 the object as mean andarea/25 as standard deviation. A 4-by-4 area sum around each point produce a value, and the binary test comparing the sum values of each point pair produce a bit in the descriptor. 128 point pairs would result in a 128 bits descriptor, which can then be compared to the descriptor extracted from the input image efficiently using hamming distance and binary shifting operations. Since the binary images in this case is much simpler than optical images used in the BRIEF paper [8], a threshold for minimum difference in sum values is enforced when selecting the descriptor point pairs so that each point pair is discriminative enough, as well as when extracting the descriptor on the input image to filter out noises. In experiments as shown in fig. 39(d), the BRIEF-based algorithm demonstrates a better overall detection performance than the template-based algorithm. Speed-wise, to detect a 104x152 object patch on a 400x226 image, the template-matching algorithm takes 0.015 seconds, while the BRIEF-based algorithm takes 0.01 seconds. However, considering the 2D detection algorithm will be executed many times (more than 10,000) in the 3D detection process, the speed of the BRIEF-based algorithm is still not very satisfying. A third 2D detection algorithm is thus proposed based on gradients computed from raw image pixels. A gradient is computed at each pixel, by comparing two neighboring pixels in both x and y-directions, producing eight discrete gradient directions. Gradients within a small grid (size 8x8) are summed into a histogram of gradients. The dominant gradients (with intensity above a threshold) are determined, as shown in fig. 38. This use of gradients reduces the number of values to be compared while representing the most descriptive features of the object. During pre-processing, dominant gradients for all 8x8 grids are computed for every projected image. During detection, scene gradientsG S and object gradientsG O are matched grid-by-grid, and summed up to produce the match confidenceConf(i,j), weighted by local intensity of gradientsInt(G O ) to produce the detection confidence as in eq. 13. In experiments, this algorithm can perform each 2D detection in 0.004 seconds, more than twice as fast as the BRIEF-based algorithm. As shown in fig. 39(d), it’s also comparable to the BRIEF-based algorithm in performance. Therefore, the gradient-based method is chosen for the rest of experiments in this chapter. Conf(i,j) = P i ′ ,j ′ Int(G O (i ′ ,j ′ ))·G S (i,j)·G O (i ′ ,j ′ ) P i ′ ,j ′ Int(G O (i ′ ,j ′ )) (13) Fig. 39(a) shows the confidence score distribution for the 4 images produced in fig. 37. The confidence score maps are then filtered with an empirically set threshold, as shown in fig. 39(b). Finally, non-maximum suppression is applied to locate the local maximums as the detected object locations, as shown in fig. 39(c). 56 Figure 38: Dominant gradients for all 8x8 grid is computed for every projected image for both object and scene (most scene grid lines are hided). Image and object patches are matched grid-by-grid for dominant gradients during detection. Figure 39: (a) 2D object detection confidence score distribution for the 4 images in fig. 37. (b) Filtered confidence map with only positives. (c) Objects are detected at local maximum in filtered confidence map. The green number marks the confidence. (d) Precision-Recall curves comparing the three different 2D object detection methods. 57 Figure 40: 2D detection results are re-projected into 3D space and combined to obtain the 3D object locations. 6.1.4 2D to 3D results Re-Projection After the 2D images in all views and depth sections are processed independently for 2D object detection, the results will be re-projected back into 3D space to estimate the 3D object location. Fig. 40 shows an illustrative re-projection from 2 views. During re-projection, each view is weighted based on its number of gradient pixels, normalized across all views, in order to focus on the views with more descriptive power. The weights are applied during the fusion of individual 2D results, where the combined 3D object location is estimated based on detection in all views, as in eq. 14. (x,y,z) 3D =Avr i,j,k (w k ·Reproj((x,y) ijk ) (14) (x,y,z) 3D is the 3D world coordinates of the detected objects. Each 2D detection result provides 2 coordinates, while the 3rd coordinate can be estimated from the depth information produced and kept during the 3D to 2D projection. In the multi-view recognition framework, a 3D object detection is established only if enough re-projected 2D detections occur in close proximity, to ensure detection stability. The final confidence is computed through averaging by total possible views (views without detection are considered as 0 confidence), then filtered by an empirically set threshold. Different possible rotations are estimated separately (refer to sec. 6.3.1). 6.2 Optimization 6.2.1 Speed-Up with Binary Conversion During 3D to 2D projection, the scene point cloud is decomposed in each view into different depth sections to reduce occlusion effects. In each section, the depth variation is limited, so it is considered constant and the 58 Table 5: Speed-up before and after binary conversion 2D Detection * 3D Recognition ** Full Grayscale 2.09 seconds 1800 seconds Binary 0.01 seconds 46 seconds * Detect a 104x152 object patch on a 400x226 image. ** Detect a 10k points object in a 100k points scene. image is converted into a binary representation without depth. These binary images allow bit-wise operations to greatly reduce computation time. In 2D detection in projected 2D images, the gradients are encoded as bits (e.g. 1 for dominant gradient, 0 otherwise). The gradient histogram within a grid with eight gradient direction can be encoded into a 8-bit integer, enabling efficient binary operations such as bit-compare, shifting and bit-count during grid matching. When matching two grids with gradients, the match score is determined by number of matched 1’s. Table 5 compares the speed-up before and after binary conversion. Bitwise optimizations lead to speed- up factors of 100 for pure 2D projection image detection, and factors of about 40 in the overall 3D object recognition algorithm. Experiments in sec. 6.4.4 show that the algorithm maintains good detection accuracy even after binary conversion. 6.2.2 Robustness with Smoothed Processing The effects of data simplifications and conversions for fast processing must be smoothed to ensure the detec- tion results are stable and robust. Gaussian filtering during gradient computation ensures smooth variations at edges. A 1-bit gradient direction in each grid for either existing or non-existing gradients was found insufficient. Better performance is obtained from 2-bits (4 levels) for gradient intensity: no gradient (00) / weak gradient (01) / medium gradient (11) / strong gradient (10). This better distinguishes between pure background or foreground (no gradient), weak gradients, and strong gradients. When weak and strong gradients are matched to a medium gradient, the match score is halved, thus counting as a weak match. In addition, The multi-view detection threshold is smoothed, with three thresholds involved: Scores from each view must pass a weak threshold; The best score from all views must pass a strong threshold, because in observation, true positives usually have a very high score in at least one of the views; Then in the final 3D merge, the combined score (weighted average score from all views) must pass another threshold to establish a 3D match. 59 Figure 41: Demonstration of rotation invariance, where the rotated object (right) are successfully recognized in the scene at different orientations. 6.3 Analysis 6.3.1 Rotation Invariance The multi-view projections enable the algorithm to recognize objects with arbitrary rotation relative to the search-target object. The projection of 3D object is performed at multiple viewpoints evenly distributed on a sphere, to evaluate all possible scene-object relative rotations. The 3D scene point cloud is also projected at several viewpoints to cope with occlusions and stabilize performance (the 3D scene data is also projected for varied depth layers). 2D detection of a projected object image on a projected scene image is performed for every combination of view and depth images. This may result in huge computational complexity, but the efficient 2D binary image detection algorithm (refer to sec. 6.2.1) can still ensure a reasonable total detection time. Fig. 41 shows some examples where the target object are successfully located under rotation changes. Multi-view projection handles viewpoint rotation, then during 2D detection, the projected image will be rotated within projected plane to search for in-plane rotation, as shown in fig. 42. The results of each relative rotation between the scene and the object views are fused separately when re-projecting 2D results back into 3D space. When multiple rotations are detected at the same location, the one detected with higher combined confidence (usually more views) is selected as the correct result. This way, the recognition results are invariant to rotation changes because all possible rotations are evaluated, without using rotation-invariant descriptors. In the experiments, 3D object is projected in 46 viewpoints (so that neighboring angles are about 30 degree), 3D scene in 6 viewpoints and 3 depth sections. In-plane rotation is searched with a 30 degree step- size for 12 total searches. These result in 46× 6× 3× 12 = 9936 times of 2D detection. The binary image conversion and fast 2D detection ensure a reasonable total detection time under this complexity. To be specific, the average 2D detection time for recognizing a 10k-point object in a 100k-point scene is about 60 Figure 42: The algorithm search for both viewpoint rotation and in-plane rotation to achieve rotation invari- ance. 0.004 seconds (faster than the number listed in table 5 because many projected images or views are almost blank and can be skipped), resulting in 9936× 0.004 = 39.7s to be spent on 2D detection theorectically (though in reality it will be even faster due to various optimizations, e.g. sharing gradients computed during in-plane rotation search). With the multi-view recognition framework, the algorithm can handle arbitrary rotation changes between the data and object, achieving rotational invariance. This is an important advantage compared to the the previous training-based 3D object recognition algorithm, which can only manage object orientations along perpendicular axis efficiently due to the use of 3D summed area table. 6.3.2 Multi-View Stability The multi-view projection of 3D object ensures rotation invariance. For projection of the 3D scene, theo- retically 2 viewpoints are enough, because one detection in a 2D image provides two coordinates, thus two detections in two views are enough to establish a 3D detection. However, with only two views, if the detection in one view is missing or wrong due to occlusion, noise or simply bad performance, the final 3D detection will also fail. In the multi-view recognition algorithm, each object and scene point clouds are projected at multiple views. Even if detections in some views fail, the remaining ones are still more than enough to obtain a correct combined 3D result. Detection in different view can verify for each other, and a 3D detection can be established only if it’s detected in many views, making the process much more stable. A wrong detection usually only occur in one or a few views, thus easily filtered out during the final re-projection and merge. De- tection fusion thresholds (described in sec. 6.2.2) from results in multiple viewpoints ensure that detections only occur when several views confirm detections, increasing algorithm stability and performance. Fig. 43(a) shows the improvement in final 3D object recognition result as the number of views increases. 61 Figure 43: (a) Detection performance improvement as number of projection views increases from 2 to 8. Notice there are more true positives and less false alarms. (b) Precision-Recall curves with 2, 4, 6 and 8 projection views for the 3D scene. Fig. 43(b) compares the precision-recall curves with 2, 4, 6 and 8 projection views for the 3D scene. More views can effectively stabilize algorithm performance and reduce false alarms. However, it should be noted that too many reviews may make the performance worse, not only because it will slow down the processing speed, but also because neighboring views are too similar and may affect the results of each other. For the best balance, 6 scene projection viewpoints are chosen (distributed on a 1/8 sphere). 6.3.3 Depth Section in Cluttered Scene Scenes such as industrial sites are often densely populated or cluttered. Projecting a complete 3D point cloud into a 2D depth image would produce heavy occlusions and interference. As discussed in sec. 6.1.2, splitting the point cloud into depth sections and then projecting within each section, effectively reduces occlusions and improves overall performance, as illustrated in fig. 44(a). Ideally, a minimal set of splitting planes should be decided automatically based on some criteria. However, scenes are complex and varied so identifying optimal splitting criteria is difficult and perhaps unrealistic. The scene is simply split at fixed depth intervals. Although scene objects may be cut, their 2D projection-image silhouettes still represent their shapes. Depending on scene and object relative size, increasing the number of depth sections does not necessarily improve performance, and it certainly will increase processing time. Fig. 44(b) compares the precision-recall curves with various depth sections in a cluttered scene. The number of depth sections are empirically picked, with a general guideline as⌈ AverageSceneSize AverageObjectSize×2 ⌉. For experiments, 3 sections are used for the best balance between speed and accuracy. The use of multiple view projections and depth sections stabilizes the algorithm performance. Similar to a classical ensemble classifier, individual views or sections serve as weak classifiers, which are combined 62 into a strong classifier with stable merged 3D recognition results. Figure 44: (a) There are 6 pumps on the 3 top pipe lines in this heavily cluttered scene. With only 2 depth sections, 4 pumps on the outside can be detected, with some false alarms. With 4 depth sections, all 6 pumps are detected correctly. (b) Precision-Recall curves with 2, 3 and 4 depth sections in a cluttered scene. 6.4 Experiments The experiment section will begin with the same industrial data and street data used to test the previous algorithm for comparison. Then for better evaluation, a more comprehensive dataset is described, and the multi-view algorithm will be evaluated against several state-of-the-art 3D recognition methods. 6.4.1 Industrial Data Industrial point clouds usually involves small industrial components like valves or pumps. They often appear cluttered with other industrial structures or buried in noises, which may cause occlusions and difficulty for a priori segmentation. In addition, the objects shapes are usually too simple and generic to extract stable local descriptors. Table 6 presents statistical results for 12 types of industrial objects, including the number of object appearances (ground truth), actual detections, true positives, false alarms, then finally recall and precision computed from the numbers. The algorithm can recognize most of the objects with a combined recall of 88.7%. Simpler objects like “T-Junction” or “Pump” categories with very generic shape are more easily confused with similar structures, and thus tend to produce more false alarms. However, false alarms usually have low detection confidences compared to true positives. Moreover, since detectors for different objects are executed in parallel, low-confidence false alarms overlapping with true positive of another object type will be removed. Fig. 45(a) provides some actual 3D object detection results in industrial scenes. Most objects appear in 63 Table 6: Statistical 3D recognition performance on 12 industrial object types. Object Category # appearance # detected # correct Recall # false alarms Precision Valve 11 14 10 90.9% 4 71.4% Pump Type 1 6 8 4 66.7% 4 50% Pump Type 2 3 5 3 100% 2 60% Pump Type 3 1 1 1 100% 0 100% Pump Type 4 3 5 2 66.7% 3 40% Pump Type 5 4 4 4 100% 0 100% Tank Type 1 4 4 4 100% 0 100% Tank Type 2 3 3 3 100% 0 100% Tank Type 3 2 2 2 100% 0 100% Ladder Type 1 8 13 8 100% 5 61.5% Ladder Type 2 1 1 1 100% 0 100% T-Junction 16 23 13 81.3% 10 56.5% Total 62 83 55 88.7% 28 66.3% simple shapes and cluttered with others, but the algorithm still manages to identify all of them successfully. Fig. 45(b) shows the precision-recall curves on industrial data compared to the training-based method, in which the new multi-view algorithm performs better in most cases. 6.4.2 Street Data Street object types are generally much easier to be segmented from the ground, thus they are targeted by many detection methods based on global features with the prerequisite of segmentation. The new multi-view method doesn’t have the prerequisite, but can still successfully recognize several different types of street objects. Table 7 presents statistical results for 4 kinds of street objects, achieving a combined recall rate of 88.9% even under highly biased and sparse point cloud scans. However, simple shapes like cars or roadblocks and sparse scans result in several false alarms. Fig. 46(a) provides some actual 3D object detection results in street scenes. Fig. 46(b) shows the precision-recall curves on street data compared to the training-based method, again demonstrating the advantage of the new method. Table 7: Statistical 3D recognition performance on street object types. Object Category # appearance # detected # correct Recall # false alarms Precision Car 4 9 4 100% 5 44.4% Lamp Post 3 5 3 100% 1 60% Tree 7 6 6 85.7% 0 100% Roadblock 4 9 3 75% 6 33.3% Total 18 29 16 88.9% 13 55.2% 64 Figure 45: (a) Object detection results in industrial scenes. (b) Precision-Recall curves on industrial data compared to the training-based method. Figure 46: (a) Object detection results in street scenes. (b) Precision-Recall curves on street data compared to the training-based method. 65 Figure 47: Examples of various test cases and recognition results. 6.4.3 Comparative Experiments Setting To better evaluate the object detection algorithm, a test dataset is constructed from many different sources. It contains three types of data, including single objects, street data and industrial data, and incorporates existing public datasets, such as the 3D Keypoint Detection Benchmark [34], UWA 3D Object Dataset [49] for object retrieval, and CMU Oakland 3-D Point Cloud Dataset [29], Washington Urban Scenes 3D Point Cloud Dataset [74] for street data. For industrial data, there is no public data available, so some of our own data is used. Note that object retrieval data [34, 49] come as mesh models, so a virtual scanner is used to produce point clouds. Original point clouds are used for all street and industrial data. The dataset contains varied data size and object density to reflect different scale and complexity, including small segments or clusters with two or three object instances without too many background data points, and large scenes with more than five object instances and many background data points such as pipes, walls and planes (only for street and industrial data). Some scan conditions are also tested, including occlusions that create partially scanned objects, and clutter in which several objects are close to each other, so they may interfere with their detections. The algorithm is compared with three state-of-the-art 3D point cloud descriptors, the majority way of recognizing 3D point cloud object today, including Spin Images [39], FPFH [44] and SHOT [68]. The PCL 1.6.0 [73] implementations of these descriptors are used, with scene segmentation, feature extraction and matching as described in [60]. Besides descriptor-based methods, the 3D window-scanning method using Adaboost and 3D Haar-like features described in chp. 4 is also compared, which is more similar to the 66 Figure 48: Precision-Recall Curves on various test cases, compared to Spin Images [39], FPFH [44], SHOT [68] and 3D window-scanning [62]. (a) Small segments with two or three object instances and few background points. (b) Large scenes with more than five object instances and many background points. (c) Industrial sites scan. (d) Street level LiDAR. (e) Occluded scene with partially scanned objects. (f) Cluttered scene in which several objects are close to each other, so they may interfere with their detections. structure of the new algorithm. All methods compared, assume the object scale is known (fixed), which is a valid assumption for most real-world industrial and street objects. The resulting statistics are compared in recall rate and precision. Recall is defined as the percentage of object instances successfully detected, while precision is the percentage of detections that are actually correct. 6.4.4 Precision-Recall Evaluation The first set of experiments compares the new algorithm with others on different sizes of data. As shown in fig. 54(a)(b), the new algorithm performs at about the same level (while achieving a higher recall rate) as state- of-the-art algorithms on smaller data scenes (fig. 47(a)), but better than them on larger scenes (fig. 47(b)). This is because larger data creates increasing numbers of feature points to be matched for descriptor-based methods, especially when there is no good criteria for prior segmentation. The new algorithm detects objects as a whole so there is no increasing feature-matching complexity or the need for prior-segmentation. The second set of experiments compares the new algorithm with others on specific types of data, either from industrial sites (fig. 47(c)) or urban street LiDAR (fig. 47(d)). As shown in figure 54(c)(d), the new 67 Table 8: Detection time comparison between the new multi-view algorithm and the 3D descriptors Time Multi-View FPFH [44] SI [39] SHOT [68] 3DSC [42] USC [69] 3D Scan [62] * 3D Recognition 54s 395s 355s 450s 6500s 5100s 80s Segment & Train 0 235s 225s 240s 280s 265s 20s * Only supports limited rotation changes, thus is not really comparable to (while still slower than) ours. algorithm performs much better than others on industrial data, since the generic scene shapes result in lower descriptive power for descriptor-based algorithms. On street data, the new algorithm achieves similar levels of detection performance as the other algorithms. The final set of experiments compares the new algorithm with others under occlusion (fig. 47(e)) or clutter (fig. 47(f)), both very common in real world scan data. As shown in figure 54(e)(f), the new algorithm performs noticeably better than others under occlusion, and achieves a much higher recall rate in cluttered scenes, thanks to the mechanisms of multi-view projections and depth sections. 6.4.5 Time Efficiency Evaluation An important goal in the design of the new algorithm is fast detection speed while maintaining a good de- tection rate. The previous section already demonstrates that the new algorithm performs at least on the same level as the state-of-the-art 3D descriptors, even better in some cases. This section compare the detection time of the algorithms. Table 10 lists the detection times of the new algorithm, the descriptor-based algorithms and the 3D window-scanning algorithm (Adaboost + 3D Haar-like) for detecting a pump object (18.9k points) in mid- size scene (159k points). The 3D Shape Context (3DSC) [42] and Unique Shape Context (USC) [69] are also included, though they are not included in the precision-recall comparison because they are too slow. The descriptor-based algorithms require prior segmentation of background points to reduce number of keypoints for descriptor computation and matching, while the window-scanning algorithm require training for window detector. Since the new algorithm does not require prior segmentation or detector training at all, the time spent on these steps are also provided for other algorithms as a comparison. As shown in table 10, the speed of the new method is at least one-order of magnitudes faster than all the descriptor-based algorithms. Even though the 3D window-scanning algorithm [62] is closer in speed, it only supports very limited rotations (those perpendicular to axes), thus its processing time is not comparable to the new method which handles any rotation. Compared to Song et al.’s latest depth-based recognition algorithm [77], the new algorithm not only limits the search in 2D space instead of 3D, but also has a two- 68 orders of magnitudes faster 2D detection in each view (0.01 seconds as shown in sec. 6.2.1 versus 2 seconds as described in [77]) thanks to the binary conversion. Their algorithm also requires thousands of hours of detector training while the new algorithm needs no training at all. The new method provides significant speed advantage compared to state-of-the-art methods, especially in large-scale applications such as industrial site or urban street data processing, without sacrificing recognition performance. 6.5 Summary This chapter describes an algorithm capable of recognizing 3D target objects in 3D point cloud scenes, in- cluding industrial scenes and street scenes. It’s the second solution for the problem, improving upon the first one using Adaboost and 3D features, which suffers from only partial rotation and unsatisfying speed. A multi-view projection recognition algorithm transforms the 3D recognition problem into a series of 2D detec- tion problems, which reduces complexity, stabilizes performance, and significantly speeds up the recognition process, without any requirement for prior scene segmentation or detector training. Experiments show fast and robust performance on various object types from both industrial and street data, with better overall results and one-order of magnitude speed-up compared to several recent 3D recognition algorithms. 69 7 3D Object Detection with Multi-View Convolutional Neural Net- work This chapter describes the third method for 3D object detection in point clouds, introducing convolutional neural network to handle multiple viewpoints and object classes. Following the previous chapter, to transform the 3D problem into 2D space, 3D point cloud will be projected based on depth information at multiple viewpoints and rotations. This will generate a large amount (at least 10,000) of 2D detection task between projections of scenes and objects, which put a requirement on the speed of single 2D detection algorithm that it must be very fast in processing all 2D images to finish the overall 3D detection task in a reasonable time. As a result, this limits the complexity of 2D detection algorithm, and thus the overall performance of 3D detection. To solve this speed-complexity trade-off, the convolutional neural network (CNN) is proposed for 2D detection, which has already been proved [66, 67, 75, 76] to be the most powerful method for 2D detection. CNN is used to handle multiple viewpoints and rotations for the same class of object together with a single pass through the network, thus reducing the total amount of 2D detection tasks dramatically. Moreover, while the existing strategies usually require an individual detector for each class of object, CNN can be trained with a multi-class output, further saving tremendous processing time when there are multiple objects to detect. To enable multi-class CNN to detect object classes with varied sizes, the training sample sizes are unified with padded boundary so the detector will search for all object classes in a uniform-sized window. On top of these, the detection efficiency is further improved by concatenating two extra levels of early rejection networks with binary outputs, simplified architecture and smaller image sizes, before the final multi-class detection network. Experiments show that the new method has competitive overall performance with at least one-order of magnitude speed-up in comparisons with state-of-the-art 3D point cloud object detection methods. The main contributions in this chapter include: • Introduce CNN to process all viewpoints together in multi-view projection-based 3D object detection. • Unify detector for multiple object classes with multi-class CNN and uniform-size training samples. • Increase detection speed by concatenating two early rejection networks with binary outputs, simplified architecture and smaller image sizes. 70 7.1 3D Object Detection with CNN 7.1.1 Multi-View 3D Object Detection This algorithm is designed to improve the previous algorithm of multi-view projection-based 3D detection method as described in chp. 6, though many details have to be changed to adapt to the introduction of CNN. The core idea of the previous algorithm [65] is to transform a 3D detection problem into a series of 2D detection problem, thereby reducing the complexity of an exhaustive 3D search into a fixed number of 2D searches. As shown in the (green-shaded) algorithm flow in fig. 49, this is achieved by projection of 3D point clouds at multiple viewpoints to decompose it into a series of 2D images. To ensure that the original 3D information is not lost, the 3D to 2D projection is done at multiple viewing angles (evenly chosen on a sphere). Depth information is utilized when projecting 2D images for each view, and kept for later re-projection back into 3D for fusion of 2D results. After the input 3D point cloud is projected into 2D images from multiple views, each view is used to locate the target object. Lastly, all 2D detection results are re-projected back into 3D space for a fused 3D object location estimate. 7.1.2 Detect 2D Projections with CNN In the original multi-view 3D detection method [65], the 3D object will be projected in about 50 viewpoints and searched with 12 in-plane rotations to approximately cover all possible orientation, and each of these view projections and rotations need to be detected individually, resulting in thousands of 2D detection tasks. To alleviate this complexity, convolutional neural network (CNN) is used for the task of 2D detection. CNN can be trained with all viewpoints and rotations together for one object class. CNN actually synergize with the multi-view projection method very well, because the large amount of projections will produce e- nough training samples for the CNN to train with. In the training phase, projections in all viewpoints and rotations will be produced and supplied as training samples for CNN so it can learn all possibilities. Howev- er, in the detection phase, CNN-based detection can handle all viewpoints and rotations for one object class together, saving tremendous amount of detection tasks. Figure 49 shows the new training and detection flow based on the original flow. In most existing 3D object detection algorithms [62] [65], each object class has its own detector or classi- fier, and they have to be applied individually, costing proportionally more time when there’re several different object classes to detect. CNN-based detection is used to improve this as well, because CNN can be trained with a multi-class output, capable of detecting all object classes together in one pass. 71 Figure 49: Comparison of pipelines for the original multi-view 3D point cloud object detection algorithm [65] and the proposed algorithm with concatenated CNNs. 72 Figure 50: Two types of CNN network structures used. Note the different number of classes in output. In implementation, two types of network architecture are used, as shown in fig. 50. One type of network has less layers with a 2-class output for fast object/non-object classification, in order to efficiently reject most non-object negative windows. The other is a more complicated network with multi-class output to decide the specific class of objects, and reject much harder non-objects. The relationship and setup of the two types of networks are explained in sec. 7.2.1. 7.1.3 Training Sample Generation The positive training samples are generated by projecting a raw 3D point cloud object instance into 2D images at different viewpoints with different in-plane rotations. The projections are performed based on depth as seen from each viewpoint. The 3D space is discretized into cells. Cells with at least one point are considered occupied. The cell size is set so that each projected object has roughly 100-150 pixels in size on average, and fixed for each dataset. Parallel projection rays from each view then sample the scene by extending rays from pixel array. The occupied cell closest to the viewpoint sets the depth value of that pixel, similar to the idea of z-buffering. CNN generally requires a large amount of training samples to be effective. Therefore, the size and diversity of positive samples are further expanded with more random ”jittering”. This includes adding random depth shift, small in-plane translation, noise, dilation or erosion on the edges, and synthetic occlusions. Implementation projects the objects at 100 viewpoints distributed on a 3D sphere, and rotates each projection in 20 rotations, then generates 5 samples for each rotations with random ”jittering”. This gives 10000 positive 73 Figure 51: Some examples of CNN training samples: (a) Positive; (b) ”Easy” negative; (c) ”Hard” negative. samples for each raw 3D object instance. Figure 51(a) shows some examples of the positive training samples. In order to perform detection for multi-view and multi-class together, an uniform search window size must be enforced for all viewpoints and classes. This requires the training samples to all have the same size without distorting the object. The solution is to find a minimal window size that can enclose all views and classes, then pad empty pixels on image border so that all positive training samples are enlarged to that size. The empty pixels may further be filled during the random ”jittering” process. Negative training samples are generated in point clouds without any object class.The negative point cloud is projected according to depth in the same way at random viewpoints, then cropped at random location of the required size into negative samples, as shown in fig. 51(b). These randomly generated negative samples are usually not very representative. They are used with the positive samples to train CNN, then perform detection on negative projected image, and all false alarms are considered ”harder” negative samples, as shown in fig. 51(c), which are fed back into the training set to fine-tune the network. This process will be repeated 2-3 times until the network has been trained with good classification capability. 7.2 Concatenated Networks 7.2.1 Concatenated CNN for Fast Negative Rejection Object detection in the projected 2D images is executed as an exhaustive scanning window search. However, due to the nature of the point cloud data, many search windows are actually almost blank or contain mostly primitive shapes. The CNN classifier will perform the same amount of convolutions and spend the same time no matter the complexity of the search window. Therefore, multiple CNNs are concatenated and trained with different network architectures and training samples, aiming for different objectives. 74 Figure 52: Concatenate two levels of early rejection network for fast negative window filtering, before the final level of multi-class detection network. A three-level structure is currently used with three networks, as shown in fig. 52. The first two levels use gradually smaller training images (the same size of search window but resized for fast computation), and the network has less layers with an output of only two classes, object or non-object. The final level is a more complicated CNN with more layers and multi-class output, in order to decide the final classification for different object classes or much harder negatives. The first level of CNN use 32x32 sized images mostly for fast rejection of very simple negative windows. The second level of CNN use 128x128 sized images to deal with slightly harder negative windows. Figure 53 shows some examples of the negative windows rejected by the first two levels of early rejection networks, demonstrating different patterns between them, with the first level filtering out mostly simpler backgrounds, and the second level dealing with more complicated negatives. During training, the class probability threshold is set so that 99 percent of positive samples must pass through the first two CNNs. Experiments show that the first CNN can efficiently reject about 65 percent of total negative windows, while the second CNN can reject roughly half of the remaining ones, which means a total of 85 percent of negative windows are rejected in the first two layers. 75 Figure 53: (a) Negative windows rejected in the first level of early rejection network are mostly simpler back- grounds. (b) Negative windows rejected in the second level of early rejection network are more complicated non-objects. Table 9: Speed Analysis for 3D object detection methods (Speed in seconds) Multi-view [65] Single CNN Multi-level CNN Single 2D projection 0.005 5 0.8 Single class 3D 60 120 25 6-class 3D 350 130 28 * Detect objects with about 20k points in a 500k-point scene. 7.2.2 Speed Analysis This section analyze the speed of the original multi-view projection-based method [65], the single-level CNN method, and multi-level concatenated CNN method. Table 9 provides some numbers for the three methods. The original multi-view method is much faster in terms of detection speed in single 2D projected image because it utilized binary operations. The CNN-based methods are much slower due to the complexity of neural network, but the early rejection networks can still bring a significant speed improvement. However, the multi-view method has hundreds of different viewpoints and rotations that all require in- dividual detections, while the CNN-based methods can handle them all together. Therefore, for detection of a single-class object, the multi-view method need about 1 minute, while single-level CNN need about 2 minutes, which is already comparable to multi-view method. With early rejections, it’s further reduced to about 25 seconds, faster than multi-view method even for single-class detection. The first two levels of early rejection networks spend about 5 seconds out of 25, to reject roughly 85 percent of negative windows. When there are much more classes of objects to detect, the time cost of multi-view method scales up proportionally, since each class requires a individual detector. Single CNN can handle multi-class problem 76 in almost the same time as single-class problem just with a multi-class softmax layer as output. multi-level CNN is even faster, with one-order of magnitude speed-up compared to the original multi-view algorithm. 7.3 Experiments 7.3.1 Experiment Settings The evaluation dataset follows the same settings as the previous chapter, which consists three types of da- ta, including single objects, street data and industrial data, with varied data size, object density and scan conditions. The algorithm is compared with three state-of-the-art 3D point cloud descriptors, including Spin Im- ages [39], FPFH [44] and SHOT [68]. The PCL 1.6.0 [73] implementations of these descriptors are used, with scene segmentation, feature extraction and matching implemented following [60]. Besides descriptor-based methods, the comparison also includes the previous two methods, the original multi-view projection-method without CNN described in chp. 6, and window-scanning method using Adaboost and 3D Haar-like features described in chp. 4. The resulting statistics are compared in recall rate and precision curves, and also detec- tion speed. Recall is defined as the percentage of object instances successfully detected, while precision is the percentage of detections that are correct. 7.3.2 Precision-Recall Evaluation The first set of experiments compares the new algorithm with others on different data sizes. As shown in fig. 54(a)(b), the new algorithm performs at about the same level as others, with more advantages on larger scenes. This is because larger data creates increasing numbers of feature points to be matched for descriptor- based methods, especially when there is no good criteria for prior segmentation. Compared to the original multi-view algorithm, the new algorithm shows more improvement on smaller segments, thanks to the more stable CNN-based 2D detection. The second set of experiments compares the new algorithm with others on specific types of data, either from industrial sites or urban street LiDAR. As shown in figure 54(c)(d), the new algorithm performs much better than others on industrial data, since the generic scene shapes result in lower descriptive power for descriptor-based algorithms. On street data, the new algorithm achieves similar levels of detection perfor- mance as others. The new algorithm is generally an improvement in both scenarios compared to the original multi-view algorithm. 77 Figure 54: Precision-Recall Curves on various test cases, compared to the original multi-view method [65] Spin Images [39], FPFH [44], SHOT [68] and 3D window-scanning [62]. (a) Small segments with two or three object instances and few background points. (b) Large scenes with more than five object instances and many background points. (c) Industrial sites scan. (d) Street level LiDAR. (e) Occluded scene with partially scanned objects. (f) Noisy scene with random noisy points. 78 Table 10: Time comparison for detecting 6 object classes Time Multi-level CNN Multi-View [65] 3D-Scan [62] FPFH [44] 6-Class 3D 28s 350s 450s 2400s SpinImage [39] SHOT [68] 3DSC [42] USC [69] 2100s 2700s 39000s 30000s The final set of experiments compares the new algorithm with others under occlusion or noise, both very common in real world scan data. As shown in figure 54(e)(f), the new algorithm performs noticeably better than others under occlusion, thanks to the mechanisms of multi-view projections and depth sections. The original multi-view algorithm performs quite badly under noise, as its simple 2D detection algorithm is susceptible to noise. Instead, CNN is trained with samples containing noise, and thus bringing a significant improvement under noise. 7.3.3 Time Efficiency Evaluation An important goal in the design of the new algorithm is fast detection speed while maintaining a good detec- tion rate. Table 10 lists the detection times of the new CNN-based algorithm, the original multi-view method without CNN [65], the 3D window-scanning method [62] and the descriptor-based methods. The task is to detect 6 classes of objects ( 20k points each) in a mid-size scene ( 500k points). The 3D Shape Context (3DSC) [42] and Unique Shape Context (USC) [69] are also included, though they are not included in the precision-recall comparison because they are too slow. As shown in table 10, the speed of the new method is at least two-order of magnitudes faster than all the descriptor-based methods, and one-order of magnitudes faster than the previous two methods, thanks to the use of CNN and concatenated early rejection networks. 7.4 Summary This chapter describes an algorithm to use convolutional neural network (CNN) for 2D detection, improving the previous algorithm which transforms 3D object detection problem into a series of 2D detection problems. A trained network can handle all viewpoints and rotations together for the same object class, as well as predicting multiple object classes, without the need for individual detector for each object class, thus reducing the amount of 2D detection tasks dramatically. To make the multi-class CNN able to detect object classes with varied sizes, the training sample sizes are unified with padded boundary so the detector will search for all object classes in a uniform-sized window. In addition, the detection efficiency is further improved by 79 concatenating two extra levels of early rejection networks with binary outputs, simplified architecture and smaller image sizes, before the final multi-class detection network. Experiments show that the new method has competitive overall performance with at least one-order of magnitude speed-up in comparisons with latest 3D point cloud detection methods. 80 8 Conclusion 3D object detection in industrial site point clouds is a long neglected but yet important problem. Solution algorithm is desired with requirements in performance, efficiency, practicability, accessibility, as well as robustness to occlusion, rotation and noise. This thesis describes three methods to tackle the issue, with gradually improving performance and ef- ficiency. The first is a general purpose 3D object detection method that combines Adaboost with 3D local features, without requirement for prior object segmentation. Experiments demonstrated competitive accuracy and robustness to occlusion, but this method suffers from limited rotation invariance. As an improvement, another method is presented with a multi-view detection approach that projects the 3D point clouds into sev- eral 2D depth images from multiple viewpoints, transforming the 3D problem into a series of 2D problems, which reduces complexity, stabilizes performance, and achieves rotation invariance. The problem is the huge amount of projected views and rotations that need to be individually detected, limiting the complexity and performance of 2D algorithm choice. Thus the third method is proposed to solve this with the introduction of convolutional neural network, because it can handle all viewpoints and rotations for the same class of object together, as well as predicting multiple classes of objects with the same network, without the need for individual detector for each object class. The detection efficiency is further improved by concatenating two extra levels of early rejection networks with binary outputs before the multi-class detection network. The practicability of the 3D point cloud object detection method is shown in an automatic 3D industrial point cloud modeling system. Prior efforts in such systems focus on primitive geometry, street structures or in- door objects, but industrial data has rarely been pursued. This thesis integrates several algorithm components into an automatic 3D modeling system for industrial site point clouds, including modules for pipe modeling, plane classification and object detection, and solves the technology gaps revealed during the integration. The integrated system is able to produce classified models of large and complex industrial scenes with a quality that outperforms leading commercial software and comparable to professional hand-made models. This thesis also describes an earlier work in multi-modal image matching which inspires later research in 3D object detection by 2D projections. Most existing 2D descriptors only work well on images of a single modality with similar texture. This thesis presents a novel basic descriptor unit called a Gixel, which uses an additive scoring method to sample surrounding edge information. Several Gixels in a circular array create the Gixel Array Descriptor, excelling in multi-modal image matching with dominant line features. 81 References [1] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10):1615–1630, 2005. [2] H. Bay, A. Ess, T. Tuytelaars, L. Van Gool. SURF: Speeded Up Robust Features. Computer Vision and Image Understanding, 110(3):346–359, 2008. [3] R. Zabih and J. Woodfill. Non-parametric local transforms for computing visual correspondance. Proceedings of the European Conference on Computer Vision, pp.151–158, 1994. [4] A. Johnson and M. Hebert. Object recognition by matching oriented points. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.684–689, 1997. [5] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):509–522, 2002. [6] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. [7] Y . Ke and R. Sukthankar. PCA-SIFT: A more distinctive representation for local image descriptors. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.511–517, 2004. [8] M. Calonder, V . Lepetit, C. Strecha, and P. Fua. BRIEF: Binary Robust Independent Elementary Features. Proceed- ings of the European Conference on Computer Vision, 2010. [9] E. Rublee, V . Rabaud, K. Konolige, G. Bradski. ORB: an efficient alternative to SIFT or SURF. Proceedings of the IEEE International Conference on Computer Vision, 2011. [10] F. Tang, S. H. Lim, N. L. Chang , H. Tao. A novel feature descriptor invariant to complex brightness changes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.2631–2638, 2009. [11] A. Bosch, A. Zisserman, and X. Munoz. Image classification using random forests and ferns. Proceedings of the IEEE International Conference on Computer Vision, 2007. [12] E. Shechtman and M. Irani. Matching local self-similarities across images and videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007. [13] S. Leutenegger, M. Chli and R. Siegwart. BRISK: Binary Robust Invariant Scalable Keypoints, Proceedings of the IEEE International Conference on Computer Vision, 2011. [14] S. Winder, G. Hua, M. Brown. Picking the best DAISY . Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.178-185, 2009. [15] G. Yang, C. V . Stewart, M. Sofka, C. L. Tsai. Registration of Challenging Image Pairs: Initialization, Estimation, and Decision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11):1973–1989, 2007. [16] A Patterson, P Mordohai, K Daniilidis. Object Detection from Large-Scale 3D Datasets using Bottom-up and Top- down Descriptors. ECCV , pp. 553-566, 2008. 82 [17] M. Himmelsbach, A. Mueller, T. Luettel, H.J. Wuensche. LIDAR-based 3D Object Perception. Proceedings of 1st International Workshop on Cognition for Technical Systems, 2008. [18] A. Velizhev, R. Shapovalov, K. Schindler. Implicit Shape Models for Object Detection in 3D Point Clouds. Interna- tional Society of Photogrammetry and Remote Sensing Congress, 2012. [19] A. Golovinskiy, V .G. Kim, T. Funkhouser. Shape-based Recognition of 3D Point Clouds in Urban Environments. ICCV , September 2009. [20] R. Wessel, R. Baranowski, R. Klein. Learning Distinctive Local Object Characteristics for 3D Shape Retrieval. In proceedings of Vision, Modeling and Visualization (VMV), pages 167-178, Oct. 2008. [21] S. Stiene, K. Lingemann, A. Nuchter, J. Hertzberg. Contour-Based Object Detection in Range Images. Third Inter- national Symposium on 3D Data Processing, Visualization, and Transmission, pp.168 - 175, June 2006. [22] M. Novotni, G.J. Park, R. Wessel, R. Klein. Evaluation of Kernel Based Methods for Relevance Feedback in 3D Shape Retrieval. In proceedings of The Fourth International Workshop on Content-Based Multimedia Indexing (CB- MI), June 2005. [23] C. Akgiil, B. Sankur, Y . Yemez. and F. Schmitt. Similarity Score Fusion by Ranking Risk Minimization for 3D Object Retrieval. In Proceedings of the Eurographics Workshop on 3D Object Retrieval, April 2008. [24] S. Hou and K. Ramani. Calligraphic Interfaces: Classifier Combination for Sketch-based 3D Part Retrieval. Com- puters and Graphics, V olume 31, Issue 4, pp. 598-609, August, 2007. [25] H. Laga and M. Nakajima. A Boosting Approach to Content-based 3D Model Retrieval. In the 5th ACM Internation- al Conference on Computer Graphics and Interactive Techniques in Australasia and South East Asia (GRAPHITE), pp. 227-234, Nov. 2007. [26] D.Y . Chen, X.P. Tian, Y .T. Shen, M. Ouhyoung. On Visual Similarity Based 3D Model Retrieval. Computer Graph- ics Forum, V olume 22 Issue 3, pp. 223-232, 2003. [27] J.F. Lalonde, N. Vandapel, D. Huber, M. Hebert. Natural Terrain Classification using Three-Dimensional Ladar Data for Ground Robot Mobility. Journal of Field Robotics, V ol. 23, No. 10, pp. 839 - 861, November, 2006. [28] M. Ruhnke, B. Steder, G. Grisetti, W. Burgard. Unsupervised Learning of Compact 3D Models Based on the Detection of Recurrent Structures. IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2137 - 2142, Oct. 2010. [29] D. Munoz, J. A. Bagnell, N. Vandapel, M. Hebert. Contextual Classification with Functional Max-Margin Markov Networks. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), June, 2009. [30] H. Hu, D. Munoz, J. A. Bagnell, M. Hebert. Efficient 3-D Scene Analysis from Streaming Data. IEEE International Conference on Robotics and Automation (ICRA), May, 2013. [31] H. S. Koppula, A. Anand, T. Joachims, A. Saxena. Semantic Labeling of 3D Point Clouds for Indoor Scenes. In NIPS, 2011. 83 [32] D. Anguelov, B. Taskar, V . Chatalbashev, D. Koller, D. Gupta, G. Heitz, A. Ng. Discriminative Learning of Markov Random Fields for Segmentation of 3D Range Data. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2005. [33] N. Gelfand, N.J. Mitra, L.J. Guibas, H. Pottmann. Robust Global Registration. Symposium on Geometry Processing, 2005. [34] A. Mian, M. Bennamoun and R. Owens. 3D Model-based Object Recognition and Segmentation in Cluttered Scenes. PAMI, vol. 28(10), pp. 1584–1601, 2006. [35] J. Knopp, M. Prasad, G. Willems, R. Timofte and L. Van Gool. Hough Transform and 3D SURF for robust three dimensional classification. In ECCV , Chersonissos, 2010. [36] G.T. Flitton, T.P. Breckon, and N.M. Bouallagu. Object Recognition using 3D SIFT in Complex CT V olumes. British Machine Vision Association, page 1-12, 2010. [37] M. Donoser, H. Bischof. 3D segmentation by maximally stable volumes (MSVs). International Conference on Pattern Recognition, pp.63-66, 2006. [38] T.H. Yu, O.J. Woodford, R. Cipolla. An Evaluation of V olumetric Interest Points, International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), pp. 282 - 289, 2011. [39] A. Johnson and M. Hebert. Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes. PAMI, V ol. 21, No. 5, pp. 433 - 449, May 1999. [40] B.K.P. Horn. Extended Gaussian Images. Proceedings of the IEEE, V ol. 72, I. 12, pp. 1671 - 1686, 1984. [41] G. Hetzel, B. Leibe, P. Levi, B. Schiele. 3D Object Recognition from Range Images using Local Feature Histograms. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), V ol.2, pp. II-394 - II-399, 2001. [42] A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik. Recognizing Objects in Range Data Using Regional Point Descriptors. ECCV , May 2004. [43] F. Endres, C. Plagemann, C. Stachniss, W. Burgard. Unsupervised Discovery of Object Classes from Range Data using Latent Dirichlet Allocation. Robotics: Science and Systems, 2009. [44] R.B. Rusu, N. Blodow, M. Beetz. Fast Point Feature Histograms (FPFH) for 3D registration. ICRA, pp. 3212-3217, 2009. [45] Y . Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Computational Learning Theory: Eurocolt’95, pages 23 - 37, 1995. [46] P.A. Viola and M.J. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR (1), pages 511 - 518, 2001. [47] F. Crow. Summed-area tables for texture mapping. Proceedings of SIGGRAPH, 18(3): 207-212, 1984. [48] S. Jayanti, Y . Kalyanaraman, N. Iyer, K. Ramani. Developing An Engineering Shape Benchmark For CAD Models. Computer-Aided Design, V ol. 38, Is. 9, pp. 939-953, Sep. 2006. 84 [49] A. Mian, M. Bennamoun, R. Owens. On the Repeatability and Quality of Keypoints for Local Feature-based 3D Object Retrieval from Cluttered Scenes. International Journal of Computer Vision, V ol. 89, Is. 2-3, pp. 348 - 361, Sep. 2010. [50] E. Tapia. A note on the computation of high-dimensional integral images. Pattern Recognition Letters, V ol. 32. Is. 2, pp. 197-201, 2011. [51] I. ClearEdge3D. Edgewise plant, Dec. 2012. [52] F. Bernardini, J. Mittleman, H. Rushmeier, C. Silva, and G. Taubin. The ball-pivoting algorithm for surface recon- struction. Visualization and Computer Graphics, IEEE Transactions on, 5(4):349C359, 1999. [53] H. Hoppe, T. DeRose, T. Duchamp, J. McDonald, and W. Stuetzle. Surface reconstruction from unorganized points. In ACM SIGGRAPH, 1992. [54] D. Holz, S. Holzer, R. B. Rusu, and S. Behnke. Real-time plane segmentation using RGB-D cameras. In Proc. of the 15th RoboCup International Symposium, 2011. [55] S. Obwald, J.S. Gutmann, A. Hornung, M. Bennewitz. From 3D point clouds to climbing stairs: A comparison of plane segmentation approaches for humanoids. In Proc. of Humanoids 2011, Bled, Slovenia, October 26-28, 2011. [56] M.Y . Yang, W. Forstner. Plane Detection in Point Cloud Data. Proceedings of the 2nd International Conference on Machine Control Guidance Bonn (2010), Issue: 1, Pages: 95-104. [57] R. Triebel, F. Dellaert, and W. Burgard. Using hierarchical EM to extract planes from 3d range scans. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), 2005. [58] B. Steder, G. Grisetti, M. Van Loock, W. Burgard. Robust On-line Model-based Object Detection from Range Images. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4739-4744, 2009. [59] R. Qiu, Q.Y . Zhou, and U. Neumann. Pipe-Run Extraction and Reconstruction from Point Clouds. European Con- ference on Computer Vision (ECCV), pp 17-30, Sep. 2014. [60] J. Huang and S. You. Detecting Objects in Scene Point Cloud: A Combinational Approach. in 2013 International Conference on 3D Vision (3DV 2013), Seattle, Washington, USA, June 2013. [61] R. B. Rusu, Z. C. Marton, N. Blodow, and M. Beetz. Persistent Point Feature Histograms for 3D Point Clouds. In Proceedings of the 10th International Conference on Intelligent Autonomous Systems, 2008. [62] G. Pang and U. Neumann. Training-based Object Recognition in Cluttered 3D Point Clouds. International Confer- ence on 3D Vision (3DV), June 29-30, 2013, Seattle, WA, USA. [63] G. Pang, R. Qiu, J. Huang, S. You and U. Neumann. Automatic 3D Industrial Point Cloud Modeling and Recogni- tion. The Fourteenth IAPR International Conference on Machine Vision Applications, May, 2015. [64] J. Huang and S. You. Point Cloud Matching based on 3D Self-Similarity. International Workshop on Point Cloud Processing (Affiliated with CVPR 2012), Providence, June, 2012. [65] G. Pang and U. Neumann. Fast and Robust Multi-View 3D Object Recognition in Point Clouds. International Con- ference on 3D Vision (3DV), Oct 2015 85 [66] Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceed- ings of the IEEE, vol. 86, no. 11, pp. 2278C2324, 1998. [67] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS), pp. 1097C1105, 2012. [68] F. Tombari, S. Salti, and L. Di Stefano. Unique signatures of histograms for local surface description. Proc. 11th European Conf. Computer Vision (ECCV), pp. 356C369, 2010. [69] F. Tombari, S. Salti, and L. Di Stefano. Unique shape context for 3-D data description. ACM Workshop on 3-D Object Retrieval, pp. 57-62, 2010. [70] C. Papazov and D. Burschka. An Efficient RANSAC for 3D Object Recognition in Noisy and Occluded Scenes. In Proceedings of the 10th Asian Conference on Computer Vision (ACCV’10), Nov. 2010. [71] F. Tombari and L. Di Stefano. Hough V oting for 3-D object Recognition under Occlusion and Clutter. IPSJ Trans. Comput. Vis. Appl. (CV A), vol.4, pp.20C29,Mar. 2012. CV A, vol.4, pp.20-29, Mar. 2012. [72] H. Chen and B. Bhanu. 3D Free-Form Object Recognition in Range Images using Local Surface Patches. Pattern Recog. Lett., vol.28, no.10, pp.1252C1262, 2007. PRL, vol.28, no.10, pp.1252-1262, 2007. [73] R. Rusu and S. Cousins. 3D is here: Point Cloud Library (PCL). IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, May 2011. [74] K. Lai and D. Fox. Object Recognition in 3D Point Clouds Using Web Data and Domain Adaptation. IJRR, vol. 29, no. 8, pp. 1019-1037, 2010. [75] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR, 2014. [76] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning rich features from RGB-D images for object detection and segmentation. ECCV , 2014. [77] S. Song and J. Xiao. Sliding Shapes for 3D Object Detection in Depth Images. ECCV , 2014. [78] M. Aubry, D. Maturana, A. Efros, B. Russell and J. Sivic. Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models. CVPR, pp. 3762-3769, 2014. [79] H. Su, S. Maji, E. Kalogerakis, E. Learned-Miller. Multi-view Convolutional Neural Networks for 3D Shape Recog- nition. ICCV , 2015. [80] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang and J. Xiao. 3D ShapeNets: A Deep Representation for V olumetric Shape Modeling. IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2015. [81] A Convolutional Learning System for Object Classification in 3-D Lidar Data. D. Prokhorov. IEEE Transactions on Neural Networks, 21 (5), pp. 858 - 863, 2010. [82] Artificial Neural Nets Object Recognition for 3D Point Clouds. D. Habermann, A. Hata, D. Wolf and F. S. Osorio. Brazilian Conference on Intelligent Systems (BRACIS), 2013. 86 [83] V oxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition. D. Maturana and S. Scherer. IEEE/RSJ International Conference on Intelligent Robots and Systems, 2015. [84] A. Aldoma, Z.C. Marton, F. Tombari, W. Wohlkinger, C. Potthast, B. Zeisl, R.B. Rusu, S. Gedikli, and M. Vincze. Tutorial: Point Cloud Library: 3D Object Recognition and 6 DOF Pose Estimation. IEEE Robotics and Automation Magazine, 19(3): 80-91, 2012. [85] L.A. Alexandre. 3D Descriptors for Object and Category Recognition: a Comparative Evaluation. Workshop on Color-Depth Camera Fusion in Robotics at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012. [86] M. Restrepo and J. Mundy. An Evaluation of Local Shape Descriptors in Probabilistic V olumetric Scenes. Proceed- ings of the British Machine Vision Conference (BMVC), pp.46.1-46.11, 2012. [87] R. Ohbuchi, K. Osada, T. Furuya and T. Banno. Salient Local Visual Features for Shape-Based 3D Model Retrieval. IEEE International Conference on Shape Modeling and Applications, pp. 93-102, June 2008. [88] L. Shang and M. Greenspan. Real-time Object Recognition in Sparse Range Images Using Error Surface Embed- ding. International Journal of Computer Vision, V ol. 89, Is. 2-3, pp. 211-228, Sep 2010. [89] S.M. Seitz, B. Curless, J. Diebel, D. Scharstein, R. Szeliski. A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. CVPR, vol.1, pp. 519-528, June 2006. [90] SHREC’11 - Shape Retrieval Contest 2011. http://www.aimatshape.net/event/SHREC/shrec2011 87
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Object detection and recognition from 3D point clouds
PDF
Green learning for 3D point cloud data processing
PDF
3D deep learning for perception and modeling
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Point-based representations for 3D perception and reconstruction
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
City-scale aerial LiDAR point cloud visualization
PDF
Hybrid methods for robust image matching and its application in augmented reality
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Data-driven 3D hair digitization
PDF
3D urban modeling from city-scale aerial LiDAR data
PDF
Interactive rapid part-based 3d modeling from a single image and its applications
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Machine learning techniques for outdoor and indoor layout estimation
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
From active to interactive 3D object recognition
PDF
Face recognition and 3D face modeling from images in the wild
Asset Metadata
Creator
Pang, Guan
(author)
Core Title
3D object detection in industrial site point clouds
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
01/24/2017
Defense Date
05/05/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D object detection,OAI-PMH Harvest,point clouds
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Neumann, Ulrich (
committee chair
), Kuo, C.-C. Jay (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
btpangolin@gmail.com,gpang@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-280478
Unique identifier
UC11280417
Identifier
etd-PangGuan-4634.pdf (filename),usctheses-c40-280478 (legacy record id)
Legacy Identifier
etd-PangGuan-4634.pdf
Dmrecord
280478
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Pang, Guan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D object detection
point clouds