Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Explainable and green solutions to point cloud classification and segmentation
(USC Thesis Other)
Explainable and green solutions to point cloud classification and segmentation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Explainable and Green Solutions to
Point Cloud Classification and Segmentation
by
Min Zhang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2022
Copyright 2023 Min Zhang
Table of Contents
Table of Contents ii
List Of Tables iv
List Of Figures vi
Abstract x
Chapter 1: Introduction 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Explainable and Green Point Cloud Classification . . . . . . . . . . . 5
1.2.2 Explainable and Green Point Cloud Segmentation . . . . . . . . . . . 6
1.2.3 Local and Global Aggregation in Point Cloud Classification and Seg-
mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2: Background Review 10
2.1 3D Point Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Traditional Point Cloud Analysis . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Deep Learning on Point Clouds . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 PointNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 PintNet++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 RandLA-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Successive Subspace Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Saak Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Saab Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 ModelNet40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 ShapeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.3 S3DIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
ii
Chapter 3: Explainable and Green Point Cloud Classification 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 PointHop Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 PointHop++ Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter 4: Explainable and Green Point Cloud Segmentation 66
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 UFF Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 GSIP Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Chapter 5: Local and Global Aggregation in Point Cloud Classification and
Segmentation 92
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 SR-PointHop Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3 GreenSeg Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Chapter 6: Conclusion and Future Work 123
6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2 Future Research Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2.1 Semantic segmentation of large-scale outdoor point clouds . . . . . . 127
6.2.2 Fast 3D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . 129
Bibliography 131
iii
List Of Tables
3.1 Resultsofablationstudywith256sampledpointsastheinputtothePointHop
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 EnsemblesoffivePointHopswithchangedhyper-parametersettingsandtheir
corresponding classification accuracies. . . . . . . . . . . . . . . . . . . . . . 47
3.3 Comparison of classification accuracy on ModelNet40, where the proposed
PointHop system achieves 89.1% test accuracy, which is 0.1% less than Point-
Net [65] and 3.1% less than DGCNN [81]. . . . . . . . . . . . . . . . . . . . 48
3.4 Comparison of time complexity between PointNet/PointNet++ and PointHop. 49
3.5 Comparison of per-class classification accuracy on the ModelNet40. . . . . . 52
3.6 Comparison of classification results on ModelNet40, where the class-Avg ac-
curacy is the mean of the per-class accuracy, and FS and ES mean “feature
selection” and “ensemble”, respectively. . . . . . . . . . . . . . . . . . . . . . 62
3.7 Comparison of time and model complexity, where the training and inference
time units are in hour and ms, respectively.. . . . . . . . . . . . . . . . . . . 63
4.1 Comparison of classification results on ModelNet40. . . . . . . . . . . . . . . 72
4.2 Performance comparison on the ShapeNetPart segmentation task with semi-
supervised DNNs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Ablation study of the object-wise segmentation. . . . . . . . . . . . . . . . . 74
4.4 Performance comparison on the ShapeNetPart segmentation task with unsu-
pervised DNNs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Comparison of data statistics of three pre-processing methods. . . . . . . . . 85
iv
4.6 Comparison of semantic segmentation performance (%) for S3DIS. . . . . . . 86
4.7 Class-wise semantic segmentation performance (%) of GSIP for S3DIS. . . . 87
4.8 Comparison of time and model complexities. . . . . . . . . . . . . . . . . . . 89
4.9 Performance comparison of two feature extractors (%). . . . . . . . . . . . . 90
4.10 Impact of quantized point-wise features (%). . . . . . . . . . . . . . . . . . . 90
5.1 Comparison of classification results on ModelNet40. . . . . . . . . . . . . . . 100
5.2 Comparison of time and model complexities. . . . . . . . . . . . . . . . . . . 101
5.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Comparison of part segmentation performance (%) for ShapeNetPart . . . . 113
5.5 Comparison of semantic segmentation performance (%) for S3DIS. . . . . . . 113
5.6 Class-wise semantic segmentation performance (%) of GSIP for S3DIS. . . . 116
5.7 Ablation Study for ShapeNetPart Segmentation. . . . . . . . . . . . . . . . . 117
5.8 Ablation Study for Semantic Segmentation of S3DIS Area 5. . . . . . . . . 119
5.9 Comparison of time and model complexities for semantic segmentation of
S3DIS Area 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.10 ModelNet40 Classification Performance (%). . . . . . . . . . . . . . . . . . . 121
6.1 Comparisonoftraditional,deeplearning(DL)andgreenlearning(GL)methods.126
v
List Of Figures
1.1 An overview of the basic tasks to solve. Taken raw point cloud (set of points)
as input, the goal is to label every point cloud as one of the object categories
for classification, label every point as one of the object part categories for
part segmentation, and label every point as one of the semantic categories for
semantic segmentation. The figure is from [65]. . . . . . . . . . . . . . . . . 2
2.1 Illustration of feature descriptors. PFH connects neighbors fully which means
it captures the surface variation for each point pair in the LRF, while FPFH
only connects partial neighbors. SHOT encodes neighborhood information
directly by building histogram for each volume of the space in the LRF. The
figures are from [73, 72, 79]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 A traditional Semantic Segmentation Pipeline. The first step is a standard
pointwise classification process. The figure is from [49]. . . . . . . . . . . . . 13
2.3 PointNet architecture. The top branch shows the classification network and
the bottom branch is for segmentation, where the two networks share a great
portion of structures. The figure is from [65]. . . . . . . . . . . . . . . . . . . 17
2.4 PointNet++ architecture. Each set abstraction level consists of a sampling
layer, grouping layer, and PointNet layer. The figure is from [67]. . . . . . . 19
2.5 Local feature aggregation module. DRB is s stack of multiple LocSE and
attentive pooling units with a skip connection. The figure is from [32].. . . . 21
2.6 RandLA-Net architecture overview. There are four encoding layers, four de-
coding layers, three fully connected layers and a dropout layer. The figure is
from [32]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Overview of the multi-stage Saak transform. The downward arrows repre-
sent the Saak transform, while the upward arrows represent the inverse Saak
transform. The figure is from [45]. . . . . . . . . . . . . . . . . . . . . . . . . 25
vi
2.8 Two-layerSaabtransformintheFFdesignofthefirsttwoconvolutionallayers
of the LeNet-5. The figure is from [48]. . . . . . . . . . . . . . . . . . . . . . 26
2.9 ModelNet40 dataset. From left to right: person, cup, stool, and guitar. . . . 30
2.10 Examples of ShapeNetPart point clouds with annotations. . . . . . . . . . . 31
2.11 Details of S3DIS Dataset. The figure is from [63]. . . . . . . . . . . . . . . . 32
3.1 Comparison of existing deep learning methods and the proposed PointHop
method. Top: Point cloud data are fed into deep neural networks in the feed-
forward pass and errors are propagated in the backward direction. This pro-
cess is conducted iteratively until convergence. Labels are needed to update
all model parameters. Bottom: Point cloud data are fed into the PointHop
systemtobuildandextractfeaturesinonefullyexplainablefeedforwardpass.
No labels are needed in the feature extraction stage (i.e. unsupervised feature
learning). The whole training of PointHop can be efficiently performed on a
single CPU at much lower complexity than deep-learning-based methods. . . 35
3.2 Random sampling of a point cloud of 2,048 points into simplified models of
(a) 256 points, (b) 512 points, (c) 768 points and (d) 1,024 points. They are
called the random dropout point (DP) models and used as the input to the
PointHop system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 An overview of the PointHop method. The input point cloud has N points
with 3 coordinates (x,y,z). It is fed into multiple PointHop units in cascade
and their outputs are aggregated by M different schemes to derive features.
All features are cascaded for object classification. . . . . . . . . . . . . . . . 37
3.4 Illustration of the PointHop unit. The red point is the center point while the
yellow points represent its K nearest neighbors. . . . . . . . . . . . . . . . . 37
3.5 Determination of the number of Saab filters in each of the PointHop units,
wherethereddotineachsubfigureindicatestheselectednumberofSaabfilters. 40
3.6 The classification accuracy as a function of the sampled point number of the
input model to the PointHop system as well as different pooling methods. . . 45
3.7 Robustness to sampling density variation: comparison of test accuracy as a
function of sampled point numbers of different methods. . . . . . . . . . . . 50
3.8 Visualization of learned features in the first-stage PointHop unit. . . . . . . . 51
vii
3.9 The label under each point cloud is its predicted class. Many flower pots are
misclassified to the plant and the vase classes. Also, quite a few cups are
misclassified to the vase class. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.10 Illustration of the PointHop++ method, where the upper-left enclosed subfig-
ure shows the operation in the first PointHop unit, and N and N
i
denote the
number of points of the input and in the nth hop, respectively. Due to little
correlation between channels, we can perform channel-wise (c/w) subspace
decomposition to reduce the model size. A subspace with its energy larger
than threshold T proceeds to the next hop while others become leaf nodes of
the feature tree in the current hop. . . . . . . . . . . . . . . . . . . . . . . . 56
3.11 Illustration of the impact of (a) values of the energy threshold and (b) the
number of cross-entropy-ranked (CE) or energy-ranked (E) features. . . . . . 60
3.12 Robustness against different sampling densities of the test model. . . . . . . 63
3.13 Visualization of (a) the correlation matrix at the first hop and (b) feature
clustering in the T-SNE plot. . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1 Anoverviewoftheproposedunsupervisedfeedforwardfeature(UFF)learning
system, which consists of a fine-to-coarse encoder and a coarse-to-fine decoder. 68
4.2 Visualizationofpartsegmentationresults(fromlefttoright): PointNet,UFF,
the ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 An overview of the proposed GSIP method, where the upper left block shows
the data pre-processing, the upper right block shows the local attribute con-
struction process and the lower block shows the encoder-decoder architecture
for large-scale point cloud semantic segmentation. . . . . . . . . . . . . . . . 78
4.4 Comparison of three data pre-processing methods. . . . . . . . . . . . . . . . 82
4.5 Qualitative results of the proposed GSIP method. . . . . . . . . . . . . . . . 88
5.1 An overview of the proposed SR-PointHop method. . . . . . . . . . . . . . . 95
5.2 Comparison of classification accuracy with different pooling schemes versus
different point numbers of point cloud scans. . . . . . . . . . . . . . . . . . . 102
5.3 An overview of the proposed GreenSeg method, where the upper left block
shows the feature extraction process for point cloud segmentation, and the
blocks in solid line and dotted line show the difference between the pipelines
for semantic segmentation and part segmentation. . . . . . . . . . . . . . . . 104
viii
5.4 VisualizationofobjectpartsegmentationresultsontheShapeNetPartdataset.
From top to bottom: ground truth, PointNet, UFF, GreenSeg. . . . . . . . . 112
5.5 Visualization of semantic segmentation results on the S3DIS dataset. From
toptobottom: inputpointcloudwithrealcolor,groundtruth,GSIP,GreenSeg.115
6.1 Semantic segmentation results of PointNet++ [67], SPG [50] and RandLA-
Net [32] on SemanticKITTI [7]. RandLA-Net takes only 0.04s to directly
process a large point cloud with 10
5
points over 150× 130× 10 meters in 3D
space, whichisupto200× fasterthanSPG.Redcircleshighlightthesuperior
segmentation accuracy of RandLA-Net. The figure is from [32]. . . . . . . . 128
6.2 Comparisonofsomestate-of-the-art3Dobjectdetectionmethodsinefficiency
and performance. M: MV3D [16], A: AVOD [42], C: ContFuse [56], V: Voxel-
Net [103], F: Frustum PointNet [64], S: SECOND [85], P+: PIXOR++ [87],
PP: PointPillars [51]. The figure is from [51]. . . . . . . . . . . . . . . . . . . 130
ix
Abstract
Point cloud processing is a fundamental but challenging research topic in the field of 3D
computer vision. In this thesis, we specifically study two point cloud processing related
problems — point cloud classification and point cloud segmentation. Given a point cloud as
theinput,thegoalofclassificationistolabeleverypointcloudasoneoftheobjectcategories
and the goal of segmentation is to label every point as one of the semantic categories. State-
of-the-art point cloud classification and segmentation methods are based on deep neural
networks. Although deep-learning-based methods provide good performance, their working
principle is not transparent. Furthermore, they demand huge computational resources (e.g.,
long training time even with GPUs). Since it is challenging to deploy them in mobile or
terminal devices, their applicability to real world problems is hindered. To address these
shortcomings, we design explainable and green solutions to point cloud classification and
segmentation.
First, we develop explainable and green methods for point cloud classification. A new
and explainable machine learning method, called the PointHop method, is proposed for 3D
pointcloudclassification. PointHopismathematicallytransparentanditsfeatureextraction
is an unsupervised procedure since no class labels are needed in this stage. PointHop has
an extremely low training complexity because it requires only one forward pass to learn
x
parameters of the system. As a follow-up, we analyze the shortcomings of the PointHop
methodandfurtherproposealightweightlearningmodelforpointcloudclassification,called
PointHop++. The lightweight comes in two flavors: 1) its model complexity is reduced
in terms of the model parameter number by constructing a tree-structure feature learning
system and 2) features are ordered and discriminant ones are selected automatically based
on the cross-entropy criterion. Experimental results on the ModelNet40 dataset shows that
PointHop and PointHop++ offer classification performance that is comparable with state-
of-the-art methods while demanding much lower training complexity.
Second,weextendthePointHopmethodtodoexplainableandgreenpointcloudsegmen-
tation since point cloud segmentation is more complex and requires much more computation
resources. First, an unsupervised feedforward feature (UFF) learning scheme for joint clas-
sification and part segmentation of 3D point cloud objects is proposed. UFF is green since
it can do multi-tasks in one-pass and has good generalization ability. It learns global shape
features through the encoder and local point features through the concatenated encoder-
decoder architecture. Experimental results on the ShapeNet and ShapeNet Part dataset
show that the UFF method has good generalization ability. The above methods only fo-
cus on small-scale point clouds, so we further extend our green learning strategy to do real
large-scalepointcloudsegmentation, resultinginamethod, calledGSIP.GSIPisanefficient
semantic segmentation method for large-scale indoor point clouds. GSIP is green since it
has significantly lower computational complexity and a much smaller model size to process
large-scalepointclouds. Itcontainstwonovelingredients: anewroom-stylemethodfordata
pre-processingandanewpointcloudfeatureextractorwhichisextendedfromPointHopwith
xi
lower memory and computational costs while preserving the segmentation performance. Ex-
periments on the S3DIS dataset show that GSIP outperforms the pioneering work in terms
of performance accuracy, model sizes and computational complexity.
Third, werethinktheroleoflocalandglobalaggregationinpointcloudclassificationand
segmentation. First, we argue that the classification performance is not related to hop num-
bers (i.e., multi-resolution point cloud representation) and propose a SR-PointHop method
for green point cloud classification, which is built upon the single resolution point cloud
representation. SR-PointHop simplifies the PointHop model by reducing the model depth
to a single hop and enriches the information of a point cloud object with more geometric
aggregations of various local representations. SR-PointHop has the capability in classifying
point clouds using a much smaller model size and can run efficiently on the CPU. Extensive
experiments on the ModelNet40 benchmark dataset show the advantages of SR-PointHop in
a simple architecture, a good classification performance, a much smaller model size and a
faster training speed. Second, the previous two green learning based segmentation methods,
i.e., the UFF method and GSIP method, target at small-scale objects only and large-scale
scenes only. It motivates us to develop a union structure that can segment both small-scale
and large-scale point clouds efficiently. We identify the weakness of UFF and GSIP in point
cloud segmentation and propose a novel green point cloud segmentation method, GreenSeg.
GreenSeg adopts a green and simple local aggregation strategy to enrich the local context
and provides the option for object-wise segmentation if object labels are available. Exten-
sive experiments are conducted on the ShapeNetPart dataset and S3DIS dataset, showing
that GreenSeg are comparable with deep learning methods which needs very complex local
aggregation and backpropagation.
xii
Chapter 1
Introduction
1.1 Significance of the Research
Point cloud data are popular due to easy access and complete description in the 3D space
so that it is widely used in the fields of computer-aided design (CAD), augmented and
virtual reality (AR/VR), robot navigation and perception, and advanced driver assistance
systems (ADAS). However, point cloud data is sparse, irregular, and unordered by nature.
In addition, the sensor typically produces a large number (tens to hundreds of thousands)
of raw data points, which brings new challenges, as many applications require real-time
processing. Hence, point cloud processing is a fundamental but challenging research topic in
the field of 3D computer vision. In this thesis, we focus on two basic tasks in this field, i.e.,
point cloud classification and point cloud segmentation.
The tasks are illustrated in Fig. 1.1. Point cloud classification aims at recognizing the
global shape of small-scale objects to label every point cloud as one of the object categories,
which lays the foundation for other applications. Point cloud segmentation is more complex
1
than point cloud classification, it needs both the global shape information and the fine-
grained details of the point cloud so that it can do point-wise classification. Point cloud
segmentation can be further divided into two branches, part segmentation and semantic
segmentation. The part segmentation labels every point as one of the part of objects, so it’s
still small-scale. While the semantic segmentation labels every point as one of the semantic
categories in the scene, which is usually large-scale.
(a) Classification (b) Part Segmentation (c) Semantic Segmentation
Figure 1.1: An overview of the basic tasks to solve. Taken raw point cloud (set of points) as
input, the goal is to label every point cloud as one of the object categories for classification,
label every point as one of the object part categories for part segmentation, and label every
point as one of the semantic categories for semantic segmentation. The figure is from [65].
Beforetheexplorationofdeeplearning, traditionalpointcloudclassificationandsegmen-
tationrelyontheresultsofhandcraftedfeatureextractionprocesses. Itdoesnotinvolveany
training data but relies on local geometric properties of points. FPFH [72] and SHOT [79]
are exemplary feature descriptors. The traditional methods are interpretable, and they can
be generalized to different tasks easily. Thus, some of the point cloud processing techniques
are still widely used in realworld applications. State-of-the-art point cloud classification
and segmentation methods extract point cloud features by building deep neural networks
(DNNs) and using backpropagation to update model parameters iteratively. For example,
thepioneeringPointNet[65]usesmulti-layerperceptrons(MLPs)toextractfeaturesforeach
2
point separately followed by a symmetric function to accumulate all point features. Deep
learning methods can achieve more outstanding performance, but require massive amounts
of training data (likely labelled data in a supervised learning setting).
The key difference between deep learning methods and traditional methods is that the
formercanperformend-to-endlearning. Givenapointcloudastheinput, theoutputwillbe
what we want, e.g., a label for each point in segmentation. The process is like a black box,
which reduces the work on the human end. The existence of DNNs and the optimization of
parameters by gradient descent helps to realize this end-to-end purpose. With the help of
GPUs, these methods can be very successful. However, the deep learning methods are often
criticized for big data, end-to-end and GPUs. First, deep learning relies too much on big
data and supervision. Without big data and expensive label annotation, it is hard to say
whether deep learning methods would still obtain such good results. Second, the time and
money cost is huge, so that it is very expensive to do deep learning. Third, the end-to-end
learning mechanism also limits the generalization ability of the method. Task-agnostic Deep
learning methods are developing; however, most of the deep learning methods are still task
specific. All these concerns impede reliable and flexible applications of the deep learning
solution in 3D vision.
Recently, a sequence of research work [43, 44, 45, 19, 48, 17, 18, 89] has been conducted
to unveil the mystery of convolutional neural networks (CNNs) and reduce its complexity,
named successive subspace learning (SSL). An important subspace approximation idea was
formedinKuo’searlyworks[43,44,45,19,48]. Specifically,heandhisco-authorsintroduced
the Saak transform and the Saab transform in [19] and [48], respectively, and conducted
them in multi-stages successively. The subspace approximation idea plays a role similar to
3
theconvolutionlayerofCNNs. Yet,nobackpropagation(BP)isneededinthedesignpfSaak
and Saab filters. They are derived from statistics of pixels of input images without any label
information. A series of unsupervised feature extraction methods were then designed for 2D
images following this principle [17, 18, 89, 71, 14, 95, 53, 52, 90]. In this way, SSL helps
reduce the model size and computation complexity, and offers mathematical transparency.
Compared with deep learning solutions, SSL can provide an explainable and green so-
lution by nature and it can guarantee much better performance than traditional methods
at the same time. Hence, it’s possible to generalize the SSL principle to do explainable
and green point cloud learning, i.e., mathematical transparency, smaller model sizes and the
model can be trained by CPU with much less time and complexity. The main saving comes
from the one-pass feedforward feature learning mechanism, where no backpropagation nei-
ther supervision label is needed. The unsupervised feature learning is built upon statistical
analysis of points in a local neighborhood of a point cloud set.
In this thesis, we design explainable and green solutions to point cloud classification and
segmentation based on the SSL principle. We first propose an explainable machine learning
method, PointHop [99], for point cloud classification and further improve its model com-
plexity and performance in PointHop++ [98]. Then, we extend the PointHop method to
do explainable and green point cloud segmentation. Specifically, an unsupervised feedfor-
ward feature (UFF) learning scheme for joint classification and part segmentation of 3D
point clouds [96] and an efficient solution to semantic segmentation of large-scale indoor
scene point clouds (i.e., the GSIP method) [97] are proposed. Finally, we rethink local and
global aggregation in point cloud classification and segmentation and propose SR-PointHop
4
for green point cloud classification using single resolution representation and GreenSeg for
segmenting both small-scale and large-scale point clouds efficiently and effectively.
1.2 Contributions of the Research
1.2.1 Explainable and Green Point Cloud Classification
Researches on point cloud learning are gaining interest recently due to the easy access and
completedescriptionofpointclouddatainthe3Dspace. However,pointclouddataissparse,
irregular, and unordered by nature. Although deep-learning-based methods provide good
classification performance, their working principle is not transparent and they demand huge
computational resources. Therefore, we propose a new method PointHop and its extension
PointHop++ to classify point cloud objects efficiently and effectively.
• We propose an explainable machine learning method called the PointHop method for
point cloud classification. It builds attributes of higher dimensions at each sampled
pointthroughiterativeone-hopinformationexchange. Theproblemofunorderedpoint
cloud data was addressed using a novel space partitioning procedure. Furthermore, we
used the Saab transform to reduce the attribute dimension in each PointHop unit. In
the classification stage, we fed the feature vector to a classifier and explored ensemble
methods to improve the classification performance.
• We further improve the PointHop method to be more lightweight. The resulting
method is called PointHop++. The lightweight comes in two flavors: 1) its model
5
complexity is reduced in terms of the model parameter number by constructing a tree-
structure feature learning system, where one scalar feature is associated with each
leaf node and 2) features are ordered and discriminant ones are selected automatically
based on the cross-entropy criterion.
• WithexperimentsconductedontheModelNet40benchmarkdataset, weshowthatthe
PointHop and PointHop++ method perform on par with deep learning solutions and
surpass other unsupervised feature extraction methods. Moreover, the experimental
results show that the training complexity of our methods are significantly lower than
that of state-of-the-art deep-learning-based methods.
1.2.2 Explainable and Green Point Cloud Segmentation
Point cloud segmentation method demands a good understanding of the complex global
structure as well as the local neighborhood of each point as compared with the point cloud
classification problem. Since point cloud segmentation requires much more computation
resources, we propose two explainable and green solutions, one can do joint classification
and part segmentation of small-scale point cloud objects, the other can do efficient semantic
segmentation of large-scale indoor scene point clouds.
• We propose an unsupervised feedforward feature (UFF) learning scheme for joint clas-
sificationandsegmentationof3Dpointclouds. UFFisgreensinceitcandomulti-tasks
in one-pass and has good generalization ability. The UFF method exploits statistical
correlations of points in a point cloud set to learn both shape and point features in a
one-pass feedforward manner through a cascaded encoder-decoder architecture. The
6
extracted features of an input point cloud are fed to classifiers for shape classification
and part segmentation.
• ExperimentsareconductedtoevaluatetheperformanceoftheUFFmethod. Forshape
classification, the UFF is superior to existing unsupervised methods and on par with
state-of-the-art DNNs. For part segmentation, the UFF outperforms semi-supervised
methods.
• We propose an efficient solution to semantic segmentation of large-scale indoor scene
pointclouds,namedGSIP(GreenSegmentationofIndoorPointclouds). GSIPisgreen
since it has significantly lower computational complexity and a much smaller model
sizeforalarge-scalesegmentationtask. Itcontainstwonovelingredients: anewroom-
style method for data pre-processing and a new point cloud feature extractor which is
extended from PointHop with lower memory and computational costs while preserving
the segmentation performance.
• Experimental results on a representative large-scale benchmark, the Stanford 3D In-
doorSegmentation(S3DIS)dataset,showthattheGSIPmethodoutperformsPointNet
in terms of performance accuracy, model sizes and computational complexity.
1.2.3 Local and Global Aggregation in Point Cloud Classification
and Segmentation
Observing limited classification performance gain in deep learning and green learning, it
appears unnecessary to build a large and complex model for point cloud classification. We
7
propose SR-PointHop for green point cloud classification, which enriches the information
with more geometric aggregations of various single resolution local representations. Since
the UFF and GSIP method target at small-scale objects and large-scale scenes separately,
it motivates us to develop a green point cloud segmentation, GreenSeg. Through a novel
local aggregation method, GreenSeg segments both small-scale and large-scale point clouds
efficiently and effectively.
• We argue that the classification performance is not related to hop numbers (i.e., multi-
resolution point cloud representation) and propose a SR-PointHop method for green
point cloud classification, which is built upon the single resolution point cloud repre-
sentation. SR-PointHop simplifies the PointHop model by reducing the model depth
to a single hop and enriches the information of a point cloud object with more geo-
metric aggregations of various local representations. SR-PointHop has the capability
in classifying point clouds using a much smaller model size and can run efficiently on
the CPU.
• Extensive experiments conducted on the ModelNet40 benchmark dataset show the
advantages of SR-PointHop in a simple architecture, good classification performance,
much smaller model size and faster training speed.
• Weproposeanovelgreenpointcloudsegmentationmethod,calledGreenSeg. Different
from UFF and GSIP, which target at small-scale objects only and large-scale scenes
only, GreenSeg is developed to segment both small-scale and large-scale point clouds
efficiently. The weakness of UFF and GSIP are identified and a green and simple
8
local aggregation strategy is adopted to enrich the local context and learn fine-grained
details.
• Experimental Results on the ShapeNetPart dataset and S3DIS dataset, showing that
GreenSeg are comparable with deep learning methods which needs very complex local
aggregation and backpropagation.
1.3 Organization of the Dissertation
The rest of the dissertation is organized as follows. In Chapter 2, we review the research
background, including traditional point cloud processing techniques, deep learning methods
for point cloud classification and segmentation, and green learning methods. In Chapter 3,
we propose two explainable and green methods (i.e., PointHop and PointHop++) for 3D
point cloud classification. In Chapter 4, we propose explainable and green solutions (i.e.,
UFF and GSIP) for point cloud segmentation. In Chapter 5, we rethink the local and global
aggregation in point cloud classification and segmentation and propose SR-PointHop for
point cloud classification and GreenSeg for point cloud segmentation. Finally, concluding
remarks and future research directions are given in Chapter 6.
9
Chapter 2
Background Review
In this chapter, we give a background introduction related to our research. First, we in-
troduce traditional point cloud analysis techniques. Then we discuss state-of-the-art deep
learning methods for point cloud classification and segmentation. Finally, we summarize the
research on successive subspace learning.
2.1 3D Point Clouds
With the demand of 3D understanding and proliferation of deep learning and, many deep
networks have been designed to process 3D objects. A 3D object can be represented in one
of the following four forms: a voxel grid, a 3D mesh, multi-view camera projections, and a
point cloud. A point cloud is represented by a set of points with 3D coordinates{x,y,z}. It
is the most straightforward format for 3D object representation since it can be acquired by
the LiDAR and the RGB-D sensors directly. However, points in a point cloud are irregular
andunorderedsotheycannotbeeasilyhandledbyregular2Dconvolutionalneuralnetworks
10
(CNNs). Therefore, the raw data points acquired by sensors are usually converted to other
representations to be further processed by deep neural networks, e.g., [66, 93, 69, 62].
Voxel grids use occupancy cubes to describe 3D shapes. Some methods [60, 9] extend the
2D convolution to the 3D convolution to process the 3D spatial data. Multi-view image data
are captured by a set of cameras from different angles. A weight-shared 2D CNN is applied
to each view, and results from different views are fused by a view aggregation operation
in [76, 24]. For instance, a group-view CNN (GVCNN) is proposed in [24] for 3D objects,
where discriminability of each view is learned and used in the 3D representation. The 3D
mesh data contains a collection of vertices, edges and faces. The MeshNet [25] treats faces
of a mesh as the basic unit and extracts their spatial and structural features individually to
offer the final semantic representation. By considering multimodal data, Zhang et al. [100]
proposed a hypergraph-based inductive learning method to recognize 3D objects, where
complex correlation of multimodal 3D representations is explored.
However, the conversion-based methods bring extra costs and information loss as com-
pared with methods using raw point clouds as the input. The point cloud data has more
completedescriptionof3Dobjectsthanotherforms. Besides, theydemandadditionalmem-
ory and computation. As to realworld applications, point clouds can be deployed in various
applicationsrangingfrom3Denvironmentanalysis[50,3]toautonomousdriving[86,87,51].
Therefore, point cloud data processing and analysis have attracted increasing attention from
the research community in recent years. Extracting features of point clouds effectively is a
key step to 3D vision tasks.
11
2.2 Traditional Point Cloud Analysis
Traditional features for 3D point clouds are extracted using a hand-crafted solution for
specific tasks. The statistical attributes are encoded into point features, which are often
invariantundershapetransformation. Kernelsignaturemethodswereusedtomodelintrinsic
local structures in [77, 10, 6]. Signature of histogram of orientations (SHOT) [79] is a
representative signature-based method which computes point distributions based on a local
reference frame (LRF). The LRF is a canonical pose of the local neighborhood which is to
ensure rotation and translation invariant. Histogram-based methods encode local geometric
variations into a histogram. The point feature histogram (PFH) and the fast point feature
histogram(FPFH)wereintroducedin[73,72]forpointcloudregistration. Thethreeclassical
featuredescriptorsPFH,FPFHandSHOTareshownintheFig. 2.1. Itwasproposedin[13]
to project 3D models into different views for retrieval. Multiple features can be combined to
meet the need of several tasks.
(a) PFH (b) FPFH (c) SHOT
Figure 2.1: Illustration of feature descriptors. PFH connects neighbors fully which means
it captures the surface variation for each point pair in the LRF, while FPFH only connects
partial neighbors. SHOT encodes neighborhood information directly by building histogram
for each volume of the space in the LRF. The figures are from [73, 72, 79].
To do point cloud classification task, keypoints are first detected and each keypoint is
described by a feature descriptor such as FPFH or SHOT. Then, the extracted features
12
are concatenated to feature vectors and fed into a classifier such as support vector machine
(SVM) [59] and random forest (RF) [12]. The classifier’s training stage does need repre-
sentative label. In the inference stage, the classifier will assign a class label to each point
of the target point cloud set. Semantic segmentation [30, 49] is formulated as a point-wise
classification problem, so the typical pipeline is similar to that of classification without the
keypoint detection step. An overview of [49] for the semantic labeling of 3D point clouds is
shown in Fig. 2.2. The first step is a standard pointwise classification process, as illustrated
above. The second step describes a method for smoothing the initial labeling by structured
regularization. This second part is not discussed here, as it is not the focus of our research.
Figure 2.2: A traditional Semantic Segmentation Pipeline. The first step is a standard
pointwise classification process. The figure is from [49].
13
2.3 Deep Learning on Point Clouds
Processingandanalysisof3DPointcloudsarechallengingsincethe3Dspatialcoordinatesof
points are irregular so that 3D points cannot be properly ordered to be fed into deep neural
networks (DNNs). To deal with the order problem, a certain transformation is needed
in the deep learning pipeline. Transformation of a point cloud into another form often
leads to information loss. Therefore, deep learning on point sets for 3D classification and
segmentation have attracted a lot of attention nowadays.
Small-scale Methods
The irregular and orderless problem was addressed by the pioneering work on the Point-
Net [65]. PointNet learns features of each point individually using multi-layer perceptrons
(MLPs) and aggregates all point features with a symmetric function to address the order
problem. Yet, PointNetfailstocapturethelocalstructureofapoint. State-of-the-artDNNs
for point cloud classification and segmentation are variants of the PointNet [65], including
PointNet++[67], DGCNN[81], PointCNN[55], PointSIFT[33]andsoon. Theyincorporate
the information of neighboring points to learn more powerful local features.
For example, PointNet++ [67] applies PointNet to the local region of each point, and
then aggregates local features in a hierarchical architecture. DGCNN [81] exploits another
idea to learn better local features. It uses point features to build a dynamical graph and
updates neighbor regions at every layer. A dynamic graph can capture better semantic
meaning than a fixed graph. PointCNN [55] learns an χ -transformation from an input point
cloud to get weights of neighbors of each point and permute points based on a latent and
14
potentially canonical order. To make local features invariant to permutations, PointSIFT
[33] discards the max pooling idea but designs an orientation-encoding unit to encode the
eight orientations, which is inspired by the well-known 2D SIFT descriptor [58].
As to unsupervised learning, autoencoders were proposed in [1, 88] for feature learning.
FoldingNet [88] trains a graph-based feature encoder and adopts a folding-based decoder to
deform a 2D grid to the underlying object surface.
Large-scale Methods
Although the above-mentioned point-based deep learning methods work well for small-scale
point clouds, achieving impressive performance in small-scale point cloud classification and
segmentation tasks, they can’t be generalized to handle large-scale point cloud directly due
tothememoryandtimeconstraints. Large-scalepointcloudsemanticsegmentationmethods
are often evaluated on the S3DIS dataset [5] for the indoor scene and the Semantic3D [29] or
SemanticKITTI [7] datasets for the outdoor scenes. These datasets have millions of points,
covering up to 200× 200 meters in 3D real-world space. In order to feed such a large amount
of data to deep learning networks, it is essential to conduct pre-processing steps on the data
such as block partitioning in [65].
Recently, efforts have been made in [50, 68, 27, 32] to tackle with large-scale point clouds
directly. SPG [50] builds super graphs composed by super points in a pre-processing step.
Then, it learns semantics for each super point rather than a point. FCPN [68] and PCT [27]
combinevoxelizationandpoint-basednetworkstoprocesslarge-scalepointclouds. However,
graph construction and voxelization are computationally heavy so that these solutions are
not suitable for mobile or edge devices. Recently, RandLA-Net [32] revisits point-based
15
deep learning methods. It replaces the farthest point sampling (FPS) method with random
sampling (RS) to save time and memory cost. In this way, the number of points that can be
processed one time is increased by 10 times, from 4,096 points to 40,960 points. However,
RS may discard key features, so it is not as accurate as FPS. To address this problem, a
new local feature aggregation module is adopted to capture complex local structures which
learnsimportantlocalfeaturesthroughtheattentionmechanism. AsharedMLPfollowedby
the softmax operation is used to compute attention scores. Then, attention-weighted local
features are summed together.
Three methods PointNet [65], PointNet++ [67] and RandLA-Net [32] are introduced in
the rest of this section.
2.3.1 PointNet
PointNet was the first work that employs deep learning directly to consume 3D point clouds,
providing a unified architecture for applications ranging from object classification, part seg-
mentation, to scene semantic parsing. For example, PointNet directly takes point clouds as
input and outputs either class labels for the entire input or per point segment/part labels
for each point of the input.
As we know, point clouds are unordered, the network has to learn to be invariant to the
orderinwhichthepointsarefedtothenetwork. Specifically,thenetworkshouldbeinvariant
to N! permutations of points for a point cloud with N points. Besides, the interactions
of points in a local region should also be taken into consideration because the points are
not isolated entities and points in a neighborhood define a local structure. Finally, it is
16
desirableforthemodeltolearntooutputthesamelabelorsemanticcategoryunderrotation,
translation, and any affine transformation of the point cloud. PointNet was designed to
maintain these properties.
Key to PointNet is the use of a single symmetric function, max pooling, since the max
pooling operator does not depend on the order of points. Moreover, PointNet uses multi-
layer perception (MLP), which is a universal function approximator, to approximate the set
function of assigning a class label or semantic category to a point. Effectively the network
learns a set of optimization functions/criteria that select interesting or informative points of
the point cloud and encode the reason for their selection. The final fully connected layers
of the network aggregate these learnt optimal values into the global descriptor for the entire
shapeasmentionedabove(shapeclassification)orareusedtopredictperpointlabels(shape
segmentation).
Figure 2.3: PointNet architecture. The top branch shows the classification network and
the bottom branch is for segmentation, where the two networks share a great portion of
structures. The figure is from [65].
An overview of the PointNet architecture is shown in the Fig. 2.3. The network takes
n points, represented by their 3D coordinates, and first applies an input transform using a
T-Net. The purpose of this is to ensure the input is invariant to geometric transformations.
17
The T-Net is like a mini-PointNet network that learns a 3× 3 affine transformation matrix.
A sequence of pointwise MLPs transform points to higher dimensional feature space. Then,
aseparatefeaturetransformisappliedwiththesamepurpose,tomakethefeaturesinvariant
to transformations. The point features are aggregated using max pooling to obtain a 1024-
dimensional global feature vector. For the classification task, this feature vector is further
fed to a MLP classifier, which outputs a k-dimensional probability vector for k classes. For
thesegmentationtask, theglobalfeatureisfusedwiththepointwisefeaturestofurtherlearn
the output labels for each point. The concatenation of local and global features enables the
network to leverage both local geometry and global semantics.
2.3.2 PintNet++
PointNet shows impressive performance for point cloud processing tasks like classification
and segmentation. However, it does not capture information about the local context of
points at different scales. In a follow-up work, termed PointNet++ [67], the researchers
proposedahierarchicalfeaturelearningframeworktoresolvesomelimitationsofPointNetby
progressivelyabstractinglargerandlargerlocalregionsalongthehierarchy. Thehierarchical
learning process is achieved by a series of set abstraction levels. At each level, a set of points
is processed and abstracted to produce a new set with fewer elements. Each set abstraction
level consists of a sampling layer, grouping layer, and PointNet layer. The PointNet++
architecture is shown in Fig. 2.4.
In the sampling layer, a subset of m points{x
i1
,x
i2
,...,x
im
} is sampled from the input n
points. The iterative farthest point sampling (FPS) technique is used here, which provides
18
Figure2.4: PointNet++architecture. Eachsetabstractionlevelconsistsofasamplinglayer,
grouping layer, and PointNet layer. The figure is from [67].
uniformcoverageoftheentirepointcloud. Thesesampledmpointsdefinethesetofcentroids
for the grouping layer. The grouping layer then constructs local region sets by finding
“neighboring” points around the centroids. It takes the input point set of size N× (d+C),
wheredcorrespondstothedimensionofthecoordinatesandC isthefeaturedimension, and
the coordinates of the centroids of dimension N
′
× d from the sampling step. All the points
lying inside a sphere of a certain radius around each centroid are collected. The output
grouped point set is of size N
′
× K× (d+C), where K is the number of points lying within
the sphere. K varies for every point depending on the density of points in the local region.
For each set of grouped points, the point coordinates are first translated to a local system
centered at the centroid. Then, the PointNet operation is performed in the local region to
encodelocalregionpatternsintofeaturevectors. Thefeaturesofall K pointsareaggregated
using a local max pooling operation. The output is of size N
′
× (d+C)
19
2.3.3 RandLA-Net
There are three reasons for the incapability of the above approaches (i.e., PointNet and
PointNet++ for directly processing massive numbers of points. First, the point sampling
methodsadoptedcurrentlyarecomputationallyexpensiveormemoryinefficient; second, the
existing local feature learners rely on expensive kernelization or graph construction; third,
these local feature learners are unable to capture complex structures in the large-scale point
clouds due to their limited size of receptive fields. The design of RandLA-Net targets at two
aspects: the sampling method and the local feature learner.
To directly process large-scale point clouds in a single pass, the sampling method should
bebothmemoryandcomputationallyefficientsothatitcanbeprocessedbyGPUs. Farthest
point sampling (FPS) is the most commonly used sampling method in current point-based
deeplearningmethodbecauseithasagoodcoverageoftheentirepointcloud. However,FPS
chooses the farthest point to the selected point set iteratively so that its time complexity
is O(N
2
). The computation time increases dramatically as the number of points increases,
FPS takes up to 200 seconds are taken to process millions of points on a single GPU, acting
as a significant bottleneck to real-time processing. In contrast, Random Sampling (RS) does
not depend on the number of points, the time complexity is O(1). RandLA-Net saves time
and memory cost by replacing FPS with RS in the feature learning process. Therefore, the
input scale can be increased by 10 times.
Despite the advantages of RS, it is not as accurate as the other sampling methods, as
prominent point features may be dropped by chance. To overcome this issue, a new local
feature aggregation module that increases the receptive field size in each layer progressively
20
Figure2.5: Localfeatureaggregationmodule. DRBissstackofmultipleLocSEandattentive
pooling units with a skip connection. The figure is from [32].
was designed so that complex local structures can be learned effectively. As is shown in Fig.
2.5, each layer comprises a dilated residual block (DRB), which is a stack of multiple local
spatialencoding(LocSE)andattentivepoolingunitswithaskipconnection. Foreachpoint,
themodulefirstobservesits K nearestneighborsafteroneLocSEandattentivepoolingunit,
then observes the K
2
neighboring points after the second unit. Experimentally, two units
are stacked to achieve both efficiency and effectiveness.
Given a center point p
i
, LocSE first gathers its K nearest neighboring points, then the
relative point position is encoded. The encoded relative point position r
k
i
and its corre-
sponding point features f
k
i
are concatenated as an augmented feature
ˆ
f
k
i
. Instead of using
max/mean pooling to the neighboring features like in PointNet/PointNet++, which leads to
a lot of information loss, attentive pooling is adopted to learn important local features. In
general, the LocSE aggregates the geometry information with the features of the local region
and attentive pooling learns to pool more informatively. Thus, the stacks of multiple LocSE
and attentive pooling units increases the receptive field of each layer to learn local complex
local structures.
21
Figure2.6: RandLA-Netarchitectureoverview. Therearefourencodinglayers,fourdecoding
layers, three fully connected layers and a dropout layer. The figure is from [32].
The details of the RandLA-Net architecture are shown in Fig. 2.6. The network follows
the commonly used encoder-decoder architecture with skip connections. The input point
cloud, which has a 10
5
scale with 3D coordinates and color, is fed into a shared MLP to
extract per-point features. Then, four encoding layers are used to reduce the size of the
point cloud to 10
2
while increasing the feature dimensions. Next, four decoding layers are
employed using nearest neighbor interpolation to upsample the point cloud and concatenate
with the features in the corresponding encoding layer. Finally, three fully connected (FC)
layers and a dropout layer are used to obtain a semantic prediction.
2.4 Successive Subspace Learning
As we know, DNNs rely on expensive labeled data. Furthermore, due to the end-to-end
optimization,deepfeaturesarelearnediterativelyviabackpropagation. Tosavebothlabeling
and computational costs, it is desired to obtain features in an unsupervised and feedforward
one-pass manner. A series of research work of successive subspace learning (SSL) on 2D
images [43, 44, 45, 19, 48, 17, 18, 89] lays the foundation for explainable and green point
cloud analysis. SSL offers a light-weight unsupervised feature learning method based on the
inherent statistical properties of data units. The model size is significantly smaller than that
22
of DNNs and it is more computationally efficient. SSL has proved to be effective in different
applications such as image classification [17, 18, 89], face gender classification [71], deepfake
image detection [14], image anomaly detection [95] and point cloud registration [35, 37], and
so on.
In this section, we will introduce some early works on using SSL to analyze 2D images to
explain the core design principle. Back in 2016, Kuo [43] noted that there was an issue with
sign confusion arising from the cascade of hidden layers in convolutional neural networks
(CNNs), and argued the need for nonlinear activation to eliminate this problem. Kuo [45]
later noted that all filters in one convolutional layer form a subspace, which means each
convolutional layer corresponds to a subspace approximation of the input. However, the
analysis of subspace approximation is still complicated due to the existence of nonlinear
activation. Therefore,itisdesirabletosolvethesignconfusionproblembyothermeans. The
Saak (Subspace Approximation with Augmented Kernels) [19, 45] and the Saab (Subspace
Approximation with Adjusted Bias) [48] transforms were proposed to avoide sign confusion
while simultaneously preserving the subspace spanned by the filters fully.
2.4.1 Saak Transform
As its name suggests, the Saak transform [45] consists of two components: subspace ap-
proximation and kernel augmentation. To build the optimal linear subspace approximation,
the second-order statistics of the input vectors are analyzed and the orthonormal eigenvec-
tors of the covariance matrix are selected as transform kernels. This is the Karhunen-Lo´ eve
23
transform (KLT). Since the complexity of KLT increases dramatically when the input di-
mensionislarge, theSaaktransformfirstdecomposeshigh-dimensionalvectorsintomultiple
lower-dimensional sub-vectors, and repeats the process recursively. However, there is a sign
confusion problem if two or more transforms are cascaded directly. A rectified linear unit
(ReLU) is inserted between to solve the problem, which introduces rectification loss. To
eliminate this loss, kernel augmentation is proposed by augmenting each transform kernel
withitsnegativevector. Boththeoriginalandaugmentedkernelsareused. WithReLU,one
half of a transformed pair will pass through and the other half of the pair will be blocked.
The integration of kernel augmentation and ReLU are equivalent to the sign-to-position
(S/P) format conversion. Multiple Saak transforms are then cascaded to transform images
of a larger size. The multi-stage Saak transforms offer a family of joint spatial-spectral
representations between the full spatial-domain representation and the full spectral-domain
representations.
An overview of the multistage Saak transform is presented in Fig. 2.7. Images are
decomposedintofourquadrantsrecursivelytoformaquad-treestructurewithitsrootbeing
the full image and its leaves being small patches of size 2× 2 pixels. The first-stage Saak
transform is applied at the leaf node. Then, multistage Saak transforms are applied from all
leafnodestotheirparents,stagebystage,untiltherootnodeisreached. Specifically,KLTis
conductedonnon-overlappinglocalcuboidsofsize2× 2× K
0
atthefirststage,where K
0
= 1
for a monochrome image and K
0
= 3 for a color image. The horizontal and vertical spatial
dimensions of the input image are reduced by one half. Thereafter, the KLT coefficients are
augmented so that the spectral dimension is doubled to give 2
3
K
0
. In the next stage, KLT
is conducted on non-overlapping local cuboids of size 2× 2× 2
3
K
0
, which yields an output of
24
spectral dimension K
2
= 2
3
K
1
by kernel augmentation. The whole process is stopped when
the kernel size reaches 1× 1× K
f
. If the image is of size 2
P
× 2
P
, we have K
f
= 2
3P
.
Figure2.7: Overviewofthemulti-stageSaaktransform. Thedownwardarrowsrepresentthe
Saak transform, while the upward arrows represent the inverse Saak transform. The figure
is from [45].
The Saak transform allows both forward and inverse transforms. This means that it can
beusedforimageanalysisaswellassynthesis(orgeneration). TheinverseSaaktransformis
conducted by performing a position-to-sign (P/S) format conversion before the inverse KLT.
In general, the Saak transform converts the spatial variation to the spectral variation, while
the inverse Saak transform converts the spectral variation to the spatial variation.
2.4.2 Saab Transform
FF-designed CNN
An interpretable feedforward (FF) design without any backpropagation (BP) was proposed
in[48]toobtainthemodelparameters. TheFFdesignisadata-centricapproachthatderives
25
the network parameters of the current layer based on data statistics from the output of the
previous layer in a one-pass manner. In our interpretation, each CNN layer corresponds
to a vector space transformation. A sample distribution of the input space exists in the
training data. To determine a proper transformation from the input to the output using the
input data distribution, two steps are employed: 1) dimension reduction through subspace
approximations and/or projections, and 2) training sample clustering and remapping. The
formerisusedintheconstructionofconvolutionallayers, whilethelatterisadoptedtobuild
fully connected (FC) layers.
The convolutional layers offer a sequence of spatial-spectral filtering operations. A new
signal transform, called the Saab transform, has been developed to construct convolutional
layers. Thisisavariantoftheprincipalcomponentanalysis(PCA)withanaddedbiasvector
to annihilate the nonlinearity of the activation, and it contributes to dimension reduction.
Multiple Saab transforms in cascade yield multiple convolutional layers. An example of
two-layer Saab transform is shown in Fig. 2.8.
Figure 2.8: Two-layer Saab transform in the FF design of the first two convolutional layers
of the LeNet-5. The figure is from [48].
The FC layers provide a sequence of sample clustering and high-to-low dimensional map-
ping operations. It is constructed by a three-level hierarchy: the feature space, the sub-class
26
space, and the class space. Linear least squared regression (LLSR) guided by pseudo-labels
is adopted to build from the feature space to the sub-class space. Then, LLSR guided by
true labels is used to build from the sub-class space to the class space. The FC layers formed
by multi-stage LLSRs in cascade correspond to a multi-layer perceptron (MLP). The de-
sign principle is not only to reduce the dimensions of the intermediate spaces, but also to
gradually increase the discriminability of some dimensions. The multilayer transformations
eventually reach the output space with strong discriminability.
Generally speaking, the Saab transform is more advantageous than the Saak transform,
because the number of Saab filters is only one half of the number of Saak filters. Besides in-
terpretingthecascadeofconvolutionallayersasasequenceofapproximatingspatial-spectral
subspaces, the fully connected layers act as a sequence of label-guided least-squares regres-
sion processes. As a result, all the model parameters of CNNs can be determined in a FF
one-pass fashion. This is known as an FF-designed CNN (FF-CNN).
Saab Transform Algorithm
For an input v = (v
0
,v
1
,··· ,v
N− 1
)
T
of dimension N, the one-stage Saab transform can be
written as
y
k
=
N
X
n=0
a
k,n
v
n
+b
k
=a
T
k
v+b
k
, k = 0,··· ,K− 1 (2.1)
where y
k
is the kth Saab coefficient, a
k
= (a
k,0
,a
k,1
,··· ,a
k,N− 1
)
T
is the weight vector and b
k
is the bias term for the kth Saab filter. The Saab transform has a particular rule in choosing
filter weight a
k
and bias term b
k
.
27
Letusfocusonfilterweightsfirst. When k = 0,thefilteriscalledtheDC(directcurrent)
filter, and its filter weight is
a
0
=
1
√
N
(1,··· ,1)
T
.
By projecting input v to the DC filter, we get its DC component v
DC
=
1
√
N
P
N
n=0
v
n
, which
is nothing but the local mean of the input. We can derive the AC component of the input
via
v
AC
=v− v
DC
.
When k > 0, the filters are called the AC (alternating current) filters. To derive AC filters,
we conduct PCA on AC components,v
AC
, and choose its first ( K− 1) principle components
as the AC filters a
k
,k = 1,··· ,K− 1. Finally, the DC filter and K− 1 AC filters form the
set of Saab filters.
Next, we discuss the choice of the bias term, b
k
, of the kth filter. In CNNs, there is an
activationfunctionattheoutputofeachconvolutionaloperationsuchastheReLU(Rectified
Linear Unit) and the sigmoid. In the Saab transform, we demand that all bias terms are the
same so that they contribute to the DC term in the next stage. Besides, we choose the bias
large enough to guarantee that the response y
k
is always non-negative before the nonlinear
activation operation. Thus, nonlinear activation plays no role and can be removed. It is
shown in [48] that b
k
can be selected using the following rule:
b
k
= constant≥ max
v
∥v∥, k = 0,··· ,N− 1.
28
Pixels in images have a decaying correlation structure. The correlation between local
pixels is stronger and the correlation becomes weaker as their distance becomes larger. To
exploit this property, we conduct the first-stage PCA in a local window for dimension reduc-
tion to get a local spectral vector. It will result in a joint spatial-spectral cuboid where the
spatial dimension denotes the spatial location of the local window and the spectral dimen-
sion provides the spectral components of the corresponding window. Then, we can perform
the second-stage PCA on the joint spatial-spectral cuboid. The multi-stage PCA is better
than the single-stage PCA since it handles decaying spatial correlations in multiple spatial
resolutions rather than in a single spatial resolution.
2.5 Datasets
Public datasets are critical when it comes to the research and development of algorithms.
Datasets provide a common ground for evaluating algorithms and methods and enable the
performance of different methods to be compared. In this section, we introduce the point
clouddatasetsthatareusedtoevaluateourmethods,i.e.,ModelNet40[84],ShapeNet[11,92]
and S3DIS [5].
2.5.1 ModelNet40
The ModelNet40 dataset [84] is a compilation of 12,308 Computer-aided Design (CAD)
models of point clouds of common objects, such as tables, chairs, sofas, airplanes, and so on.
In all, the ModelNet40 dataset includes 40 object categories. The dataset is divided as 9840
modelsfortrainingand2468modelsfortesting. Eachpointcloudconsistsof2048points. All
29
the point cloud models are pre-aligned into a canonical frame of reference. ModelNet40 and
itssubsetModelNet10arewidelyusedinpointcloudobjectclassificationandshaperetrieval
tasks. ModelNet40 is a synthetic dataset. Some point cloud models from the ModelNet40
dataset are shown in Fig. 2.9.
Figure 2.9: ModelNet40 dataset. From left to right: person, cup, stool, and guitar.
2.5.2 ShapeNet
TheShapeNetcoredataset[11]contains57,448CADmodelsofman-madeobjects(airplane,
bag, car, chair, etc.) in 55categories. EachCADmodelis sampled to2048pointswith three
cartesiancoordinates. TheShapeNetcoredatasetisnotfullyannotated. TheShapeNetPart
dataset[92]isasubsetoftheShapeNetcoredatasettopredictapartcategoryforeachpoint.
The ShapeNetPart dataset has 16,881 CAD models in 16 object categories, which are each
sampled at 2048 points to generate point clouds. Each object category is annotated with
two to six parts, and there are 50 parts in total. The dataset is divided into three sections:
12,137 shapes for training, 1870 shapes for validation, and 2874 shapes for testing. Different
models of airplane, bag, earphone and car with annotations are shown in Fig. 2.10.
30
Figure 2.10: Examples of ShapeNetPart point clouds with annotations.
2.5.3 S3DIS
TheStanford3DIndoorSegmentation(S3DIS)dataset[5]isasubsetoftheStanford2D-3D-
Semanticsdataset. Itisoneofthebenchmarkdatasetsforpointcloudsemanticsegmentation
tasks. TheS3DISdatasetcontainspointcloudsscannedfrom6indoorareaswith271rooms.
There are 13 categories in total, such as ceiling, floor, wall, door, etc. Each point has 9
dimensions including XYZ, RGB and normalized XYZ. Details of the dataset is shown in
Fig. 2.11. Different from ShapeNet, the dataset is labeled by object categories instead of
object part categories. The dataset is usually pre-processed by block partitioning. That is,
each room is split into 1× 1 meter blocks, where each block is randomly sampled to 4096
points for training, while all points can be used for testing depending on the memory of the
computing devices. The K-fold strategy is used for training and testing.
31
Figure 2.11: Details of S3DIS Dataset. The figure is from [63].
32
Chapter 3
Explainable and Green Point Cloud Classification
3.1 Introduction
Researches on point cloud learning are gaining interest recently due to the easy access and
completedescriptionofpointclouddatainthe3Dspace. However,pointclouddataissparse,
irregular, and unordered by nature. Traditional methods for point cloud classification and
segmentation tasks usually use handcrafted feature descriptors, which tend to be geometric
and/or shallow. Nevertheless, these methods do not need to be supervised, and they are
quite efficient and readily interpretable. In contrast, deep learning methods provide good
classification performance but require end-to-end supervision to complete tasks, and the
learned features are more semantic owing to the high costs of computation resources like
GPU. The high time and memory costs also make it challenging to deploy these methods
in mobile or terminal devices. In addition, these methods are often criticized for the lack of
interpretability. To address these problems, we propose explainable and green methods for
point cloud classification which are data-driven like deep learning methods while learning
features in a single feedforward pass like traditional methods.
33
The rest of this section is organized as follows. A new and explainable machine learning
method, called the PointHop method, is proposed for 3D point cloud classification in Sec.
3.2. As a follow-up, we analyze the shortcomings of the PointHop method and further
propose a lightweight learning model for point cloud classification, called PointHop++, in
Sec. 3.3. Finally, concluding remarks and future research directions are given in Sec. 3.4.
3.2 PointHop Method
We compare PointHop with deep-learning-based methods in Fig. 3.1. PointHop is math-
ematically transparent and it requires only one forward pass to learn parameters of the
system. Furthermore, its feature extraction is an unsupervised procedure since no class
labels are needed in this stage.
3.2.1 Methodology
The source point cloud model typically contains a large number of points of high density,
and its processing is very time-consuming. We can apply random sampling to reduce the
number of points with little degradation in classification performance. As shown in Fig.
3.2, an exemplary point cloud model of 2,048 points is randomly sampled and represented
by four different point numbers. They are called the random dropout point (DP) models.
A model with more sampled points provides higher representation accuracy at the cost of
higher computational complexity. We will use the DP model as the input to the proposed
PointHop system, and show the classification accuracy as a function of the point numbers
of a DP model in Sec. 3.2.2.
34
Deep Network
PointHop System
FeedForward Pass
FeedForward Pass
Back-Propagation
Parameter Update
Parameter Update
Feature
Feature
Point Cloud
Point Cloud
Need BP
Need Label-Guidance
Need GPU
No BP
No Label-Guidance
Only CPU
Deep Learning Methods
Proposed PointHop Method
Figure 3.1: Comparison of existing deep learning methods and the proposed PointHop
method. Top: Point cloud data are fed into deep neural networks in the feedforward pass
and errors are propagated in the backward direction. This process is conducted iteratively
until convergence. Labels are needed to update all model parameters. Bottom: Point cloud
data are fed into the PointHop system to build and extract features in one fully explainable
feedforward pass. No labels are needed in the feature extraction stage (i.e. unsupervised
feature learning). The whole training of PointHop can be efficiently performed on a single
CPU at much lower complexity than deep-learning-based methods.
256 points 512 points
768 points 1024 points
Figure 3.2: Random sampling of a point cloud of 2,048 points into simplified models of (a)
256 points, (b) 512 points, (c) 768 points and (d) 1,024 points. They are called the random
dropout point (DP) models and used as the input to the PointHop system.
35
A point cloud of N points is defined as P ={p
1
,··· ,p
N
}, wherep
n
= (x
n
,y
n
,z
n
)∈R
3
,
n = 1,··· ,N. There are two distinct properties of the point cloud data:
• Unordered data in the 3D space
Point clouds comprise a set of points in the 3D space without a specific order, which
is different from images where pixels are defined in a regular 2D grid.
• Disturbance in scanned points
For the same 3D object, different point sets can be acquired with uncertain position
disturbance. For example, different scanning methods are applied to the surface of the
same object or the scanning device is at different locations.
An overview of the proposed PointHop method is shown in Fig. 3.3. A point cloud, P,
is taken as the input, and PointHop outputs the corresponding class label. The PointHop
mechanism consists of two stages: 1) local-to-global attribute building through multi-hop
information exchange; and 2) classification and ensembles. The input point cloud has N
points with three coordinates (x,y,z). It is fed into multiple PointHop units in cascade and
their outputs are aggregated by M different schemes to derive features. All features are
cascaded for object classification.
Local-to-Global Attribute Building
The attribute building stage addresses the problem of unordered point cloud data using
a space partitioning procedure and developing a robust descriptor that characterizes the
relationshipbetweenapointanditsone-hopneighbors. Initially,theattributesofapointare
its 3D coordinates. When multiple PointHop units are performed in cascade, the attributes
36
Point Cloud
PointHop
Unit
PointHop
Unit
PointHop
Unit
PointHop
Unit
Table
Bench
TV Stand
Cup
Door
Classifier
M Aggregations M Aggregations M Aggregations M Aggregations
M⇥ D
1
AAAB9XicbVBNS8NAEJ3Ur1q/qh69BFvBU0nqQY8FPXgRKtgPaNOy2W7apZtN2J0oJfR/ePGgiFf/izf/jds2B219MPB4b4aZeX4suEbH+bZya+sbm1v57cLO7t7+QfHwqKmjRFHWoJGIVNsnmgkuWQM5CtaOFSOhL1jLH1/P/NYjU5pH8gEnMfNCMpQ84JSgkXrluy7ykOn0pudOy/1iyak4c9irxM1ICTLU+8Wv7iCiScgkUkG07rhOjF5KFHIq2LTQTTSLCR2TIesYKolZ5aXzq6f2mVEGdhApUxLtufp7IiWh1pPQN50hwZFe9mbif14nweDKS7mME2SSLhYFibAxsmcR2AOuGEUxMYRQxc2tNh0RRSiaoAomBHf55VXSrFbci0r1vlqqVbI48nACp3AOLlxCDW6hDg2goOAZXuHNerJerHfrY9Gas7KZY/gD6/MHjByR0g==
M⇥ D
2
AAAB9XicbVBNS8NAEJ3Ur1q/qh69LLaCp5DEgx4LevAiVLAf0KZls920SzebsLtRSuj/8OJBEa/+F2/+G7dtDtr6YODx3gwz84KEM6Ud59sqrK1vbG4Vt0s7u3v7B+XDo6aKU0log8Q8lu0AK8qZoA3NNKftRFIcBZy2gvH1zG89UqlYLB70JKF+hIeChYxgbaRe9a6rWURVdtPzptV+ueLYzhxolbg5qUCOer/81R3EJI2o0IRjpTquk2g/w1Izwum01E0VTTAZ4yHtGCqwWeVn86un6MwoAxTG0pTQaK7+nshwpNQkCkxnhPVILXsz8T+vk+rwys+YSFJNBVksClOOdIxmEaABk5RoPjEEE8nMrYiMsMREm6BKJgR3+eVV0vRs98L27r1Kzc7jKMIJnMI5uHAJNbiFOjSAgIRneIU368l6sd6tj0VrwcpnjuEPrM8fjaKR0w==
M⇥ D
3
AAAB9XicbVBNT8JAEJ3iF+IX6tFLI5h4Ii0c9EiiBy8mmAiYQCHbZQsbtttmd6ohDf/DiweN8ep/8ea/cYEeFHzJJC/vzWRmnh8LrtFxvq3c2vrG5lZ+u7Czu7d/UDw8aukoUZQ1aSQi9eATzQSXrIkcBXuIFSOhL1jbH1/N/PYjU5pH8h4nMfNCMpQ84JSgkXrl2y7ykOn0uleblvvFklNx5rBXiZuREmRo9Itf3UFEk5BJpIJo3XGdGL2UKORUsGmhm2gWEzomQ9YxVBKzykvnV0/tM6MM7CBSpiTac/X3REpCrSehbzpDgiO97M3E/7xOgsGll3IZJ8gkXSwKEmFjZM8isAdcMYpiYgihiptbbToiilA0QRVMCO7yy6ukVa24tUr1rlqqV7I48nACp3AOLlxAHW6gAU2goOAZXuHNerJerHfrY9Gas7KZY/gD6/MHjyiR1A==
M⇥ D
4
AAAB9XicbVDLSgNBEOz1GeMr6tHLYCJ4WnajoMeAHrwIEcwDkk2YnUySIbMPZnqVsOQ/vHhQxKv/4s2/cZLsQRMLGoqqbrq7/FgKjY7zba2srq1vbOa28ts7u3v7hYPDuo4SxXiNRTJSTZ9qLkXIayhQ8masOA18yRv+6HrqNx650iIKH3Accy+gg1D0BaNopE7pro0i4Dq96VxMSt1C0bGdGcgycTNShAzVbuGr3YtYEvAQmaRat1wnRi+lCgWTfJJvJ5rHlI3ogLcMDalZ5aWzqyfk1Cg90o+UqRDJTP09kdJA63Hgm86A4lAvelPxP6+VYP/KS0UYJ8hDNl/UTyTBiEwjID2hOEM5NoQyJcythA2pogxNUHkTgrv48jKpl2333C7fl4sVO4sjB8dwAmfgwiVU4BaqUAMGCp7hFd6sJ+vFerc+5q0rVjZzBH9gff4AkK6R1Q==
N
2
⇥ D
2
AAAB9XicbVDLSgNBEOz1GeMr6tHLYBA8LbtR0GNAD54kgnlAsgmzk9lkyOyDmV4lLPkPLx4U8eq/ePNvnCR70MSChqKqm+4uP5FCo+N8Wyura+sbm4Wt4vbO7t5+6eCwoeNUMV5nsYxVy6eaSxHxOgqUvJUoTkNf8qY/up76zUeutIijBxwn3AvpIBKBYBSN1L3rVkgHRcg1uelWeqWyYzszkGXi5qQMOWq90lenH7M05BEySbVuu06CXkYVCib5pNhJNU8oG9EBbxsaUbPIy2ZXT8ipUfokiJWpCMlM/T2R0VDrceibzpDiUC96U/E/r51icOVlIkpS5BGbLwpSSTAm0whIXyjOUI4NoUwJcythQ6ooQxNU0YTgLr68TBoV2z23K/cX5aqdx1GAYziBM3DhEqpwCzWoAwMFz/AKb9aT9WK9Wx/z1hUrnzmCP7A+fwDjI5Fm
N
3
⇥ D
3
AAAB9XicbVDLSgNBEOyNrxhfUY9eBoPgadlNBD0G9OBJIpgHJJswO5lNhsw+mOlVQsh/ePGgiFf/xZt/4yTZgyYWNBRV3XR3+YkUGh3n28qtrW9sbuW3Czu7e/sHxcOjho5TxXidxTJWLZ9qLkXE6yhQ8laiOA19yZv+6HrmNx+50iKOHnCccC+kg0gEglE0UveuWyEdFCHX5KZb6RVLju3MQVaJm5ESZKj1il+dfszSkEfIJNW67ToJehOqUDDJp4VOqnlC2YgOeNvQiJpF3mR+9ZScGaVPgliZipDM1d8TExpqPQ590xlSHOplbyb+57VTDK68iYiSFHnEFouCVBKMySwC0heKM5RjQyhTwtxK2JAqytAEVTAhuMsvr5JG2XYrdvn+olS1szjycAKncA4uXEIVbqEGdWCg4Ble4c16sl6sd+tj0Zqzsplj+APr8wfmNpFo
N
4
⇥ D
4
AAAB9XicbVDLSgNBEOyNrxhfUY9eBoPgadmNAT0G9OBJIpgHJJswO5lNhsw+mOlVQsh/ePGgiFf/xZt/4yTZgyYWNBRV3XR3+YkUGh3n28qtrW9sbuW3Czu7e/sHxcOjho5TxXidxTJWLZ9qLkXE6yhQ8laiOA19yZv+6HrmNx+50iKOHnCccC+kg0gEglE0UveuWyEdFCHX5KZb6RVLju3MQVaJm5ESZKj1il+dfszSkEfIJNW67ToJehOqUDDJp4VOqnlC2YgOeNvQiJpF3mR+9ZScGaVPgliZipDM1d8TExpqPQ590xlSHOplbyb+57VTDK68iYiSFHnEFouCVBKMySwC0heKM5RjQyhTwtxK2JAqytAEVTAhuMsvr5JG2XYv7PJ9pVS1szjycAKncA4uXEIVbqEGdWCg4Ble4c16sl6sd+tj0Zqzsplj+APr8wfpSZFq
N×3
N
1
⇥ D
1
AAAB9XicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BPXiSCOYBySbMTmaTIbMPZnqVsOQ/vHhQxKv/4s2/cZLsQRMLGoqqbrq7vFgKjbb9beVWVtfWN/Kbha3tnd294v5BQ0eJYrzOIhmplkc1lyLkdRQoeStWnAae5E1vdD31m49caRGFDziOuRvQQSh8wSgaqXvXdUgHRcA1uek6vWLJLtszkGXiZKQEGWq94lenH7Ek4CEySbVuO3aMbkoVCib5pNBJNI8pG9EBbxsaUrPITWdXT8iJUfrEj5SpEMlM/T2R0kDrceCZzoDiUC96U/E/r52gf+mmIowT5CGbL/ITSTAi0whIXyjOUI4NoUwJcythQ6ooQxNUwYTgLL68TBqVsnNWrtyfl6pXWRx5OIJjOAUHLqAKt1CDOjBQ8Ayv8GY9WS/Wu/Uxb81Z2cwh/IH1+QPmFJF4
Figure 3.3: An overview of the PointHop method. The input point cloud has N points with
3coordinates(x,y,z). ItisfedintomultiplePointHopunitsincascadeandtheiroutputsare
aggregated by M different schemes to derive features. All features are cascaded for object
classification.
of a point will grow by considering its relationship with one-hop neighbor points iteratively.
Theattributesofapointevolvefromalowdimensionalvectorintoahighdimensionalvector
through this module. To control this rapid dimension growth, the Saab transform is applied
for dimension reduction, so that the dimension grows at a slower rate. All these operations
are conducted inside the PointHop processing unit. The local descriptor is robust since the
construction acts to minimize the issues of unordered 3D data and disturbance of scanned
points into account.
Saab
Transform
8 Quadrant
Partitioning
z
x
y
Space
Grouping
Input Point Cloud
KNN
Local Region (One Hop) Feature Reduction Points in Order
...
Local Descriptor
⇠ 1
AAAB7HicbVBNS8NAEJ2tX7V+VT16WSyCp5BUQY8FLx4rmLbQxrLZbtqlm03Y3Ygl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSq4Nq77jUpr6xubW+Xtys7u3v5B9fCopZNMUebTRCSqExLNBJfMN9wI1kkVI3EoWDsc38z89iNTmify3kxSFsRkKHnEKTFW8ntP/MHrV2uu486BV4lXkBoUaParX71BQrOYSUMF0brruakJcqIMp4JNK71Ms5TQMRmyrqWSxEwH+fzYKT6zygBHibIlDZ6rvydyEms9iUPbGRMz0sveTPzP62Ymug5yLtPMMEkXi6JMYJPg2ed4wBWjRkwsIVRxeyumI6IINTafig3BW355lbTqjnfh1O8uaw2niKMMJ3AK5+DBFTTgFprgAwUOz/AKb0iiF/SOPhatJVTMHMMfoM8ffKSOaA==
⇠ 2
AAAB7HicbVBNS8NAEJ2tX7V+VT16WSyCp5BUQY8FLx4rmLbQxrLZbtqlm03Y3Ygl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSq4Nq77jUpr6xubW+Xtys7u3v5B9fCopZNMUebTRCSqExLNBJfMN9wI1kkVI3EoWDsc38z89iNTmify3kxSFsRkKHnEKTFW8ntP/KHer9Zcx50DrxKvIDUo0OxXv3qDhGYxk4YKonXXc1MT5EQZTgWbVnqZZimhYzJkXUsliZkO8vmxU3xmlQGOEmVLGjxXf0/kJNZ6Eoe2MyZmpJe9mfif181MdB3kXKaZYZIuFkWZwCbBs8/xgCtGjZhYQqji9lZMR0QRamw+FRuCt/zyKmnVHe/Cqd9d1hpOEUcZTuAUzsGDK2jALTTBBwocnuEV3pBEL+gdfSxaS6iYOYY/QJ8/fiiOaQ==
⇠ 8
AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbB07JbBXssePFYwX5Au5Zsmm1js8mSZMWy9D948aCIV/+PN/+NabsHbX0w8Hhvhpl5YcKZNp737RTW1jc2t4rbpZ3dvf2D8uFRS8tUEdokkkvVCbGmnAnaNMxw2kkUxXHIaTscX8/89iNVmklxZyYJDWI8FCxiBBsrtXpP7L5W6pcrnuvNgVaJn5MK5Gj0y1+9gSRpTIUhHGvd9b3EBBlWhhFOp6VeqmmCyRgPaddSgWOqg2x+7RSdWWWAIqlsCYPm6u+JDMdaT+LQdsbYjPSyNxP/87qpiWpBxkSSGirIYlGUcmQkmr2OBkxRYvjEEkwUs7ciMsIKE2MDmoXgL7+8SlpV179wq7eXlbqbx1GEEziFc/DhCupwAw1oAoEHeIZXeHOk8+K8Ox+L1oKTzxzDHzifP70njoM=
Figure 3.4: Illustration of the PointHop unit. The red point is the center point while the
yellow points represent its K nearest neighbors.
37
The PointHop unit is shown in Fig. 3.4. For each point in the point cloud P, we search
its K nearest neighbor points for p
c
= (x
c
,y
c
,z
c
) in the point cloud, including the point
itself, where the distance is measured by the Euclidean norm. The point and its K nearest
neighbors form a local region:
KNN(p
c
) ={p
c1
,··· ,p
cK
}, p
c1
,··· ,p
cK
∈P. (3.1)
We treatp
c
as a new origin for each local region centered atp
c
, so that we can partition the
local region into eight octants ξ j
, j = 1,··· ,8 based on the value of each coordinate (i.e.,
greater or less than that of p
c
).
The centroid of the point attributes in each octant is computed via
a
j
c
=
1
K
j
K
j
X
i=1
t
j
ci
a
ci
, j = 1,··· ,8, (3.2)
where a
ci
is the attribute vector of point p
ci
and
t
j
ci
=
1, x
ci
∈ξ j
,
0, x
ci
/ ∈ξ j
,
(3.3)
is the coefficient to indicate whether point x
ci
is in octant ξ j
and K
j
is the number of K-NN
points in octant ξ j
. Finally, all centroids of attributes a
j
c
, j = 1,··· ,8, are concatenated to
form a new descriptor of sampled point p
c
:
a
c
=Concat{a
j
c
}
8
j=1
. (3.4)
38
This descriptor is robust with respect to disturbance in positions of acquired points
because of the averaging operation in each quadrant. We use the 3D coordinates, (x,y,z),
as the initial attributes of a point. It is called the 0-hop attributes. The dimension of 0-hop
attributes is 3. The local descriptor as given in Eq. (3.4) has a dimension of 3× 8 = 24.
We adopt the local descriptor as the new attributes of a point that takes its relationship
with its KNN neighbors into account. It is called the 1-hop attributes. Note that the 0-hop
attributescanbegeneralizedto(x,y,z,r,g,b)forpointcloudswithcolorinformation(r,g,b)
at each point.
If p
B
is a member in KNN(p
A
), we call that p
B
is a 1-hop neighbor of p
A
. If p
C
is
a 1-hop neighbor of p
B
and p
B
is a 1-hop neighbor of p
A
, we call p
C
is a 2-hop neighbor
of p
A
if p
C
is not a 1-hop neighbor of p
A
. The dimension of the attribute vector of each
point grows from 3 to 24 due to the change of local descriptors from 0-hop to 1-hop. We can
build another local descriptor based on the 1-hop attributes of each point. The descriptor
defines the 2-hop attributes of dimension 24 × 8 = 192. The n-hop attributes characterize
the relationship of a point with its m-hop neighbors, m≤ n.
As n becomes larger, the n-hop attributes offer a larger coverage of points in a point
cloud model, which is analogous to a larger receptive field in deeper layers of CNNs. Yet,
thedimensiongrowingrateisfast. Itisdesiredtoreducethedimensionofthen-hopattribute
vector first before reaching out to neighbors of the ( n+1)-hop. The Saab transform [48] is
used to reduce the attribute dimension of each point.
Each PointHop unit has one-stage Saab transform. For L PointHop units in cascade,
we need L-stage Saab transforms. We set L = 4 in the experiments. Each Saab transform
contains three steps: 1) DC/AC separation, 2) PCA and 3) bias addition. The number of
39
AC Saab filters is determined by the energy plot of PCA coefficients as shown in Fig. 3.5.
We choose the knee location of the curve as indicated by the red point in each subfigure.
5 10 15 20 25
components number
8
6
4
2
0
log of energy
PCA energy of 0 layer
(a) First unit
0 20 40 60 80 100 120
components number
10
8
6
4
2
0
2
log of energy
PCA energy of 1 layer
(b) Second unit
0 25 50 75 100 125 150 175
components number
10
8
6
4
2
0
2
4
log of energy
PCA energy of 2 layer
(c) Third unit
0 50 100 150 200 250 300
components number
8
6
4
2
0
2
4
6
log of energy
PCA energy of 3 layer
(d) Fourth unit
Figure3.5: DeterminationofthenumberofSaabfiltersineachofthePointHopunits, where
the red dot in each subfigure indicates the selected number of Saab filters.
ThesystemdiagramoftheproposedPointHopmethodisshowninFig. 3.3. Itconsistsof
multiplePointHopunits. FourPointHopunitsareshowninthefigure. Forthe ithPointHop
unit output, we use N
i
× D
i
to characterize its two parameters; namely, it has N
i
points
and each of them has D
i
attributes.
40
For the ith PointHop unit, we aggregate (or pool) each individual attribute of N
i
points into a single feature vector. To enrich the feature set, we consider multiple ag-
gregation/pooling schemes such as the max pooling [65], the mean aggregation, the l
1
-norm
aggregation and the l
2
-norm aggregation. Then, we concatenate them to obtain a feature
vector of dimension M× D
i
, where M is the number of attribute aggregation methods, for
the ith PointHop unit. Finally, we concatenate feature vectors of all PointHop units to form
the ultimate feature vector of the whole system.
To reduce computational complexity and speed up the coverage rate, we adopt a spatial
sampling scheme between two consecutive PointHop units so that the number of points to
be processed is reduced. This is achieved by the farthest point sampling (FPS) scheme
[40, 23, 61] since it captures the geometrical structure of a point cloud model better. For
a given set of input points, the FPS scheme first selects the point closest to the centroid.
Afterwards, it selects the point that has the farthest Euclidean distance to existing points
in the selected subset iteratively until the target number is reached. The advantage of the
FPS scheme will be illustrated in Sec. 3.2.2.
Classification and Ensembles
Uponobtainingthefeaturevector,weadoptwellknownclassifierssuchasthesupportvector
machine(SVM)[21]andtherandomforest(RF)[8]classifiersfortheclassificationtask. The
SVM classifier performs classification by finding gaps that separate different classes. Test
samples are then mapped into one of the side of the gap and predicted to be the label of that
side. The RF classifier first trains a number of decision trees and each decision tree gives a
41
output. Then, the RF classifier ensembles outputs from all decision trees to give the mean
prediction. Both classifiers are mature and easy to use.
Ensemble methods fuse results from multiple weak classifiers to get a more powerful one
[22, 70, 20, 94]. Ensembles are adopted to improve the classification performance further-
more. We consider the following two ensemble strategies.
1. Decision ensemble. Multiple PointHop units are individually used as base classi-
fiers and their decision vectors are concatenated to form a new feature vector for the
ensemble classifier.
2. Feature ensemble. Features from multiple PointHop units are cascaded to form the
final vector for the classification task.
It is our observation that the second strategy offers better classification accuracy at the cost
ofahighercomplexityifthefeaturedimensionislarge. Wechoosethesecondstrategyforits
higher accuracy. With the feature ensemble strategy, it is desired to increase the diversity of
PointHop to enrich the feature set. We use the following four schemes to achieve this goal.
First, we augment the input data by rotating it with a certain degree. Second, we change
the number of Saab filters in each PointHop unit. Third, we change the K value in the KNN
scheme. Fourth, we vary the numbers of points in PointHop units.
3.2.2 Experimental Results
ModelNet40 Dataset. We conduct experiments on a popular 3D object classification
datasetcalledModelNet40[84]. Thedatasetcontains40categoriesofCADmodelsofobjects
such as airplanes, chairs, benches, cups, etc. Each initial point cloud has 2,048 points and
42
each point has three Cartesian coordinates. There are 9,843 training samples and 2,468
testing samples.
Experimental Setting. We adopt the following default setting in our experiments.
• The number of sampled points into the first PointHop unit: 256 points.
• The sampling method from the input point cloud model to that as the input to the
first PointHop unit: random sampling.
• The number of K in the KNN: K = 64.
• The number of PointHop units in cascade: 4.
• The number of Saab AC filters in the ith PointHop unit: 15 (i = 1), 25 (i = 2), 40
(i = 3) and 80 (i = 4).
• The sampling method between PointHop units: Farthest Point Sampling (FPS).
• The number of sampled points in the 2nd, 3rd and 4th PointHop units: 128, 128 and
64.
• The aggregation method: mean pooling.
• The classifier: the random forest classifier.
• Ensembles: No.
Ablation Study on PointHop Unit
WeshowclassificationaccuracyvaluesundervariousparametersettingsinTable3.1. Wesee
from the table that it is desired to use features from all stages, the FPS between PointHop
43
units, ensembles of all pooling schemes, the random forest classifier and the Saab transform.
As shown in the last row, we can reach a classification accuracy of 86.1% with randomly
selected256pointsastheinputtothePointHopsystem. Thewholetrainingtimeis5minutes
only. The FPS not only contributes to higher accuracy but also reduces the computation
time dramatically since it can enlarge the receptive field in a faster rate. The RF classifier
has a higher accuracy than the SVM classifier. Besides, it is much faster.
Table 3.1: Results of ablation study with 256 sampled points as the input to the PointHop
system.
Feature used FPS Pooling Classifier Dimension Reduction
Accuracy (%)
All stages Last stage Yes No Max Mean l
1
l
2
SVM Random Forest PCA Saab
✓ ✓ ✓ ✓ ✓ 77.5
✓ ✓ ✓ ✓ ✓ 77.4
✓ ✓ ✓ ✓ ✓ 79.6
✓ ✓ ✓ ✓ ✓ 79.9
✓ ✓ ✓ ✓ ✓ 78.8
✓ ✓ ✓ ✓ ✓ 80.2
✓ ✓ ✓ ✓ ✓ 84.5 (default)
✓ ✓ ✓ ✓ ✓ 84.8
✓ ✓ ✓ ✓ ✓ 85.6
✓ ✓ ✓ ✓ ✓ ✓ 85.3
✓ ✓ ✓ ✓ ✓ ✓ 85.7
✓ ✓ ✓ ✓ ✓ ✓ 85.1
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 86.1
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 85.6
We study the classification accuracy as a function of the sampled number of all point
cloud models as well as different pooling methods in Fig. 3.6, where the x-axis shows the
number of sampled points which is the same in training and testing. Corresponding to Fig.
44
3.2, we consider the following four settings: 256 points, 512 points, 768 points and 1,024
points. Different color curves are obtained by different pooling schemes. We compare eight
cases: fourindividualones, threeensemblesoftwo, andoneensembleofallfour. Weseethat
the maximum pooling and the mean pooling give the worst performance. Their ensemble
does not perform well, either. The performance gap is small for the remaining five schemes
as the point number is 1,024. The ensemble of all pooling schemes given the best results
in all four cases. The highest accuracy is 88.2% when we use 768 or 1,024 points with the
ensemble of all four pooling schemes.
200 300 400 500 600 700 800 900 1000
Points Number
0.78
0.80
0.82
0.84
0.86
0.88
Test Accuracy
Mean Pooling
Max Pooling
l_1 Pooling
l_2 Pooling
Mean and Max Pooling
l_1 and Max Pooling
l_2 and Max Pooling
All
Figure 3.6: The classification accuracy as a function of the sampled point number of the
input model to the PointHop system as well as different pooling methods.
45
Ensembles of PointHop Systems
Underthedefaultsetting,weconsiderensemblefivePointHopswithchangedhyper-parameters
(HP) to increase its diversity. They are summarized in Table 3.2. The hyper parameters of
concern include the following four.
• HP-A. We augment each point cloud model by rotating it with 45our times.
• HP-B. We use different numbers of AC filters in the PointHop units.
• HP-C. We adopt different K values in the KNN query in the PointHop units.
• HP-D. We take point cloud models of different point numbers as the input to the
PointHop units in four stages.
For HP-B, HP-C and HP-D, the four numbers in the table correspond to those in the
first,second,thirdandfourthPointHopunits,respectively. TogetensembleresultsofHP-A,
we keep HP-B, HP-C and HP-D the same (say, Setting 1). The same procedure applies in
gettingtheensembleresultsofHP-B,HP-CandHP-D.Furthermore,wecanderiveensemble
results of all cases as shown in the last column. We see from the table that the most simple
and effective ensemble result is achieved by rotating point clouds, where we can reach the
test accuracy of 88%. Thus, we focus on this ensemble method only in later experiments.
Comparison with State-of-the-Art Methods
We first compare the classification accuracy of the proposed PointHop system with those
of several state-of-the-art methods such as PointNet [65], PointNet++ [67], PointCNN [55]
and DGCNN [81] in Table 3.3. All of these works (including ours) are based on the model
46
Table3.2: EnsemblesoffivePointHopswithchangedhyper-parametersettingsandtheircorrespondingclassificationaccuracies.
Setting 1 Setting 2 Setting 3 Setting 4 Setting 5
Ensemble
accuracy (%)
HP-A 0 45 90 135 180 88.0
88.0
HP-B (15, 25, 40, 80) (15, 25, 35, 50) (18, 30, 50, 90) (20, 40, 60, 100) (20, 40, 70, 120) 87.0
HP-C (64, 64, 64, 64) (32, 32, 32, 32) (32, 32, 64, 64) (96, 96, 96, 96) (128, 128, 128, 128) 87.8
HP-D (512, 128, 128, 64) (512, 256, 128, 64) (512, 256, 256, 128) (512, 256, 256, 256) (512, 128, 128, 128) 86.8
47
of 1,024 points. The column of “average accuracy” means the average of per-class classifi-
cation accuracy while the column of “overall accuracy” shows the best result obtained. Our
PointHopbaselinecontainingasinglemodelwithoutanyensemblescanachieve88.65%over-
all accuracy. With ensemble, the overall accuracy is increased to 89.1%. The performance
of PointHop is worse than that of PointNet [55] and DGCNN [81] by 0.1% and 3.1%, respec-
tively. On the other hand, our PointHop method performs better than other unsupervised
methods such as LFD-GAN [2] and FoldingNet [88].
Table 3.3: Comparison of classification accuracy on ModelNet40, where the proposed
PointHop system achieves 89.1% test accuracy, which is 0.1% less than PointNet [65] and
3.1% less than DGCNN [81].
Method
Feature
extraction
Average
accuracy (%)
Overall
accuracy (%)
PointNet [65]
Supervised
86.2 89.2
PointNet++ [67] - 90.7
PointCNN [55] 88.1 92.2
DGCNN [81] 90.2 92.2
PointNet baseline
(Handcraft, MLP)
Unsupervised
72.6 77.4
LFD-GAN [2] - 85.7
FoldingNet [88] - 88.4
PointHop (baseline) 83.3 88.65
PointHop 84.4 89.1
Next, we compare the time complexity in Table 3.4. As shown in the table, the training
time of the PointHop system is significantly lower than deep-learning-based methods. It
takes 5 minutes and 20 minutes in training a PointHop baseline of 256-point and 1,024-point
cloud models, respectively, with CPU. Our CPU is Intel(R) Xeon(R) CPU E5-2620 v3 at
2.40GHz. In contrast, PointNet [65] takes more than 5 hours in training using one GTX1080
48
GPU. Furthermore, we compare the inference time in the test stage. PointNet++ demands
163.2ms in classifying a test sample of 1024 points while our PointHop method only needs
108.4ms. The most time consuming module in the PointHop system is the KNN query that
comparesthedistancebetweenpoints. Itispossibletolowertraining/testingtimeevenmore
by speeding up this module.
Table 3.4: Comparison of time complexity between PointNet/PointNet++ and PointHop.
Method
Total
training time
Inference
time (ms)
Device
PointNet (1,024 points) ∼ 5 hours 25.3 GPU
PointNet++ (1,024 points) - 163.2 GPU
PointHop (256 points) ∼ 5 minutes 103 CPU
PointHop (1,024 points) ∼ 20 minutes 108.4 CPU
In Fig. 3.7, we examine the robustness of classification performance with respect to
modelsoffourpointnumbers,i.e.,256,512,768and1,024. Forthefirstscenario,thenumbers
in training and testing are the same. It is indicated by DP in the legend. The PointHop
methodandthePointNetvanillaareshowninvioletandyellowlines. ThePointHopmethod
with DP is more robust than PointNet vanilla with DP. For the second scenario, we train
each method based on 1,024-point models and, then, apply the trained model to point
clouds of the same or fewer point numbers in the test. For the latter, there is a point cloud
model mismatch between training and testing. We see that the PointHop method is more
robust than PointNet++ (SSG) in the mismatched condition. The PointHop method also
outperforms DGCNN in the mismatched condition of the 256-point models.
49
300 400 500 600 700 800 900 1000
Points Number
0.4
0.5
0.6
0.7
0.8
0.9
Test Accuracy
Ours (DP)
Ours
PointNet vanilla (DP)
PointNet vanilla
PointNet++ (SSG)
DGCNN
Figure 3.7: Robustness to sampling density variation: comparison of test accuracy as a
function of sampled point numbers of different methods.
Feature Visualization
The learned features of the first-stage PointHop Unit are visualized in Fig. 3.8 for six highly
varyingpointcloudmodels. Weshowtheresponsesofdifferentchannelsthatarenormalized
into [0,1] (or from blue to red in color). We see that many common patterns are learned
such as corners of tents/lamps and plans of airplanes/beds. The learned features comprise
powerful and informative description in the 3D geometric space.
Error Analysis
The average accuracy of the PointHop method is worse than PointNet [65] by 1.8%. To
provide more insights, we show per-class accuracy on ModelNet40 in Table 3.5. We see that
PointHop achieves equal or higher accuracy in 18 classes. On the other hand, it has low
50
airplane bed lamp tent toilet plant
Figure 3.8: Visualization of learned features in the first-stage PointHop unit.
accuracy in several classes, including flower-pot (10%), cup (55%), radio (65%) and sink
(60%). Among them, the flower pot is the most challenging one.
We conduct error analysis on two object classes, “flower pot” and “cup”, in Figs. 3.9 (a)
and (b), respectively. The total test number of the flower pot class is 20. Eleven, six and
one of them are misclassified to the plant, the vase and the lamp classes, respectively. There
are only two correct classification cases. We show all point clouds of the flower pot class
in Fig. 3.9 (a). Only the first point cloud has a unique flower pot shape while others have
both the flower pot and the plant or are similar to the vase in shape. As to the cup class
classification, six are misclassified to the vase class, one misclassified to the bowl class and
another one misclassified to the lamp class. There are twelve correct classification results.
51
Table 3.5: Comparison of per-class classification accuracy on the ModelNet40.
Network airplane bathtub bed bench bookshelf bottle bowl car chair cone
PointNet 100.0 80.0 94.0 75.0 93.0 94.0 100.0 97.9 96.0 100.0
PointHop 100.0 94.0 99.0 70.0 96.0 95.0 95.0 97.0 100.0 90.0
cup curtain desk door dresser flower pot glass box guitar keyboard lamp
PointNet 70.0 90.0 79.0 95.0 65.1 30.0 94.0 100.0 100.0 90.0
PointHop 55.0 85.0 90.7 90.0 83.7 10.0 95.0 99.0 95.0 75.0
laptop mantel monitor night stand person piano plant radio range hood sink
PointNet 100.0 96.0 95.0 82.6 85.0 88.8 73.0 70.0 91.0 80.0
PointHop 100.0 91.0 98.0 79.1 80.0 82.0 76.0 65.0 91.0 60.0
sofa stairs stool table tent toilet tv stand vase wardrobe xbox
PointNet 96.0 85.0 90.0 88.0 95.0 99.0 87.0 78.8 60.0 70.0
PointHop 96.0 75.0 85.0 82.0 95.0 97.0 82.0 84.0 70.0 75.0
The errors are caused by shape/functional similarity. To overcome the challenge, we may
need to supplement the data-driven approach with the rule-based approach to improve the
classification performance furthermore. For example, the height-to-radius ratio of a flower
pot is smaller than that of a vase. Also, if the object has a holder, it is more likely to be a
cup rather than a vase.
3.2.3 Discussion
An explainable machine learning method called the PointHop method was proposed for
point cloud classification in this section. It builds attributes of higher dimensions at each
sampled point through iterative one-hop information exchange. This is analogous to a larger
receptive field in deeper convolutional layers in CNNs. The problem of unordered point
cloud data was addressed using a novel space partitioning procedure. Furthermore, we
52
flower pot lamp
plant
plant plant
plant plant
flower pot
plant plant
plant plant
vase vase vase
vase
vase vase
plant plant
(a) flower pot test samples
bowl cup cup cup cup
cup cup cup cup cup
cup cup cup lamp vase
vase vase vase vase vase
(b) cup test samples
Figure 3.9: The label under each point cloud is its predicted class. Many flower pots are
misclassified to the plant and the vase classes. Also, quite a few cups are misclassified to the
vase class.
53
used the Saab transform to reduce the attribute dimension in each PointHop unit. In the
classification stage, we fed the feature vector to a classifier and explored ensemble methods
to improve the classification performance. It was shown by experimental results that the
training complexity of the PointHop method is significantly lower than that of state-of-the-
art deep-learning-based methods with comparable classification performance. We conducted
error analysis on hard object classes and pointed out a future research direction for further
performance improvement by considering data-driven and rule-based approaches jointly.
3.3 PointHop++ Method
There are two shortcomings of PointHop. First, it has a large spatial dimension and a small
spectraldimensioninthebeginningofthepipeline. Eachpointhasasmallreceptivefield. As
wemovetofurtherhops(orstages),thereceptivefieldincreasesinsize,andthesystemtrades
a larger spatial dimension for a higher spectral dimension. We use n
t
= n
a
× n
e
to denote
the tensor dimension at a certain hop, where n
a
and n
e
are spatial and spectral dimensions,
respectively. UndertheSSLframework,weneedtoconducttheprincipalcomponentanalysis
(PCA) on input tensor space. That is, we compute the covariance matrix of vectorized
tensors, which has a dimension of n
t
× n
t
. Then, if we want to find d principal components,
the complexity is O(dn
2
t
+d
3
). Since n
t
>d, the first term dominates. To make the learning
model smaller, it is desired to lower the input tensor dimension so as to reduce the filter
size. Second, the loss function minimization plays an important role in deep-learning-based
methods. However, it was not incorporated in PointHop. To get a lightweight model and
leveragethelossfunctionforbetterperformance, wepresentnewideastoimprovePointHop.
54
3.3.1 Methodology
An overview of the proposed PointHop++ method is illustrated in Fig. 3.10. A point cloud
set,P, which consists of N points denoted by p
n
= (x
n
,y
n
,z
n
), 1≤ n≤ N, is taken as input
to the feature learning system to obtain a powerful feature representation. After that, the
linear least squares regression (LLSR) is conducted on the obtained features to output the
40D probability vector where the corresponding class labels come from.
Initial Feature Space Construction
Given a point cloud, P = {p
1
,p
2
,··· ,p
N
}, where p
n
= (x
n
,y
n
,z
n
) ∈ R
3
, N is the size of
the point set. To extract the local feature of each point p
c
∈ P, we follow the same design
principle of the PointHop unit. The k nearest neighbor points of point p
c
are retrieved to
build a neighboring point set:
Neighborhood(p
c
) ={p
c
1
,p
c
2
,··· ,p
c
k
},
including p
c
itself. The neighborhood set excluding p
c
is partitioned into eight quadrants
according to their relative spatial coordinates. Then, the mean pooling is used to generate
a D-dimensional attribute vector of each quadrant. Mathematically, we have the following
mapping:
g :R
D
×··· R
D
| {z }
k
→R
D
×··· R
D
| {z }
8
, (3.5)
where D = 3 for the first hop and D = 1 for the remaining hops. The operation in the
first PointHop unit is shown in the upper-left enclosed subfigure of Fig. 3.10. In words,
55
First hop Second hop Third hop
Input Points
PointHop
Unit
N
1
⇥ 24
AAAB+HicbVBNS8NAEJ34WetHox69LLaCp5JEUY8FL56kgv2ANpbNdtMu3WzC7kaoob/EiwdFvPpTvPlv3LY5aOuDgcd7M8zMCxLOlHacb2tldW19Y7OwVdze2d0r2fsHTRWnktAGiXks2wFWlDNBG5ppTtuJpDgKOG0Fo+up33qkUrFY3OtxQv0IDwQLGcHaSD27VLl9cFFXs4gq5J1XenbZqTozoGXi5qQMOeo9+6vbj0kaUaEJx0p1XCfRfoalZoTTSbGbKppgMsID2jFUYLPIz2aHT9CJUfoojKUpodFM/T2R4UipcRSYzgjroVr0puJ/XifV4ZWfMZGkmgoyXxSmHOkYTVNAfSYp0XxsCCaSmVsRGWKJiTZZFU0I7uLLy6TpVd2zqnfnlWsXeRwFOIJjOAUXLqEGN1CHBhBI4Rle4c16sl6sd+tj3rpi5TOH8AfW5w9IMJGA
N⇥ 3
AAAB83icbVDLSgNBEOyNrxhfUY9eBhPBU9hNQD0GvHiSCOYB2SXMTmaTIbMPZnqFsOQ3vHhQxKs/482/cZLsQRMLGoqqbrq7/EQKjbb9bRU2Nre2d4q7pb39g8Oj8vFJR8epYrzNYhmrnk81lyLibRQoeS9RnIa+5F1/cjv3u09caRFHjzhNuBfSUSQCwSgaya3eExdFyDVpVAflil2zFyDrxMlJBXK0BuUvdxizNOQRMkm17jt2gl5GFQom+azkpponlE3oiPcNjajZ42WLm2fkwihDEsTKVIRkof6eyGio9TT0TWdIcaxXvbn4n9dPMbjxMhElKfKILRcFqSQYk3kAZCgUZyinhlCmhLmVsDFVlKGJqWRCcFZfXiedes1p1OoP9UrzKo+jCGdwDpfgwDU04Q5a0AYGCTzDK7xZqfVivVsfy9aClc+cwh9Ynz8wZZBv
c/w Subspace
Decomposition
N
1
⇥ 1
AAAB9XicbVDLSgNBEOyNrxhfUY9eBhPBU9iNoB4DXjxJBPOAZBNmJ7PJkNkHM71KWPIfXjwo4tV/8ebfOEn2oIkFDUVVN91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCoqaNEMd5gkYxU26OaSxHyBgqUvB0rTgNP8pY3vpn5rUeutIjCB5zE3A3oMBS+YBSN1Cvf9RzSRRFwTZxyv1iyK/YcZJU4GSlBhnq/+NUdRCwJeIhMUq07jh2jm1KFgkk+LXQTzWPKxnTIO4aG1Oxx0/nVU3JmlAHxI2UqRDJXf0+kNNB6EnimM6A40sveTPzP6yToX7upCOMEecgWi/xEEozILAIyEIozlBNDKFPC3ErYiCrK0ARVMCE4yy+vkma14lxUqvfVUu0yiyMPJ3AK5+DAFdTgFurQAAYKnuEV3qwn68V6tz4WrTkrmzmGP7A+fwBW9JEQ
...
PointHop
Unit
N
2
⇥ 8
AAAB9XicbVDLTgJBEOzFF+IL9ehlIph4IruYKEcSL54MJgImsJDZYRYmzD4y06shG/7DiweN8eq/ePNvHGAPClbSSaWqO91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCopaNEMd5kkYzUg0c1lyLkTRQo+UOsOA08ydve+Hrmtx+50iIK73ESczegw1D4glE0Uq9826uSLoqAa1Ir94slu2LPQVaJk5ESZGj0i1/dQcSSgIfIJNW649gxuilVKJjk00I30TymbEyHvGNoSM0eN51fPSVnRhkQP1KmQiRz9fdESgOtJ4FnOgOKI73szcT/vE6Cfs1NRRgnyEO2WOQnkmBEZhGQgVCcoZwYQpkS5lbCRlRRhiaoggnBWX55lbSqFeeiUr2rluqXWRx5OIFTOAcHrqAON9CAJjBQ8Ayv8GY9WS/Wu/WxaM1Z2cwx/IH1+QNjJZEY
>T
c/w Subspace
Decomposition
N
2
⇥ 1
AAAB9XicbVDLSgNBEOyNrxhfUY9eBhPBU9iNoB4DXjxJBPOAZBNmJ7PJkNkHM71KWPIfXjwo4tV/8ebfOEn2oIkFDUVVN91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCoqaNEMd5gkYxU26OaSxHyBgqUvB0rTgNP8pY3vpn5rUeutIjCB5zE3A3oMBS+YBSN1Cvf9aqkiyLgmjjlfrFkV+w5yCpxMlKCDPV+8as7iFgS8BCZpFp3HDtGN6UKBZN8WugmmseUjemQdwwNqdnjpvOrp+TMKAPiR8pUiGSu/p5IaaD1JPBMZ0BxpJe9mfif10nQv3ZTEcYJ8pAtFvmJJBiRWQRkIBRnKCeGUKaEuZWwEVWUoQmqYEJwll9eJc1qxbmoVO+rpdplFkceTuAUzsGBK6jBLdShAQwUPMMrvFlP1ov1bn0sWnNWNnMMf2B9/gBYgpER
...
...
PointHop
Unit
N
2
⇥ 8
AAAB9XicbVDLTgJBEOzFF+IL9ehlIph4IruYKEcSL54MJgImsJDZYRYmzD4y06shG/7DiweN8eq/ePNvHGAPClbSSaWqO91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCopaNEMd5kkYzUg0c1lyLkTRQo+UOsOA08ydve+Hrmtx+50iIK73ESczegw1D4glE0Uq9826uSLoqAa1Ir94slu2LPQVaJk5ESZGj0i1/dQcSSgIfIJNW649gxuilVKJjk00I30TymbEyHvGNoSM0eN51fPSVnRhkQP1KmQiRz9fdESgOtJ4FnOgOKI73szcT/vE6Cfs1NRRgnyEO2WOQnkmBEZhGQgVCcoZwYQpkS5lbCRlRRhiaoggnBWX55lbSqFeeiUr2rluqXWRx5OIFTOAcHrqAON9CAJjBQ8Ayv8GY9WS/Wu/WxaM1Z2cwx/IH1+QNjJZEY
>T
c/w Subspace
Decomposition
N
2
⇥ 1
AAAB9XicbVDLSgNBEOyNrxhfUY9eBhPBU9iNoB4DXjxJBPOAZBNmJ7PJkNkHM71KWPIfXjwo4tV/8ebfOEn2oIkFDUVVN91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCoqaNEMd5gkYxU26OaSxHyBgqUvB0rTgNP8pY3vpn5rUeutIjCB5zE3A3oMBS+YBSN1Cvf9aqkiyLgmjjlfrFkV+w5yCpxMlKCDPV+8as7iFgS8BCZpFp3HDtGN6UKBZN8WugmmseUjemQdwwNqdnjpvOrp+TMKAPiR8pUiGSu/p5IaaD1JPBMZ0BxpJe9mfif10nQv3ZTEcYJ8pAtFvmJJBiRWQRkIBRnKCeGUKaEuZWwEVWUoQmqYEJwll9eJc1qxbmoVO+rpdplFkceTuAUzsGBK6jBLdShAQwUPMMrvFlP1ov1bn0sWnNWNnMMf2B9/gBYgpER
...
PointHop
Unit >T
c/w Subspace
Decomposition
...
PointHop
Unit
>T
c/w Subspace
Decomposition
...
N
3
⇥ 8
AAAB9XicbVDLTgJBEOzFF+IL9ehlIph4IruQKEcSL54MJvJIYCGzwyxMmH1kpldDCP/hxYPGePVfvPk3DrAHBSvppFLVne4uL5ZCo21/W5mNza3tnexubm//4PAof3zS1FGiGG+wSEaq7VHNpQh5AwVK3o4Vp4Enecsb38z91iNXWkThA05i7gZ0GApfMIpG6hXvehXSRRFwTarFfr5gl+wFyDpxUlKAFPV+/qs7iFgS8BCZpFp3HDtGd0oVCib5LNdNNI8pG9Mh7xgaUrPHnS6unpELowyIHylTIZKF+ntiSgOtJ4FnOgOKI73qzcX/vE6CftWdijBOkIdsuchPJMGIzCMgA6E4QzkxhDIlzK2EjaiiDE1QOROCs/ryOmmWS06lVL4vF2pXaRxZOINzuAQHrqEGt1CHBjBQ8Ayv8GY9WS/Wu/WxbM1Y6cwp/IH1+QNks5EZ
N
3
⇥ 8
AAAB9XicbVDLTgJBEOzFF+IL9ehlIph4IruQKEcSL54MJvJIYCGzwyxMmH1kpldDCP/hxYPGePVfvPk3DrAHBSvppFLVne4uL5ZCo21/W5mNza3tnexubm//4PAof3zS1FGiGG+wSEaq7VHNpQh5AwVK3o4Vp4Enecsb38z91iNXWkThA05i7gZ0GApfMIpG6hXvehXSRRFwTarFfr5gl+wFyDpxUlKAFPV+/qs7iFgS8BCZpFp3HDtGd0oVCib5LNdNNI8pG9Mh7xgaUrPHnS6unpELowyIHylTIZKF+ntiSgOtJ4FnOgOKI73qzcX/vE6CftWdijBOkIdsuchPJMGIzCMgA6E4QzkxhDIlzK2EjaiiDE1QOROCs/ryOmmWS06lVL4vF2pXaRxZOINzuAQHrqEGt1CHBjBQ8Ayv8GY9WS/Wu/WxbM1Y6cwp/IH1+QNks5EZ
N
3
⇥ 1
AAAB9XicbVDLTgJBEOzFF+IL9ehlIph4IruQqEcSL54MJgImsJDZYRYmzD4y06shG/7DiweN8eq/ePNvHGAPClbSSaWqO91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCopaNEMd5kkYzUg0c1lyLkTRQo+UOsOA08ydve+Hrmtx+50iIK73ESczegw1D4glE0Uq9826uRLoqAa+KU+8WSXbHnIKvEyUgJMjT6xa/uIGJJwENkkmrdcewY3ZQqFEzyaaGbaB5TNqZD3jE0pGaPm86vnpIzowyIHylTIZK5+nsipYHWk8AznQHFkV72ZuJ/XidB/8pNRRgnyEO2WOQnkmBEZhGQgVCcoZwYQpkS5lbCRlRRhiaoggnBWX55lbSqFadWqd5VS/WLLI48nMApnIMDl1CHG2hAExgoeIZXeLOerBfr3fpYtOasbOYY/sD6/AFaEJES
N
3
⇥ 1
AAAB9XicbVDLTgJBEOzFF+IL9ehlIph4IruQqEcSL54MJgImsJDZYRYmzD4y06shG/7DiweN8eq/ePNvHGAPClbSSaWqO91dXiyFRtv+tnJr6xubW/ntws7u3v5B8fCopaNEMd5kkYzUg0c1lyLkTRQo+UOsOA08ydve+Hrmtx+50iIK73ESczegw1D4glE0Uq9826uRLoqAa+KU+8WSXbHnIKvEyUgJMjT6xa/uIGJJwENkkmrdcewY3ZQqFEzyaaGbaB5TNqZD3jE0pGaPm86vnpIzowyIHylTIZK5+nsipYHWk8AznQHFkV72ZuJ/XidB/8pNRRgnyEO2WOQnkmBEZhGQgVCcoZwYQpkS5lbCRlRRhiaoggnBWX55lbSqFadWqd5VS/WLLI48nMApnIMDl1CHG2hAExgoeIZXeLOerBfr3fpYtOasbOYY/sD6/AFaEJES
Saab
Transform
8 Quadrant
Partitioning
z
x
y
Space
Grouping
Input Point Cloud
KNN
Local Region (One Hop) Feature Reduction Points in Order
...
Local Descriptor
⇠ 1
AAAB7HicbVBNS8NAEJ2tX7V+VT16WSyCp5BUQY8FLx4rmLbQxrLZbtqlm03Y3Ygl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSq4Nq77jUpr6xubW+Xtys7u3v5B9fCopZNMUebTRCSqExLNBJfMN9wI1kkVI3EoWDsc38z89iNTmify3kxSFsRkKHnEKTFW8ntP/MHrV2uu486BV4lXkBoUaParX71BQrOYSUMF0brruakJcqIMp4JNK71Ms5TQMRmyrqWSxEwH+fzYKT6zygBHibIlDZ6rvydyEms9iUPbGRMz0sveTPzP62Ymug5yLtPMMEkXi6JMYJPg2ed4wBWjRkwsIVRxeyumI6IINTafig3BW355lbTqjnfh1O8uaw2niKMMJ3AK5+DBFTTgFprgAwUOz/AKb0iiF/SOPhatJVTMHMMfoM8ffKSOaA==
⇠ 2
AAAB7HicbVBNS8NAEJ2tX7V+VT16WSyCp5BUQY8FLx4rmLbQxrLZbtqlm03Y3Ygl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSq4Nq77jUpr6xubW+Xtys7u3v5B9fCopZNMUebTRCSqExLNBJfMN9wI1kkVI3EoWDsc38z89iNTmify3kxSFsRkKHnEKTFW8ntP/KHer9Zcx50DrxKvIDUo0OxXv3qDhGYxk4YKonXXc1MT5EQZTgWbVnqZZimhYzJkXUsliZkO8vmxU3xmlQGOEmVLGjxXf0/kJNZ6Eoe2MyZmpJe9mfif181MdB3kXKaZYZIuFkWZwCbBs8/xgCtGjZhYQqji9lZMR0QRamw+FRuCt/zyKmnVHe/Cqd9d1hpOEUcZTuAUzsGDK2jALTTBBwocnuEV3pBEL+gdfSxaS6iYOYY/QJ8/fiiOaQ==
⇠ 8
AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbB07JbBXssePFYwX5Au5Zsmm1js8mSZMWy9D948aCIV/+PN/+NabsHbX0w8Hhvhpl5YcKZNp737RTW1jc2t4rbpZ3dvf2D8uFRS8tUEdokkkvVCbGmnAnaNMxw2kkUxXHIaTscX8/89iNVmklxZyYJDWI8FCxiBBsrtXpP7L5W6pcrnuvNgVaJn5MK5Gj0y1+9gSRpTIUhHGvd9b3EBBlWhhFOp6VeqmmCyRgPaddSgWOqg2x+7RSdWWWAIqlsCYPm6u+JDMdaT+LQdsbYjPSyNxP/87qpiWpBxkSSGirIYlGUcmQkmr2OBkxRYvjEEkwUs7ciMsIKE2MDmoXgL7+8SlpV179wq7eXlbqbx1GEEziFc/DhCupwAw1oAoEHeIZXeHOk8+K8Ox+L1oKTzxzDHzifP70njoM=
Figure 3.10: Illustration of the PointHop++ method, where the upper-left enclosed subfigure shows the operation in the first
PointHopunit,andN andN
i
denotethenumberofpointsoftheinputandinthenthhop, respectively. Duetolittlecorrelation
between channels, we can perform channel-wise (c/w) subspace decomposition to reduce the model size. A subspace with its
energy larger than threshold T proceeds to the next hop while others become leaf nodes of the feature tree in the current hop.
56
the averaged attribute of all points in a quadrant is selected as the representative attribute
of that quadrant. For the first hop, we use the spatial coordinates p
n
= (x
n
,y
n
,z
n
) as the
attributes. Fortheremaininghops,weuseaone-dimensional(1D)spectralcomponentasthe
attributeofretrievedpoints. Thisispossiblesinceweapplythec/wsubspacedecomposition
to the output from the previous hop.
It is worthwhile to point out that, instead of using the max pooling as a symmetric
function, we adopt the mean pooling as a symmetry function here. This is to ensure that
the attributes of points are invariant under the permutation of points in the point cloud
while the local structure is retained at the same time. Attributes of all eight quadrants are
concatenated to become a∈R
8D
, which represents the attribute of selected point p
c
before
c/w subspace decomposition.
Channel-Wise (C/W) Subspace Decomposition
TheSaabtransform[48]isavariantofthePCA[82]designedtoovercomethesignconfusion
problem [43] when multiple PCA stages are in cascade. It is used as a dimension reduction
toolinPointHop. AllSaabtransformcoefficientsaregroupedtogetherandusedastheinput
to the next hop unit in PointHop. Here, we would like to prove that the Saab coefficients
of different channels are weakly correlated. Then, we can decompose the Saab coefficient
vectorofdimension8D into8one-dimensional(1D)subspaces. Each1Dsubspacerepresents
a spatial-spectral localized representation of the point set. Besides its physical meaning, this
representation demands less computation in the next hop. For ease of implementation, all
components after the Saab transform are kept in PointHop++.
57
Tovalidatec/wsubspacedecomposition, wecomputethecorrelationofSaabcoefficients.
The input to the Saab transform is
A = [a
1
,··· ,a
N
]
T
∈R
N× 8D
,
where a
n
is the 8D attribute vector of point p
n
, and the filter weight is
W = [w
1
,w
2
,··· ,w
8D
]∈R
8D× 8D
,
where w
1
=
1
√
8D
[1,1,··· ,1]
T
and others are eigenvectors of covariance matrix A ranked
by its associated eigenvalue λ i
from the largest to the smallest. The output of the Saab
transform is
B =A· W = [b
1
,··· ,b
8D
],
whereb
i
∈R
N× 1
,i = 1,··· ,8D. Hence,thecorrelationbetweenSaabcoefficientsofdifferent
channels is
Cor(b
i
,b
j
) =
1
N
(A· w
i
)
T
(A· w
j
) =
1
N
(λ i
w
i
)
T
(λ j
w
j
)
= 0,
(3.6)
where i̸= j. The last equality comes from the orthogonality of eigenvectors in PCA analy-
sis. This justifies the decomposition of a joint feature space into multiple uncorrelated 1D
subspaces as
R
8D
→R
1
×··· R
1
| {z }
8D
. (3.7)
58
We should point out that, because of the special choice of the first filter weight w
1
, the
above analysis is only an approximation. In practice, we observe very weak correlation
between Saab coefficients (in the order of 10
− 4
) as compared to the diagonal term (i.e.
self-correlation).).
Channel Split Termination and Feature Priority Ordering
We compute the energy of each subspace as
E
i
=E
p
× λ i
P
8D
j=1
λ j
, (3.8)
where i = 1,··· ,8D and E
p
is the energy of its parent node. If the energy of a node is less
than a pre-set threshold, T, we terminate its further split and keep it as a leaf node of the
feature tree at the current hop. Other nodes will proceed to the next hop. All leaf nodes
are collected as the feature representation after the feature tree construction is completed.
To determine threshold value T, the training and validation accuracy curves are plotted
as a function of T in Fig. 3.11 (a). We see that the training accuracy keeps increasing
when T decrease from 0.1 to 0.00001. Yet, the overall validation accuracy increases to the
maximum value of 90.3% at T = 0.0001. After that, the validation accuracy decreases.
Thus, we choose T = 0.0001.
Once the feature tree is constructed, it is desired to order features based on their dis-
criminantpowerandselectthemaccordinglytoavoidoverfit. Afeatureismorediscriminant
if its cross entropy is lower. The cross entropy can be computed for each feature at a leaf
node. We follow the same process as described in [48]. That is, a clustering algorithm [80]
59
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Threshold -log(T)
70
75
80
85
90
95
100
Accuracy (%)
Train Overall
Validation Overall
Validation Mean
(a) Value of Threshold
0 500 1000 1500 2000 2500 3000 3500 4000
#Attributes
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Accuracy (%)
Train_CE
Validation_CE
Train_E
Validation_E
(b) Number of Ranked Features
Figure 3.11: Illustration of the impact of (a) values of the energy threshold and (b) the
number of cross-entropy-ranked (CE) or energy-ranked (E) features.
is adopted to partition the 1D subspace into J intervals. Then, the majority vote is used
to predict the label for each interval. Based on the groundtruth labels, the probability that
each sample belongs to a class can be obtained. Mathematically, we have
L =
J
X
j=1
L
j
, L
j
=− M
X
c=1
y
j,c
log(p
j,c
), (3.9)
where M is the class number, y
j,c
is binary indicator to show whether sample j is correctly
classified, and p
j,c
is the probability that sample j belongs to class c.
We compare training and validation accuracy curves using features that are ranked by
the cross-entropy values and the energy values, respectively, in Fig. 3.11 (b), where the
x-axis indicates the total number of top-ranked features. We see that overfitting is improved
60
by both methods. The cross-entropy-ranked method performs better when the total feature
number is smaller.
3.3.2 Experimental Results
ExperimentsareconductedontheModelNet40dataset[84],whichcontains40objectclasses.
1024pointsaresampledrandomlyfromtheoriginalpointcloudsetastheinputtoPointHop++.
The depth of the feature tree is set to four hops. The farthest point sampling [23] is used
to downsample points from one hop to the next to increase the receptive field and speed up
the computation.
Classification Performance. Classification accuracy of different methods are compared
in Table 3.6. PointHop++ (baseline), which has an energy threshold 0.0001 without feature
selection or ensembles, gives 90.3% overall accuracy and 85.6% class-avg accuracy. By in-
corporating the feature selection tool as discussed in Sec. 3.3.1, PointHop++ (FS) improves
the overall and class-avg accuracy results by 0.5% and 0.9%, respectively. Furthermore,
we rotate point clouds by 45 degrees and conduct LLSR to get a 40D feature for eight
times. Then, these features are concatenated and fed into another LLSR. The ensemble
method has an overall accuracy of 91.1% and a class-avg accuracy of 87%. PointHop++
method achieves the best performance among unsupervised feature extraction methods. It
outperforms PointHop [99] by 2% in overall accuracy. As compared with deep networks,
PointHop++ outperforms PointNet [65] and PointNet++ [67]. It has a gap of 1.1% against
PointCNN [55] and DGCNN [81].
61
Table3.6: ComparisonofclassificationresultsonModelNet40, wheretheclass-Avgaccuracy
is the mean of the per-class accuracy, and FS and ES mean “feature selection” and “ensem-
ble”, respectively.
Method
Accuracy (%)
class-avg overall
Supervised
PointNet [65] 86.2 89.2
PointNet++ [67] - 90.7
PointCNN [55] 88.1 92.2
DGCNN [81] 90.2 92.2
Unsupervised
LFD-GAN [2] - 85.7
FoldingNet [88] - 88.4
PointHop [99] 84.4 89.1
PointHop++ (baseline) 85.6 90.3
PointHop++ (FS) 86.5 90.8
PointHop++ (FS+ES) 87 91.1
Comparison of Model and Time Complexities. Comparison of time complexity and
model sizes of different methods is given in Table 3.7. Four deep networks were trained on
a single GeForce GTX TITAN X GPU. It took at least 7 hours to train a 1,024 point cloud
model while PointHop++ only took 25 minutes on a Intel(R)Xeon(R) CPU. As to inference
time of every sample, both PointHop and PointHop++ took about 100 ms while DGCNN
took 163 ms. The number of model parameters are also computed to show space complexity.
The Saab filter size of PointHop++ is 4X less than that of PointHop. The total model
parameters of PointHop++ is 20X less than that PointNet [65] and 10X less than DGCNN
[81] [55].
Comparison of Robustness. We compare the robustness of different models against
sampling density variation in Fig. 3.12. All models are trained on 1,024 point cloud model.
The test model are randomly sampled with 768, 512, and 256 points, respectively. We see
62
Table 3.7: Comparison of time and model complexity, where the training and inference time
units are in hour and ms, respectively.
Method
Time Parameter No. (M)
Training Inference Filter Classifier Total
PointNet [65] 7 10 - - 3.48
PointNet++ [67] 7 14 - - 1.48
DGCNN [81] 21 154 - - 1.84
PointHop [99] 0.33 108 0.037 - -
PointHop++ 0.42 97 0.009 0.15 0.159
300 400 500 600 700 800 900 1000
Points Number
0.4
0.5
0.6
0.7
0.8
0.9
Accuracy (%)
PointHop
PointNet vanilla
PointNet++ (SSG)
DGCNN
PointHop++
Figure 3.12: Robustness against different sampling densities of the test model.
thatPointHop++aremorerobustthanPointHop[99],PointNet++(SSG)[67]andDGCNN
[81] under mismatched sampling densities.
Other Visualization. Finally, we show the correlation matrix of AC components at the
first hop in Fig. 3.13. It verifies the claim that different AC components are uncorrelated.
Furthermore,wevisualizethefeaturedistributionwiththeT-SNEplot,wherethedimension
is reduced to 2D. We visualize the features of the 10 object classes from ModelNet10 [84],
which is a subset of ModelNet40 [84]. We see that most features of the same category
63
are clustered together, which demonstrates the discriminant power of features selected by
PointHop++.
0 4 8 12 16 20
0 4 8 12 16 20
0.0
0.2
0.4
0.6
0.8
1.0
(a) correlation matrix
bathtub
bed
chair
desk
dresser
monitor
night_stand
sofa
table
toilet
(b) feature clustering
Figure 3.13: Visualization of (a) the correlation matrix at the first hop and (b) feature
clustering in the T-SNE plot.
3.3.3 Discussion
A tree-structured unsupervised feature learning system was proposed in this section, where
one scalar feature is associated with each leaf node and features are ordered based on their
discriminant power. The tree-structured feature construction process at each hop can be
summarized as follows.
• Use the knn algorithm to retrieve neighbor points;
• Use the decoupled attribute to perform the Saab transform;
• If the energy of a node is greater than a pre-set threshold, perform the c/w subspace
decomposition and obtain decoupled attributes as the input to the next hop.
64
Theaboveprocessisrepeateduntilthelasthopisreached. Oncethefeaturetreeconstruction
iscompleted,eachleafnodecontainsascalarfeature. Thesefeaturesarerankedaccordingto
their energy and cross entropy. Finally, the LLSR is adopted as the classifier. The resulting
PointHop++ method achieves state-of-the-art classification performance while demanding a
significantly small learning model which is ideal for mobile computing.
3.4 Conclusion
Explainable and green methods, PointHop and PointHop++, are proposed for point cloud
classification. PointHop is an explainable machine learning method. It is mathematically
transparent and its feature extraction is an unsupervised procedure since no class labels
are needed in this stage. PointHop is a green method, it has an extremely low training
complexity because it requires only one forward pass to learn parameters of the system.
We further analyze the shortcomings of the PointHop method and propose a lightweight
learning model for point cloud classification, called PointHop++, targeting at solving the
shortcomings. The lightweight comes in two flavors: 1) its model complexity is reduced
in terms of the model parameter number by constructing a tree-structure feature learning
system and 2) features are ordered and discriminant ones are selected automatically based
on the cross-entropy criterion. With experiments conducted on the ModelNet40 benchmark
dataset, we show that the PointHop and PointHop++ method perform on par with deep
learning solutions and surpass other unsupervised feature extraction methods. Moreover,
the experimental results show that the training complexity of our methods are significantly
lower than that of state-of-the-art deep-learning-based methods.
65
Chapter 4
Explainable and Green Point Cloud Segmentation
4.1 Introduction
Explainable and green machine learning methods work well for the basic point cloud classifi-
cation task, which motivate us to generalize it to more complex tasks. Therefore, we extend
the PointHop method to do explainable and green point cloud segmentation. First, point
cloud segmentation is more complex than point cloud classification, it needs both the global
shape information and the fine-grained details of the point cloud so that it can do point-wise
classification. Second, point cloud segmentation requires more computational resources in
order to do point-wise classification. Point cloud segmentation can be further divided into
two branches, part segmentation and semantic segmentation. The part segmentation labels
every point as one of the part of objects which are small-scale. While the semantic seg-
mentation labels every point as one of the semantic categories in the scene, which is usually
large-scale.
The rest of this section is organized as follows. An unsupervised feedforward feature
(UFF) learning scheme for joint classification and part segmentation of 3D point cloud
66
objects is proposed in Sec. 4.2. UFF only focuses on small-scale point clouds. We further
extend our green learning strategy to do real large-scale point cloud segmentation, resulting
in a method, called GSIP. GSIP is proposed in Sec. 4.3 which is an efficient semantic
segmentation method for large-scale indoor point clouds. Finally, concluding remarks and
future research directions are given in Sec. 4.4.
4.2 UFF Method
By generalizing the PointHop, we propose a new solution for joint point cloud classification
and part segmentation here. Our main contribution is the development of an unsupervised
feedforward feature (UFF) learning system with an encoder-decoder architecture. UFF ex-
ploits the statistical correlation between points in a point cloud set to learn shape and point
features in a one-pass feedforward manner. It obtains the global shape features with an
encoder and the local point features using the encoder-decoder cascade. The shape/point
features are then fed into classifiers for shape classification and point classification (i.e. part
segmentation).
4.2.1 Methodology
System Overview. An illustration of the UFF system is given in Fig. 4.1. It takes a point
cloudastheinputandgeneratesitsshapeandpointfeaturesastheoutput. TheUFFsystem
is composed of a fine-to-coarse (F2C) encoder and a coarse-to-fine (C2F) decoder which are
in cascade. Such an architecture is frequently used in image segmentation. The encoder
provides global shape features while the concatenated encoder-decoder generates local point
67
features. The parameters of the encoder and the decoder are obtained in a feedforward one-
pass manner. They are computed using the correlation of spatial coordinates of points in a
point cloud set. Since it is a statistics-centric (rather than optimization-centric) approach,
no label (or iterative optimization via backpropagation) is needed.
Feedforward learning
Graph coarsening Graph interpolation
Skip connection
Classifier
A A A A
A Feature Aggregation
A
A
A
A
PointHop unit Point Feature
Shape Feature
Figure 4.1: An overview of the proposed unsupervised feedforward feature (UFF) learning
system, which consists of a fine-to-coarse encoder and a coarse-to-fine decoder.
Encoder Architecture. We design the UFF system with the joint classification and seg-
mentation tasks as presented in Sec. 4.2.2 in mind. The same design principle can be easily
generalizedtodifferentcontexts. Theencoderhasfourlayers,whereeachlayerisaPointHop
unit [99]. A PointHop unit is used to summarize the information of a center pixel and its
neighbor pixels. The main operations include: 1) local neighborhood construction using
the nearest-neighbor search algorithm, 2) 8-quadrant 3D spatial partitioning and averaging
of point attributes from the previous layer in each quadrant for feature extraction, and 3)
feedforward convolution using the Saab transform for dimension reduction. A point pooling
operation is adopted between two consecutive PointHop units based on the farthest point
sampling(FPS)principle. ByapplyingtheFPSiteratively,wecanreducethesampledpoints
68
of a point cloud and enlarge the receptive field from one layer to the next. The PointHop
units from the first to the fourth layers summarize the structures of 3D neighborhoods of
the short-, mid- and long-range distances, respectively. For more details, we refer to [99].
Roles of Encoder/Decoder. For point-wise segmentation, we need to find discriminant
features for all points in the original input point cloud. The down-sampled resolution has
to be interpolated back to a finer resolution layer by layer. The spatial coordinates and
attributes of points at each layer are recorded by the encoder. The decoder is used to
generate new attributes of points layer by layer in a backward fashion. It is important to
emphasize the difference between attributes of a point at the encoder and at the decoder.
An attribute vector of a point at the encoder is constructed using a bottom-up approach. It
does not have a global view in earlier layers. An attribute vector of a point at the decoder
is constructed using a bottom-up approach followed by a top-down approach. It has the
global information built-in automatically. For convenience, we order layers of the decoder
backwards. The inner most layer is the 4
th
layer, then, 3
rd
, 2
nd
, and 1
st
. The outputs of
the corresponding layers (with the same scale) between the encoder and the decoder are
skip-connected as shown in Fig. 4.1.
Decoder Architecture. The decoder is used to obtain features of points at the (l− 1)
th
layer based on point features at the l
th
layer. Its operations are similar to those of
the encoder with minor modification. For every point at the ( l− 1)
th
layer, we perform
the nearest neighbor search to find its neighbor points located at the l
th
layer. Then, we
conduct the 8-quadrant spatial partitioning and averaging of point attributes at the l
th
layer
in each quadrant for feature extraction. Finally, we perform the feedforward convolution
using the Saab transform for dimension reduction after aggregating the features from every
69
quadrants. It is worthwhile to emphasize that the difference between our decoder and that
of PointNet++ [67]. The latter calculates the weighted sum of the features of the neighbors
according to their normalized spatial distances.
Feature Aggregation. Featureaggregationwasintroducedin[99]toreducethedimension
of a feature vector while preserving its representation power. For an D-dimensional vector
a = (a
1
,··· ,a
D
)
T
, M aggregated values can be used to extract its key information, where
M < D. Then, we can reduce the dimension of the vector from D to M. Four (M = 4)
aggregation schemes were adopted in [99]. They include the mean, l
1
-norm, l
2
-norm and
l
∞
-norm (i.e., max-pooling) of the input vector. We will apply the same feature aggregation
scheme here. Feature aggregation is denoted by A in Fig. 4.1.
LetN
l
andD
l
bethepointnumberandtheattributedimensionperpointatthel
th
layer,
respectively. For the encoder, the raw feature map of a point cloud at the l
th
layer is a 2D
tensor of dimension N
l
× D
l
. Feature aggregation is conducted along the point dimension,
and the aggregated feature map is a 2D tensor of dimension M × D
l
, where M = 4 is the
numberofaggregationmethods. Forthedecoder, featureaggregationisconductedonpoints
of a finer scale. At the l
th
layer, after interpolation and point pooling, the raw feature map
at each point is a 2D tensor of dimension S× D
l− 1
, whereS = 8 is the number of quadrants.
Feature aggregation is conducted along the S dimension, and the aggregated feature map is
a 2D tensor of dimension M× D
l− 1
with M = 4.
IntegrationwithClassifiers. Theapplicationoflearnedshapeandpointfeaturestopoint
cloud classification and segmentation tasks is also illustrated in Fig. 4.1. For point cloud
classification,theresponsesfromalllayersoftheencoderareconcatenatedandaggregatedas
shapefeatures. Theyarethenfedintoaclassifier. Nodecoderisneededfortheclassification
70
task. For part segmentation, attributes of a point in the output layer of the decoder are
concatenated to get point features. We use the predicted object label to guide the part
segmentation task. That is, for each object class, we train a separate classifier for part
segmentation. Although feature learning is unsupervised, class and segmentation labels are
needed to train classifiers in final decision making.
4.2.2 Experimental Results
Experiments are conducted to demonstrate the power of UFF features in this section.
Model Pre-training. We obtain UFF model parameters (i.e., the feedforward convolu-
tional filter weights of the Saab transform) from the ShapeNet dataset [11]. It contains
55 categories of man-made objects (e.g., airplane, bag, car, chair, etc.) and 57,448 CAD
models in total. Each CAD model is sampled to 2048 points initially, where each point has
three Cartesian coordinates. No data augmentation is used. To show generalizability of the
UFF method, we apply the learned UFF model to other datasets without changing the filter
weights. This is called the pre-trained model in the following.
Shape Classification. We obtain the shape features from the ModelNet40 dataset [84]
using the pre-trained model. All models are initially sampled to 2048 points. We train
a random forest (RF) classifier, a linear SVM and a linear least square regressor (only
the higher one is reported) on learned features and report the classification accuracy on
the ModelNet40 dataset in Table 4.1. We compare the performance of our model with
respect to unsupervised and supervised methods. As shown in the table, our UFF model
71
achieves 90.4% overall accuracy which surpasses existing unsupervised methods. It is also
competitive with state-of-the-art supervised models.
Table 4.1: Comparison of classification results on ModelNet40.
Method OA (%)
Supervised
a
PointNet [65] 89.2
PointNet++ [67] 90.2
PointCNN [55] 92.2
DGCNN [81] 92.2
Unsupervised
a
PointHop [99] 89.1
PointHop++ [98] 91.1
Unsupervised
b
FoldingNet [88] 88.9
PointCapsNet [102] 88.9
MultiTask [31] 89.1
Ours 90.4
a
Learning on the ModelNet40 data.
b
Transfer learning from the ShapeNet on the ModelNet40 data.
Part Segmentation. Part segmentation is typically formulated as a point-wise classifica-
tion task. We need to predict a part category for each point. We conduct experiments on
the ShapeNetPart dataset [92], which is a subset of the ShapeNet core dataset, to evaluate
the learned point features. The ShapeNetPart dataset contains 16,881 shapes from 16 ob-
ject categories. Each object category is annotated with two to six parts, and there are 50
parts in total. Each point cloud is sampled from CAD object models and has 2048 points.
The dataset is split into three parts: 12,137 shapes for training, 1,870 shapes for validation
and 2,874 shapes for testing. We follow the evaluation metric in [65], which is the mean
Intersection-over-Union (mIoU) between point-wise ground truth and prediction. We first
compute shape’s IoU by averaging IoUs of all parts in a shape and, then, obtain mIoUs for a
category by averaging over all shapes in the same category. Finally, the instance mIoU (Ins.
72
mIoU) is computed by averaging over all shapes while the category mIoU (Cat. mIoU) is
computed by averaging over mIoUs with respect to all categories.
By following [102], we randomly sample 1% and 5% of the training data and get point
features from sampled training and testing data with the pre-trained model to see both the
capability of learning from limited data and the generalization ability on segmentation task.
Here we use the predicted labels for the test objects to guide the part segmentation task.
Specifically, we first classify the object labels of the test data using the shape feature. Then,
we train 16 different random forests on extracted point features of the sampled training data
for each object class and evaluate them on the point features of the testing data, which has
the corresponding predicted label, respectively. We show the part segmentation results in
Table 4.2.2 and compare them with three state-of-the-art semi-supervised works. It shows
that our method has a better performance.
Table 4.2: Performance comparison on the ShapeNetPart segmentation task with semi-
supervised DNNs.
Method
∗ 1% train data 5% train data
OA (%) mIoU (%) OA (%) mIoU (%)
SO-Net [54] 78.0 64.0 84.0 69.0
PointCapsNet [102] 85.0 67.0 86.0 70.0
MultiTask [31] 88.6 68.2 93.7 77.7
Ours 88.7 68.5 94.5 78.3
∗ Transfer learning from the ShapeNet on the ShapeNetPart data.
We further conduct an ablation study to validate the object-wise segmentation method
in Table 4.2.2. We follow above setting, randomly sample 5% ShapeNetPart data. It shows
that using the predicted test label to guide the part-segmentation increases Ins. mIoU by
3.4%. Besides, training multiple classifiers reduces the computation complexity.
73
Table 4.3: Ablation study of the object-wise segmentation.
Object-wise Object label Cat. mIoU (%) Ins. mIoU (%)
No - 71.5 74.9
Yes Predicted 76.2 78.3
Yes Ground truth 78.1 81.5
We compare the UFF method with other state-of-the-art supervised methods in Table
4.4. Here, we train the model with 5% of the ShapeNetPart data (rather than using the
pre-trained one). We see a performance gap between our model and DNN-based models. As
compared with Table 4.2.2, pre-training boosts the performance. This is consistent with the
classification result in Table 4.1.
Table 4.4: Performance comparison on the ShapeNetPart segmentation task with unsuper-
vised DNNs.
Method
∗ %train data Cat. mIoU (%) Ins. mIoU (%)
PointNet [65]
100%
80.4 83.7
PointNet++ [67] 81.9 85.1
DGCNN [81] 82.3 85.1
PointCNN [55] 84.6 86.14
UFF 5% 73.9 76.9
∗ Learning on the ShapeNetPart data.
areo bag cap car chair
ear
phone
guitar knife lamp laptop motor mug pistol rocket
skate
board
table
83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6
82.4 79.0 87.7 77.3 90.8 71.8 91 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6
84.2 83.7 84.4 77.1 90.9 78.5 91.5 87.3 82.9 96.0 67.8 93.3 82.6 59.7 75.5 82.0
84.1 86.5 86.0 80.8 90.6 79.7 92.3 88.4 85.3 96.1 77.2 95.3 84.2 64.2 80.0 83.0
71.9 68.8 74.9 68.0 84.4 78.2 86.2 76.1 67.7 94.5 58.0 93.2 67.5 49.9 68.0 75.6
Some part segmentation results of PointNet, UFF and the ground truth are visualized
in Fig. 4.2. In general, visualization results are satisfactory although our model may fail to
classify fine-grained details in some hard examples.
74
Ours Ground truth PointNet
Figure 4.2: Visualization of part segmentation results (from left to right): PointNet, UFF,
the ground truth.
4.2.3 Discussion
An UFF method, which exploits statistical correlations of points to learn shape and point
features of a point cloud in a one-pass feedforward manner, was proposed in this section. It
learnsglobalshapefeaturesthroughanencoderandlocalpointfeaturesthroughthecascade
of an encoder and a decoder. Experiments were conducted on joint point cloud classification
and part segmentation to demonstrate the power of features learned by UFF.
75
4.3 GSIP Method
As compared with the point cloud classification problem that often targets at small-scale
objects, a high-performance point cloud semantic segmentation method demands a good
understanding of the complex global structure as well as the local neighborhood of each
point. Meanwhile,efficiencymeasuredbycomputationalcomplexityandmemorycomplexity
is important for practical real-time systems.
GSIP has two novel components: 1) a room-style data pre-processing method and 2)
an enhanced PointHop feature extractor. For the former, we compare existing data pre-
processtechniquesandidentifythemostsuitabledatapreparationmethod. Indeep-learning
methods, which are implemented on GPUs, data pre-processing is used to ensure that input
units contain same number of points so that the throughput is maximized by exploiting
GPU’s built-in parallel processing capability. For the S3DIS dataset, a unit has 4,096 points
in PointNet which adopts the block-style pre-processing while a unit has 40,960 points in
RandLA-Net which adopts the view-style pre-processing. Since feature extraction for green
point cloud learning runs on a CPU, we are able to relax the constraint that each input unit
canhavedifferentnumberofpoints. Weproposeanewroom-stylepre-processingmethodfor
GSIP and show its advantages. For the latter, we point out some weaknesses of PointHop’s
feature extraction when extending it to large-scale point clouds and fix them with a simpler
processing pipeline.
76
4.3.1 Methodology
An overview of the proposed GSIP method is given in Fig. 4.3. A raw point cloud set is first
voxel-downsampled, where each downsampled point has 9D attributes (i.e., XYZ, RGB, and
normalized XYZ). Each sampled point is augmented with 12D additional attributes. They
include: surface normals (3D), geometric features (6D, planarity, linearity, surface variation,
etc.) [30] and HSV color space (3D, converted from RGB). Surface normals and geomet-
ric features are commonly used in traditional point cloud processing. As compared with
the XYZ coordinates, they describe local geometric patterns and can be easily computed.
Here, we include them as additional features to the XYZ coordinates. For points that have
the same geometric pattern but belong to different objects (e.g., wall and blackboard), the
RGB plus HSV color spaces work better than RGB alone. Consequently, each point has
21D input features. The point-wise features are not powerful enough since they are local-
ized in the space. To obtain more powerful features, we need to group points based on
different neighborhood sizes (e.g., points in a small neighborhood, a mid-size neighborhood
and a large neighborhood). By borrowing the terminology from graph theory, we call these
neighborhoods hops, where hop 1 denotes the smallest neighborhood size. To carry out the
semantic segmentation task, we need to extract features at various hops, which is achieved
by an unsupervised feature extractor. The feature extractor adopts the encoder-decoder
architecture. It has four encoding hops followed by four decoding hops. Finally, a classifier
is trained and used to classify each point into a semantic category based on its associated
hop features.
77
Data Pre-processing
(N, 21) Raw point cloud Voxel downsampled
Coordinate
Surface normal
Color
Geometric
Random
sampling
K-NN
point features
Positional
Features
(K, 31)
Local Attribute Construction – Hop 1
(N, 31)
Maxpool
Hop 1 Attributes
Raw point cloud
(N/4, 31)
RS
Hop 1
(N/16, 41)
(N/32, 51)
(N/64, 61)
RS
Hop 2
Output semantic labels
(N, 13)
(N/32, 112)
RS
Hop 3
RS
Hop 4
US
(N/16, 153)
US
(N/4, 184)
US
(N, 205)
US
(N, 21)
Figure4.3: AnoverviewoftheproposedGSIPmethod, wheretheupperleftblockshowsthe
datapre-processing, theupperrightblockshowsthelocalattributeconstructionprocessand
the lower block shows the encoder-decoder architecture for large-scale point cloud semantic
segmentation.
Data Pre-processing
Data pre-processing is used to prepare the input data in proper form so that they can
be fed into the learning pipeline for effective processing. Although the generic principle
holds for any large-scale point clouds, the implementation detail depends on the application
context. Datapre-processingtechniquestargetingatlarge-scaleindoorscenepointcloudsare
discussed here. Specifically, we use the S3DIS dataset [5], which is a subset of the Stanford
2D-3D-Semantics dataset [4], as an example. S3DIS is one of the benchmark datasets for
point cloud semantic segmentation. It contains point clouds scanned from 6 indoor areas
with 271 rooms. There are 13 object categories in total, including ceiling, floor, wall, door,
etc. Eachpointhas9dimensions, i.e., XYZ,RGBandnormalizedXYZ.Datapre-processing
techniques can be categorized into the following three styles.
78
Block Style. Block partitioning was proposed by PointNet and adopted by its follow-
ups. The 2D horizontal plane of a room is split into 1× 1 meter blocks while its 1D vertical
direction is kept as it is to form an input unit as shown in Fig. 4.4(a). Each unit contains
4,096 points, which is randomly sampled from its raw block. In the inference stage, 4,096
points of each unit are classified and, then, their predicted labels are propagated to their
neighbors. Typically, the k-fold strategy is adopted for train and test. For example, if area
6 is selected as the test area, the remaining 5 areas are used for training. Under this setting,
the block-style data pre-processing has 20,291 training units and 3,294 testing units, where
each unit has 4,096 points.
View Style. View-style data pre-processing was adopted by RandLA-Net, but not
explained in the paper. We get its details from the codes and describe its process below. It
first partitions the 3D space of a room into voxel grids and randomly selects one point per
voxel. For instance, the first conference room of area 1 has about one million points. One
can obtain 70k points by setting the voxel grid size to 0.04, around 7.7% points are kept in
this example. This is a commonly used point downsampling procedure. Then, it iteratively
selects a fixed number of points generating input units to facilitate GPU parallel processing.
For initialization, it randomly selects a room and a point in the room as the reference point.
Then, it finds the 40,960 nearest points around the reference point by K-NN algorithm to
form the first unit. This process is repeated to get a sequence of input units until it reaches
thetargetunitnumber. FortheS3DISdataset, thetargettrainingandtestunitnumbersfor
each fold are set to 3K and 2K, respectively. To reduce the overlapping of different units, it
assignsapossibilitytoeverypointrandomlyatthebeginningandupdatesthepossibilitiesof
the selected points in each round as inversely proportional to their distances to the reference
79
point. Thus, unselected points will be more likely to be chosen as the next reference point.
Four examples are shown in Fig. 4.4(b) to visualize points inside the same unit. We see that
they offer certain views to a room.
Therearehoweverseveralproblemsfortheabovetwodatapre-processingmethods. First,
they do not have a global view of the whole room, resulting in inaccurate nearest neighbor
search in unit boundaries. Second, the view-style method generates 3K units for 200 rooms
(i.e., 15 units per room on the average) in the training. There are redundant points in the
intersection of views of the same room. The total number of training points increases from
80M (≈ 20,291× 4,096) of the block style to 120M (≈ 3,000× 40,960) of the view style.
We may ask whether it is essential to have so many training points.
RoomStyle. Itisdesiredtoincreaseinputscaleandreducethetotalnumberoftraining
points while keeping the same segmentation performance. To achieve this goal, we propose
a room-style pre-processing method and adopt a flexible feature extraction pipeline, which
will be discussed in Sec. 4.3.1. The most distinctive aspect of the room style is that each
unit can have a different number of points. Thus, we include all points in a room in one unit
andcallitroom-stylepre-processing. Thisisapossiblesolutionsinceourpointcloudfeature
extractor can be implemented in the CPU effectively. By relaxing the constraint that all
units have the same number of points demanded by the GPU implementation, the data pre-
processing problem can be greatly simplified. By following the first step of RandLA-Net’s
pre-processing, we downsample raw points by voxel downsampling method with a fixed grid
size. Afterwards, the number of points for each room ranges from 10K to 200K. By taking
rooms in areas 1-5 as training and rooms in area 6 as testing as an example, we have 224
rooms (or 224 units) for training and 48 rooms (or 48 units) for testing. The training and
80
testing sets contain 15M and 2M points, respectively. The total number of training points is
much smaller than those of the block-style and the view-style methods while the input scale
is much larger.
Feature Extractor and Classifier
Points in a room-style input unit are fed into a feature extractor to obtain point-wise fea-
tures as shown in Fig. 4.3. The upper right block shows the local attribute construction
process and the lower block shows the encoder-decoder architecture for large-scale point
cloud semantic segmentation. It is developed upon our previous work PointHop. The reason
for developing new feature learner is due to several shortcomings when extending PointHop
from small-scale to large-scale point cloud learning. It is detailed below.
PointHop and PointHop++ Feature Extractor. PointHop [99] and PointHop++
[98] are unsupervised feature extractors proposed for small-scale point cloud classification.
Theyhavebeensuccessfullyappliedtojointpointcloudclassificationandpartsegmentation
[96]andpointcloudregistration[35,37]. PointHopconstructsattributesfromlocaltoglobal
by stacking several hops, covering small-, mid- and long-range neighborhoods. In each hop,
thelocalneighborhoodofeachpointaredividedbyeightoctantsusingthethe3Dcoordinates
of local points. Then, point attributes in each octant are aggregated and concatenated to
form a local descriptor, which keeps more information than naive max pooling. Due to the
fast dimension growing, Saab transform [48] is used for dimension reduction. Between two
consecutive units, FPS is used to downsample the point cloud to increase the speed as well
receptive field of each point. PointHop++ [98] is an extension of PointHop. It has a lower
model size and training complexity by leveraging the observation that spectral features are
81
(a) Block style
View 1 View 2
View 3 View 4
(b) View style
Conference Room (77k) Office (60k)
(c) Room style
Figure 4.4: Comparison of three data pre-processing methods.
82
uncorrelated after PCA. Thus, we can conduct PCA to each spectral feature independently,
which is called the channel-wise Saab transform.
There are three shortcomings for the pipeline used in PointHop and PointHop++ when
it is ported to large-scale point cloud data. First, the computational efficiency is limited by
FPS between two consecutive hops. Second, the memory cost of eight-octant partitioning
and grouping is high. Third, the covariance matrix computation in the Saab transform is
computationally intensive with a higher memory cost. To address them, we propose several
changes.
Proposed Feature Extractor. As shown in Fig. 4.3 (enclosed by the orange block),
the new processing module contains random sampling (RS), K-NN, relative point positional
encoding [32], max pooling, and point feature standardization. First, we use RS rather than
FPS between hops for faster computation of large-scale point clouds. Second, to reduce the
memory cost of the eight-octant partitioning and grouping, we adopt max pooling. Since
the feature dimension remains the same with max pooling prevents, no dimension reduction
via the Saab transform is needed, which further helps save memory and time cost. However,
max pooling may drop important information occasionally. To address it, we add relative
point positional encoding before max pooling to ensure that point features are aware of their
relative spatial positions. Specifically, for point p
i
and its K neighbors{p
1
i
,··· ,p
k
i
,··· ,p
K
i
},
the relative point position r
k
i
of each neighbor p
k
i
is encoded as
r
k
i
=p
i
⊕ p
k
i
⊕ (p
i
− p
k
i
)⊕∥ p
i
− p
k
i
∥, (4.1)
83
where ⊕ denotes concatenation, and ∥·∥ is the Euclidean distance. We will show that
the new pipeline is much more economic than PointHop, PointHop++ and deep-learning
methods in terms of memory consumption and computational complexity in Sec. 4.3.2.
Classifier. TheS3DISdatasetishighlyimbalancedinpointlabels. Amongthe13object
categories, the ceiling, floor and wall three classes take around 75% of the data. We adopt
the XGBoost classifier [15] to help reduce the influence of imbalanced data. Other classi-
fiers such as Linear Least Square Regression, Random Forest and Support Vector Machine
(SVM) demand higher computational speed and memory cost in the large-scale point cloud
segmentation problem. Overall, XGBoost can handle the large-scale point cloud data with
good performance, fast speed and lower memory requirement.
4.3.2 Experimental Results
Experimental Setup. We adopt the following setting to evaluate the semantic segmenta-
tion performance for the S3DIS dataset. The grid size is 0.04 in voxel-based downsampling.
The feature extractor has 4 hops. We set K = 64 in K-NN search. Some methods may
choose smaller K to save computational complexity, i.e., 16, 20 and 32, but our method can
afford a larger K. Because the input is a large-scale indoor scene, a larger K will lead to a
larger receptive field which helps learn the structure of the scene. Thus, we choose 64 here.
The subsampling ratios between two consecutive hops are 0.25, 0.25, 0.5, 0.5, respectively.
Three nearest neighbors’ features are used to interpolate in the upsampling module. For
example, to upsample from N/64 points to N/32 points (see Fig.1), we first find the three
nearest points in the N/64 point set for each point in the N/32 point set. Then, we average
84
the features of the three points. In this way, we get the point features for N/32 points. The
output features are truncated to 32-bit before fed into the XGBoost classifier. The standard
6-fold cross validation is used in the experiment, where one of the six areas is selected as
test area in each fold. By following [65], we consider two the evaluation metrics: the mean
Intersection-over-Union (mIoU) and the Overall Accuracy (OA) of the total 13 classes.
Comparison of Data Pre-processing Methods. The statistics of the S3DIS dataset
using three data pre-processed methods are compared in Table 4.5. The proposed room-
stylemethodhasmorepointsineachinputunit, i.e., inputscale, rangingfrom10Kto200K.
We also compare the data size when the training data are collected from areas 1-5 and the
test data are collected from area 6. The total number of training points of the room-style
method is 18.75% of the block-style method and 12.5% of the view-style method. Points
inside each unit of the three methods are visualized in Fig. 4.4, which includes a conference
room and an office from area 1. The block-style method loses the structure of a complete
room. As to the view-style method, the four views of a conference room overlap a lot with
each other, producing significant redundancy. The units obtained by the room-style method
look more natural. They offer a view of the whole room while keeping a small data size.
Table 4.5: Comparison of data statistics of three pre-processing methods.
Method Block View Room
Input Scale (× 10
3
) 4 40 10-200
Total Data Size
(× 10
6
)
Train 80 120 15
Test 10 80 2
Semantic Segmentation Performance. We compare the semantic segmentation perfor-
mance on the S3DIS dataset of PointNet and the proposed GSIP in Table 4.6, where the
85
better results are shown in bold. It is common to use area 5 as the test. Thus, we show per-
formance for area 5 as well as the 6-fold. As shown in the table, GSIP outperforms PointNet
in mIoU by 2.7% and 0.9% for area 5 and 6-fold, respectively. The cross validation results
of all 6 areas of GSIP are shown in Table 4.7. We see that area 6 is the easiest one (64.5%
mIoU and 86.5% OA) while area 2 is the hardest one (31% and 68.8%). The mIoU and OA
over 13 classes averaged by the 6-fold are 48.5% and 79.8%, respectively. Visualization of
GSIP’s segmentation results and the ground truth of two room in area 6 is given in Fig. 4.5.
The ceiling is removed to show inner objects clearly.
ItisworthwhiletopointoutthatsomecategorieshaveextremelylowmIoU(closeto0%)
in more than 2 areas. For example, sofa got 1.7%, 6.4%, 4.4%, and 3.4% mIoU in areas 2,
3, 4 and 5. This is attributed to data imbalance. The data imbalance problem is commonly
seen in large-scale segmentation tasks. For example, in a regular room, it is highly possible
thatmorepointsarefromthewall, ceilingandfloor, whilelesspointsfromchairs, desks, and
other small objects. The beam, column, sofa, and board are even less common. Without
seeing enough samples during training, the decrease in prediction performance is valid.
Table 4.6: Comparison of semantic segmentation performance (%) for S3DIS.
Method OA mIoUceilingfloor wallbeamcolumnwindowdoortablechair sofa bookcaseboardclutter
Area 5
PointNet - 41.1 88.0 97.369.8 0.05 3.9 42.3 10.8 58.9 52.6 5.9 40.3 26.4 33.2
GSIP 79.9 43.8 89.2 97.072.2 0.1 18.4 37.3 22.564.359.5 3.4 47.2 22.9 35.7
6-fold
PointNet78.5 47.6 88.0 88.7 69.3 42.4 23.1 47.5 51.6 54.1 42.0 9.6 38.2 29.4 35.2
GSIP 79.8 48.5 91.8 89.873.0 26.3 24.0 44.6 55.855.551.110.2 43.8 21.8 43.2
Comparison of Model and Time Complexities. We compare the model size and train-
ing time complexity of GSIP and PointNet in Table 4.8. PointNet takes 22 hours to train in
86
Table 4.7: Class-wise semantic segmentation performance (%) of GSIP for S3DIS.
1 2 3 4 5 6 mean
ceiling 91.9 88.9 95.2 89.9 89.2 95.6 91.8
floor 93.7 58.5 97.7 95.2 97.0 96.8 89.8
wall 71.5 70.9 73.3 72.5 72.2 77.5 73.0
beam 21.7 5.9 64.9 0.2 0.1 65.2 26.3
column 38.7 12.8 20.9 15.9 18.4 37 24.0
window 77.5 18.3 39.7 15.1 37.3 79.5 44.6
door 71.5 46.0 69.0 51.7 22.5 73.8 55.8
table 61.7 23.3 62.5 49.8 64.3 71.3 55.5
chair 49.4 21.8 60.4 49.3 59.5 66.3 51.1
sofa 20.4 1.7 6.4 4.4 3.4 24.7 10.2
bookcase 41.2 22.8 58.1 34.4 47.2 58.9 43.8
board 29.4 5.0 22.8 14.2 22.9 36.7 21.8
clutter 46.6 27.6 51.3 42.4 35.7 55.7 43.2
mIoU 55.0 31.0 55.6 41.2 43.8 64.5 48.5
OA 81.1 68.8 83.9 78.8 79.9 86.5 79.8
a single GeForce GTX TITAN X GPU. GSIP takes around 40 minutes for feature extraction
with a Intel(R)Xeon(R) CPU and 20 minutes to train the XGBoost classifier with 4 GeForce
GTX TITAN XGPUs. The totaltraining timeis around 1hour for each fold, which is much
faster than PointNet. If we use a single GPU for XGBoost training, the overall training
time is still significantly less than PointNet. To justify our claim, we calculate the time
complexity of algorithm theoretically in terms of big O. The proposed method is composed
of data pre-processing, feature extractor and classifier. For each sample with N points,
1. data pre-processing: voxel downsampling takes O(NlogN); geometric feature calcula-
tion takes O(NK
2
), K is the number of nearest points;
87
Point Cloud Ground Truth Our results
Figure 4.5: Qualitative results of the proposed GSIP method.
2. feature extraction: random sampling takes O(1); K-NN search takes O(KND), D is
feature dimension; relative point position encoding takes O(KND); max pooling takes
O(N). To sum up, the feature extractor costs O(KND);
3. XGBoost classifier: the XGBoost classifier’s complexity can be found in their paper,
which is not our focus.
For the algorithms involved, brute force time complexities are used for calculation. There
are better implementations that can optimize the complexity, which are not considered here.
88
As to the model size, GSIP has 24K parameters while PointNet has 900K parameters.
PointNet is an end-to-end supervised method, which uses some fully connected layers at
the end of their pipeline to predict point labels. We do not break the pipeline into the
feature extractor and the classifier for comparison. The parameters mainly come from the
XGBoostclassifier, whichhas128treesandthemaximumdepthofatreeis6. Foreachtree,
there are 2 parameters per intermediate node and 1 parameter per leaf node. The feature
extractor only has 2 parameter in each hop, mean and standard deviation, in point feature
standardization.
Table 4.8: Comparison of time and model complexities.
Method Device Training time Parameter No.
PointNet GPU 22 hours 900169
GSIP
Feature CPU 40 minutes 8
XGBoost GPU 20 minutes 24320
OtherComparisons. AsdiscussedinSec. 4.3.1,thenewfeatureextractorismoreeffective
than the PointHop feature extractor in terms of computational and memory efficiency. It
is worthwhile to compare the segmentation performance of the two to see whether there is
any performance degradation. We compare the effectiveness of the new feature extractor
and the PointHop feature extractor under the same GSIP framework for the S3DIS dataset
in Table 4.9. Their performance is comparable as shown in the table. Actually, the new
one achieves slightly better performance. Furthermore, we compare the quantization effect
of extracted features before feeding them into the XGBoost classifier in Table 4.10. We see
little performance degradation for features to be quantized to 16 or 32 bits. Thus, we can
save computation and memory for classifier training and testing by using 16-bit features.
89
Table 4.9: Performance comparison of two feature extractors (%).
Method mIoU OA
GSIP 48.5 79.8
PointHop 47.9 79.1
Table 4.10: Impact of quantized point-wise features (%).
Quantization mIoU OA
32-bit 64.5 86.5
16-bit 64.9 86.5
4.3.3 Discussion
A green semantic segmentation method for large-scale indoor point clouds, called GSIP, was
proposed in this section. It contains two novel ingredients: a new room-style method for
datapre-processingandanewpointcloudfeatureextractorwhichisextendedfromPointHop
withlowermemoryandcomputationalcostswhilepreservingthesegmentationperformance.
We evaluated the performance of GSIP against PointNet with the indoor S3DIS dataset and
showed that GSIP outperforms PointNet in terms of performance accuracy, model sizes
and computational complexity. As to future extension, it is desired to generalize the GSIP
methodfromthelarge-scaleindoorpointcloudstothelarge-scaleoutdoorpointclouds. The
latter has many real world applications. Furthermore, the data imbalance problem should
be carefully examined so as to boost the segmentation and/or classification performance.
4.4 Conclusion
Explainable and green methods, UFF and GSIP, are proposed for point cloud segmentation.
UFF is an unsupervised feedforward feature learning scheme for joint classification and
90
part segmentation of 3D point cloud objects. UFF is green since it can do multi-tasks
in one-pass and has good generalization ability. It learns global shape features through the
encoder and local point features through the concatenated encoder-decoder architecture.
The above methods only focus on small-scale point clouds, so we further extend our green
learning strategy to do real large-scale point cloud segmentation, resulting in a method,
called GSIP. GSIP is an efficient semantic segmentation method for large-scale indoor point
clouds. GSIP is green since it has significantly lower computational complexity and a much
smaller model size to process large-scale point clouds. It contains two novel ingredients: a
newroom-stylemethodfordatapre-processingandanewpointcloudfeatureextractorwhich
isextendedfromPointHopwithlowermemoryandcomputationalcostswhilepreservingthe
segmentation performance.
91
Chapter 5
Local and Global Aggregation in Point Cloud
Classification and Segmentation
5.1 Introduction
Explainable and green solutions have been proposed for point cloud classification and seg-
mentation in this dissertation. However, we observe two limitations from them. First, the
classification performance gain is very difficult for both deep learning and green learning.
For instance, PointNet [65] and Point Transformer [101] can achieve 47.6% and 73.5% mIoU
for semantic segmentation of the S3DIS dataset [5], and 89.2% and 93.7% accuracy for clas-
sification of the ModelNet40 dataset [84]. As compared with the 25.9% performance gain
in semantic segmentation, the improvement of 4.5% classification accuracy of Point Trans-
former over PointNet is unimpressive. Similarly, we only get 2% more classification accuracy
when PointHop [99] was upgraded to PointHop++ [98]. It appears unnecessary to build a
large and complex model for point cloud classification. Second, the UFF method [96] and
92
GSIP method [97] target at small-scale objects and large-scale scenes separately. This moti-
vates us to develop a method that can segment both small-scale and large-scale point cloud,
while keeping the advantage of GSIP.
Therestofthissectionisorganizedasfollows. Agreenpointcloudclassificationmethod,
SR-PointHop, is proposed in Sec. 5.2. SR-PointHop is built upon the single resolution
point cloud representation and enriches the information of a point cloud object with more
geometric aggregations of various local representations. A green point cloud segmentation
method, GreenSeg, is proposed in Sec. 5.3. Through a novel local aggregation method,
GreenSeg segments both small-scale and large-scale point clouds efficiently and effectively.
Finally, concluding remarks and future research directions are given in Sec. 5.4.
5.2 SR-PointHop Method
Based on observations on deep learning and green learning [46] methods, it appears unnec-
essary to build a large and complex model for point cloud classification. With this motiva-
tion, we propose a new lightweight point cloud classification method, called single-resolution
PointHop (SR-PointHop). SR-PointHop advances PointHop by converting multiple hops to
a single hop. With a shallow hop, both the computation time and the model size can be
reduced. It compensates the performance loss of a shallow model by aggregating geometric
information of local representations. Examples include global aggregation, cone aggregation
and inverted cone aggregation. For each aggregation, seven statistics are calculated. They
are the maximum, minimum, mean, L
1
-norm, L
2
-norm, standard deviation and variance.
The resulting features are further ranked and selected base on their discriminant power.
93
As compared with PointHop and PointHop++ that need at least four hops to learn point
cloud representations, SR-PointHop has an extremely simple architecture and comparable
performance.
5.2.1 Methodology
An overview of the proposed SR-PointHop is illustrated in Fig. 5.1. The input point cloud
scan consists of N points, each point, p
i
, has three Cartesian coordinates (x
i
,y
i
,z
i
), repre-
senting its location in the 3D space. To obtain a discriminant point cloud representation
for classification, we first construct local representations for each point. Next, we conduct
comprehensive geometric aggregations for the local representations. Then we compute the
cost of each representation dimension, rank all dimensions from the lowest to the largest,
select discriminant features based on the elbow point of the curve, and feed them into the
classifier to obtain an object label. Finally, the SR-PointHop method is discussed.
Point Cloud Representation Construction
Foreachpoint,itslocalregionisconstructedbasedonitsK nearestneighborsusingtheKNN
search. AsshownintheleftmostblockofFig. 5.1,theredcoloredpointrepresentsthecenter
point and the yellow points represent its local neighbors. We divide the local region into
eight octants according to the center point and the relative positions between the neighbors
and the center point. The averaged coordinates of points in each octant is calculated to
represent the octant. Hence, the local structure of each point can be described geometrically
with 24(= 3× 8) dimensions. The eight octant method is effective for two reasons. First, it
94
Input Point Cloud
(N, 3)
x
z
y
K-NN Octant Feature
(N, K, 3) (N, 8, 3) (N, 24)
Saab
Transform
DFT
+ +
x
z
y
x
z
y
(3, 7, 24) (6, 7, 24)
Cone Inverted Cone
(1, 7, 24)
Global
max
mean
L
1
L
2
std
var
min
Airplane
…
Classifier
Point Cloud Representation Construction Local Representation Aggregation
Discriminant Feature Classification
Figure 5.1: An overview of the proposed SR-PointHop method.
95
ensures that the representation of a point is invariant under the permutation of points in the
local region. Second, it encodes more information than the ordinary max pooling operation.
The idea of constructing octant representations was first proposed in PointHop, which
was the first green point cloud learning method. It adopts an unsupervised representation
learning tool to find point cloud attributes through multiple PointHop units. The earlier
PointHop unit has a local receptive field while the latter PointHop unit has a large receptive
field. They are effective in capturing local and global information of a point cloud scan. The
responses from multiple PointHop units are aggregated and concatenated to feed into the
classifier. The SR-PointHop unit works about the same as described above except for one
difference. The PointHop unit uses the Saab transform [48] for feature dimension reduction
in each hop. For SR-PointHop, we keep the 24D representation vector for each point. This
is because SR-PointHop is a single hop method and it is desired to keep the geometrical
information as is for further processing.
Local Representation Aggregation
To compensate the performance loss of a single hop architecture, we consider multiple
schemes to aggregate local representations as shown in the middle block of Fig. 5.1. The
aggregationschemesincludetheglobalaggregationtothewholeobject,theconeaggregation
and the inverted cone aggregation applied to parts of the object.
First, we adopt a global aggregation scheme as done by other methods. Initially, the
global aggregation scheme used in PointNet is max pooling. Three more pooling schemes
are added in PointHop; namely, the mean, L
1
-norm and L
2
-norm poolings. These poolings
describe point cloud objects from complementary angles. They jointly work better than
96
the max pooling alone. Inspired by this idea, we add three more pooling schemes in SR-
PointHop. They are the min, standard deviation and variance poolings. As a result, there
are seven pooling schemes in total. Each of them can capture some unique information. We
ensemble all of them, leading to the global aggregation strategy of SR-PointHop.
However, the global aggregation may ignore fine-grained structures of an object. A
cone-like aggregation, including cone and inverted cone, is conducted to aggregate features
of different parts of the object. For the cone aggregation, we create a double cone along
each axis with their apex at the origin and generatrix makes an angle θ to the axis. All
points located within the double cone are aggregated using the seven pooling schemes. As
shown in the middle block of Fig. 5.1, the three double cones are colored with purple, green
and yellow. The points are also colored correspondingly. We aggregate three double cones
instead of six cones because some parts of the objects are symmetric. Moreover, we invert
the cones to capture a different spatial structure for the part. For example, the cone along
positive/negativey axisnowhasitsbaseonthexz-planeafterinverting. Thereare6inverted
cones in total. For each one, we aggregate points inside the inverted cone using the seven
pooling schemes. They are marked by 6 different colors in the middle block of Fig. 5.1. We
aggregate six cones independently since they provide complementary geometric information
of the object. The angle value, θ , is determined experimentally. We select 75
◦ for three
double cones and 45
◦ for six inverted cones.
Discriminant Feature Classification
The aggregation of global and local representations provides a total of 1680(=24x7x10)
dimensions for each point cloud. Next, we compute their discriminant power, rank them
97
from the best to the worst, and select a subset to train a classifier. A supervised feature
selectionmethod,calledthediscriminantfeaturetest(DFT)[90],isutilizedtodeterminethe
discriminant power of each dimension independently. The weighted entropy loss for each 1D
dimension at a uniformly set of partitioned points is calculated. The lowest entropy loss is
chosen as the DFT loss of the dimension. We show the DFT loss curve in the right subfigure
of Fig. 5.1, where the x-axis is the sorted feature index and the y-axis is the DFT loss
value. The lower the loss, the higher the discriminant power. Consequently, we can use the
elbow point of the curve as indicated by the red point to determine a subset of discriminant
features. Features with their loss values less than that of the red point are selected. Finally,
the linear least squares regression (LLSR) is taken as the classifier for final decision making.
Discussion
SR-PointHopconstructsasingleresolutionrepresentationofpointclouds. Itsarchitecture is
extremely shallow as compared with other methods. However, extensive geometric aggrega-
tionsofbothlocalandglobalrepresentationscompensatestheperformancelossofthesimple
model. In state-of-the-art learning methods, the aggregation of global representations for
classification is simple, either max pooling or mean pooling. However, such a global pooling
operation ignores the local information of objects. The success of cone-like aggregations
points out that attention can be paid to different parts of objects to differentiate different
object classes.
98
5.2.2 Experimental Results
WeevaluateSR-PointHopontheModelNet40[84]benchmarkdatasetforshapeclassification.
In experiments, we set K = 32 in KNN search, select 1569 features using DFT, and feed
them into the LLSR classifier.
Classification Performance
Classification results on ModelNet40 are compared in Table 5.1. Following [65], all bench-
markingmethodsareevaluatedintwometrics: themeanoftheper-classaccuracy(class-avg)
andtheoverallaccuracyofallpointcloudscans. Theresultsofstate-of-the-artdeeplearning
methods and green learning methods show that the performance gain is limited despite of
huge efforts made. Specifically, there is only 4.5% overall accuracy improvement from Point-
Netto Point Transformer. SR-PointHopachieves90.6%overallaccuracy, which outperforms
PointHop and has comparable performance against PointHop++. Since SR-PointHop is a
single resolution method with only one hop, we also calculate the classification accuracy of
PointHop and PointHop++ using only one hop. We see that both of them have a large
performance gap against SR-PointHop. It demonstrates the effectiveness of the proposed
geometric aggregation schemes of local representations. As compared with deep networks,
SR-PointHop outperforms PointNet and only 0.1% less than PointNet++.
Comparison of Model and Time Complexities
Table 5.2 compares the time complexity and the model sizes of several deep learning and
greenlearningmethods. Toclarify,thedeeplearningmethodsaretrainedonasingleGeForce
GTX TITAN X GPU, while the green learning methods are trained on a 24-threads Intel(R)
99
Table 5.1: Comparison of classification results on ModelNet40.
Learning Scheme Method
Accuracy (%)
class-avg overall
Deep Learning
PointNet [65] 86.2 89.2
PointNet++ [67] - 90.7
PointCNN [55] 88.1 92.5
PointConv [83] - 92.5
DGCNN [81] 90.2 92.9
KPConv [78] - 92.9
PCT [28] - 93.2
Point Trans. [101] 90.6 93.7
Green Learning
4 hops
PointHop [99] 84.4 89.1
PointHop++ [98] 87 91.1
1 hop
PointHop 44.8 60.7
PointHop++ 49.7 64.5
SR-PointHop 86 90.6
Xeon(R) CPU. It takes several hours to train deep learning models. In contrast, green
learning models only takes around half an hour. The new SR-PointHop model completes its
training within 4 minutes. Although we test on the CPU, the inference time is still com-
parable with that of other deep learning models using GPU. The efficiency of SR-PointHop
is further improved as compared with that of PointHop and PointHop++. The numbers
of parameters are given to compare different model sizes, where the ‘M’ stands for million.
Because of the single resolution representation, SR-PointHop’s filter size is 18X less than
that of PointHop. Its total parameter number is only 62K, which is the smallest one among
all models listed in the table.
100
Table 5.2: Comparison of time and model complexities.
Method
Time (hr, ms) Parameter No. (M)
Training Inference Filter Classifier Total
PointNet [65] 7 10 - - 3.48
PointNet++ [67] 7 14 - - 1.48
DGCNN [81] 21 154 - - 2.63
PointHop [99] 0.33 108 0.037 - -
PointHop++ [98] 0.42 97 0.009 0.15 0.159
SR-PointHop 0.08 23 0.002 0.06 0.062
More Experiments on ModelNet40
More experiments with various settings are conducted on ModelNet40 to show the contri-
bution of each design in SR-PointHop. We first study the overall classification accuracy
using different combinations of the pooling schemes in Fig. 5.2. We start with the best
setting of PointHop, which is an ensemble of the max, mean, L
1
and L
2
pooling. SR-
PointHop has worse performance than PointHop under this setting. Then, we add three
new pooling schemes to SR-PointHop, and test seven more cases. It shows the ensemble of
all seven aggregation schemes gives the best performance. The robustness of SR-PointHop
is also checked by showing the performance with fewer points in point cloud scans. Even
though SR-PointHop uses single resolution representation, it only performs slightly worse
than PointHop with less than 300 points.
Furthermore,wedoablationstudiesonthenumberofnearestneighborsandaggregations
in Table 5.3. In the first three rows, the K value is set to 32, we get the highest accuracy
(90.64%) with both global and cone aggregations. It also shows that the proposed cone ag-
gregation can capture more shape information of objects than the simple global aggregation.
This study validates the power of geometric aggregations of local representations. In the
101
200 300 400 500 600 700 800 900 1000
Points Number
0.82
0.84
0.86
0.88
0.90
Test Accuracy
PointHop: Mean, Max, l_1, l_2
SR-PointHop: Mean, Max, l_1, l_2
SR-PointHop: Mean, Max, l_1, l_2, std
SR-PointHop: Mean, Max, l_1, l_2, var
SR-PointHop: Mean, Max, l_1, l_2, min
SR-PointHop: Mean, Max, l_1, l_2, std, var
SR-PointHop: Mean, Max, l_1, l_2, std, min
SR-PointHop: Mean, Max, l_1, l_2, var, min
SR-PointHop: Mean, Max, l_1, l_2, std, var, min
Figure 5.2: Comparison of classification accuracy with different pooling schemes versus dif-
ferent point numbers of point cloud scans.
last three rows, we compare the performance with K = 16, 32, and 64 in the KNN search.
We see that K = 32 gives the best performance. Thus, we successfully reduce K from 64 in
PointHop to 32 in SR-PointHop. This helps reduce the time complexity.
Table 5.3: Ablation Study
Nearest Neighbor # (K) Aggregation Overall
Accuracy (%) 16 32 64 Global Cones
✓ ✓ 72.37
✓ ✓ 89.75
✓ ✓ ✓ 90.64
✓ ✓ ✓ 89.14
✓ ✓ ✓ 89.79
102
5.2.3 Discussion
ASR-PointHopmethod,builtuponthesingleresolutionpointcloudrepresentationforgreen
pointcloudclassification,wasproposedinthissection. SR-PointHopsimplifiesthePointHop
model by reducing the model depth to a single hop and enriches the information of a point
cloudobjectwithmoregeometricaggregationsofvariouslocalrepresentations. SR-PointHop
has the capability in classifying point clouds using a much smaller model size and can run
efficiently on the CPU. It provides an ideal solution to mobile and edge computing.
5.3 GreenSeg Method
A green learning methodology was proposed recently for 3D point cloud classification and
segmentation, which has significantly smaller model size and lower training complexity com-
pared with deep learning. The UFF method and GSIP method are two green learning based
methods, targeting at small-scale objects and large-scale scenes respectively. It motivates us
to develop a union structure that can segment both small-scale and large-scale point clouds
efficiently. In this work, we identify the weakness of UFF and GSIP in point cloud segmen-
tation and propose a novel green point cloud segmentation method, GreenSeg. GreenSeg
adoptsagreenandsimplelocalaggregationstrategytoenrichthelocalcontextandprovides
the option for object-wise segmentation if object labels are available.
5.3.1 Methodology
AnoverviewoftheproposedGreenSegmethodisgiveninFig. 5.3. Dependingonthedataset,
the input point cloud is first pre-processed and/or enriched. For part segmentation, each
103
Part Segmentation Semantic Segmentation
Raw Point Cloud
(N, 50)
Output labels
(N
1
, D+P)
S
Hop 1
(N
2
, D+2P)
(N
3
, D+3P)
(N
4
, D+4P)
S
Hop 2
(N
3
, 2D+8P)
S
Hop 3
S
Hop 4
US
(N
2
, 3D+11P)
US
(N
1
, 4D+13P)
US
(N, 9D+28P)
US
(N, D)
Hop 5 Hop 6 Hop 7 Hop 8
Unsupervised Feature Extraction
Data Pre-processing
Feature Enrich
XGBoost Classifier
(M(9D+28P),)
Global Aggregation
LLSR
Airplane
…
Object label
XGBoost Classifier
Input Point Cloud
(N, 13)
Figure 5.3: An overview of the proposed GreenSeg method, where the upper left block shows the feature extraction process for
point cloud segmentation, and the blocks in solid line and dotted line show the difference between the pipelines for semantic
segmentation and part segmentation.
104
point with 3D coordinates is enriched with 9 additional features to capture local geometric
patterns of the object, including 3D surface normals and 6D geometric features (planarity,
linearity, surface variation, etc.) [30]. For semantic segmentation, we follow the data pre-
processing method in [97]. Instead of using the standard block partitioning method to split
the raw point cloud (i.e., a room in the indoor scene) into small pieces, we take each room
as an input unit to keep the structure of the room. The raw point cloud is first voxel
downsampled by a grid size of 0.04, which reduces the total number of points in training by
a large amount. Then, each point that has XYZ, RGB, and normalized XYZ is augmented
with 12 additional features, where 9D is the same as that in part segmentation and the rest
3D is HSV color converted from RGB color to help recognize object with same geometry
pattern but different color. Hence, the D in the figure stands for 12 and 24 for semantic and
part segmentation, respectively.
Withtheinitialinputfeatures,acommonlyusedencoder-decoderarchitectureisadopted
tobuildmorepowerfulpoint-wisefeaturesforsegmentation. Asshownintheupperleftblock
in Fig. 5.3, the feature extraction consists of four encoding hops and four decoding hops. It
isbasedonourpreviousworksUFFandGSIP,whicharedesignedforsmall-scaleobjectonly
and large-scale scene only. To be applicable to both small-scale and large-scale point clouds,
theshortcomingsofthetwomethodsareanalyzedandfixed. Finally, thepoint-wisefeatures
are then classified by XGBoost [15]. When the object label exist, object-wise segmentation
is provided to boost the performance. The lower right block in Fig. 5.3 shows that part
segmentation takes object-wise segmentation.
105
Feature Extraction in UFF and GSIP
UFF [96] and GSIP [97] are two green learning based methods proposed for small-scale
object part segmentation and large-scale indoor scene semantic segmentation, respectively.
They are extended from the fundamental works in point cloud classification [99, 98]. The
feature extraction in UFF is also a cascaded encoder-decoder architecture, where PointHop
[99] is applied as encoder directly to build attributes from local to global. It stacks four
hops to summarize the structures of 3D neighborhoods in the short-, mid- and long-range
distances. In each hop, the local neighborhood of each point is partitioned by eight octants
andpointattributesineachoctantisaggregatedbymeanpooling. Tocontrolthedimension
of the local descriptor, Saab transform [48] is adopted. For decoder, UFF uses eight octant
grouping and Saab transform as well.
Although the local descriptor design is more informative compared with the naive max
pooling in encoder and the simple interpolation of neighbor features in decoder, the memory
cost of eight octant grouping and Saab transform is high. It’s not suitable to process large-
scale point cloud data. GSIP is then proposed for segmenting large-scale point clouds with
less memory consumption and fast speed. GSIP first changes the eight-octant grouping
and Saab transform in encoder back to max pooling, which keeps the feature dimension
unchanged and helps save memory and time cost. To compensate for the information loss
caused by max pooling, relative point positional encoding is added before max pooling to
ensure that point features are aware of their relative spatial positions. In decoder, three
nearest neighbors’ features are used to interpolate to higher resolution.
106
Summary of Local Aggregation
Here, wesummarizethelocalaggregationinUFF,GSIPandPointNet++forapointp
i
with
featuref
i
. All three methods first construct its local neighborhood E byK-NN. In UFF, the
local aggregation is
f
′
i
= Saab(mean
k∈E
1
(f
k
i
)⊕ mean
k∈E
2
(f
k
i
)⊕···⊕ mean
k∈E
8
(f
k
i
)), (5.1)
where k is the index of point in each octant of the local set, f
′
i
is the aggregated feature. In
GSIP, the local aggregation is
r
k
i
=p
i
⊕ p
k
i
⊕ (p
i
− p
k
i
)⊕∥ p
i
− p
k
i
∥,
f
′
i
= max
k∈E
(f
k
i
⊕ r
k
i
),
(5.2)
where r
k
i
is relative point positional encoding. In PointNet++,
f
′
i
= max
k∈E
(MLP(f
k
i
)), (5.3)
where MLP is multi-layer perceptron.
Proposed Unsupervised Feature Extraction
GSIP’s architecture is more suitable for segmenting both small-scale and large-scale point
clouds than UFF. However, GSIP emphasizes too much on speed and memory saving, ig-
noring a lot of fine-grained details. Although relative point positional encoding helps com-
pensate part of the information loss, max pooling in encoder drops most of the information
107
in the local descriptor. Moreover, the role the decoder in GSIP is only for upsampling. The
proposed unsupervised feature extraction is mainly targeting at improving from the two as-
pects. First, borrowing the idea from UFF, we do feature learning in decoder rather than
only interpolating the feature for upsampling. Each hop of decoder follows the same steps
as encoder: K-NN, relative point positional encoding, pooling, and point feature standard-
ization. Second, a green and simple local aggregation strategy is adopted to enrich the local
context. Specifically, two different pooling schemes are selected for encoder and decoder.
It’s better than a single pooling scheme since different statistics of the local set can be en-
coded. Because the local context is enriched without increasing the feature dimensions, the
computational resources can be saved a lot. The options for local aggregation include
f
′
i
= L
1
k∈E
(f
k
i
⊕ r
k
i
), (5.4)
f
′
i
= L
2
k∈E
(f
k
i
⊕ r
k
i
), (5.5)
f
′
i
= mean
k∈E
(f
k
i
⊕ r
k
i
). (5.6)
Throughexperiments,acombinationofL
1
poolinginencoderandmaxpoolingindecoder
is adopted by part segmentation, and a combination of L
1
pooling in encoder and mean
pooling in decoder is adopted by semantic segmentation. Since the feature grows slowly, a
mean pooling is added in last hop’ local aggregation in decoder to augment the point-wise
feature. ThefeaturedimensionsofeachhopisshowninFig. 5.3, thenewlyproposedfeature
extractor gives final feature with points N and dimension 9D+28P, where P represents the
numberoffeaturesinrelativepointpositionalencoding. Betweentwoconsecutivehopsinthe
108
encoder, the point cloud is downsampled to increase receptive field. For part segmentation,
farthest point sampling (FPS), which can keep the structure of object well, is applied to the
point cloud object. But each input unit in semantic segmentation has 10K to 200K points,
random sampling is more suitable than FPS since FPS is an iterative algorithm. It will be
time and memory consuming when more number of points participated.
Classifier
Because of the point-wise classification, the number of training data points is huge for both
semanticsegmentationandpartsegmentation. SomecommonlyusedclassifierssuchasRan-
domForestandSupportVectorMachinehavehighmemorycostandhaveslowspeed. Hence,
XGBoost [15] is adopted to predict a label for each point. For semantic segmentation, the
number of estimators in XGBoost classifier is 512 and the depth is 6. For part segmentation,
an object-wise segmentation is taken to boost the performance since the ground truth ob-
ject label exist. First, we do global aggregation on the point-wise features following [99] to
obtain global object feature, where M in Fig. 5.3 represents the number of pooling schemes
used. Then we train XGBoost classifier for each object category. During testing phase, Lin-
ear Least Square Regression (LLSR) is used to do object classification. With the predicted
object label, the corresponding XGBoost classifier is selected to do segmentation. By sepa-
rating the segmentation into small tasks, the memory cost is further decreased. Moreover,
segmenting two to six parts is much easier than segmenting 50 parts. GreenSeg can achieve
over 98.5% accuracy for ShapeNetPart classification, this is accurate enough to boost the
segmentation performance.
109
5.3.2 Experimental Results
WeconductexperimentsontheShapeNetPart[92]datasetandS3DISdataset[5]toevaluate
the learned point features. We follow the experimental settings of GSIP [97] but make
changes to adapt to the part segmentation task. In the following tables for performance
comparison, the results quoted are taken from the cited papers. Except from the part
segmentation and semantic segmentation results, we also show the comparison of model
and time complexities. More experiments, including ablation studies and ModelNet40 [84]
classification, are conducted.
Part Segmentation
The ShapeNetPart [92] dataset is a subset of the ShapeNet core dataset, it includes 16,881
CAD models of 16 object categories. There are 50 parts annotated, two to six parts in
each object category. Initially, we sample 2048 points from each CAD model and divide the
dataset into a training set and a testing set, 14007 and 2874 shapes respectively. Then we
use the proposed GreenSeg pipeline to predict a part label for each point of the objects.
For the evaluation metric, we calculate the mean Intersection-over-Union (mIoU) between
ground truth and prediction. Specifically, the averaged IoUs of all parts in a shape is the
shape’s mIoU, and the averaged mIoUs of all shapes are the instance mIoU (Ins. mIoU). In
Table 5.4, we compare the Ins. mIoU of GreenSeg with both deep learning based methods
and green learning based methods. The categorical mIoUs are also listed in Table 5.4 for
further analysis, which is the averaged mIoUs of all shapes in the same category.
110
We first compare GreenSeg with green learning based methods in Table 5.4, including
UFF [96] and GSIP [97]. The UFF method is trained with 5% data which is randomly
sampled,othermethodsaretrainedwith100%data. TheGSIPmethoddoesnotprovidepart
segmentation results so we evaluate it on the ShapeNetPart dataset on our own. As shown
in the table, GreenSeg achieved 83.8% Ins. mIoU, which is the best performance among
the green learning based methods. Additionally, we compare GreenSeg with deep learning
based methods. GreenSeg is better than Kd-Net [41] and PointNet [65], which are 82.3%
and 83.7% respectively. It’s worthwhile to point out that GreenSeg has an unsupervised
feature extractor which learns the point-wise feature without backpropagation. Therefore,
it’s understandable that GreenSeg has a performance gap to state-of-the-arts methods such
as DGCNN [81] and PointCNN [55] if taken the low computational resources request into
consideration. The analysis of model and time complexities will be discussed later.
Visualization of GreenSeg’s part segmentation results are given in Fig. 5.4, provided
with the corresponding ground truth and results by PointNet and UFF for comparison. For
example,wecanobserveabetterpredictionoftheconnectionbetweenthecupanditshandle
clearly.
Semantic Segmentation
The S3DIS dataset [5] is a subset of the Stanford 2D-3D-Semantics dataset [4]. It is one of
the benchmark datasets for large-scale indoor scene semantic segmentation. S3DIS contains
point clouds that are scanned from 6 indoor areas. Each point cloud is a room, there are 271
rooms in total. For each point, XYZ, RGB and normalized XYZ are provided. The goal of
semantic segmentation is to label each point with one of the 13 object categories, including
111
Ground Truth PointNet GreenSeg UFF
Figure 5.4: Visualization of object part segmentation results on the ShapeNetPart dataset.
From top to bottom: ground truth, PointNet, UFF, GreenSeg.
112
Table 5.4: Comparison of part segmentation performance (%) for ShapeNetPart
Method
Ins.
mIoU
areo bag cap car chair
ear
phone
guitar knife lamp laptop motor mug pistol rocket
skate
board
table
Deep
PointNet [65] 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6
PointNet++ [67] 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6
Kd-Net [41] 82.3 80.1 74.6 74.3 70.3 88.6 73.5 90.2 87.2 81.0 94.9 57.4 86.7 78.1 51.8 69.9 80.3
LocalFeatureNet [74] 84.3 86.1 73.0 54.9 77.4 88.8 55.0 90.6 86.5 75.2 96.1 57.3 91.7 83.1 53.9 72.5 83.8
DGCNN [81] 85.2 84.0 83.4 86.7 77.8 90.6 74.7 91.2 87.5 82.8 95.7 66.3 94.9 81.1 63.5 74.5 82.6
PointCNN [55] 86.1 84.1 86.5 86.0 80.8 90.6 79.7 92.3 88.4 85.3 96.1 77.2 95.3 84.2 64.2 80.0 83.0
Green
UFF [96] (5%) 76.9 71.9 68.8 74.9 68.0 84.4 78.2 86.2 76.1 67.7 94.5 58.0 93.2 67.5 49.9 68.0 75.6
GSIP [97] 77.5 75.2 61.1 61.6 72.2 85.4 55.9 89.5 76.6 75.0 91.7 64.2 90.0 75.5 41.2 61.1 72.5
GreenSeg 83.8 82.6 81.7 74.0 77.7 89.2 70.8 91.7 83.5 82.3 94.9 71.3 95.0 80.6 50.5 69.5 80.6
Table 5.5: Comparison of semantic segmentation performance (%) for S3DIS.
Method OA mIoU ceiling floor wall beam column window door table chair sofa bookcase board clutter
Area 5
PointNet [65] - 41.1 88.0 97.3 69.8 0.05 3.9 42.3 10.8 58.9 52.6 5.9 40.3 26.4 33.2
GSIP [97] 79.9 43.8 89.2 97.0 72.2 0.1 18.4 37.3 22.5 64.3 59.5 3.4 47.2 22.9 35.7
GreenSeg 83.0 48.3 92.2 98.1 74.7 0.0 16.8 39.4 32.4 68.9 75.2 8.2 55.1 24.5 41.8
6-fold
PointNet [65] 78.6 47.6 88.0 88.7 69.3 42.4 23.1 47.5 51.6 54.1 42.0 9.6 38.2 29.4 35.2
PointNet++ [67] 80.9 53.2 90.2 91.7 73.1 42.7 21.2 49.7 42.3 62.7 59.0 19.6 45.8 38.2 45.6
GSIP [97] 79.8 48.5 91.8 89.8 73.0 26.3 24.0 44.6 55.8 55.5 51.1 10.2 43.8 21.8 43.2
GreenSeg 82.4 53.7 92.9 91.6 74.7 29.9 25.8 45.3 62.4 61.0 59.6 26.1 51.2 27.4 49.9
113
ceiling, floor, wall, door, etc. Usually, area 5 is selected as test data and the rest areas are
used as training data. By following [65], we also provide the result of standard 6-fold cross
validation. That is, each area is taken as the test area in each fold. For the evaluation
metric, we calculate the mIoU and Overall Accuracy (OA) of the 13 categories.
In Table 5.5, we show the semantic segmentation performance of GreenSeg on the S3DIS
dataset and compare with PointNet and GSIP. The better results are marked in bold. We
can observe that GreenSeg obtains much better results than PointNet and GSIP. When
using area 5 as the test data, GreenSeg outperforms PointNet and GSIP in mIoU by 7.2%
and 4.5% respectively. For the 6-fold result, GreenSeg outperforms PointNet, GSIP and
PointNet++ in mIoU by 6.1%, 5.2% and 0.5% respectively. PointNet++ does not provide
the 6-fold result, the performance in the table are quoted from other papers for reference.
The detailed cross validation results are provided in Table 5.6.
Analyzing the categorical mIoUs show that the green learning based methods perform
poorer than deep learning methods in some objects, such as beam, window and board. This
is caused by the different data processing techniques in the two kinds of methods. The
PointNet and PointNet++ use block partitioning to split each room into 1× 1 meter blocks
along the horizontal plane and sample each block to 4096 points as an input unit. In this
way, they can be processed in parallel by GPU. Due to the flexible feature extraction that
can be implemented in the CPU effectively, green learning does not have such requirement.
Specifically, each unit can have a different or same number of points. It adopts a room style
data processing technique which takes each room after voxel downsampling as an input unit.
Theroomstyledataprocessinggeneratesmuchlesspointsthanblockstylewhilekeepingthe
114
Point Cloud Ground Truth GSIP GreenSeg
Figure 5.5: Visualization of semantic segmentation results on the S3DIS dataset. From top
to bottom: input point cloud with real color, ground truth, GSIP, GreenSeg.
115
Table 5.6: Class-wise semantic segmentation performance (%) of GSIP for S3DIS.
Object\Area 1 2 3 4 5 6 mean
ceiling 95.1 87.3 95.7 91.1 92.2 95.7 92.9
floor 94.8 64.4 98.0 96.9 98.1 97.3 91.6
wall 71.1 73.6 74.9 74.0 74.7 79.6 74.7
beam 23.5 13.5 70.5 0.0 0.0 72.0 29.9
column 25.7 20.1 16.5 35.1 16.8 40.7 25.8
window 80.6 25.5 40.7 11.5 39.4 74.3 45.3
door 77.9 49.6 77.7 53.6 32.4 83.2 62.4
table 64.9 36.8 69.5 51.7 68.9 74.2 61.0
chair 59.7 28.0 67.1 53.5 75.2 74.0 59.6
sofa 43.6 11.1 20.7 10.3 8.2 62.4 26.1
bookcase 48.2 31.9 64.8 39.1 55.1 68.1 51.2
board 31.7 2.7 36.0 32.6 24.5 36.8 27.4
clutter 57.7 29.1 57.1 48.8 41.8 64.8 49.9
mIoU 59.6 36.4 60.7 46.0 48.3 71.0 53.7
OA 83.4 71.8 86.0 81.1 83.0 88.9 82.4
structure of the rooms. Despite the advantages, it causes the objects that are less frequent
to be less. This severs the data imbalance problem.
ThesegmentationresultsofGreenSegarevisualizedinFig. 5.5,thetworoomsarechosen
from area 6. To show the inner prediction clearly, we remove the ceiling and view the room
fromtop. Asshowninthefigure,wealsoprovidethepointcloudwithRGBcolor,theground
truth, and the prediction by GSIP for comparison. We can observe obvious improvement of
GreenSeg over GSIP. For instance, GreenSeg predicts the board of the first room better and
the prediction has much less noise in the wall and beam.
116
Table 5.7: Ablation Study for ShapeNetPart Segmentation.
K-NN
Metric
Encoder Decoder Object-wise Segmentation Loss
mIoU (%)
Aggregation LPE Aggregation Prediction GT merror mlogloss mae mphe
L
2
max mean ✓ ✓ 77.50
L
2
max mean ✓ ✓ ✓ 80.10
L
2
max max ✓ ✓ ✓ 81.61
L
2
max L
1
✓ ✓ ✓ 81.92
L
2
max L
2
✓ ✓ ✓ 82.17
L
2
L
2
L
2
✓ ✓ ✓ 81.81
L
2
L
2
L
1
✓ ✓ ✓ 81.64
L
2
L
2
max ✓ ✓ ✓ 82.70
L
2
L
1
L
1
✓ ✓ ✓ 81.67
L
2
L
1
L
2
✓ ✓ ✓ 82.05
L
2
L
1
max ✓ ✓ ✓ 82.90
L
2
L
1
✓ max ✓ ✓ ✓ 83.47
L
1
L
1
✓ max ✓ ✓ ✓ 83.69
L
1
L
1
✓ max ✓ ✓ ✓ ✓ ✓ 83.71
L
1
L
1
✓ max ✓ ✓ ✓ ✓ 83.72
L
1
L
1
✓ max ✓ ✓ ✓ ✓ 83.77
L
1
L
1
✓ max ✓ ✓ ✓ ✓ 84.35
117
Ablation Study
We further do ablation study for both part segmentation in Table 5.7 and semantic seg-
mentation in Table 5.8. The third row of Table 5.7 is the GSIP setting applied to the
ShapeNetPart dataset, then we make changes in the following rows to validate our setting.
First, the object-wise segmentation helps guide the part segmentation, which improves the
basic GSIP setting from 77.5% to 80.1%. Second, L
1
pooling in encoder and max pooling
in decoder gives the best performance among different combinations of aggregations. Third,
adding local positional encoding to the decoder helps the feature extractor to capture more
representation point-wise features. Fourth, L
1
distance works better than L
2
distance in
K-NN. Fifth, mean Pseudo Huber error (mphe) are added to the loss function in XGBoost,
which is multiclass classification error rate (merror) and multiclass logloss (mlogloss) in
GSIP. Finally, we also test to guide the segmentation with the ground truth object label.
Comparing the last two rows, we can find that our best performance (83.77%) is very close
totheidealcase(84.35%). Toclarify, 84.35%isnotthebestpossibleperformancethatgreen
learning can achieve. With a better feature extraction, it can be pushed furthermore.
A similar ablation study is conducted for semantic segmentation. In Table 5.8, the
selection of K-NN distance metric, adding local positional encoding to decoder and adding
losses to XGBoost all show their effectiveness. Meanwhile, we find that using mean pooling
toaggregatefeaturesindecoderworksbetterthanmaxpoolinginsemanticsegmentation. It
isreasonablebecausesemanticsegmentationtargetsatthelarge-scalepointcloudwhilepart
segmentation targets at the small-scale object. Mean pooling can help reduce the influence
of noise.
118
Table 5.8: Ablation Study for Semantic Segmentation of S3DIS Area 5.
K-NN
Metric
Decoder Loss Added
mIoU (%)
LPE Aggregation mae mphe
L
2
max 45.80
L
2
mean 46.32
L
2
✓ mean 47.49
L
1
✓ mean 47.78
L
1
✓ mean ✓ ✓ 47.88
L
1
✓ mean ✓ 48.09
L
1
✓ mean ✓ 48.25
Comparison of Model and Time Complexities
The model sizes and time complexities are compared in Table 5.9. For fair comparison,
we retrained GSIP, PointNet and PointNet++ using the same device for consistency. The
modelsofPointNetandPointNet++usedarebasedonathirdpartyimplementation
1
. They
are trained by 100 epochs using a single A100 GPU. For GSIP and GreenSeg, the feature
extraction uses CPU only while XGBoost uses a A100 GPU to train. As can be seen in the
table, the training of GreenSeg can be done within 38 minutes which is much less than that
ofPointNetandPointNet++. Moreover, GreenSeg’sparameternumberis35timeslessthan
PointNet and 10 times less than PointNet++.
ModelNet40 Classification
To show the effectiveness of GreenSeg’s feature, we evaluate its classification accuracy on
the ModelNet40 dataset and compare with other methods in Table. 5.10. For object clas-
sification, we simply extract global feature from the point-wise feature that we obtained for
1
https://github.com/yanx27/Pointnet Pointnet2 pytorch
119
Table 5.9: Comparison of time and model complexities for semantic segmentation of S3DIS
Area 5.
Method Device Training time Parameter #
PointNet [65] 100
Epochs
A100 GPU
26 hours 3529558
PointNet++ [67] 42 hours 968269
GSIP [97]
Feature CPU 18 minutes 8
XGBoost A100 GPU 20 minutes 24320
GreenSeg
Feature CPU 3 minutes 8
XGBoost A100 GPU 35 minutes 97280
segmentation. Then linear least squares regression is adopted to predict a label for each
object using the global feature. As shown in the table, GreenSeg outperforms PointNet,
PointHop and unsupervised deep learning based methods. This is consistent with its perfor-
mance in segmentation task. GreenSeg is not as powerful as PointHop++ and SR-PointHop
in classification, the reason is that GreenSeg focuses on the local aggregation which is good
forlearningfine-graineddetails, whiletheothertwomethodsfocusontheglobalaggregation
which is beneficial for learning the global structure of an object.
5.3.3 Discussion
A novel green point cloud segmentation method, GreenSeg, was proposed in this work.
GreenSeg is developed to segment both small-scale and large-scale point clouds efficiently.
A green and simple local aggregation strategy is adopted to enrich the local context and
learn fine-grained details. We evaluated the segmentation performance of GreenSeg on the
ShapeNetPartdatasetandS3DISdataset,showingthatGreenSegoutperformsdeeplearning
methods such as PointNet that demand more model sizes and computational costs. As
to future work, we can extend GreenSeg to solve a more complex and noisy point cloud
120
Table 5.10: ModelNet40 Classification Performance (%).
Method
Accuracy
class-avg overall
Deep
Supervised
PointNet [65] 86.2 89.2
PointNet++ [67] - 90.7
PointCNN [55] 88.1 92.5
DGCNN [81] 90.2 92.9
Point Trans. [101] 90.6 93.7
Unsupervised
LFD-GAN [2] - 85.7
FoldingNet [88] - 88.9
PointCapsNet [102] - 88.9
MultiTask [31] - 89.1
Green Unsupervised
UFF [96] - 90.4
PointHop [99] 84.4 89.1
PointHop++ [98] 87 91.1
SR-PointHop 86.0 90.6
GreenSeg 86.9 89.6
environment which is closer to real world environments. Moreover, to compete with state-
of-the-arts deep learning based methods, it is desired to further enrich the feature set.
5.4 Conclusion
Werethinkthelocalandglobalaggregationinpointcloudclassificationandsegmentationin
this chapter. Two green learning based methods, SR-PointHop and GreenSeg, are proposed.
SR-PointHop is built upon the single resolution point cloud representation for point cloud
classification. It simplifies the PointHop model by reducing the model depth to a single
hop and enriches the information of a point cloud object with more geometric aggregations
of various local representations. GreenSeg is designed for segmenting both small-scale and
121
large-scale point clouds efficiently. The shortcomings of UFF and GSIP are analyzed and a
green and simple local aggregation strategy is adopted to enrich the local context. Object-
wise segmentation is provided if object labels are available.
122
Chapter 6
Conclusion and Future Work
6.1 Summary of the Research
This dissertation focuses on two problems: point cloud classification and segmentation.
In the first work, we developed explainable and green methods for point cloud classifica-
tion. First, a new and explainable machine learning method, called the PointHop method,
was proposed for 3D point cloud classification. PointHop is mathematically transparent and
its feature extraction is an unsupervised procedure since no class labels are needed in this
stage. PointHop is green, it has an extremely low training complexity because it requires
only one forward pass to learn parameters of the system. Second, we further proposed a
lightweight learning model for point cloud classification, called PointHop++, by overcom-
ing the shortcomings of the PointHop method. The lightweight comes in two flavors: 1)
its model complexity is reduced in terms of the model parameter number by constructing
a tree-structure feature learning system and 2) features are ordered and discriminant ones
are selected automatically based on the cross-entropy criterion. Experimental results on
123
the ModelNet40 dataset showed that PointHop and PointHop++ offer classification per-
formance that is comparable with state-of-the-art methods while demanding much lower
training complexity.
Inthesecondwork, weextendedthePointHopmethodtodoexplainableandgreenpoint
cloud segmentation since point cloud segmentation is more complex and requires much more
computation resources. First, an unsupervised feedforward feature (UFF) learning scheme
for joint classification and part segmentation of 3D point cloud objects was proposed. UFF
is green since it can do multi-tasks in one-pass and has good generalization ability. It learns
global shape features through the encoder and local point features through the concatenated
encoder-decoder architecture. Experimental results on the ShapeNet and ShapeNet Part
dataset showed that the UFF method has good generalization ability. Second, an efficient
semantic segmentation method for large-scale indoor point clouds, called GSIP, was pro-
posed. GSIP is green since it has significantly lower computational complexity and a much
smaller model size. It contains two novel ingredients: a new room-style method for data pre-
processing and a new point cloud feature extractor which is extended from PointHop with
lower memory and computational costs while preserving the segmentation performance. Ex-
perimentsontheS3DISdatasetshowedthatGSIPoutperformsthepioneeringworkinterms
of performance accuracy, model sizes and computational complexity.
In the third work, we rethought the local and global aggregation in point cloud clas-
sification and segmentation. First, we proposed a SR-PointHop method for green point
cloud classification, which is built upon the single resolution point cloud representation.
SR-PointHop simplifies the PointHop model by reducing the model depth to a single hop
and enriches the information of a point cloud object with more geometric aggregations of
124
various local representations. Experimental results on the ModelNet40 dataset showed that
SR-PointHop has the capability in classifying point clouds using a much smaller model size
and can run efficiently on the CPU. Second, we proposed a novel green point cloud seg-
mentation method, called GreenSeg. Different from UFF and GSIP, which are designed for
small-scale objects only and large-scale scenes only, GreenSeg is developed to segment both
small-scale and large-scale point clouds efficiently. GreenSeg adopts a green and simple lo-
cal aggregation strategy to enrich the local context and provides the option for object-wise
segmentation if object labels are available. Extensive experiments were conducted on the
ShapeNetPart dataset and S3DIS dataset, showing that GreenSeg are comparable with deep
learning methods which needs very complex local aggregation and backpropagation.
We summarize the comparison between traditional, deep learning and green learning
[47, 57] methods qualitatively in several aspects in Table 6.1. As shown in the table, green
point cloud learning methods have several advantages:
1. data-driven but not data-eager (i.e. robust with less data),
2. no supervision needed for feature extraction,
3. good generalization ability, where extracted feature can be used for multi-tasking such
as object classification, part segmentation and. registration,
4. mathematically interpretable,
5. smaller model sizes and lower computation complexity.
125
Table 6.1: Comparison of traditional, deep learning (DL) and green learning (GL) methods.
Feature Traditional DL GL
Data eagerness low high middle
Supervision low high middle
Model size small large middle
Time complexity low high middle
Interpretation easy hard easy
Performance poor good good
6.2 Future Research Topics
The proposed explainable and green point cloud learning solutions have advantages in terms
of mathematical interpretability, model size and computational complexity, so that it has a
great potential for further generalization. Except the point cloud classification and segmen-
tation, successful attempts have been made in other tasks such as point cloud registration,
odometry and pose estimation [35, 36, 34, 39, 38]. As to future extension, it is desired to
generalize the GSIP method from the large-scale indoor point clouds to the large-scale out-
door point clouds, which has a high requirement of both accuracy and efficiency. The latter
has many real world applications. We bring up two challenging research problems:
• Semantic segmentation of large-scale outdoor point clouds. Given a pint
cloud, the goal of semantic segmentation is to label every point as one of the semantic
categories. As is shown in Fig. 6.1, the outdoor scenes are more complex than indoor
scenes and the requirement for realtime processing is also higher.
• Fast 3D object detection. Given a point cloud, the goal of object detection is to
localize, shape and semantic the 3D objects in the scene. As is shown in Fig. 6.2 (b),
126
the outputs are 3D oriented bounding boxes and the corresponding class labels for the
3D objects. (A 3D oriented box has seven parameters: 3D location of the box center,
length, height and width of the box, and the yaw.)
6.2.1 Semantic segmentation of large-scale outdoor point clouds
WeproposedaGSIPmethodforefficientlysemanticsegmentationoflarge-scaleindoorpoint
clouds in this thesis. We evaluated the performance of GSIP against PointNet [65] with the
indoor S3DIS dataset and showed that GSIP outperforms PointNet in terms of performance
accuracy, model sizes and computational complexity. This is guaranteed by GSIP due to the
feedforward strategy and subspace approximation nature. More importantly, it takes the
real large-scale semantic segmentation into consideration which can segment 10K to 200K
points at one time. However, there are several shortcomings of GSIP:
• the unsupervised feature learning is very fast compared with deep learning method,
but it is not as powerful as deep learning methods;
• the data imbalance problem is not well examined in GSIP.
To boost the segmentation, we need to overcome the weaknesses of GSIP first.
Moreover, outdoor scene is even harder than indoor scene. The environment is complex.
We may encounter unknown objects, noises and uneven distribution of points. The environ-
ment can change a lot over time. The outdoor objects have more complex structures than
indoor scene objects and they are not static anymore. For example, a moving object may
generate different point clouds. SemanticKITTI is a dataset [7] composed of large-scale out-
doorscenepointclouds, whichisfamousbenchmarkforsemanticsegmentation. Anexample
127
(a) PointNet++ (2.4s) (b) SPG (10.8s)
(c) RandLA-Net (0.04) (d) Ground Truth
Figure 6.1: Semantic segmentation results of PointNet++ [67], SPG [50] and RandLA-Net
[32] on SemanticKITTI [7]. RandLA-Net takes only 0.04s to directly process a large point
cloud with 10
5
points over 150× 130× 10 meters in 3D space, which is up to 200× faster than
SPG. Red circles highlight the superior segmentation accuracy of RandLA-Net. The figure
is from [32].
of outdoor scene is shown in Fig. 6.1, it is quite different from indoor scene, the sparsity is
also a core problem to solve in this task.
Besides, outdoor scene semantic segmentation has higher requirements on efficiency due
to its application in intelligent systems such as autonomous driving. Some qualitative se-
manticsegmentation resultson the SemanticKITTI dataset [7] ofstate-of-the-art works, i.e.,
PointNet++ [65], SPG [50] and RandLA-Net [32], with their processing time are shown in
Fig. 6.1. RandLA-Net [32] takes only 0.04s to directly process a large point cloud with 10
5
points over 150× 130× 10 meters in 3D space, which is up to 200× faster than SPG [50].
128
We will use the green point cloud learning strategy to do semantic segmentation of large-
scale outdoor point clouds and target at solving the problems mentioned above.
6.2.2 Fast 3D Object Detection
Detecting3Dobjectsinurbanenvironmentisafundamentalandchallengingproblemformo-
tion planning in order to plan a safe route in autonomous driving. Specifically, autonomous
vehicles (AVs) need to detect and track moving objects such as pedestrians, cyclists and
vehicles in real time. Therefore, the computation speed is critical. AVs carry a variety of
sensors such as camera and LiDAR (Light Detection and Ranging), etc. Recent approaches
for 3D object detection either fuse RGB image from camera and point cloud from LiDAR
or use point cloud alone. Point cloud is irregular and extremely computational, but it is
crucial for accurate 3D estimation compared with 2D images. Therefore, converting and
utilizing point cloud data more efficiently and effectively has become the primary problem
in the detection task, which is also quite interesting and challenging for us.
There are two main streams in dealing with the unstructured point cloud data, voxel-
based and point-based methods. Voxel-based methods convert sparse point cloud into com-
pactrepresentationswitharegularshapesothatitcanadoptexisting2Ddetectionmethods
without extra efforts. The conversion is conducted either by projecting into images [16] or
dividing into equally distributed voxels [103, 85, 87, 51]. Features in each voxel to the back-
bone 2D CNN are either handcrafted [87] or generated by PointNet-like backbones [65, 67].
Voxel-based methods are straightforward and efficient while suffering from information loss
and performance bottleneck. Point-based [91, 75] methods take point cloud as input and
129
(a) Comparison of efficiency and performance
(b) Qualitative results of PP
Figure 6.2: Comparison of some state-of-the-art 3D object detection methods in efficiency
and performance. M: MV3D [16], A: AVOD [42], C: ContFuse [56], V: VoxelNet [103], F:
Frustum PointNet [64], S: SECOND [85], P+: PIXOR++ [87], PP: PointPillars [51]. The
figure is from [51].
outputs bounding boxes on each point, which is usually more accurate but less efficient than
voxel-based methods. Set abstraction and feature propagation are two basic modules for
point-based methods, where the former is for downsampling and extracting context features
andthelatterisforupsamplingandbroadcastingfeaturestopoints. Thesamplingstrategies
arecrucialtopoint-basedmethodswhichisanactiveresearchfield. Comparisonofefficiency
130
and performance of some state-of-the-art methods on the KITTI object detection dataset
[26] are shown in Fig. 6.2 (a).
The KITTI object detection [26] is a dataset composed of large-scale outdoor scene point
clouds,whichisfamousbenchmarkforobjectdetection. ExamplesareshowninFig. 6.2(b),
it has 7481 training point clouds and 7518 testing point clouds, comprising 80256 labeled
objects in total. There are mainly three types of objects: car, pedestrian and cyclist. As
you may find, the dataset is extremely imbalanced, there are much more background points
than interested objects. Moreover, the objects’ points are sparse when they are farther from
the LiDAR.
Since the green point cloud learning takes much less time complexity and model size, it’s
suitabletoapplyittotheobjectdetectiontasktoreducecomputationalcostssinceefficiency
does matter in object detection. We will use the green point cloud learning strategy to
large-scale outdoor point clouds for 3D object detection and target at solving the problems
mentioned above. Our green point cloud learning can be incorporated into the feature
construction process while retaining the detection head.
131
Bibliography
[1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learn-
ing representations and generative models for 3d point clouds. arXiv preprint
arXiv:1707.02392, 2017.
[2] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Rep-
resentation learning and adversarial generation of 3d point clouds. arXiv preprint
arXiv:1707.02392, 2(3):4, 2017.
[3] MikaelaAngelinaUyandGimHeeLee. Pointnetvlad: Deeppointcloudbasedretrieval
for large-scale place recognition. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 4470–4479, 2018.
[4] Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. Joint 2d-3d-semantic data
for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017.
[5] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer,
andSilvioSavarese. 3dsemanticparsingoflarge-scaleindoorspaces. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages 1534–1543,
2016.
[6] MathieuAubry,UlrichSchlickewei,andDanielCremers. Thewavekernelsignature: A
quantum mechanical approach to shape analysis. In 2011 IEEE international confer-
ence on computer vision workshops (ICCV workshops), pages 1626–1633. IEEE, 2011.
[7] JensBehley,MartinGarbade,AndresMilioto,JanQuenzel,SvenBehnke,CyrillStach-
niss, and Jurgen Gall. Semantickitti: A dataset for semantic scene understanding of
lidar sequences. In Proceedings of the IEEE/CVF International Conference on Com-
puter Vision, pages 9297–9307, 2019.
[8] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[9] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Generative and
discriminative voxel modeling with convolutional neural networks. arXiv preprint
arXiv:1608.04236, 2016.
[10] Michael M Bronstein and Iasonas Kokkinos. Scale-invariant heat kernel signatures for
non-rigidshaperecognition. In2010 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, pages 1704–1711. IEEE, 2010.
132
[11] AngelXChang, ThomasFunkhouser, LeonidasGuibas, PatHanrahan, QixingHuang,
Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An
information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
[12] Nesrine Chehata, Li Guo, and Cl´ ement Mallet. Airborne lidar feature selection for
urban classification using random forests. In Laserscanning, 2009.
[13] Ding-YunChen,Xiao-PeiTian,Yu-TeShen,andMingOuhyoung. Onvisualsimilarity
based 3d model retrieval. In Computer graphics forum, volume 22, pages 223–232.
Wiley Online Library, 2003.
[14] Hong-Shuo Chen, Mozhdeh Rouhsedaghat, Hamza Ghani, Shuowen Hu, Suya You,
and C-C Jay Kuo. Defakehop: A light-weight high-performance deepfake detector.
In 2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6.
IEEE, 2021.
[15] TianqiChen, TongHe, MichaelBenesty, VadimKhotilovich, YuanTang, HyunsuCho,
et al. Xgboost: extreme gradient boosting. R package version 0.4-2, 1(4):1–4, 2015.
[16] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object de-
tection network for autonomous driving. In Proceedings of the IEEE conference on
Computer Vision and Pattern Recognition, pages 1907–1915, 2017.
[17] Yueru Chen and C-C Jay Kuo. Pixelhop: A successive subspace learning (ssl) method
for object recognition. Journal of Visual Communication and Image Representation,
70:102749, 2020.
[18] Yueru Chen, Mozhdeh Rouhsedaghat, Suya You, Raghuveer Rao, and C-C Jay Kuo.
Pixelhop++: A small successive-subspace-learning-based (ssl-based) model for image
classification. In 2020 IEEE International Conference on Image Processing (ICIP),
pages 3294–3298. IEEE, 2020.
[19] Yueru Chen, Zhuwei Xu, Shanshan Cai, Yujian Lang, and C-C Jay Kuo. A saak
transform approach to efficient, scalable and robust handwritten digits recognition. In
2018 Picture Coding Symposium (PCS), pages 174–178. IEEE, 2018.
[20] Yueru Chen, Yijing Yang, Wei Wang, and C-C Jay Kuo. Ensembles of feedforward-
designed convolutional neural networks. arXiv preprint arXiv:1901.02154, 2019.
[21] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning,
20(3):273–297, 1995.
[22] Thomas G Dietterich. Ensemble methods in machine learning. In International work-
shop on multiple classifier systems , pages 1–15. Springer, 2000.
[23] Yuval Eldar, Michael Lindenbaum, Moshe Porat, and Yehoshua Y Zeevi. The far-
thest point strategy for progressive image sampling. IEEE Transactions on Image
Processing, 6(9):1305–1315, 1997.
133
[24] Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and Yue Gao. Gvcnn: Group-
view convolutional neural networks for 3d shape recognition. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 264–272, 2018.
[25] Yutong Feng, Yifan Feng, Haoxuan You, Xibin Zhao, and Yue Gao. Meshnet: Mesh
neural network for 3d shape representation. arXiv preprint arXiv:1811.11424, 2018.
[26] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous
driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer
vision and pattern recognition, pages 3354–3361. IEEE, 2012.
[27] Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and
Shi-Min Hu. Pct: Point cloud transformer. arXiv preprint arXiv:2012.09688, 2020.
[28] Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and
Shi-Min Hu. Pct: Point cloud transformer. Computational Visual Media, 7(2):187–
199, 2021.
[29] Timo Hackel, Nikolay Savinov, Lubor Ladicky, Jan D Wegner, Konrad Schindler, and
Marc Pollefeys. Semantic3d. net: A new large-scale point cloud classification bench-
mark. arXiv preprint arXiv:1704.03847, 2017.
[30] Timo Hackel, Jan D Wegner, and Konrad Schindler. Fast semantic segmentation of
3d point clouds with strongly varying density. ISPRS annals of the photogrammetry,
remote sensing and spatial information sciences, 3:177–184, 2016.
[31] Kaveh Hassani and Mike Haley. Unsupervised multi-task feature learning on point
clouds. In Proceedings of the IEEE International Conference on Computer Vision,
pages 8160–8171, 2019.
[32] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki
Trigoni, and Andrew Markham. Randla-net: Efficient semantic segmentation of large-
scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 11108–11117, 2020.
[33] Mingyang Jiang, Yiran Wu, Tianqi Zhao, Zelin Zhao, and Cewu Lu. Pointsift: A
sift-like network module for 3d point cloud semantic segmentation. arXiv preprint
arXiv:1807.00652, 2018.
[34] Pranav Kadam, Min Zhang, Jiahao Gu, Shan Liu, and C-C Jay Kuo. Greenpco: An
unsupervised lightweight point cloud odometry method. In 2022 IEEE 24th Interna-
tional Workshop on Multimedia Signal Processing (MMSP), pages 01–06. IEEE, 2022.
[35] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. Unsupervised point cloud
registration via salient points analysis (spa). In 2020 IEEE International Conference
on Visual Communications and Image Processing (VCIP), pages 5–8. IEEE, 2020.
[36] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. Gpco: An unsupervised
green point cloud odometry method. arXiv preprint arXiv:2112.04054, 2021.
134
[37] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. R-pointhop: A
green, accurate and unsupervised point cloud registration method. arXiv preprint
arXiv:2103.08129, 2021.
[38] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. R-pointhop: A green,
accurate, and unsupervised point cloud registration method. IEEE Transactions on
Image Processing, 31:2710–2725, 2022.
[39] Pranav Kadam, Qingyang Zhou, Shan Liu, and C-C Jay Kuo. Pcrp: Unsupervised
point cloud object retrieval and pose estimation. arXiv preprint arXiv:2202.07843,
2022.
[40] Ioannis Katsavounidis, C-C Jay Kuo, and Zhen Zhang. A new initialization technique
for generalized lloyd iteration. IEEE Signal processing letters, 1(10):144–146, 1994.
[41] Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the
recognition of 3d point cloud models. In Proceedings of the IEEE international con-
ference on computer vision, pages 863–872, 2017.
[42] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander.
Joint 3d proposal generation and object detection from view aggregation. In 2018
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages
1–8. IEEE, 2018.
[43] C-C Jay Kuo. Understanding convolutional neural networks with a mathematical
model. JournalofVisualCommunicationandImageRepresentation,41:406–413,2016.
[44] C-C Jay Kuo. The cnn as a guided multilayer recos transform [lecture notes]. IEEE
signal processing magazine, 34(3):81–89, 2017.
[45] C-C Jay Kuo and Yueru Chen. On data-driven saak transform. Journal of Visual
Communication and Image Representation, 50:237–246, 2018.
[46] C-CJayKuoandAzadMMadni. Greenlearning: Introduction,examplesandoutlook.
arXiv preprint arXiv:2210.00965, 2022.
[47] C-CJayKuoandAzadMMadni. Greenlearning: Introduction,examplesandoutlook.
Journal of Visual Communication and Image Representation, 90:103685, 2023.
[48] C-C Jay Kuo, Min Zhang, Siyang Li, Jiali Duan, and Yueru Chen. Interpretable con-
volutional neural networks via feedforward design. Journal of Visual Communication
and Image Representation, 60:346–359, 2019.
[49] Loic Landrieu, Hugo Raguet, Bruno Vallet, Cl´ ement Mallet, and Martin Weinmann.
Astructuredregularizationframeworkforspatiallysmoothingsemanticlabelingsof3d
point clouds. ISPRS Journal of Photogrammetry and Remote Sensing, 132:102–118,
2017.
135
[50] LoicLandrieuandMartinSimonovsky. Large-scalepointcloudsemanticsegmentation
with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 4558–4567, 2018.
[51] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar
Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
12697–12705, 2019.
[52] Xuejing Lei, Wei Wang, and C-C Jay Kuo. Genhop: An image generation method
based on successive subspace learning. In 2022 IEEE International Symposium on
Circuits and Systems (ISCAS), pages 3314–3318. IEEE, 2022.
[53] Xuejing Lei, Ganning Zhao, Kaitai Zhang, and C-C Jay Kuo. Tghop: an explainable,
efficient, and lightweight method for texture generation. APSIPA Transactions on
Signal and Information Processing, 10, 2021.
[54] Jiaxin Li, Ben M Chen, and Gim Hee Lee. So-net: Self-organizing network for point
cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 9397–9406, 2018.
[55] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen.
Pointcnn: Convolution on x-transformed points. Advances in neural information pro-
cessing systems, 31:820–830, 2018.
[56] Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion
for multi-sensor 3d object detection. In Proceedings of the European Conference on
Computer Vision (ECCV), pages 641–656, 2018.
[57] Shan Liu, Min Zhang, Pranav Kadam, and Chung-Chieh Jay Kuo. 3D Point Cloud
Analysis: Traditional, Deep Learning, and Explainable Machine Learning Methods.
Springer, 2021.
[58] DavidGLowe. Distinctiveimagefeaturesfromscale-invariantkeypoints. International
journal of computer vision, 60(2):91–110, 2004.
[59] Cl´ ement Mallet, Fr´ ed´ eric Bretar, Michel Roux, Uwe Soergel, and Christian Heipke.
Relevance assessment of full-waveform lidar data for urban area classification. ISPRS
journal of photogrammetry and remote sensing, 66(6):S71–S84, 2011.
[60] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network
for real-time object recognition. In 2015 IEEE/RSJ International Conference on In-
telligent Robots and Systems (IROS), pages 922–928. IEEE, 2015.
[61] Carsten Moenning and Neil A Dodgson. Fast marching farthest point sampling. Tech-
nical report, University of Cambridge, Computer Laboratory, 2003.
[62] Panagiotis Papadakis, Ioannis Pratikakis, Theoharis Theoharis, and Stavros Peranto-
nis. Panorama: A 3d shape descriptor based on panoramic views for unsupervised 3d
object retrieval. International Journal of Computer Vision, 89(2-3):177–192, 2010.
136
[63] Florent Poux and Roland Billen. Voxel-based 3d point cloud semantic segmentation:
unsupervised geometric and relationship featuring vs deep learning methods. ISPRS
International Journal of Geo-Information, 8(5):213, 2019.
[64] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum point-
nets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 918–927, 2018.
[65] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learn-
ing on point sets for 3d classification and segmentation. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
[66] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J
Guibas. Volumetric and multi-view cnns for object classification on 3d data. In Pro-
ceedings of the IEEE conference on computer vision and pattern recognition, pages
5648–5656, 2016.
[67] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical
feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413,
2017.
[68] Dario Rethage, Johanna Wald, Jurgen Sturm, Nassir Navab, and Federico Tombari.
Fully-convolutional point networks for large-scale point clouds. In Proceedings of the
European Conference on Computer Vision (ECCV), pages 596–611, 2018.
[69] GernotRiegler,AliOsmanUlusoy,andAndreasGeiger. Octnet: Learningdeep3drep-
resentations at high resolutions. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 3577–3586, 2017.
[70] Lior Rokach. Ensemble-based classifiers. Artificial Intelligence Review , 33(1-2):1–39,
2010.
[71] Mozhdeh Rouhsedaghat, Yifan Wang, Xiou Ge, Shuowen Hu, Suya You, and C-C Jay
Kuo. Facehop: A light-weight low-resolution face gender classification method. arXiv
preprint arXiv:2007.09510, 2020.
[72] Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature histograms
(fpfh) for 3d registration. In 2009 IEEE international conference on robotics and
automation, pages 3212–3217. IEEE, 2009.
[73] Radu Bogdan Rusu, Nico Blodow, Zoltan Csaba Marton, and Michael Beetz. Aligning
pointcloudviewsusingpersistentfeaturehistograms. In2008IEEE/RSJInternational
Conference on Intelligent Robots and Systems, pages 3384–3391. IEEE, 2008.
[74] Yiru Shen, Chen Feng, Yaoqing Yang, and Dong Tian. Neighbors do help: Deeply
exploiting local structures of point clouds. arXiv preprint arXiv:1712.06760, 1(2),
2017.
137
[75] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn: 3d object proposal
generationanddetectionfrompointcloud. InProceedingsoftheIEEE/CVFconference
on computer vision and pattern recognition, pages 770–779, 2019.
[76] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-
view convolutional neural networks for 3d shape recognition. In Proceedings of the
IEEE international conference on computer vision, pages 945–953, 2015.
[77] JianSun, MaksOvsjanikov, andLeonidasGuibas. Aconciseandprovablyinformative
multi-scale signature based on heat diffusion. In Computer graphics forum, volume 28,
pages 1383–1392. Wiley Online Library, 2009.
[78] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui,
Fran¸ cois Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable con-
volution for point clouds. In Proceedings of the IEEE/CVF international conference
on computer vision, pages 6411–6420, 2019.
[79] Federico Tombari, Samuele Salti, and Luigi Di Stefano. Unique signatures of his-
tograms for local surface description. In European conference on computer vision,
pages 356–369. Springer, 2010.
[80] Kiri Wagstaff, Claire Cardie, Seth Rogers, Stefan Schr¨ odl, et al. Constrained k-means
clustering with background knowledge. In Icml, volume 1, pages 577–584, 2001.
[81] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and
Justin M Solomon. Dynamic graph cnn for learning on point clouds. Acm Trans-
actions On Graphics (tog), 38(5):1–12, 2019.
[82] Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemo-
metrics and intelligent laboratory systems, 2(1-3):37–52, 1987.
[83] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks
on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 9621–9630, 2019.
[84] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang,
and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages
1912–1920, 2015.
[85] YanYan,YuxingMao,andBoLi. Second: Sparselyembeddedconvolutionaldetection.
Sensors, 18(10):3337, 2018.
[86] Bin Yang, Ming Liang, and Raquel Urtasun. Hdnet: Exploiting hd maps for 3d object
detection. In Conference on Robot Learning, pages 146–155, 2018.
[87] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real-time 3d object detection
from point clouds. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 7652–7660, 2018.
138
[88] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Point cloud auto-
encoder via deep grid deformation. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 206–215, 2018.
[89] Yijing Yang, Vasileios Magoulianitis, and C-C Jay Kuo. E-pixelhop: An enhanced
pixelhop method for object classification. arXiv preprint arXiv:2107.02966, 2021.
[90] YijingYang,WeiWang,HongyuFu,andC-CJayKuo. Onsupervisedfeatureselection
from high dimensional feature spaces. arXiv preprint arXiv:2203.11924, 2022.
[91] Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Point-based 3d single stage
object detector. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 11040–11048, 2020.
[92] Li Yi, Vladimir G Kim, Duygu Ceylan, I Shen, Mengyan Yan, Hao Su, Cewu Lu,
Qixing Huang, Alla Sheffer, Leonidas Guibas, et al. A scalable active framework for
region annotation in 3d shape collections. ACM Transactions on Graphics (TOG),
35(6):210, 2016.
[93] Haoxuan You, Yifan Feng, Rongrong Ji, and Yue Gao. Pvnet: A joint convolutional
network of point cloud and multi-view for 3d shape recognition. In 2018 ACM Multi-
media Conference on Multimedia Conference, pages 1310–1318. ACM, 2018.
[94] Cha Zhang and Yunqian Ma. Ensemble machine learning: methods and applications.
Springer, 2012.
[95] Kaitai Zhang, Bin Wang, Wei Wang, Fahad Sohrab, Moncef Gabbouj, and C-C Jay
Kuo. Anomalyhop: An ssl-based image anomaly localization method. arXiv preprint
arXiv:2105.03797, 2021.
[96] Min Zhang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Unsupervised feedforward
feature (uff) learning for point cloud classification and segmentation. In 2020 IEEE
International Conference on Visual Communications and Image Processing (VCIP),
pages 144–147. IEEE, 2020.
[97] Min Zhang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Gsip: Green semantic
segmentation of large-scale indoor point clouds. Pattern Recognition Letters, 164:9–
15, 2022.
[98] Min Zhang, Yifan Wang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Pointhop++:
A lightweight learning model on point sets for 3d classification. In 2020 IEEE Inter-
national Conference on Image Processing (ICIP), pages 3319–3323. IEEE, 2020.
[99] MinZhang,HaoxuanYou,PranavKadam,ShanLiu,andC-CJayKuo. Pointhop: An
explainablemachinelearningmethodforpointcloudclassification. IEEE Transactions
on Multimedia, 22(7):1744–1755, 2020.
[100] Zizhao Zhang, Haojie Lin, Xibin Zhao, Rongrong Ji, and Yue Gao. Inductive multi-
hypergraph learning and its application on view-based 3d object classification. IEEE
Transactions on Image Processing, 27(12):5957–5968, 2018.
139
[101] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point
transformer. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 16259–16268, 2021.
[102] Yongheng Zhao, Tolga Birdal, Haowen Deng, and Federico Tombari. 3d point capsule
networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1009–1018, 2019.
[103] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d
objectdetection. InProceedingsoftheIEEEconferenceoncomputervisionandpattern
recognition, pages 4490–4499, 2018.
140
Abstract (if available)
Abstract
Point cloud processing is a fundamental but challenging research topic in the field of 3D computer vision, we study two related problems — point cloud classification and point cloud segmentation. Given a point cloud, the goal of classification is to label every point cloud as one of the object categories and the goal of segmentation is to label every point as one of the semantic categories. State-of-the-art point cloud classification and segmentation methods are based on deep neural networks. Although deep-learning-based methods provide good performance, their working principle is not transparent. Furthermore, they demand huge computational resources (e.g., long training time even with GPUs). Since it is challenging to deploy them in mobile or terminal devices, their applicability to real world problems is hindered. To address these shortcomings, we design explainable and green solutions to point cloud classification and segmentation.
We first propose an explainable machine learning method, PointHop, for point cloud classification and further improve its model complexity and performance in PointHop++. Then, we extend the PointHop method to do explainable and green point cloud segmentation. Specifically, an unsupervised feedforward feature (UFF) learning scheme for joint classification and part segmentation of 3D point clouds and an efficient solution to semantic segmentation of large-scale indoor scene point clouds (i.e., the GSIP method) are proposed. Finally, we rethink local and global aggregation in point cloud classification and segmentation, proposing SR-PointHop for green point cloud classification using single resolution representation and GreenSeg for segmenting both small-scale and large-scale point clouds efficiently and effectively.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Green learning for 3D point cloud data processing
PDF
3D deep learning for perception and modeling
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
3D object detection in industrial site point clouds
PDF
Green unsupervised single object tracking: technologies and performance evaluation
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Green image generation and label transfer techniques
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Green knowledge graph completion and scalable generative content delivery
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Object detection and recognition from 3D point clouds
PDF
Explainable and lightweight techniques for blind visual quality assessment and saliency detection
PDF
Planning and learning for long-horizon collaborative manipulation tasks
PDF
Efficient graph learning: theory and performance evaluation
PDF
Visual knowledge transfer with deep learning techniques
PDF
Explainable AI architecture for automatic diagnosis of melanoma using skin lesion photographs
PDF
Hybrid methods for robust image matching and its application in augmented reality
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
Asset Metadata
Creator
Zhang, Min
(author)
Core Title
Explainable and green solutions to point cloud classification and segmentation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-12
Publication Date
12/14/2022
Defense Date
12/09/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
classification,explainable,green learning,OAI-PMH Harvest,point cloud,segmentation,unsupervised learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Jenkins, Keith (
committee member
), Nikolaidis, Stefanos (
committee member
)
Creator Email
zhan980@usc.edu,zhangmin8558@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112620925
Unique identifier
UC112620925
Identifier
etd-ZhangMin-11378.pdf (filename)
Legacy Identifier
etd-ZhangMin-11378
Document Type
Dissertation
Format
theses (aat)
Rights
Zhang, Min
Internet Media Type
application/pdf
Type
texts
Source
20221214-usctheses-batch-997
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
explainable
green learning
point cloud
segmentation
unsupervised learning