Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Machine learning methods for 2D/3D shape retrieval and classification
(USC Thesis Other)
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MACHINE LEARNING METHODS FOR 2D/3D SHAPE RETRIEV AL AND
CLASSIFICATION
by
Xiaqing Pan
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2017
Copyright 2017 Xiaqing Pan
Contents
List of Tables v
List of Figures vii
Abstract xii
1 Introduction 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 2D/3D Shape Retrieval . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 3D Shape Classification . . . . . . . . . . . . . . . . . . . . . 8
1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 The Two-Stage Shape Retrieval (TSR) Method . . . . . . . . . 11
1.3.2 The Irrelevance Filtering and Similarity Ranking (IF/SR) Method 12
1.3.3 The V olumetric CNN (VCNN) Method . . . . . . . . . . . . . 13
1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 14
2 Background 15
2.1 Local Features for 2D Shapes . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Local Features for 3D Shapes . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Diffusion Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.3 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . 28
2.5 Review of Existing Datasets . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.1 Performance Measurements . . . . . . . . . . . . . . . . . . . 37
2.5.2 2D Shape Datasets . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.3 3D Shape Datasets . . . . . . . . . . . . . . . . . . . . . . . . 43
ii
3 A Two-Stage Shape Retrieval (TSR) Method with Global and Local Fea-
tures 49
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Proposed TSR Method . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.2 The ICF Stage (Stage I) . . . . . . . . . . . . . . . . . . . . . 56
3.3.3 The LMR Stage (Stage II) . . . . . . . . . . . . . . . . . . . . 65
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.1 MPEG-7 Shape Dataset . . . . . . . . . . . . . . . . . . . . . 66
3.4.2 Kimia99 Shape Dataset . . . . . . . . . . . . . . . . . . . . . . 71
3.4.3 Tari1000 Shape Dataset . . . . . . . . . . . . . . . . . . . . . 72
3.4.4 Unbalanced Shape Datasets . . . . . . . . . . . . . . . . . . . 73
3.4.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 74
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 3D Shape Retrieval via Irrelevance Filtering and Similarity Ranking (IF/SR) 76
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Proposed IF/SR Method . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.2 Stage I: Irrelevance Filtering . . . . . . . . . . . . . . . . . . . 83
4.2.3 Stage II: Similarity Ranking . . . . . . . . . . . . . . . . . . . 93
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5 Design, Analysis and Application of A Volumetric Convolutional Neural
Network 100
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3 Proposed VCNN Method . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.2 Shape Anchor Vectors (SA Vs) . . . . . . . . . . . . . . . . . . 105
5.3.3 Network Parameters Selection . . . . . . . . . . . . . . . . . . 107
5.3.4 Confusion Sets Identification and Re-Classification . . . . . . . 110
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6 Summary and Future Work 124
6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2.1 2D/3D Shape Retrieval . . . . . . . . . . . . . . . . . . . . . . 126
6.2.2 3D Shape Classification . . . . . . . . . . . . . . . . . . . . . 128
iii
Bibliography 130
iv
List of Tables
2.1 Bull’s eye scores of several state-of-the-art methods for the MPEG-7
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2 Top k consistency of several shape retrieval methods for the Kimia99
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3 Bull’s eye scores of several shape retrieval methods for the Tari1000
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4 Overviews of SHREC from year 2009 to year 2014. . . . . . . . . . . . 44
2.5 Retrieval performances of five parcipants in SHREC12 generic shape
retrieval contest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.6 Classification performance measured by ACA and AIA scores of several
state-of-the-art methods for the ModelNet40 dataset. . . . . . . . . . . 47
3.1 Skeleton features for six shapes in Fig.3.4. . . . . . . . . . . . . . . . . 59
3.2 Comparison of bull’s eye scores with different cluster numbers for the
TSR method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Comparison of bull’s eye scores of several state-of-the-art methods for
the MPEG-7 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4 Comparison of top 20, 25, 30, 35, 40 retrieval accuray for MPEG-7 dataset. 67
3.5 Comparison of top N consistency of several shape retrieval methods for
the Kimia99 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6 Comparison of bull’s eye scores of several shape retrieval methods for
the Tari1000 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.7 Comparison of correct retrieval rates with non-uniformly distributed data. 74
3.8 Computation time of two off-line processes. . . . . . . . . . . . . . . . 75
3.9 Computation time of four on-line processes. . . . . . . . . . . . . . . . 75
4.1 Comparison of the First-Tier (FT) scores with different cluster numbers
for the IF/SR method in the SHREC12 dataset. The best score is shown
in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2 Comparison of the NN, FT, ST, E and DCG scores of five state-of-the-
art methods, the proposed IF/SR method, and the IF/SR method with
LCDP postprocessing for the SHREC12 dataset. The best score for each
measurement is shown in bold. . . . . . . . . . . . . . . . . . . . . . . 95
v
4.3 Comparison of top 20, 25, 30, 35, 40 retrieval accuracy for the SHREC12
dataset, where the best results are shown in bold. . . . . . . . . . . . . 95
5.1 Comparison of network parameters of the proposed VCNN and the V ox-
elNet, where the numbers in each cell follow the format pm
j
q
3
K
j 1
: 117
5.2 Comparison of ACA and AIA scores of several state-of-the-art methods
for the ModelNet40 dataset.. . . . . . . . . . . . . . . . . . . . . . . . 119
vi
List of Figures
1.1 An airplane object is represented by (a) a 2D shape, (b) a mesh model,
and (c) a volumetric data. . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Examples of intra-class variations in 2D shapes: (a) articulation, (b)
contour noise, (c) contour deformation , and (d) multiple projection
angles. Every pair of shapes represents a class and each subfigure gives
two exemplary classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Examples of intra-class variations in 3D shapes: (a) articulation, (b)
sematic-level variation. Shapes in each subfigure are from a same class. 3
1.4 Examples of inter-class similarities in (a) 2D shapes, (b) 3D shapes.
Every pair of shapes represents two shapes from two different classes. . 3
1.5 The flow chart for traditional 2D/3D shape retrieval system. . . . . . . . 6
1.6 Comparison of retrieval results between (a)IDSC, (b)SLVF (first row)
and the proposed method (second row). The query is in a red box given
in the leftmost column for each row. . . . . . . . . . . . . . . . . . . . 7
1.7 Comparison of retrieval results among AIR (first row), AIR+DP in [25]
and the proposed method. The query is in a red box given in the leftmost
column for each row. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8 The flow chart of a CNN-based method introduced in [60]. . . . . . . . 9
1.9 Examples of three sets of confusing classes generated by using the method
in [60] (a) bookself and tv stand, (b) chair and stool, and (c) flower pot
and plant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Shape context example (a) the log polar coordinate at a target point (b)
the shape context feature of the target point. . . . . . . . . . . . . . . . 17
2.2 Comparision between inner distance (red curve) and Euclidean distance
(yellow straight line) of an articulate shape under three different poses. 18
2.3 Interior noise influences the robustness of inner distance. . . . . . . . . 19
2.4 Examples of defections in 3D meshes (a) gap: disconnected leg parts
(b) holes (c) self-intersection (d)interior structure. . . . . . . . . . . . . 21
2.5 The cylinder coordination and its correponding cylinderical projection
in [70]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 The basic flowchart of extracting SLVF from a 3D mesh. . . . . . . . . 24
2.7 One-level decision trees for different shapes. . . . . . . . . . . . . . . . 27
vii
2.8 The flow chart of LeNet [46] for the 2D digit classification problem. . . 29
2.9 The examples of three activation functions (a) Sigmoid, (b) ReLU, and
(c) Leaky ReLU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.10 Illustration of the relationship between input feature vectors and anchor
vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.11 The flow chart of the MVCNN model [87]. . . . . . . . . . . . . . . . 34
2.12 The saliency map of two examples: a dresser and a monitor, calculated
by [87]. The first and third row are the original views. The average
saliency score is shown for each view at the bottom. The second and
forth row are the pixel-wise saliency map for each view. The darker
pixel indicates higher saliency value. Top three views with highest
saliency scores are boxed in blue. . . . . . . . . . . . . . . . . . . . . . 34
2.13 The flow chart of the V oxelNet method [60]. . . . . . . . . . . . . . . . 36
2.14 Examples of MPEG-7 shape dataset. Every class is represented by two
instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.15 Examples of Kimia99 shape dataset. Every class is represented by two
instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.16 Examples of Tari1000 shape dataset. Every class is represented by two
instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.17 Examples of Shrec12 generic 3D shape dataset. Every class is repre-
sented by two instances. . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.18 Examples of ShapeNet 3D shape dataset. Every class is represented by
two instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1 Comparison of retrieved shapes using AIR (the first row), AIR+DP (the
second row) and the proposed TSR method (the third row) with respect
to four query shapes: (a) Apple, (b) Key, (c) Cup, and (d) Bone. . . . . 51
3.2 The flow chart of the proposed TSR system. . . . . . . . . . . . . . . . 56
3.3 Shape normalization results of six classes. . . . . . . . . . . . . . . . . 57
3.4 Illustration of skeleton feature extraction (from left to right of each
image): the original shape, the initial skeleton and the pruned skeleton. . 58
3.5 Five Haar-like filters used to extract wavelet features of a normalized
shape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Several clustered MPEG-7 dataset shapes using the spectral clustering
method applied in the AIR feature space. . . . . . . . . . . . . . . . . . 60
3.7 Illustration of a shared dominating feature (i.e., the aspect ratio) among
all training fish samples outside of the yellow box, which could termi-
nate the decision quickly and reject the testing sword-like fish in the red
box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.8 Selecting relevant clusters for a query apple shape by thresholding a cost
function as shown in Eq. (4.5). . . . . . . . . . . . . . . . . . . . . . . 64
viii
3.9 Comparison of retrieved rank-ordered shapes (left-to-right in the top
row followed by left-to-right in the second row within each black stripe).
For each query case, retrieved results of IDSC+DP1, AIR+DP2 and TSR
are shown in the first, second and third black stripes of all subfigures,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.10 Comparison of retrieved rank-ordered shapes (left-to-right in the top
row followed by left-to-right in the second row within each black stripe).
For each query case, retrieved results of IDSC+DP1, AIR+DP2 and TSR
are shown in the first, second and third black stripes of all subfigures,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.11 Comparison of precision-and-recall curves of several methods for the
MPEG-7 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.12 Comparison of the precision and recall curves of several shape retrieval
methods for the Kimia99 dataset. . . . . . . . . . . . . . . . . . . . . . 72
3.13 Comparison of precision and recall curves of several shape retrieval
methods for the Tari1000 dataset. . . . . . . . . . . . . . . . . . . . . . 73
4.1 Illustrations of intra-class variation at each row and inter-class similarity
at each pair of rows such as (a) and (b), (c) and (d), (e) and (f). . . . . . 77
4.2 Errors generated by the global feature RISH. The query is marked by a
red box as (a) chair, (b) desk lamp and (c) human. The errors are marked
by yellow boxes in each subfigure. . . . . . . . . . . . . . . . . . . . . 79
4.3 Comparison of retrieved shapes using the DG1SIFT method (the first
row) and the proposed IF/SR method (the second row) against five query
shapes (from top to bottom): (a) bicycle, (b) round table, (c) desk lamp,
(d) piano, and (e) home plant. . . . . . . . . . . . . . . . . . . . . . . 80
4.4 The flow chart of the proposed IF/SR system. . . . . . . . . . . . . . . 82
4.5 The examples of the original mesh (left) and its voxelization result (right)
in each pair for three mesh models: piano, chair and hand. . . . . . . . 84
4.6 The examples of the visualized reflective symmetry descriptors gener-
ated by [37] for four mesh models: chair, motocycle, drum and insect.
A farther point on the surface to the origin indicates a larger symmtry
value on the corresponding direction. . . . . . . . . . . . . . . . . . . 85
4.7 Shape normalization results of four 3D shape classes. . . . . . . . . . . 86
4.8 Visualization of surface features of four shapes, where (a) and (b) pro-
vide two house shapes while (c) and (d) provide two truck shapes. . . . 87
4.9 Illustration of the seven-band Haar filters. . . . . . . . . . . . . . . . . 88
4.10 The xyz-invariance (black box),
-variance (red box) and rectilin-
earity (blue box) values of six examples from three classes: apartment
house, fish and cup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.11 Several clusterd SHREC12 shapes using the spectral clustering method
using the DG1SIFT feature. . . . . . . . . . . . . . . . . . . . . . . . . 89
ix
4.12 Selecting relevant clusters for the query desk lamp in Fig. 4.3(c) by
thresholding a cost function shown in Eq. 4.4. . . . . . . . . . . . . . 91
4.13 Comparison of retrieved top 20 rank-ordered shapes. For each query
case given in the leftmost column, retrieved results of DG1SIFT and the
proposed IF/SR method are shown in the first and second rows of all
subfigures, repectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.14 Comparison of retrieved top 20 rank-ordered shapes. For each query
case given in the leftmost column, retrieved results of DG1SIFT and the
proposed IF/SR method are shown in the first and second rows of all
subfigures, repectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.15 Comparison of precision and recall curves of the proposed IF/SR method
and several benchmarking methods for the SHREC12 dataset. . . . . . 98
5.1 Two sets of confusing classes: (a) desks (the first row) and tables (the
second row), and (b) cups (the first row), flower pots (second row)
and vases (third row). The confusion is attributed to the similar global
appearance of these 3D shapes. . . . . . . . . . . . . . . . . . . . . . . 102
5.2 The flow chart of the proposed system . . . . . . . . . . . . . . . . . . 105
5.3 Illustration of anctor vectors at the last stage of a CNN, where each
anchor vector points to a 3D shape class. They are called the shape
anchor vectors (SA Vs). In this example, one SA V points to the airplane
class while another SA V points to the cone class. . . . . . . . . . . . . 106
5.4 Illustration of 3D shapes in sub-classes obtained from (a) the Chair
class, (b) the Mantel class and (c) the Sink class. We provide 16 rep-
resentative shapes for each sub-class and encircle them with a blue box. 112
5.5 A mixed or pure set is enclosed by a blue box. Each mixed set contains
multiple shape classes which are separated by green vertical bars. Two
representative 3D shapes are shown for each class. Each row has one
mixed set and several pure sets. The mixed set in the first row contains
bookshelf, wardrobe, night stand, radio, xbox, dresser and tv stand; that
in the second row contains cup, flower pot, lamp, plant and vase; that
in the third row contains bench, desk and table; that in the four row
contains chair and stool; that in the fifth row contains curtain and door;
and that in the six row contains mantel and sink. . . . . . . . . . . . . . 114
5.6 The split of the confusion set of bench, desk and table yields three pure
subsets (a) bench, (b) desk and (c) table and three mixed subsets (d)
bench and table, (e) desk and table and (f)bench, desk and table. . . . . 115
5.7 The split of the confusion set of lamp, flower pot, cup and vase yields
three pure subsets (a) lamp, (b) plant and (c) cup and three mixed subsets
(d) flower pot and vase, (e) flower pot and plant and (f) cup, flower pot,
lamp, vase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
x
5.8 Three BIC curves measured under three different filter sizesm
j
3;5;7
and seven different filter numbersK
j
32;64;128;256;512;768;1024
for the first convolutional layer. . . . . . . . . . . . . . . . . . . . . . . 118
5.9 Comparison of accuracy per classe between the V oxelNet method (blue
bars) and our VCNN design (yellow bars). . . . . . . . . . . . . . . . . 120
5.10 Four examples of corrected errors: (a) desk, (b) lamp, (c) chair, (d) vase.
Each example has a testing sample on the top and its assigned subset on
the bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.11 Four examples of uncorrected errors: (a) cup, (b) desk, (c) dresser,
(d) plant. Each example includes a testing sample on the top and the
assigned subset on the bottom. . . . . . . . . . . . . . . . . . . . . . . 122
6.1 Examples of some existing confusing pairs. . . . . . . . . . . . . . . . 127
xi
Abstract
Shape classification and retrieval are two important problems in both computer vision
and computer graphics. A robust shape analysis contributes to many applications such as
manufacture components recognition and retrieval, sketch-based shape retrieval, medi-
cal image anaysis, 3D model repository management, etc. In this dissertation, we pro-
pose three methods to address three significant problems such as 2D shape retrieval, 3D
shape retrieval and 3D shape classification, respectively.
First, in the 2D shape retrieval problem, most state-of-the-art shape retrieval methods
are based on local features matching and ranking. Their retrieval performance is not
robust since they may retrieve globally dissimilar shapes in high ranks. To overcome
this challenge, we decompose the decision process into two stages. In the first irrelevant
cluster filtering (ICF) stage, we consider both global and local features and use them
to predict the relevance of gallery shapes with respect to the query. Irrelevant shapes
are removed from the candidate shape set. After that, a local-features-based matching
and ranking (LMR) method follows in the second stage. We apply the proposed TSR
system to three shape datasets: MPEG-7, Kimia99 and Tari1000. We show that TSR
outperforms all other existing methods. The robustness of TSR is demonstrated by the
retrieval performance.
Second, a novel solution for the content-based 3D shape retrieval problem using
an unsupervised clustering approach, which does not need any label information of 3D
xii
shapes, is presented. The proposed shape retrieval system consists of two modules in
cascade: the irrelevance filtering (IF) module and the similarity ranking (SR) module.
The IF module attempts to cluster gallery shapes that are similar to each other by exam-
ining global and local features simultaneously. However, shapes that are close in the
local feature space can be distant in the global feature space, and vice versa. To resolve
this issue, we propose a joint cost function that strikes a balance between two distances.
Irrelevant samples that are close in the local feature space but distant in the global fea-
ture space can be removed in this stage. The remaining gallery samples are ranked in
the SR module using the local feature. The superior performance of the proposed IF/SR
method is demonstrated by extensive experiments conducted on the popular SHREC12
dataset.
Third, the design, analysis and application of a volumetric convolutional neural net-
work (VCNN) are studied to address the 3D shape classification problem. Although a
large number of CNNs have been proposed in the literature, their design is empirical.
In the design of the VCNN, we propose a feed-forward K-means clustering algorithm
to determine the filter number and size at each convolutional layer systematically. For
the analysis of the VCNN, we focus on the relationship between the filter weights (also
known as anchor vectors) from the last fully connected (FC) layer to the output. Typ-
ically, the output of the VCNN contains a couple of sets of confusing classes, and the
cause of these confusion sets can be well explained by analyzing their anchor vector rela-
tionships. Furthermore, a hierarchical clustering method followed by a random forest
classification method is proposed to boost the classification performance among confus-
ing classes. For the application of the VCNN, we examine the 3D shape classification
problem and conduct experiments on a popular dataset called the ModelNet40. The
proposed VCNN offers the state-of-the-art performance among all volume-based CNN
methods.
xiii
Chapter 1
Introduction
1.1 Significance of the Research
2D and 3D shapes are often encountered in many computer vision problems such as
manufacture components recognition and retrieval, sketch-based shape retrieval, 3D
model repository management, antique models search engine, etc. A 2D shape [44],
also known as a sillouette image, is a binary image with a black background and a
single object as the foreground. A 3D shape usually has two representations such as
mesh and volumetric data. A mesh model [89] is composed by vertices and edges. An
edge is connected between two vertices while a face is constructed by connecting edges.
Multiple adjacent faces construct a surface for a 3D object. A volumetric model is a
discrete 3D data, which consists of voxels with binary value1 as the foreground object
and0 as the background. In Fig. 1.1, an airplane object is represented by a 2D shape, a
mesh model and a volumetric data. It is worthy to note that a 2D shape can be acquired
from the projection of a 3D shape so that the strong relationship between the 2D and
3D world is built. With the development of advanced acquisition techniques, a dramat-
ically increasing number of 2D and 3D shape models are available on the Internet such
as Google Sketchup and Yobi3D [44], [89]. Robust methods to accurately retrieve and
classify 2D shapes and 3D shapes are in demand.
2D/3D Shape Retrieval. Given a query shape, a shape retrieval system retrieves
ranked shapes from a gallery set according to a designed similarity measurement
1
(a) (b) (c)
Figure 1.1: An airplane object is represented by (a) a 2D shape, (b) a mesh model, and
(c) a volumetric data.
(a) (b) (c) (d)
Figure 1.2: Examples of intra-class variations in 2D shapes: (a) articulation, (b) contour
noise, (c) contour deformation , and (d) multiple projection angles. Every pair of shapes
represents a class and each subfigure gives two exemplary classes.
between the query shape and shapes in the gallery set, called gallery shapes. The perfor-
mance of a shape retrieval system is evaluated by the consistency between the retrieved
shapes and human interpretation. The shape retrieval problem is challenging due to a
wide range of intra-class variations and inter-class similarities. Intra-class variations in
2D shapes are mainly caused by articulation, noise, contour deformation, and multiple
projection angles. Fig.1.2 (a)-(d) show these four types of intra-class variations in 2D
shapes. For 3D shapes, intra-class variations generally contain semantic-level variations
and articulation variations. For example, in Fig.1.3 (a) and (b), different gestures of
hands indicate the articulation variations and various appearances of mugs produce the
2
(a)
(b)
Figure 1.3: Examples of intra-class variations in 3D shapes: (a) articulation, (b) sematic-
level variation. Shapes in each subfigure are from a same class.
(a)
(b)
Figure 1.4: Examples of inter-class similarities in (a) 2D shapes, (b) 3D shapes. Every
pair of shapes represents two shapes from two different classes.
semantic-level variations. Inter-class similarities are defined that two classes of shapes
may happen to have strongly similar appearances. For example, Fig. 1.4 (a) and (b)
display four examples of the inter-class similarities in 2D shapes and 3D shapes, respec-
tively. To overcome the intra-class variations and the inter-class similarities, many inter-
esting features are proposed to maximize the margin among classes and minimize the
variation within each class. However, purely relying on designing features fails in deal-
ing with a large varieties of real cases. It will be discussed in Section 1.2. In this
dissertation, we proposed a Two-Stage Shape Retrieval (TSR) method and an Irrele-
vance Filtering and Similarity Ranking (IF/SR) method to systematically solve the 2D
3
shape retrieval problem and the 3D shape retrieval problem in an unsupervised manner,
respectively. They will be introduced in Chapter 3 and Chapter 4.
3D Shape Classification. A 3D shape classifier predicts the class label of a given
testing 3D shape. The performance of a 3D shape classifier is determined by compar-
ing predicted labels with groud-truth labels. Traditionally, a 3D classification method is
composed by two steps: feature extraction and supervised classifier learning. The 3D
shape classification problem shares the same challenges such as the intra-class variations
and the inter-class similarities in the 3D shape retrieval problem. Besides, after acquir-
ing the feature space determined by a designed feature, it is essential to learn a robust
classifier to separate samles from different classes. However, it is hard for traditional
hand-craft features to strike the balance between the feature extraction and the super-
vised classifier learning. In recent years, solutions based on the convolutional neural
network (CNN) [41], [45] have been proposed to learn the strategy of feature extraction
and classifier simultaneously, called the end-to-end training. Evidently, CNN-based
methods outperform traditional methods based on hand-craft features in the 3D shape
classification problem. However, the performance of a CNN-based classifier heavily
relies on its empirically preset parameters. Furthermore, being similar to any other real
world classification problem, there exist sets of confusing classes in the feature space. It
will be discussed in Section 1.2. We propose a V olumetric CNN (VCNN) method that
chooses network parameters automatically with theoretic justification. In our VCNN
method, we also propose a hierarchical classification method to adaptively analyze the
feature space and enhance the classification accuracy. Our method will be explained in
Chapter 5.
4
1.2 Related Previous Work
1.2.1 2D/3D Shape Retrieval
Traditional shape retrieval system starts with extracting features for both query shape
and gallery shapes. By calculating the distance of the features between the query shape
and each gallery shape, the gallery shapes are ranked and retrieved. To enhance the
retrieval performance, some post-processing methods are also applied to refine the dis-
tance matrix. The general flowchart of traditional shape retrieval systems is shown in
Fig. 1.5 .
In the feature extraction stage, both global features and local features were consid-
ered in past years. Global features for 2D shapes, such as the Zernike moments [39] and
the Fourier descriptor [109], capture the basic properties of a shape contour and region.
However, they are not effective in capturing local details of shape contours. Similarly,
global features for 3D shapes, such as the rotation invariant spherical harmonics method
[38] and the shell histogram [2], lose critical local properties on the surface of a 3D
shape. Therefore, these global features result in a low discriminative retrieval perfor-
mance.
To overcome the drawbacks of global features, recent research efforts have focused
on the development of more powerful local features. Local features are designed to
explore properties of local regions on shape contours and surfaces. For 2D shapes, the
shape context method(SD)[12] sets a local frame for each sampled points. The local
frame is transferred into a local log polar coordinate. The other sampled points on the
contour, which lies in the coordinate, form a histogram corresponding to their locations.
Finally, calculating the similarity between two shapes becomes a bipartite matching
problem with respect to two sets of points. The similarity is represented by the mini-
mum cost of the matching function. However, the SD method is extremely sensitive to
5
Figure 1.5: The flow chart for traditional 2D/3D shape retrieval system.
articulation variations because of the measurement of Euclidean distance. The inner dis-
tance shape context method[54] applies the inner distance function, which is also called
as the geodesic distance, into the SD algorithm and improves the performance against
articulation variations. Furthermore, in order to solve other variations such as multiple
project angles and contour noises, the articulation invariant representation (AIR) method
[32] and the aspect shape context (ASC) method [55] are proposed by considering a view
angle normalization and a 3D height function respectively.
Local features for 3D shapes contain view-based features, surface features and so on.
The light field descriptor method (LFD) [19], which is a view-based feature, projects a
3D mesh into multiple views. Each view is represented by a Zernike moments feature
and a polar Fourier transform feature. The similarity between two shapes is the mini-
mum cost by matching two sets of views. Extended from the LFD method, the salient
local visual feature method (SLVF) [68] extracts SIFT [57] points for each view. By col-
lecting all SIFT points in a training pool, a codebook is trained and the feature for a new
shape is calculated by the Bag-of-Words method. Other view-based methods [18], [50]
attempt to use different features for views, such as depth line, SURF, optional chain code
and so on. Compared with view-based methods, surface features capture the local sur-
face property directly on a manifold. For example, the spin images method[36] projects
oriented vertices on a mesh into local 2D coordinates. Then, the feature for a target point
6
(a)
(b)
Figure 1.6: Comparison of retrieval results between (a)IDSC, (b)SLVF (first row) and
the proposed method (second row). The query is in a red box given in the leftmost
column for each row.
is represented by the histogram of its neighbors. The MeshSIFT [84], 3DSURF [40] and
MeshHOG [107] methods extend the SIFT [57], SURF [11], HOG [21] methods from
2D images to 3D shapes, respectively. The heat kernel signature (HKS) method [16]
uses heat diffusion equation to capture the amount of diffused heat with regarding to
the time function for each vertex on a mesh. The HKS method is robust under shape
deformation since isometric shapes obtain a same HKS feature.
Although local-features-based methods capture important shape and surface proper-
ties, their locality restricts the discrimination among shapes on the global scale. Conse-
quently, their retrieval results may include globally irrelevant shapes in high ranks. To
illustrate this claim, two exemplary queries, a octopus as a 2D shape and a chair as a 3D
shape, are given in the leftmost column and their retrieval results based on IDSC and
SLVF are displayed from left to right in rank order in Fig. 1.6 (a) and (b) respectively.
The results show that IDSC and SLVF are not able to capture the basic global property
of an octopus - tentacles and a chair - back so that two queries are confused with apples
and rectangular table, respectively.
7
Figure 1.7: Comparison of retrieval results among AIR (first row), AIR+DP in [25] and
the proposed method. The query is in a red box given in the leftmost column for each
row.
Post-processing techniques such as diffusion processing (DP) [25], [73], [74],
[103],[104] have been proposed to compensate for errors arising from local-features-
based shape retrieval methods. The DP treats each samples as a node and the similarity
between any two samples corresponding to a weighted edge. All samples form a con-
nected graph, called a manifold, and affinities are diffused along the manifold to improve
measured similarities. However, the DP has its limitations. When shapes of two classes
are mixed in the feature space, it cannot separate them properly. Also, when a query is
far away from the majority of samples in its classes, it is difficult to retrieve shapes of
the same class with high ranks. We take the case in Fig. 1.7 as an example. AIR gen-
erates errors as comma shapes when taking a query bone. These two classes are mixed
in the corresponding feature space. After the diffusion process, the errors are amplified
instead.
1.2.2 3D Shape Classification
Traditional methods for 3D shape classification use classifier such as Supported-Vector-
Machine (SVM), random forest, adaboost, etc, to learn a hand-craft feature space with
training samples. However, it is hard for hand-craft features to balance the performance
8
Figure 1.8: The flow chart of a CNN-based method introduced in [60].
of the feature extraction and the classifier learning. A CNN method is composed by
several convolutional layers and fully connected layers. Convolutional layers build a
hierarchical structure to strike the balance between the global analysis and the local
analysis to extract robust features. Fully connected layers serve as classifier to acquire
the prediction of class labels. By using the back-propagation technique, the global loss
function is optimized simultaneously for both the feature extraction and the classifier
learning. A CNN method for the 3D shape classification problem classifies 3D shapes
using either view-based [79], [87] or volume-based input data [60], [75], [97]. In Fig.
1.8, the flow chart of a volumetric-based CNN method introduced in [60] is shown. The
neural network model is composed by two convolutional layers and one fully connected
layer. A view-based CNN classifies 3D shapes by analyzing multiple rendered views
while a volume-based CNN conducts classification directly in the 3D representation.
Currently, the classification performance of the view-based CNN is better than that of
the volume-based CNN since the resolution of the volumetric input data has to be much
lower than that of the view-based input data due to higher memory and computational
9
(a) (b) (c)
Figure 1.9: Examples of three sets of confusing classes generated by using the method
in [60] (a) bookself and tv stand, (b) chair and stool, and (c) flower pot and plant.
requirements of the volumetric input. On the other hand, since volume-based methods
preserve the 3D structural information, it is expected to have a greater potential in the
long run.
Although CNN-based methods outperform traditional hand-craft feature, they suffer
from two critical drawbacks. First, the parameters for each layer, including the number
of filters and the size of filters, are set empirically before the training process. Therefore,
optimal parameters requires huge amounts of blinding tests, which have extremely high
computational complexity. Second, although the feature space after convolutional layers
takes care of both global and local properties of 3D shapes, it is inevitable that exists
sets of confusing classes which cannot be easily solved by the following fully connected
layers. Three exemplar sets of confusing classes generated by the method in [60] are
illustrated in Fig. 1.9. Consequently, the classification accuracy for these sets is low.
1.3 Contributions of the Research
In this dissertation, a shape retrieval system, called two-stage shape retrieval (TSR)
method, is proposed to address the 2D shape retrieval problem. It consists of two stages
such as I) the irrelevant clustering filtering (ICF) stage and II) the local-features-based
10
matching and ranking (LMR) stage to systematically remove global irrelevant shapes
from the retrieval result. By extending our TSR method, we design an Irrelevance Fil-
tering and Similarity Ranking (IF/SR) method to improve the result of the 3D shape
retrieval problem. To solve the 3D shape classification problem, we propose a novel
volumetric CNN model (VCNN). It predicts network parameters without the supervised
training and reinforces the predictions on sets of confusing classes in the feature space.
The contributions of three methods are listed as follows.
1.3.1 The Two-Stage Shape Retrieval (TSR) Method
We conduct a thorough analysis on local features such as AIR and IDSC to demon-
strate that the locality of local features ignores the global coherence between the
retrieved shapes and the query. We design three types of global features such as
skeleton feature, wavelet feature, and geometrical feature to be complementary to
existing local features.
We consider both global and local features in the measurement of the similarity
score between a query shape and gallery shapes. We propose a direct assignment
to represent the global similarity and an indirect assignment to represent the local
similarity. The direct assignment is achieved by using unsupervised clustering and
supervised classifier training to transfer a local feature space into a global feature
space. To obtain the indirect assignment, we adopt KNN classifiers to consider
the neighbor distribution for each gallery sample in a local feature space.
We propose a joint cost function to strikes a balance between global and local
features. Two contradictory cases are resolved. Shapes that are close in the local
feature space can be distant in the global feature space, and vice versa.
11
We conduct experiments on several popular 2D shape datasets by using different
retrieval measurements. Our method outperforms all the state-of-the-art meth-
ods. Especially, we achieve a significantly better retrieval performance in high
ranks. The reasonability of our system is proved and the errors are also analyzed
in details. For example, in the second row of Fig. 1.6(a), for the query shape octo-
pus, we remove all errors, which are apples, from the high ranks of its retrieval
result.
1.3.2 The Irrelevance Filtering and Similarity Ranking (IF/SR)
Method
We conduct a thorough analysis on local features, i.e., view-based features, for
3D shapes to reveal their weakness that their discriminative power is restricted in
the global scale. In particular, they may retrieve globally irrelevant 3D shapes in
high ranks.
We proposed a robust shape normalization process to reduce the influence of trans-
lation, rotation, and scaling variances existing in each individual class of shapes.
The normalization process helps improve the robustness of extracting global fea-
tures.
We design a set of global features for 3D shapes including surface features,
wavelet features, and geometrical feature to compensate for the chosen local fea-
ture - DG1SIFT [67]. Our global features cover the properties such as surface,
frequency, rectilinearily, cut-plane and so on.
Feature concatenation is often adopted by traditional methods to combine local
and global features. However, proper feature weighting and dimension reduction
remain to be a problem. We propse a robust shape retrieval system in casade: the
12
irrelevance filtering (IF) module and the similarity (SR) module. The IF mod-
ule attempts to cluster gallery shapes that are similar to each other by examining
global and local features simultaneously. However, shapes that are close in the
local feature space can be distant in the global feature space, and vice versa. To
resolve this issue, we propose a joint cost function that strikes a balance between
two distances. In particular, irrelevant samples that are close in the local feature
space but distant in the global feature space can be removed in this stage. The
remaining gallery samples are ranked in the SR module using the local feature.
Our experiment is conducted on a popular dataset - SHREC12 generic 3D shape
dataset [47]. We show that our method has a superior performance over all state-
of-the-art methods in this dataset. Furthermore, we visually compare our retrieval
performance with other methods and prove the effectiveness of our proposed
method. An illustration of our retrieval result is shown in Fig. 1.6 (b). By taking a
non-wheel chair as the query, we remove all errors (rectangular tables) from high
ranks.
1.3.3 The Volumetric CNN (VCNN) Method
In contrast with the traditional CNN design using an empirical rule to choose
network parameters, we choose them for the VCNN with theoretic justification.
That is, we propose a feed-forward K-means clustering algorithm to identify the
optimal filter number and filter size systematically.
Being similar to any other real world classification problem, there exist sets of
confusing classes in the 3D shape classification problem. we analyze the filter
weights that connect the last fully connected layer and the output layer. All filter
weights associated with a 3D shape output class define the ”shape anchor vector”
13
(SA V) for this class. We show that two shape classes are confusing if the angle of
their SA Vs is small and a particular class has a relatively wide feature distribution.
We propose a hierarchical clustering method to determine whether two shape
classes belong to the same confusion set without conducting test. Initially, we
identify confusion set, which is composed by multiple confusing classes, by using
the confusion matrix. Then, we split samples of the same confusion sets into mul-
tiple subsets automatically. The classification of each confusion subset is enhance
by reclassification using a random forest classifier.
Experiments are conducted on a popular ModelNet40 dataset. Our method offers
the state-of-the-art performance among all volume-based CNN methods. Error
analyses are discussed and the reason behind erroneous cases are justified.
1.4 Organization of the Dissertation
The rest of this dissertation is organized as follows. In Chapter 2, we give a back-
ground review to several state-of-the-art local features, diffusion processing approaches,
machine learning techniques and popular 2D/3D shape datasets. The proposed TSR sys-
tem for the 2D shape retrieval problem will be introduced in Chapter 3. The IF/SR
method to address the 3D shape retrieval problem will be described in Chapter 4.
Our VCNN method for the 3D shape classification problem is explained in Chapter 5.
Finally, we summarize three pieces of our work and propose the future work in Chapter
6.
14
Chapter 2
Background
2.1 Local Features for 2D Shapes
To locally describe a shape, the properties of sampled contour points are usually
explored. As the first step, the sampling process is based on a uniform arclength
step and a pre-defined sampling number because uniform arclength sampling preserves
the basic topological information of a 2D shape (or contour). After sampling, the
contour information is transformed into a 2D signal by clockwise traversal order as
pptq;t 1;2;:::;N, whereN is the sampling number andt is the location of a contour
point. The basic requirements of designing a local 2D shape feature are translational
invariant, rotational invariant, scaling invariant, articulation invariant, noise invariant
and multiple projection angles invariant. The sampling scheme discussed above already
guarantees the translational invariance.
Curvature Scale Space (CSS) [63] calculates the curvature values of each sampling
point in multiple scales - as the first step. Multiple scales are achieved by applying a
Gaussian filter to each sampling point recursively. Then, curvature zero-crossing points
in each scale are extracted to be the candidate points for each scale. The recursive
smoothing operation terminates when no zero-crossing points are detected on the whole
contour. Next, the contour of a shape is represented by a 2D signal in pt; q space,
which is called the curvature scale space. The signal has response when a point at
locationt is a zero-crossing point at scale. Based on the 2D signal. the peaks above
a certain threshold are detected to be the salient points. Finally, the magnitude -
15
is normalize into 1 and the highest peak is cyclic translated to the origin at t 0 to
achieve rotation invariance. To measure the similarity of two shapes, the peaks from
two shapes are firstly matched. The matching process gives a pre-defined tolerance
corresponding to the location. Second, the similarity is measured by summing up the
differences of all matched and unmatched peaks. Although CSS captures the curvature
information of a shape, it loses the relationship among contour points, which is related to
the region property, so that its retrieval performance is unsatisfactory. Moreover, when
articulation happens, the location of salient points changes dramatically thus resulting
in a low discrimination power.
To consider the relationship among contour points, shape context (SC) [12] con-
structs a local coordinate for each sampling point to describe the property of its neigh-
bors. The coordinate is designed to be a log polar space centered at the target contour
point shown in Fig. 2.1 (a). Then, the number of other contour points lying in each
sector is recorded. By treating each sector as a bin, a 2D histogram in Fig. 2.1 (b) is
constructed for each sampling contour point regarding with logr and, which indicate
the location of a sector. To achieve rotation invariance, the direction of each log polar
coordinate is determined by the tangent orientation of its target points. To measure the
similarity between two shapes, SC compares two sets of contour points by consider-
ing the problem as a bipartite matching problem. The minimum cost of the matching
function becomes the distance between two shapes. Compared with CSS, SC boosts the
retrieval performance because of its consideration on the spatial relationship among con-
tour points. However, the Euclidean distance measured between contour points makes
SC sensitive to articulation variations.
Based on the intuition of SC, the inner distance shape context (IDSC) method [54]
attempts to combine the inner distance and the SC algorithm to solve the articulation
problems. It claims that the inner distance between two contour points, which is the
16
(a) (b)
Figure 2.1: Shape context example (a) the log polar coordinate at a target point (b) the
shape context feature of the target point.
shortest path lying in the shape, keeps in a very small variation when an articulate shape
changes its pose. The comparison between inner distance and Euclidean distance mea-
sured in an articulate shape is shown in Fig. 2.2. It shows that the Euclidean dis-
tances between two points under articulation variations change dramatically, while the
inner distances are able to keep the robustness. Extended from the inner distance idea,
IDSC designs a new measurement called inner angle to describe the angular relation-
ship between two contour points. The traditional angle and distance measurement in
log polar coordinate are finally replaced by the inner distance and the inner angle. The
similarity measurement of IDSC is the same as the SC. The inner distance and the inner
angle help IDSC improve the retrieval performance. However, merely using the inner
distance cannot resolve the problems of interior noise and multiple projection angles
variations.
Interior noise deteriorates IDSC because a small gap dramatically changes the inner
distance between two points. The example of this deterioration is shown in Fig. 2.3.
Although these two samples belongs to a same class, they are far away from each other
in the inner distance space. To solve the interior noise problem, aspect space context
17
(a) (b) (c)
Figure 2.2: Comparision between inner distance (red curve) and Euclidean distance
(yellow straight line) of an articulate shape under three different poses.
(ASC) [55] add one more dimension, height, to the distance function. It piles up a 2D
shape into 3D shape by different height values - h. Then, the inner distance function
in the aspect space is described asASID pppt
1
q;ppt
2
q;hq. This inner distance function
with height parameter has two properties as:
lim
h8
ASID pppt
1
q;ppt
2
q;hq ID pppt
1
q;ppt
2
qq; (2.1)
lim
h 0
ASID pppt
1
q;ppt
2
q;hq ED pppt
1
q;ppt
2
qq: (2.2)
where ID pppt
1
q;ppt
2
qq and ED pppt
1
q;ppt
2
qq are the inner distance and the Euclidean
distance between two contour points respectively. The histogram of a target point is
also expanded by adding the height parameter. The similarity between two shapes is the
minimum cost by matching two sets of points among different heights pair-wisely.
Articulation-invariant representation (AIR) [32] extends IDSC by considering the
transformation in the 3D space. To achieve the multiple projection angles variation,
AIR firstly conducts a segmentation on a 2-D shape based on the approximate convex
18
(a) (b)
Figure 2.3: Interior noise influences the robustness of inner distance.
decomposition. In other words, convex regions have higher probabilities to be seg-
mented. Motivated by the relationship between the inner distance and the Euclidean dis-
tance that two points on a convex shape contour should comply withID pppt
1
q;ppt
2
qq ED pppt
1
q;ppt
2
qq, the convexity of a set of pointsS is measured by:
CV pS q 1 1
t
2
t
‚
ppt
1
qPS
‚
ppt
2
qPS;t
1
t
2
p1 ED pppt
1
q;ppt
2
qq
ID pppt
1
q;ppt
2
qq
q: (2.3)
The function value becomes one if a shape is convex. Then, the segmentation problem is
solved by minimizing a cost function with regard to the convexity measurement above:
Cpn;s
i
q n
‚
i 1
‚
ppt
1
qPS
‚
ppt
2
qPS;t
1
t
2
p1 ED pppt
1
q;ppt
2
qq
ID pppt
1
q;ppt
2
qq
q: (2.4)
wheres
i
is segmented part in shapeS.n is the optimized number of segments. The cost
function above is minimized by the normalize cut technique. After obtaining the seg-
mentation for each shape, each segment is transformed from its minimal enclosing par-
allelogram to a unit square to conquer the multiple projection angles variations. Finally,
IDSC is applied to the transformed shape and the corresponding features are extracted.
19
AIR achieved the best retrieval performance on MPEG-7 shape dataset described in
section. 2.5.2 among those techniques without post-processing approaches.
2.2 Local Features for 3D Shapes
We discuss two types of the state-of-the-art local features for 3D meshes such as surface-
based features and view-based features. Similar with the design of local features for 2D
shape, local features for 3D feature also requires to be translation-invariant, rotation-
invariant and scale-invariant. Surface-based features, such as MeshSIFT [84], HKS
[16], geodesic distance matrix (GDM) [83], 3DSURF [40] and MeshHOG [107] usu-
ally have a prerequisite that the input mesh needs to be a non-degenerate manifold.
However, a generic 3D shape retrieval problem contains highly degenerated manifolds
existing in many gallery and query meshes. The common defections are gap, hole, self-
intersections, interior structure and so on. The examples of these defections are shown
in Fig. 2.4.
In order to give a clean mesh to the surface-based features extraction process, an
input mesh with defections needs to be repaired. However, the fully automatic mesh
repair technique is still an open problem in computer graphics. In this proposal, in order
to solve the generic 3D shape retrieval problem, we focus on view-based local features.
The light field descriptor (LFD) [19] places the center of the mass of a 3D mesh
at the center of a light field. The cameras are then put on 10 vertices of a regular half
dodecahedron. Considering the lighting source is uncertain, the light is turned off so
that each projected image is a silhouette images denoted asI
1
;I
2
;:::;I
10
. Then, a 35D
Zernike moments and a 10D Fourier descriptor are extracted as the features for each
image. We define the feature asf
i
for thei
th
image. To compare two 3D meshes, one
mesh is fixed and the other one is rotated 60 times to traverse all possible rotations of
20
(a) (b)
(c) (d)
Figure 2.4: Examples of defections in 3D meshes (a) gap: disconnected leg parts (b)
holes (c) self-intersection (d)interior structure.
a dodecahedron. At each rotation, the distance is calculated by summing up all differ-
ences between each paired image. Finally, the minimum distance among all rotations
is selected as the similarity between two 3D meshes. Mathematically, the similarity
between two 3D meshesO
1
andO
2
is expressed as:
S pO
1
;O
2
q min
j 1:::60
10
‚
j 1
dpf
i1
pO
1
q;f
ij
pO
2
qq; (2.5)
wheref
ij
means thei
th
projected image underj
th
rotation. LFD is an early proposed
view-based feature that solves the degenerate mesh problem in 3D mesh retrieval. How-
ever, its weak features for each projected image and the comparison method prevent it
from obtaining a high retrieval performance in a large dataset.
21
(a) (b)
Figure 2.5: The cylinder coordination and its correponding cylinderical projection in
[70].
Several techniques improve the features used in LFD for each view in order to
increase the discriminability. For example, Multi-View Depth Line (MDLA) [18] uses
a N N depth-buffer image for each view instead of a silhouette image. For each
depth-buffer image,2 N depth lines are generated vertically and horizontally. Finally,
by treating each depth line as a sequence, two views are compared by using a dynamic
programming algorithm (the Needleman-Wunsch algorithm).
PANORAMA [70] changes the dodecahedron-based projection method to be a
panoramic projection. Before the projection process, PANORAMA firstly normalizes
the pose of an input mesh by using two methods as NPCA and CPCA. Then, for each
normalization method, three cylindrical range image are extracted by putting the mesh
at the centroid of a cylinder which has height 2R and radiusR. R is the radius of the
minimal enclosing sphere of the input mesh. The projection example is shown in Fig.
2.9. Then, the 2D Fourier coefficients and Discrete Wavelet Transform coefficients are
extracted from each view. Finally, two features for a mesh are formed by concatenating
these coefficients under two normalization results. The similarity between two meshes
are the minimum distance among four pairs of features.
The weighted bipartite graph matching method [29] considers the view importance
in the framework of LFD. In matching two sets of views, the redundancy always
22
occurs when dealing with trivial views. In order to reduce the redundancy, this method
uses hierarchical agglomerative clustering (HAC) to cluster the light field views in the
Zernike feature space. The views at all centroids are selected as the representative views.
The initial importance of a representative view is then determined by its cluster size.
Afterwards, the values of importance are updated by usingthe random walk algorithm to
consider the similarity among different clusters of views. Finally, with the importance
value as the view weight, comparison between two sets of views becomes the weighted
bipartite matching problem solved by using Kuhn-Munkres method of max-weighted
bipartite matching.
The salient local visual feature (SLVF) method [68] proposes a new approach to
arrange multiple views, which uses Scale-Invariant-Feature-Transform (SIFT) and Bag-
of-Words (BoW) method. Similar with LFD, the first step of SLVF is to project a 3D
shape into N views. However, instead of using silhouette images, SLVF uses range
images that contains depth information for each view. Then, after selecting a training set
of 3D shapes, all the SIFT features extracted from the views of all the training samples
form a training feature pool. Afterwards, a visual codebook is trained by using the K-
means algorithm. Finally, for each testing 3D shape, the histogram of its SIFT features
from all views is statistically constructed as its feature. The similarity between two
3D mesh models are calculated by comparing two feature vectors by using Kullback-
Leibler divergence (KLD). The basic flowchart of extract SLVF feature from a 3D mesh
is shown in Fig. 2.6.
Similar with the SLVF scheme, the depth buffered super-vector coding (DBSVC)
method [50] extracts dense power SURF descriptors from the views of a 3D shape.
Compared with the traditional SURF descriptor, power SURF has more dimensions but
more powerful. After generating the codebook by using K-means algorithm, a mesh is
23
Figure 2.6: The basic flowchart of extracting SLVF from a 3D mesh.
encoded by the super-vector coding technique, which contains both the histogram and
the offsets between the features and words in the codebook.
2.3 Diffusion Process
Traditionally, the rank of the retrieved samples for a query is determined by the simi-
larity measured between the query and each gallery shape. However, only using these
similarities cannot guarantee a reasonable interpretation of a feature space. The differ-
ence within a class can be large and the difference among classes can be small. To rank
the samples more robustly, diffusion processes introduce the influence among all gallery
shapes. To fulfill this idea, diffusion processes [25], [73], [74], [103], [104] usually con-
sider the feature space as a connected graph, which is also interpreted as a manifold.
Then, the affinities are diffused along this manifold and converged into a refined affinity
measurement.
For example, in Locally constraint diffusion process (LCDP) [103], the graph is
represented by an affinity matrix, which is constructed by a Gaussian kernel as
Api;j q expp
distpx
i
;x
j
q
2
2
2
q (2.6)
24
where distpx
i
;x
j
q measures the distance between two shapes x
i
and x
j
in a feature
space. Then, the transition matrix is assigned by considering a local K-NN constraint as
T
K
pi;j q A
K
pi;j q
j
A
K
pi;j q
; (2.7)
whereA
K
pi;j q $
’
’
&
’
’
%
Api;j q ifx
j
belongs to the K-NN ofx
i
;
0 otherwise:
(2.8)
Finally, the original affinity matrix is updated iteratively by
A
t 1
TA
t
T
T
: (2.9)
With the initial status as:
A
0
A: (2.10)
By using the diffusion process, intuitively, if a gallery sample is close to the query in
the defined feature space but far from all other high ranked gallery samples, it will be
pushed farther from the query, and vice versa.
Based on the idea of LCDP, diffusion on Tensor Product Graph (TPG) [104] applies
a tensor product graph into its diffusion process to improve the AIR performance on
MPEG-7 shape dataset by considering higher order information. TPG claims its suc-
cess on 2D shape retrieval, image retrieval and image segmentation problems. The
diffusion by using Mutual Nearest-Neighbor (MNN) method [74] uses a mutual nearest
neighbor graph to enhance the correlation between connected samples in the original
graph. Diffusion Processes for Retrieval Revisited (DPRR) [25] fuses different initial-
ization methods, transition matrices in different diffusion methods to boost the retrieval
performance. Diffusion-based methods are effective to improve an affinity matrix in a
25
manifold domain but it fails in two cases. First, when shapes from two classes are mixed
in a feature space, a diffusion process is incapable to seperate them properly. Second,
when a query is far away from the majority of shapes in its class, it is hard to retrieve
them with high ranks. These issues arise because a diffusion process is restricted by a
feature space.
2.4 Machine Learning Techniques
2.4.1 Random Forest
Random Forest [13] is an ensemble-learning algorithm developed by Leo Breiman to
solve classification and regression problems. The algorithm is based on a traditional
but effective model called the decision tree. To understand a shape, decision rules are
highly related to the human perception because human tends to search the prominent
properties of a shape to differentiate it with other irrelevant shapes instead of calculating
a similarity function. For examples, Fig.2.7 shows some simple single-level decision
trees. The first-level prominent properties for an octopus and a camel can be tentacles
and humps respectively. To describe a shape more completely, multiple features are
necessary to increase the discriminability, while the Random Forest is able to combine
these features reasonably. In the Random Forest algorithm, briefly speaking, it randomly
draws a subset of samples from the training data for building a decision tree iteratively.
For each tree, each node selects a subset of features randomly and then determines the
best split in the feature space to the next level based on the Gini impurity criterion.
When a testing sample comes, it goes through every tree and the votes from all trees
are aggregated for the prediction result. By using this bootstrapping strategy, Random
Forest can deal with outliers in training data without paying heavy over-fitting penalties.
26
Figure 2.7: One-level decision trees for different shapes.
Its strategy of using multiple trees is able to guarantee a proper combination of different
decision functions in the feature space.
2.4.2 Spectral Clustering
Random Forest is more powerful as a supervised learning approach than an unsupervised
learning approach . To reveal the data pattern of an unlabeled dataset, we can learn
hidden labels by using a unsupervised clustering method. In our scheme, we choose
the Spectral Clustering method based on [64]. The reasons are two folds. First, the
Spectral Clustering is able to deal with an affinity matrix directly and avoid the problem
caused by missing general feature vectors. As we know, many pieces of work on the 2D
shape retrieval task are related to a bipartite matching problem instead of projecting a
shape into a feature space directly. This makes seed-based clustering methods infeasible
such as K-means and Mean-shift. Second, Spectral Clustering can tackle non-convex
regions well. It is an advantage over traditional clustering methods. Here, we review the
algorithm of [64] with the help of a local feature for an example.
Suppose we have n shapes in the dataset represented asX t x
1
;x
2
;:::;x
n
u and we
want to cluster them intoK clusters.
27
1. Use a local feature to compute the affinity matrixA P R
n n
. The diagonal ele-
ments inA is set to be 0.
2. Degree matrix D of A is defined as D
ii
j
A
ij
, where A
ij
is the ele-
ment at i
th
row and j
th
column. A normalized affinity matrix is defined as
L D
1{2
AD
1{2
.
3. Choose K largest eigenvectors corresponding to L as the columns in a matrix
Y PR
n k
and normalize the rows ofY .
4. Perform the K-means algorithm on Y by treating each row vector as an obser-
vation. If the i
th
row of Y is clustered into cluster k, assign the label k to the
corresponding samplex
i
.
2.4.3 Convolutional Neural Network (CNN)
CNN for 2D Images. The CNN techniques are originally proposed to solve the 2D
image classification problem. It is a supervised learning method extended from Neural
Networks, which are composed by neurons with learnable weights and biases in each
layer. The neurons in one layer is connected to neurons in the next layer by pre-defined
non-linear activation functions. The input of a CNN model is simply set to be the origi-
nal image and the output is the probabilistic score to each class. A CNN model consists
of two types of layers such as convolutional layers and fully connected layers. The for-
mer ones serves as an automatic feature extractor and the later ones aim to fullfill the
classification task. We borrow the flow chart from LeNet [46], which is proposed to
solve the 2D character classification problem, for illustration in Fig. 2.8.
The LeNet method consists of two convolutional layers, two fully connected layer
and one output layer. In each layer, there are two main operations such as convolu-
tion and pooling (subsampling). Give a filter, the convolution output for a local patch,
28
Figure 2.8: The flow chart of LeNet [46] for the 2D digit classification problem.
called receptive field, in each feature map is the response to the filter. By adopting the
activation function, the output of a receptive field is expressed as:
opr q pb L
‚
l
M
‚
m
N
‚
n
w pl;m;nqppl;m;nqq: (2.11)
The symbol p:q is the non-linear activation function. p is the element in the recep-
tive field r. w and b are the weights and bias of a target filter. L M N is the
dimension of the receptive field, same as the one of the filter. It is worthy to note
that different receptive field in different locations in a same layer share the same filter
weights and bias. The activation function introduce non-linearity and remove/suppress
the weak response, which produce negative convolutional output. We list three popular
activation functions in Fig. such as sigmoid function, ReLU function and Leaky ReLU
function. Individually, the sigmoid function is expressed as:
pxq 1
1 e
x
: (2.12)
29
(a) (b)
(c)
Figure 2.9: The examples of three activation functions (a) Sigmoid, (b) ReLU, and (c)
Leaky ReLU.
The ReLU function is defined by:
pxq $
’
’
&
’
’
%
x ifx ¥ 0
0 ifx 0
(2.13)
The Leaky ReLU is extended from the ReLU function as:
pxq $
’
’
&
’
’
%
x ifx ¥ 0
x ifx 0
(2.14)
30
The activation function from the last fully connected layer to the output layer is
popularly chosen to be the softmax function One important reason is to acquire the
probabilistic score of a given training sample. In other words, the scores to all classes
sum up to be1. The softmax function is defined by:
hpxq
j
e
x
j
K
k
e
x
k
; forj 1:::K; (2.15)
whereK is the number of classes,mathbfx is the feature vector with each dimension
as the output of each neuron in the previous layer.
The pooling operation reduce the spatial dimension of the feature map generated by
each convolutional layer. One of its purpose is to reduce the complexity to an affordable
extent. The other purpose is to build a hierarchical representation of the input image
from the local to global scale. In LeNet, the original input image has spatial dimension
32 32 and the feature map before the fully connected layer has spatial dimension5 5.
A complete CNN model has a loss function determined at the output layer. The
overall goal of training a CNN model is to minimize the target loss function. The popular
choices of a loss function are in the quadratic cost form expressed as
C 1
2N
‚
x
||y pxq f pxq||
2
; (2.16)
wherex is the training sample,N is the total number of training samples andf p:q is the
overall approximated function, or in the cross entropy form as
C 1
N
‚
x
‚
j
ry pxq
j
lnpf pxq
j
q p 1 y pxq
j
qp1 lnpf pxq
j
qqs: (2.17)
The loss function can be minimized by using the famous back-propagation method based
of the stochastic gradient descent algorithm.
31
Figure 2.10: Illustration of the relationship between input feature vectors and anchor
vectors.
Anchor Vector and RECOS. Besides interpreting the CNN models in a way
of convolution, two novel interpretations such as ”Anchor Vector” and ”REctified-
COrrelations on a Sphere(RECOS) are proposed in [42]. We denote the weight of the
i
th
filter in the j
th
layer to be w
ji
and an input feature to this layer to be x
j 1
. The
convolutional term in Eq. 2.11 is the dot product between x
j 1
and w
ji
. Naturally, it
can be expressed as the projection function if we setx
j 1
andw
ji
to be in unit length.
The output of the projection function is anti-proportional to the angle betweenx
j 1
and
w
ji
as:
w
ji
T
x
j 1
9cosp pw
ji
;x
j 1
qq: (2.18)
Geometrically, both input feature vectors and filter weight vectors lie on a high
dimensional sphere. The relationship between the input feature vectors and filter weight
32
vectors is visualized in Fig. 2.10. The angles between the input feature vectorx
i
and
three filter weight vectorsw
1
,w
2
, andw
3
are
1
,
2
, and
3
. Obviously, the filterw
1
contributes to the greatest filter responses among the three filters according to the small-
est angle
1
. If three filters in Fig. 2.10 correspond to the filters in the last layer, it is
concluded that the input sample will be classified to the class of thew
1
. The function-
ality of filter weight vectors in a multi-layer CNN model is to rectify the input image
to maximize the inter-class margin and minimize the intra-class variation layer by layer.
Intuitively, the filter weight vectors represent the frequently occurring patterns in the
input feature maps. Since the filter weight vector serves as a rectification function, each
filter is called ”Anchor Vector” and each layer, which is composed by multiple filters, is
called ”REctified-COrrelations on a Sphere(RECOS).
CNN for 3D Shapes. Inspired by CNN models for 2D images, many CNN models
for 3D shapes are proposed in recent literature. These methods are based on 2D image
data [79], [87] or volume-based input data [60], [75], [97]. The view-based approach
renders a 3D shape into multiple views. Classifying a 3D shape becomes analyzing
a bundle of views collectively. However, There are two challenging problems for the
view-based approach. First, it needs to remove the rotational variance caused by the
order of views. Apparently, concatenating features for multiple views is not able to
resolve the problem. Second, the importance of different views is undefined if fea-
tures from all views are treated equally. The Multi-View Convolutional Neural Network
(MVCNN) method [87] renders a 3D shape into 12 or 80 views. The network is based
on the structure of VGG or AlexNet network model [82] [41]. To remove the rotational
variance, a view-pooling layer is inserted between the last convolutional layer and the
first fully connected layer. The view-pooling layer extract the maximum value in the
spectural domain so that the feature maps from all views are simplified into a single
feature map. Obviously, the view pooling is invariant under different orders of the input
33
Figure 2.11: The flow chart of the MVCNN model [87].
Figure 2.12: The saliency map of two examples: a dresser and a monitor, calculated by
[87]. The first and third row are the original views. The average saliency score is shown
for each view at the bottom. The second and forth row are the pixel-wise saliency map
for each view. The darker pixel indicates higher saliency value. Top three views with
highest saliency scores are boxed in blue.
views. Finally, MVCNN fine-tunes the VGG or AlexNet network models and is applied
to the 3D shape dataset. The flowchart of MVCNN is illustrated in Fig. 2.11.
Another advantage of using the view-pooling layer is to define the importance of
input views automatically in the back-propagation stage. According to the saliency
34
definition in the back-propagation stage, the pixel-wise saliency in the input view is
defined in [87] as
rw
1
;w
2
;:::;w
K
s r
BF
c
BI
1
|
S
;
BF
c
BI
2
|
S
;:::;
BF
c
BI
K
|
S
s; (2.19)
where tI
1
;I
2
;:::;I
K
u areK views for a 3D shapeS and tw
1
;w
2
;:::;w
K
u are the saliency
map forK views, respectively. The saliency maps of an exemplar dresser and monitor
calculated by [87] are shown in Fig. 2.12. The texture on two examples is detected and
given high saliency scores.
View-based methods can preserve the high resolution of 3D shapes since they lever-
age the view projection and lower the complexity from 3D to 2D. Furthermore, a view-
based CNN can be fine-tuned from a pretrained CNN that was trained by large amounts
of 2D images. However, the view-based CNN has two potential shortcomings. First,
the surface of a 3D shape can be affected by the shading effect. Second, reconstruct-
ing the relationship among views is difficult since the 3D information is lost after the
view-pooling process.
The volume-based approach voxelizes a 3D mesh model for a 3D representation.
Since the input data is extended into 3D domain, traditional CNN approach for 2D
images can be generalized accordingly. The V oxelNet method [60], the pioneer of the
volume-based approach, consists of two convolutional layers and one fully connected
layer. In the first convolutional layer, the filter size is 5 5 5 with filter number 32.
The second convolutional layer has32 filters with dimension3 3 3 32. The fully
connected layer contains128 neurons. The flow chart of the V oxelNet method is shown
in Fig.2.13
35
Figure 2.13: The flow chart of the V oxelNet method [60].
The volume-based approach have two drawbacks. First, to control computational
complexity, the resolution of a 3D voxel model is much lower than that of its corre-
sponding 2D view-based representation. As a result, high frequency components of the
original 3D mesh are sacrificed. Second, there are few pretrained CNN models on 3D
data, volume-based networks have to be trained from scratch. Although a volumetric
representation preserves the 3D structural information of an object, the performance of
classifying 3D mesh models directly is still lower than that of classifying the corre-
sponding view-based 2D models.
36
2.5 Review of Existing Datasets
2.5.1 Performance Measurements
Measurements for Retrieval Systems. The performance measurement of a retrieval
system is to examine the consistency between retrieved objects and human visual expe-
rience. There are many measurements proposed to reflect different user experiences. We
suppose a dataset haveN samples, which are represented byX t x
1
;x
2
;::::;x
N
u, and
M classes asC t c
1
;c
2
;::::;c
M
u. A classc
k
hasm
k
samples. The class of a sample
x
i
isc
x
i
Furthermore, a testing query set includingP query samples is represented by
Q t q
1
;q
2
;:::;q
P
u. By sendingq
j
into a retrieval system, the ranked retrieved samples
areR t r
j1
;r
j2
;:::;r
jN
u. Some popular measurements [80], [44], [78] are explained
as below.
Nearest neighbor: the accuracy of the first retrieved sample that belongs to the same
class of the query sample. It is expressed as
NN pQq P
j
I pr
j1
Pc
q
j
q
P
: (2.20)
The possible best sore is 100%. This measurement is also related to the performance a
feature has in a nearest neighbor classifier.
First Tier: the accuracy of the top K retrieved samples that belongs to the same
class of the query sample. K is the number of samples in the query’s class -c
q
j
. The
First Tier score is defined as:
FT pQq P
j
mq
j
i 1
I pr
ji
Pc
q
j
q
Pm
q
j
: (2.21)
The best possible score of first tier is 100%.
37
Second Tier (Bull’s eye score): the accuracy of the top K retrieved samples that
belongs to the same class of the query sample. To be more relaxed than First Tier score,
K is the double number of samples in the query’s class -c
q
j
. The Second Tier score is
defined as:
ST pQq P
j
2mq
j
i 1
I pr
ji
Pc
q
j
q
Pm
q
j
: (2.22)
The best possible score of second tier is also 100%. The second tier score is also called
bull’s eye score in 2D shape retrieval problem.
Precision and Recall: Precision measures the retrieval accuracy under different
values of topK. By fixingK, recall is the percentage of retrieved samples over the total
relevant samples in the dataset. Mathematically, they are defined as:
P
K
pQq P
j
K
i 1
I pr
ji
Pc
q
j
q
PK
; (2.23)
R
K
pQq P
j
K
i 1
I pr
ji
Pc
q
j
q
Pm
q
j
: (2.24)
By alteringK, precision and recall are represented by a 2D curve. The maximum value
of horizontal axis, which is recall, is fixed to be 1.
E-Measure: an alternative and composite measurement of the precision and recall
under a fixingK. It is motivated by the user experience that the first page of retrieved
samples is more important that those in the later page. The E-Measure is expressed as:
E 2
1
P
K
1
R
K
: (2.25)
K is predefined for a dataset. It is the number of samples designed to be shown in every
page. The best possible score for E-Measurement is 1:0. A higher score indicates a
better retrieval score.
38
Discounted Cumulative Gain (DCG): it is motivated by the user experience that
the retrieved samples with higher orders are more important than those with lower order.
Mathematically, the discounted cumulative gain of a query in itsi
t
h retrieval result is
defined as:
DCGpq
j
;iq $
’
’
&
’
’
%
I pr
ji
Pc
q
j
q; ifi 1
DCGpr
j pi 1q
q I pr
ji
Pcq
j
q
log
2
piq
; otherwise
(2.26)
The final DCG of each query is normalized by its maximum possible score:
DCGpq
j
q DCG
N
1 mq
j
i 2
1
log
2
piq
: (2.27)
Top k consistency: the number of the consistent samples with the query at each
K
th
retrieval. This method provides with a more detailed measurement to a uniformly
distributed dataset. If the number of samples in every class is same as each other, the
k
th
consistency is defined as:
C
k
P
‚
j
I pr
jk
Pc
q
j
q: (2.28)
Measurements for Classification Systems. The measurements for a 3D shape clas-
sification system are relatively simpler than those for a retrieval system. In the classifica-
tion problem, a dataset is split into a training set withI samples asX t x
1
;x
2
;::::;x
I
u
and a testing set withJ samples denoted byY t y
1
;y
2
;::::;y
J
u. There areM classes
asC t c
1
;c
2
;::::;c
M
u and thek
th
class hasm
k
testing samples. The ground-truth class
labels for the testing dataset is represented byL t l
1
;l
2
;:::;l
J
u. The classifier function
39
is expressed by f p:q. To handle datasets with non-uniform distributions. Two popu-
lar measurements are always adopted: Average Instance Accuracy (AIA) and Average
Class Accuracy (ACA).
The ACA score averages the prediction scores over class. It is expressed as:
ACApY q 1
M
‚
c
k
PC
1
m
k
‚
y
j
Pc
k
1pf py
j
q l
j
q; (2.29)
where1p:q is an indicator function.
The AIA scores takes the average of prediction scores over testing samples as:
AIApY q 1
J
‚
y
j
PY
1pf py
j
q l
j
q: (2.30)
2.5.2 2D Shape Datasets
MPEG-7 Shape Dataset. The MPEG-7 shape dataset [44] contains 1400 shape sample
in 70 independent classes. Each class has 20 shape samples. It is a popular 2D shape
dataset because of its large varieties of intra-class variations and inter-class similari-
ties. The primary intra-class variations in MPEG-7 shape dataset include contour noise,
contour deformation, multiple projection views and articulation. The examples of 48
classes in MPEG-7 shape dataset are shown in Fig. 2.14. Each class is represented by
two samples.
The standard measurement for MPEG-7 shape dataset is the bull’s eye score. The
performances of the start-of-the-art methods are shown in Table. 2.1.
Kimia99 Shape Dataset. Compared with the MPEG-7 shape dataset, The Kimia99
shape dataset [78] introduces more artificial occlusion and part distortion. It has 99
shapes in total and these shapes are uniformly distributed into 9 classes. The examples
40
Figure 2.14: Examples of MPEG-7 shape dataset. Every class is represented by two
instances.
Figure 2.15: Examples of Kimia99 shape dataset. Every class is represented by two
instances.
of all classes in Kimia99 shape dataset are shown in Fig. 2.15. Each class is represented
by two samples.
Kimia99 is a relatively small dataset. Its measurement is topk consistency and the
performances of several state-of-the-art methods are shown in Table.
41
Table 2.1: Bull’s eye scores of several state-of-the-art methods for the MPEG-7 dataset.
Method (without diffusion process) Bull’s eye score
CSS [61] 75.44%
Optimized CSS [62] 81.12%
Contour segment [4] 84.33%
IDSC [54] 85.40%
Triangle area [1] 87.23%
Shape tree [27] 87.70%
ASC [55] 88.30%
HF [95] 89.66%
AIR [32] 93.67%
Method (with diffusion process)
IDSC+GT [10] 91.61%
IDSC+LCDP [103] 93.32%
ASC+LCDP [55] 95.96%
IDSC+SC+Co-Transduction [9] 97.72%
AIR+MNN [74] 99.89%
AIR+TPG [104] 99.90%
AIR+DPRR [25] 100.00%
Tari1000 Shape Dataset. Tari1000 shape dataset borrows a subset of shapes from
MPEG-7 shape dataset. However, beside these shapes, this dataset add more shapes,
which have more articulation intra-class variations. The robustness of a retrieval method
against articulation variations is able to be examined based on this dataset. Tari1000
shape dataset has 1000 shapes in 50 classes. Similar with MPEG-7, it has 20 shapes
for each class. The examples of 15 classes in Tari1000 shape dataset are shown in Fig.
2.16. Each class is represented by two samples.
Similar with MPPEG-7 shape dataset, bull’s eye score is used to measure the per-
formance of Tari1000 shape dataset. The scores of the state-of-the-art approaches are
listed in Table.
42
Table 2.2: Top k consistency of several shape retrieval methods for the Kimia99 dataset.
Method 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
SC[12] 97 91 88 85 84 77 75 66 56 37
Gen. Model [92] 99 97 99 98 96 96 94 83 75 48
CPDH+EMD [81] 98 94 95 92 90 88 85 84 71 52
Path Similarity [7] 99 99 99 99 96 97 95 93 89 73
HC & KFPs 92 93 95 96 93 96 96 90 96 76
Shock Edit [78] 99 99 99 98 98 97 96 95 93 82
Triangle Area [1] 99 99 99 98 98 97 97 98 94 79
Shape Tree [27] 99 99 99 99 99 99 99 97 93 86
Symbolic Rep. [22] 99 99 99 99 99 99 99 97 93 86
IDSC[54] 99 99 99 98 98 97 97 98 94 79
IDSC+GT[10] 99 99 99 99 99 99 99 99 97 99
Figure 2.16: Examples of Tari1000 shape dataset. Every class is represented by two
instances.
2.5.3 3D Shape Datasets
SHREC Shape Dataset. 3D Shape retrieval contest (SHREC) [31], [93], [14], [47],
[50] is a contest under the workshops of Eurographic conference. It has been held annu-
ally since 2006. Each contest contains several tracks such as generic 3D shape retrieval,
43
Table 2.3: Bull’s eye scores of several shape retrieval methods for the Tari1000 dataset.
Method Bull’s eye score
SC [12] 94.17%
IDSC [54] 95.33%
ASC [55] 95.44%
IDSC+GT [10] 99.35%
IDSC+LCDP [9] 99.70%
IDSC+DDGM+Co-Transduction [9] 99.995%
non-rigid shape retrieval, partial model retrieval, sketch-based shape retrieval and so
on. We focus on the generic 3D shape retrieval problem. In each year, a new dataset
was proposed focusing on enlarging the number of samples and rearranging shapes to
include more shape variations. Table. 2.4 gives an overview of the performances and
datasets in several years from 2009 to 2014. we use the First-Tier score as the common
measurement for all contests.
Table 2.4: Overviews of SHREC from year 2009 to year 2014.
Year Track Nos of classes Nos of shapes per class
2014 [50] Large-scale 171 53
2012 [47] Generic 60 20
2010 [14] Large-scale 54 Non-uniform
2010 [93] Generic 43 Non-uniform
2009 [31] Generic 40 18
Year Total Nos of shapes Best FT Best method
2014 [50] 8987 52% DBSVC
2012 [47] 1200 66% DG1SIFT
2010 [14] 10000 55% CM-BoF
2010 [93] 3168 64% CM-BoF
2009 [31] 720 73% MVDL
In the table above, view-based features (DBSVC, DG1SIFT and CM-BoF) by using
the Bag-of-Words method achieved the best retrieval performances in the recent-year
44
Figure 2.17: Examples of Shrec12 generic 3D shape dataset. Every class is represented
by two instances.
contests. However, the overall FT performances indicate the current methods are not
able to resolve the inter-class similarities and intra-class variations effectively.
The SHREC12 generic 3D shape dataset [47] is a popular dataset because of its ade-
quate number of models, diverse shape variations and uniform distribution of samples
in each class. Therefore, in this section, we illustrate this 3D shape dataset in more
details. The SHREC12 generic 3D shape dataset consists of 1200 shapes in 60 classes
as shown in Table. 2.4. The meshes are collected from four fundamental datasets such as
SHREC11 generic 3D benchmark [26], SHREC10 generic 3D warehouse [93], Princton
45
Shape Benchmark [80] and SHREC 07 watertight shape benchmark [30]. The author
carefully arranges these models and selects 1200 of them to guarantee enough diversi-
ties. The examples of this dataset are shown in Fig.2.17. It is obvious that the shape
variations in this 3D shape dataset are much larger than 2D shape examples in section
2.5.2.
Table 2.5: Retrieval performances of five parcipants in SHREC12 generic shape retrieval
contest
Method NN FT ST E DCG
LSD-sum [8] 0.517 0.232 0.327 0.224 0.565
ZFDR [48] 0.818 0.491 0.621 0.442 0.776
3DSP L2 1000 chi2 [47] 0.662 0.367 0.496 0.346 0.678
DVD+DB+GMR [47] 0.828 0.613 0.739 0.527 0.833
DG1SIFT [67] 0.879 0.661 0.799 0.576 0.871
There are five participants in SHREC12 generic shape retrieval contest. Some of
them uses different parameters to measure the performances of their methods. Table.
2.5 lists their performance by using six measurements as Nearest-Neighbor, First-Tier,
Second-Tier, E-measurement and Discounted Cumulative Gain.
ShapeNet Dataset. With the rapid growth of supervised classification methods,
larger, more uniform and general datasets are on demand. The ShapeNet dataset is
proposed in [97]. The ShapeNet dataset is constructed by multiple sources such as
Yobi3D search engine, PSB dataset and so on. Mis-categorized shapes are corrected
by using Amazon Mechanical Turk. Consequently, the ShapeNet dataset has 151128
shapes categorized into 660 classes. To acquire the high quality of the dataset for the
classification task, the author choose a subset of the dataset and call it the ModelNet40
shape dataset. The ModelNet40 shape dataset has 40 common classes with9843 training
samples and 2468 testing samples. Importantly, every shape is normalized in the Z
direction. Every shape has two representations as mesh and volumetric data. Examples
46
Figure 2.18: Examples of ShapeNet 3D shape dataset. Every class is represented by two
instances.
Table 2.6: Classification performance measured by ACA and AIA scores of several
state-of-the-art methods for the ModelNet40 dataset.
V olume-based methods ACA AIA
3DShapeNets[97] 77.30% -
V oxelNet[60] 83.01% 87.40%
3D-Gan [96] 83.30% -
SubV olume[75] 86.00% 89.20%
AniProbing [75] 85.60% 89.90%
View-based methods ACA AIA
DeepPano [79] 77.63% -
GIFT [6] 83.10% -
MVCNN [87] 90.10% -
Pairwise [35] 90.70% -
FusionNet [34] 90.80% -
of the ModelNet40 dataset are shown in Fig. 2.18. Each class is represented by two
volumetric models.
47
We show the classification performances of both view-based methods and volume-
based methods in recent literature for the ModelNet40 dataset in Table. 2.6. The mea-
surements are AIA and ACA scores.
48
Chapter 3
A Two-Stage Shape Retrieval (TSR)
Method with Global and Local
Features
3.1 Introduction
2D shapes, also known as silhouette images, are often encountered in computer vision
tasks such as manufacture components recognition and retrieval, sketch-based shape
retrieval, medical image analysis, etc. Given a 2D shape as the query, a shape retrieval
system retrieves ranked shapes from a gallery set according to a certain similarity mea-
sure between the query shape and shapes in the retrieval dataset, called gallery shapes.
The performance is evaluated by consistency between ranked shapes and human inter-
pretation. The 2D shape retrieval problem is challenging due to a wide range of shape
variations, including articulation, noise, contour deformation, topological transforma-
tion and multiple projection angles. It is worthwhile to emphasize that our research
addresses a retrieval problem with no labeled data at all. It is very different from the
deep learning architecture, such as that in [41], that relies on a huge amount of labeled
data for training.
Traditionally, the similarity between two shapes is measured using global or local
features that capture shape properties such as contours and regions. Global features
include the Zernike moment [39] and the Fourier descriptor [109]. They are however
49
not effective in capturing local details of shape contours, resulting in low discriminative
performance. Recent research efforts have focused on the development of more power-
ful local features and post-processing techniques (e.g., diffusion). A substantial progress
has been made in this area and will be briefly reviewed below.
The shape context (SC) method [12] describes a contour point by its relationship to
other contour points in a local log polar coordination. However, the Euclidian distance
used to construct a local coordination is sensitive to articulation variations. The inner-
distance shape context (IDSC) method [54] attempts to resolve the articulation problem
by using the inner-distance between two points on a contour. The aspect shape con-
text (ASC) method [55] and the articulation invariant representation (AIR) method [32]
extend IDSC to account for shape interior variations and projection variations, respec-
tively. Fast computation of the elastic geodesic distance in [86] for shape retrieval
is recently studied in [23]. Although local-features-based methods capture important
shape properties, their locality restricts discrimination among shapes on the global scale.
Consequently, their retrieval results may include globally irrelevant shapes in high ranks.
To illustrate this claim, three exemplary query shapes, given in the leftmost column of
each row, and their top 10-ranked retrieval results are displayed from left to right in rank
order in Figs. 3.1(a)-(d). The results of the AIR method are shown in the first row of
each subfigure. Apparently, these retrieved results are against human intuition.
Post-processing techniques such as the diffusion process (DP) [25], [73], [74], [103],
[104] have been proposed to compensate for errors arising from local-features-based
shape retrieval methods. The DP treats each sample as a node and the similarity between
any two samples corresponds to a weighted edge. All samples form a connected graph,
called a manifold, and affinities are diffused along the manifold to improve measured
similarities. However, the DP has its limitations. When shapes of two classes are mixed
in the feature space, it cannot separate them properly. Also, when a query is far away
50
(a) Apple
(b) Key
(c) Cup
(d) Bone
Figure 3.1: Comparison of retrieved shapes using AIR (the first row), AIR+DP (the
second row) and the proposed TSR method (the third row) with respect to four query
shapes: (a) Apple, (b) Key, (c) Cup, and (d) Bone.
from the majority of samples in its class, it is difficult to retrieve shapes of the same
class with high ranks. For example, AIR is confused with several classes as shown in
the first row of Figs. 3.1(a)-(d). Even the best DP does not help retrieval results much
51
as shown in the second row of Figs. 3.1 (a)-(d). Clearly, the DP is constrained by the
underlying feature space.
Being motivated by the above observations, we develop a more robust shape retrieval
system with two main contributions in this paper. First, we consider both global and
local features. In order to obtain more powerful and robust global features, we develop
a new skeleton-based feature and adopt two traditional features. Second, we propose a
two-stage shape retrieval (TSR) system that consists of: I) the irrelevant cluster filtering
(ICF) stage and II) the local-features-based matching and ranking (LMR) stage. For
Stage II, we can adopt any state-of-the-art shape retrieval solution, and our focus is on
the design of Stage I. The robustness of the proposed TSR system can be intuitively
explained below. In the ICF stage (Stage I), we attempt to cluster gallery shapes that are
similar to each other by examining global and local features simultaneously. Two con-
tradictory cases may arise. Shapes that are close in the local feature space can be distant
in the global feature space, and vice versa. Here, we resolve the contradiction with a
joint cost function that strikes a balance between the two distances. It is convenient to
use the ICF stage to filter out unlikely shapes. In particular, shapes that are close in the
local feature space but distant in the global feature can be removed in this stage. Then,
the TSR system will avoid the same mistake of traditional one-stage matching methods
when it proceeds to its second LMR stage. The retrieved results of the TSR system are
shown in the third row of Figs. 3.1 (a)-(d). All wrongly retrieved results are corrected
by TSR. The novelty of our work lies in two areas: 1) identifying the cause of unreliable
retrieval results in all state-of-the-art methods for the 2D shape retrieval problem, and
2) finding a new system-level framework to solve this problem.
The rest of this paper is organized as follows. The relationship between the TSR
method and multi-view learning techniques is discussed in Section 3.2. The TSR method
52
is described in Section 5.3.1. Experimental results are shown in Section 5.4. Finally,
concluding remarks are given in Section 5.5.
3.2 Related Work
In this paper, we consider both global and local features to improve the shape retrieval
performance. Multi-view learning techniques [99], which combine multiple features,
have been studied and applied to various applications such as image classification and
retrieval. LMIB [100] uses theory of the information bottleneck to model the multi-view
learning problem. MISL [101] exploits the latent intact space of multi-view samples
using the complementary information among them. LM3FE [59] transforms original
multiple features into multiple latent spaces and combine them with weights to predict
the label of an input image. These transformation matrices and weights of combina-
tion are learned simultaneously. MVMC [58] proposes a multi-view matrix completion
method to solve the transductive multi-label image classification problem. A weight set
is learned to effectively combine the matrix completion outputs from different features.
GTDA [90] combines multi-Gabor functions and serves as a pre-processing step for a
conventional classifier to solve the gait recognition problem. It achieves convergence
in the training stage by integrating the differential scatter discriminant criterion into the
tensor representation. WSDR [102] integrates the angle and distance principle to reduce
the feature dimension. MHDSC [56] integrates Hessian regularization with discrimina-
tive sparse coding to address the image annotation problem. By using multiple features
in the training stage, a testing image is coded by multiple trained dictionaries. Multi-
modal sparse coding is also applied to the image retrieval problem in [105]. To predict
click features for those images in a dataset without the ground truth, this method con-
structs multiple dictionaries using different visual features. The obtained sparse codes
53
are further refined through an iterative two-stage optimization procedure. Finally, these
sparse codes are combined by a voting strategy to predict click features. VCLTR [106]
proposes a ranking model to integrate visual features into click features by alternatively
minimizing the corresponding two cost functions. ABRS-SVM [91] bootstraps both
samples and features to create multiple weak classifiers. Two different aggregation mod-
els complying with majority voting and Bayes sum are adopted to improve the image
retrieval problem based on relevance feedback.
Our method follows the principle of consensus and complementarity as adopted by
multi-view learning [99] yet it is different from previous methods in two aspects. First,
previous methods adopt supervised or semi-supervised learning schemes that demand
a huge amount of labeled data for training. There is no labeled data in the 2D shape
retrieval problem, which creates more challenges. To overcome this difficulty, we first
use the local features to cluster unlabeled data and then use the global features to build
links among clusters to obtain reliable prediction based on the global properties of 2D
shapes. Second, instead of combining multiple features directly to obtain the final score,
we develop the ICF method to remove irrelevant shapes in Stage I. To achieve this
goal, we design a set of global features to compensate the weakness of local features.
Afterwards, we explore both global and local features to define a cost function for final
retrieval decision against a query.
3.3 Proposed TSR Method
3.3.1 System Overview
An overview flow chart of the proposed TSR system is given in Fig. 3.2. As shown in
the figure, the system consists of two stages. Stage I of the TSR system is trained in an
off-line process with the following three steps.
54
1. Initial clustering. All samples in the dataset are clustered using their local
features.
2. Classifier Training. Samples close to the centroid of each cluster are selected as
training data. Their extracted global features are used to train a random forest
classifier.
3. Relevant Clusters Assignment. The trained random forest classifier assigns rel-
evant clusters to all samples in the dataset so that each sample is associated with
a small set of relevant clusters.
In the on-line query process, we extract both global and local features from a query
shape and, then, proceed with the following two steps:
1. Predicting Relevant Clusters. Given a query sample, we assign it a set of rele-
vant clusters based on a cost function. The cost function consists of two negative
log likelihood terms. One likelihood reflects the relevant cluster distribution
of the query sample itself while the other is the mean of the relevant cluster
distributions of its local neighbors. The ultimate relevant clusters are obtained by
thresholding the cost function.
2. Local Matching and Ranking. We conduct matching and ranking for samples
in the relevant clusters with a distance in the local feature space. The diffusion
process can also be applied to enhance the retrieval accuracy.
A query bell shape is given in Fig. 3.2. The traditional shape retrieval algorithm
(with Stage II only) finds birds, a bell and a beetle in the top six ranks. However, the
clusters of birds and beetles are not relevant to the bell shape as predicted by the trained
55
Figure 3.2: The flow chart of the proposed TSR system.
classifier. Since they are removed from the candidate set in Stage II, the mistake can be
avoided. After the processing of Stage I, the retrieved top 6 samples are all bell shapes.
We will describe the processing in Stages I and II in detail below.
3.3.2 The ICF Stage (Stage I)
Shape Normalization. Each shape is normalized so that it is invariant against trans-
lation, scaling and rotation. Translational invariance is achieved by aligning the shape
56
Figure 3.3: Shape normalization results of six classes.
centroid and the image center. For rotational invariance, we align the dominant reflec-
tion symmetry axis, which passes through the shape centroid and has the maximum
symmetry value, vertically. After rotational normalization, we set the larger side (width
or height) of the shape to unity for scale invariance. Examples of shape normalization
are given in Fig. 3.3, where the first and third rows are the original shapes while the sec-
ond and forth rows are their corresponding normalized results. Although normalization
based on the dominant reflection symmetry axis works well in general in our experi-
ments, it is worthwhile to point out that, if samples of a class do not contain a clear
dominant reflection axis, their normalized poses may not be well aligned (e.g., the three
running men in the third row of Fig. 3.3). However, there still exist rotational invariant
global features in TSR to compensate for the articulational variation. After shape nor-
malization, we perform hole-filling and contour smoothing to remove the interior holes
and contour noise, respectively.
Global Features. To capture the global layout of a shape, we consider three feature
types: 1) skeleton features -f
s
, 2) wavelet features -f
w
, and 3) geometrical features -
f
g
.
57
Figure 3.4: Illustration of skeleton feature extraction (from left to right of each image):
the original shape, the initial skeleton and the pruned skeleton.
For skeleton features, we extract the basic structural information of a shape while
ignoring minor details due to contour variations. We first apply the thinning algorithm
[72] to obtain the initial skeleton. Then, a pruning process is developed to extract its
clean skeleton without over-simplification. We show several input shapes and their ini-
tial and pruned skeletons in Fig. 3.4. After getting a clean skeleton, we consider four
types of salient points: 1) turning points (that have a sharp curvature change), 2) end
points, 3) T-junction points and 4) cross-junction points. The numbers of these four
salient points form a 4D skeleton feature vector denoted by f
s
. Skeleton features of
the six shapes in Fig. 3.4 are given in Table 3.1. They are rotational, translational and
scaling invariant.
For wavelet features, Haar-like filters were adopted for face detection in [94]. Being
motivated by this idea, we adopt five Haar-like filters as shown in Fig. 3.5 to extract
wavelet features. For a normalized shape, the first two filters are used to capture the
2-band symmetry while the middle two filters are used to capture the 3-band symmetry
58
Table 3.1: Skeleton features for six shapes in Fig.3.4.
Nos. of Salient pts bone beetle chicken device0 dog fish
turning pts 2 4 3 0 1 0
end pts 2 6 4 5 5 2
T-junction pts 0 2 2 0 3 0
cross-junction pts 0 1 0 1 0 0
Figure 3.5: Five Haar-like filters used to extract wavelet features of a normalized shape.
horizontally and vertically. The last one is used to capture the cross diagonal symmetry.
The responses of the five filters form a 5D wavelet feature vector denoted byf
w
.
Furthermore, we incorporate the following four geometrical features [72]: 1) aspect
ratio, 2) circularity, 3) symmetry and 4) solidity. The aspect ratio is the ratio of the width
and height of the bounding box of a shape. The circularity is set to4A{P
2
, whereA is
the area andP is the perimeter of the shape. The aspect ratio and circularity are closer
to one, if a shape is closer to a square or a circle. If a shape is closer a long bar, its
aspect ratio and circularity are closer to zero. The symmetry is computed based on the
dominant reflection symmetry axis of a shape. The solidity of a shape is the ratio of
its area and the area of its convex hull. If a shape is a convex set, its solidity is unity.
Otherwise, it will be less than one. These four geometric features form a 4D geometrical
feature vector denoted byf
g
.
Shape Clustering. In the traditional 2D shape retrieval formulation, all shapes in the
dataset are not labeled. Under this extreme case, we use the spectral clustering algorithm
[64] to reveal the underlying relationship between gallery shapes. The local feature is
strong at grouping locally similar shapes but it is sensitive to local variances as discussed
59
Figure 3.6: Several clustered MPEG-7 dataset shapes using the spectral clustering
method applied in the AIR feature space.
in Section 4.1. In contrast, the global feature is powerful at differentiating global dissim-
ilar shapes but weak at finding locally similar shapes. Thus, the combination of the two
in an early stage (say, initial clustering and classifier training) tends to cause confusion
and lower the performance. For this reason, we use the local feature in the initial cluster-
ing but use the global features in classifier training. For the MPEG-7 dataset, shapes in
several clusters using the AIR feature are shown in Fig. 4.11. Some clusters look reason-
able while others do not. Actually, any unsupervised clustering method will encounter
the following two challenges. First, uncertainty occurs near cluster boundaries so that
samples near boundaries have a higher probability of being wrongly clustered. Second,
the total number of shape classes is unknown. When the cluster number is larger than
the class number in the database, the clustering algorithm creates sub-classes or even
mixed classes. The relationship between them has to be investigated.
To address the first challenge, we extractN
i
samples closest to the centroid of theith
cluster and assign them a cluster label. Clearly, samples sharing the same cluster label
are close to each other in the feature space. There is a trade-off in choosing a proper
value ofN
i
. A smallerN
i
value guarantees higher clustering accuracy but fewer gallery
samples will be assigned cluster labels. Empirically, we set the value ofN
i
to one half
60
of the size of theith cluster. To address the second challenge, we use local features to
conduct clustering and assign cluster labels to a subset of samples. These labeled sam-
ples are used to train a random forest classifier [13] with their global features. Finally,
all gallery shapes are treated as testing samples. The random forest classifier is used to
predict the probability of each cluster type for them by voting. In this way, samples that
are clustered in the local feature space can be linked to multiple clusters probabilistically
due to similarities in the global feature space.
The overfitting problem has already been suppressed by our low dimensional global
feature vector (13D) and the bootstrapping algorithm used in the random forest classi-
fier. However, when the training data size is small, it is difficult to predict outliers in
testing samples accurately. It is possible for the classifier to terminate a decision pro-
cess in an early stage due to a shared dominating feature between the testing sample and
the training samples. One example is shown in Fig. 3.7, where the “sword-like” fish
sample in the red box is clearly an outlier with respect to other fish samples. If all fish
samples except for the outlying sample are used as training samples, the aspect ratio is
an important feature for the fish class since its value is consistent among all training fish
samples. When we use the sword-like fish as a testing sample, this feature will dominate
and terminate the decision process early with a wrong predicted result. Namely, it would
be a sword rather than a fish. To overcome this problem, we train multiple random for-
est classifiers using four different feature subsets, such as tf
s
;f
w
;f
g
u, tf
s
;f
w
u, tf
s
;f
g
u
and tf
w
;f
g
u, to suppress the impact from a dominating feature. Finally, we combine the
results of these classifiers by the sum rule to obtain the final prediction.
Cluster Relevance Assignment. The output of the ICF stage includes: 1) a set of
indexed clusters and 2) soft classification (or multi-labeling) of all gallery samples. For
item #1, we use the unsupervised spectral clustering algorithm to generate clusters as
described above. If the class number is known (or can be estimated), it is desired that
61
Figure 3.7: Illustration of a shared dominating feature (i.e., the aspect ratio) among
all training fish samples outside of the yellow box, which could terminate the decision
quickly and reject the testing sword-like fish in the red box.
the cluster number is larger than the class number. Each of these clusters is indexed by a
cluster ID. For item #2, we adopt soft classification so that each sample can be associated
with multiple clusters. This is done for two reasons. If two sub-classes belong to the
same ground truth class, we need a mechanism to regroup them together. Clearly, a hard
classification process does not allow this to happen. Second, a hard classification error
cannot be easily compensated while a soft classification error is not as fatal and it is
likely to be fixed in the LMR stage (stage II) of the TSR system.
We consider two relevant clusters assignment schemes below.
1) Direct Assignment
We apply the random forest classifier to both training and testing samples based on their
global features. Then, the probability for theith shape sample (denoted byy
i
) belongs
to the kth cluster denoted by c
k
can be estimated by the following normalized voting
result:
P
rf
py
i
Pc
k
q v
k
j
v
j
; (3.1)
wherev
k
is the number of votes claiming thaty
i
belongs toc
k
. Eq. (4.2) associatesy
i
to
its relevant clusters directly.
62
2) Indirect Assignment
Intuitively, a good cluster relevance assignment scheme should take both global and
local features into account. This can be achieved as follows. For query sampley
i
, we
find itsK nearest neighbors (denoted byx
j
) using a certain distance function in a local
feature space (e.g. the same feature space used in IDSC or AIR). Then, the probability
of y
i
belonging to c
k
can be estimated by the weighted sum of the probability in Eq.
(4.2) in form of
P
knn
py
i
Pc
k
q x
j
Pknnpy
i
q
P
rf
px
j
Pc
k
q
cm
x
j
Pknnpy
i
q
P
rf
px
j
Pc
m
q
: (3.2)
Eq. (4.3) associates y
i
to its relevant clusters indirectly. That is, the assignment is
obtained by averaging the relevant clusters assignment of its K nearest neighbors.
Empirically, we choose K to be 1.5 times the average cluster size in the experiments.
We show an example that assigns a query apple shape to its relevant clusters in Fig.
3.8(a), whose x-axis and y-axis are the negative log functions of Eqs. (4.2) and (4.3),
respectively. Every dot in Fig. 3.8(a) represents a cluster after the shape clustering
process. To visualize shapes represented by a dot, we plot a representative sample of
each cluster in Fig. 3.8(b).
We see that the distance between the bat cluster and the apple cluster is short in the
x-axis but long in the y-axis. This is because that samples of the apple and bat clusters
are interleaved in the local feature space. This is evidenced in the retrieval results of AIR
in Fig. 3.1(a). However, the apple and bat clusters have little intersection in the global
feature space. On the other hand, the cup and apple clusters have large intersection in the
global feature space. Yet, their distance is far in the local feature space. It is apparent that
Eqs. (4.2) and (4.3) provide complementary relevance assignment strategies for query
sample y
i
. It is best to integrate the two into one assignment scheme. For example,
63
(a)
(b)
Figure 3.8: Selecting relevant clusters for a query apple shape by thresholding a cost
function as shown in Eq. (4.5).
we can draw a line to separate relevant and irrelevant clusters with respect to the query
apple shape in this plot.
64
Mathematically, we define a cost function as follows
Jpy
i
;c
k
q logpP
knn
py
i
Pc
k
qP
rf
py
i
Pc
k
qq
r logpP
knn
py
i
Pc
k
qq logpP
rf
py
i
Pc
k
qqs: (3.3)
We computeJpy
i
;c
k
q for all clustersc
k
. If
Jpy
i
;c
k
q ; (3.4)
where is a pre-selected threshold, we say that clusterc
k
is a relevant cluster for query
y
i
. Otherwise, it is irrelevant.
3.3.3 The LMR Stage (Stage II)
In the LMR stage, we rank the similarity of shapes in the retrieved relevant clusters by
using a local-features-based matching scheme (e.g., AIR) including a diffusion process.
We adopt the Local Constrained Diffusion Process (LCDP) from [25] in the TSR system.
The diffusion process is slightly modified with the availability of relevant clusters in the
TSR system since the diffusion process can be conducted on a more reasonable manifold
due to the processing in Stage I.
3.4 Experimental Results
We demonstrate the retrieval performance of the proposed TSR system by conduct-
ing experiments on three shape datasets: MPEG-7, Kimia99 and Tari1000. We set the
threshold, , of the cost function in Eq. (4.5) to 7 empirically in all experiments. We
also consider the incorporation of two diffusion processes in various schemes. They are
denoted by:
65
Table 3.2: Comparison of bull’s eye scores with different cluster numbers for the TSR
method.
Cluster Numbers (M) 16 32 48 64 80 96 112 128
TSR (ICF+AIR) 96.00% 97.54% 98.81% 99.51% 99.62% 99.85% 99.92% 99.90%
TSR (ICF+AIR+DP1) 98.32% 98.90% 99.72% 99.99% 99.99% 99.99% 100.00% 99.99%
DP1: the diffusion process proposed in [103],
DP2: the diffusion process proposed in [25].
3.4.1 MPEG-7 Shape Dataset
The MPEG-7 shape dataset [44], which remains to be the largest and most challeng-
ing, contains 1400 shape samples in 70 independent classes. Samples are uniformly
distributed so that each class has C 20 shape samples. The retrieval performance
is measured by the bull’s eye score, which means the percentage of shapes sharing the
same class with a query in the top2 C 40 retrieved shapes. AIR and DP1 are used
as the local feature space and the diffusion process in the TSR method, respectively.
We first show the bull’s eye scores of two TSR schemes using a different cluster
number M in the shape clustering step in Table 3.2. Both TSR methods adopt the
AIR features for the distance computation. However, one of them uses DP1 while the
other does not. Since TSR(ICF+AIR+DP1) offers better performance, we choose it as
the default TSR configuration for the MPEG-7 shape dataset. Generally speaking, the
performance degrades when M is small due to the loss of discriminability in larger
cluster sizes. The retrieval performance improves as the cluster number increases up to
112. After that, the performance saturates and could even drop slightly. That means that
we lose the advantage of clustering when the cluster size is too small. For the remaining
MPEG-7 dataset experimental results, we chooseM 112.
66
Table 3.3: Comparison of bull’s eye scores of several state-of-the-art methods for the
MPEG-7 dataset.
Method Bull’s eye score
CSS [61] 75.44%
IDSC [54] 85.40%
ASC [55] 88.30%
HF [95] 89.66%
AIR [32] 93.67%
IDSC+DP1 [103] 93.32%
ASC+DP1 [55] 95.96%
IDSC+SC+Co-Transduction [9] 97.72%
AIR+TPG [104] 99.90%
AIR+DP2 [25] 100.00%
Proposed TSR (ICF+AIR+DP1) 100.00%
Table 3.4: Comparison of top 20, 25, 30, 35, 40 retrieval accuray for MPEG-7 dataset.
N 20 25 30 35 40
IDSC 77.21% 80.44% 82.61% 84.16% 85.40%
IDSC+DP1 88.53% 90.78% 92.03% 92.73% 93.32%
AIR 88.17% 89.99% 91.28% 92.64% 93.67%
AIR+DP2 94.42% 97.92% 98.66% 99.38% 100%
Proposed TSR 98.46% 99.09% 99.40% 99.71% 100%
The bull’s eye scores of the TSR method and several state-of-the-art methods are
compared in Table 3.3. Both TSR and AIR+DP2 reach 100%. The bull’s eye score is
one of the popular retrieval performance measures for the 2D shape retrieval problem.
However, since each MPEG-7 shape class contains 20 shape samples, the measure
of correctly retrieved samples from the top 40 ranks cannot reflect the true power of
the proposed TSR method. To push the retrieval performance further, we compare the
accuracy of retrieved results from the top 20, 25, 30, 35 and 40 ranks of TSR and several
state-of-the-art methods in Table 3.4 whose last column corresponds the bull’s eye scores
reported in Table 3.3. The superiority of TSR stands out clearly in this table.
67
(a) Bird
(b) Device9
Figure 3.9: Comparison of retrieved rank-ordered shapes (left-to-right in the top row
followed by left-to-right in the second row within each black stripe). For each query
case, retrieved results of IDSC+DP1, AIR+DP2 and TSR are shown in the first, second
and third black stripes of all subfigures, respectively.
WhenN 20, TSR can retrieve 20 shapes of the entire class correctly with respect
to most query samples. However, it still makes mistakes occasionally. It is worth-
while to show these erroneous cases to have further insights. For this reason, we
68
(a) Guitar
(b) Octopus
Figure 3.10: Comparison of retrieved rank-ordered shapes (left-to-right in the top row
followed by left-to-right in the second row within each black stripe). For each query
case, retrieved results of IDSC+DP1, AIR+DP2 and TSR are shown in the first, second
and third black stripes of all subfigures, respectively.
conduct error analysis in Figs. 3.9(a)-(b) and Figs. 3.10(a)-(b). The performance of
IDSC+DP1 is clearly worse than that of AIR+DP2 and TSR. AIR+DP2 makes mistakes
between bird/bell, circle/device8, guitar/frog and octopus/fish as shown in the second
69
Figure 3.11: Comparison of precision-and-recall curves of several methods for the
MPEG-7 dataset.
black stripes of all subfigures. This type of mistakes is not consistent with human visual
experience. In contrast, TSR makes mistakes between bird/truck , device9/device3, gui-
tar/spoon, octopus/device2 as shown in the third black stripes of all subfigures. These
mistakes are closer to human visual experience. Actually, these wrongly retrieved
shapes are similar to the query shape in their global attributes as a result of the spe-
cial design of the TSR system.
For further performance benchmarking, we show the precision-and-recall curves of
TSR and several methods in Fig. 3.11. We see from the figure that TSR outperforms all
other methods by a significant margin except for AIR+DP2.
70
Table 3.5: Comparison of top N consistency of several shape retrieval methods for the
Kimia99 dataset.
Method 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
SC[12] 97 91 88 85 84 77 75 66 56 37
Gen. Model [92] 99 97 99 98 96 96 94 83 75 48
Path Similarity [7] 99 99 99 99 96 97 95 93 89 73
Shock Edit [78] 99 99 99 98 98 97 96 95 93 82
Triangle Area [1] 99 99 99 98 98 97 97 98 94 79
Shape Tree [27] 99 99 99 99 99 99 99 97 93 86
IDSC[54] 99 99 99 98 98 97 97 98 94 79
IDSC+GT[10] 99 99 99 99 99 99 99 99 97 99
Proposed TSR 99 99 99 99 99 99 99 99 99 99
3.4.2 Kimia99 Shape Dataset
The MPEG-7 shape dataset contains primarily articulation and contour deformation
variations. We also conduct experiments on the Kimia99 shape dataset [78] that con-
tains other variations such as occlusions and distorted parts. However, this dataset is
relatively small. It contains 99 shapes in total, which are classified into 9 classes. Each
class has 11 shapes. For this dataset, we choose IDSC as the local feature extraction and
ranking method and DP1 as the diffusion process in the LMR stage as the default TSR
method. The number of clusters is set to 15.
The common evaluation criterion for this dataset is topN (withN 1;2; ;10)
consistency, which measures consistency at the top N
th
retrieved shapes against each
query. Note that the best possible value is 99, which is summed up by consistency of
all 99 query samples. The topN consistency results of several methods are compared
in Table 3.5. The TSR method can exclude round 75% irrelevant shapes for each query
and effectively improve the IDSC result. The TSR method can achieve the highest
consistency, namely, 99, for all possibleN values. The precision-and-recall curves of
IDSC, IDSC+DP1 and TSR are shown in Fig. 3.12. We conclude that the TSR method
71
Figure 3.12: Comparison of the precision and recall curves of several shape retrieval
methods for the Kimia99 dataset.
does not make any mistake in shape retrieval against the Kimia99 Shape Dataset as
supported by data in Table 3.5 and Fig. 3.12.
3.4.3 Tari1000 Shape Dataset
We test the TSR method on a new dataset called Tari1000 [3]. Tari1000 consists 1000
shapes classified into 50 classes. Each class has 20 shapes. As compared to the MPEG-
7 dataset, Tari1000 contains more deformation and articulation variations. Here, we
adopt IDSC as the local feature extraction and ranking method and DP1 as the diffusion
process in the LMR stage as the default TSR method. The number of clusters is set to
75. The bull’s eye scores of several methods are compared in Table 3.6 and the Precision
72
Figure 3.13: Comparison of precision and recall curves of several shape retrieval meth-
ods for the Tari1000 dataset.
and Recall curves are shown in Fig. 3.13. We see that TSR can achieve the perfect Bull’s
eye score (100%) which proves its robustness against severe articulations.
3.4.4 Unbalanced Shape Datasets
There is no non-uniformly distributed dataset available for the 2D shape retrieval prob-
lem. To test the performance on an unbalanced dataset, we adopt the scheme introduced
in [9] for the MPEG7 shape dataset. We randomly remove 10%, 20%, 40% samples
from the original dataset. For the 10% removal case, we randomly partition the dataset
into 5 subsets, where each subset has 14 classes. We randomly remove 0%, 5%, 10%,
15% and 20% for the five subsets, respectively, to achieve the target 10% removal. We
73
Table 3.6: Comparison of bull’s eye scores of several shape retrieval methods for the
Tari1000 dataset.
Method Bull’s eye score
SC [12] 94.17%
IDSC [54] 95.33%
ASC [55] 95.44%
IDSC+GT [10] 99.35%
IDSC+LCDP [9] 99.70%
IDSC+DDGM+Co-Transduction [9] 99.995%
Proposed TSR 100.00%
Table 3.7: Comparison of correct retrieval rates with non-uniformly distributed data.
Percents of Random Removal 10% 20% 40%
IDSC 84.99% 84.85% 85.31%
IDSC+DP1 92.36% 91.83% 90.88%
AIR 93.21% 93.19% 93.76%
AIR+DP2 99.87% 99.69% 99.31%
Proposed TSR 99.91% 99.84% 99.64%
repeat the removal process 10 times and average the bull’s eye scores. A similar proce-
dure is adopted for the 20% and 40% removal cases. We compare the proposed method
with several state-of-the-art methods in Table 3.7, which demonstrates the robustness of
our method.
3.4.5 Complexity Analysis
Our method consists of on-line and off-line processes. We show the computational time
for two major off-line modules, i.e. initial clustering and classifier training, in Table 3.8.
Here, we use the MPEG-7 shape dataset as the example.
As to the on-line retrieval process, we adopt an existing local feature extraction and
diffusion process (items #1 and #4 in Table 3.9) and the additional cost lies in the global
feature extraction and the direct and indirect assignment processes (items #2 and #3 in
74
Table 3.8: Computation time of two off-line processes.
Computational time (seconds)
Initial clustering 7.65
Classifier training 13.05
Table 3.9: Computation time of four on-line processes.
Computational time (seconds/query)
Local feature extraction (IDSC)
+distance calculation
0.97+6.75
Global feature extraction 0.29
Direct and indirect assignment 0.03
Diffusion (DP1) 4.29
Table 3.9). We show the computational time for these modules in Table 3.9 by using the
MPEG-7 shape dataset as the example. Our experiments are conducted using an Intel
Core i5-3470S at 2.9GHz CPU. It is apparent that the computational time of the two
additional modules together is much less than that of existing modules such as IDSC
and DP1.
3.5 Conclusion
A robust two-stage shape retrieval (TSR) method was proposed to solve the 2D shape
retrieval problem. In the ICF stage, the TSR method explores the underlying global
properties of 2D shapes. Irrelevant shape clusters are removed for each query shape. In
the LMR stage, the TSR method only need to focus on the matching and ranking in a
much smaller subset of shapes. We conducted thorough retrieval performance evaluation
on three popular datasets - MPEG7, Kimia99 and Tari1000. The TSR method retrieves
more globally similar shapes and achieves the highest retrieval accuracy as compared
with its benchmarking methods.
75
Chapter 4
3D Shape Retrieval via Irrelevance
Filtering and Similarity Ranking
(IF/SR)
4.1 Introduction
Content-based 3D shape retrieval [89] has received a lot of attention in recent years due
to an rapidly increasing number of 3D models over the Internet (e.g., Google sketchup
and Yobi3D). Applications of 3D shape retrieval technologies include: 3D model repos-
itory management, mechanical components retrieval, medical organ model analysis, etc.
Given a 3D shape model as the query, a content-based 3D shape retrieval system ana-
lyzes the query shape and retrieves ranked 3D shapes from the gallery set according to a
similarity measure. Its performance is evaluated by consistency between ranked shapes
and human interpretation. A robust and efficient 3D shape retrieval system is needed for
users to access and exploit large 3D datasets effectively.
Recently, convolutional neural-network (CNN) based solutions achieved impressive
performance by training a network using either multiple views of 3D shapes [5, 77,
79, 87, 98] or the 3D volumetric data [60, 75, 97]. However, their training procedure
demands a large amount of labeled data, which is labor-intensive. In this work, we
address the 3D shape retrieval problem using an unsupervised learning approach. It has
broader applications since no labeled data are needed.
76
(a) Bird
(b) Airplane
(c) Chair
(d) Wheel chair
(e) Bike
(f) Motocycle
Figure 4.1: Illustrations of intra-class variation at each row and inter-class similarity at
each pair of rows such as (a) and (b), (c) and (d), (e) and (f).
The main challenge in 3D shape retrieval lies in a wide range of shape variations.
A generic 3D shape dataset such as SHREC12 [47] includes both rigid and non-rigid
shapes. Shape variations can be categorized into inter-class similarities and intra-class
variations. For the latter, we have articulation, surface deformation, noise, etc. For
example, in Fig. 4.1 (a) and (b), shapes in the bird class and shapes in the airplane
class share strong similarities while the difference within each class is also significant.
Similar inter-class similarities and intra-class variations can be observed in Fig. 4.1 (c)
77
and (d) as the chair class and wheel chair class and in Fig. 4.1 (e) and (f) as the bike
class and the motocycle class.
Global and/or local features can be used to measure the similarity between two 3D
shapes. The rotation invariant spherical harmonics (RISH) [38] and the D2 shape distri-
bution [69] are two representative global features. They capture object surface properties
using the frequency decomposition and the vertex distance histogram, respectively. The
retrieval performance using global features only may degrade due to the loss of fine
shape details. The corresponding errors made by the global feature RISH are illustrated
in Fig. 4.2. Obviously, they are against human intuition. To overcome this limitation,
research efforts in recent years have focused on developing more discriminative local
features. They can be categorized into surface-based and view-based features. Surface-
based local features [15, 16, 28, 76, 84] describe a local surface region to achieve pose
oblivion, scale and orientation invariance. Although surface-based retrieval methods are
effective in handling non-rigid shape retrieval [51, 53], they are not robust against shape
artifacts that do occur in generic shape datasets. Retrieval methods using view-based
local features are favored for this reason.
View-based methods project a 3D shape into multiple views. Generally speaking,
an adequate number of view samples can represent a 3D shape well. The light field
descriptor (LFD) method [19] and the multi-view depth line approach (MDLA) [18]
represent each view by Zernike moments plus polar Fourier descriptors and depth lines,
respectively. The similarity between two shapes is measured by enumerating multiple
rotation setups. The salient local visual feature (SLVF) method [68] extracts the SIFT
points [57] from each range view. After constructing a codebook from the training pool,
one feature of a 3D shape can be represented by the histogram of SIFT points from all
views using the Bag of Words (BoW) approach. The DG1SIFT method [67] extends the
SLVF method by extracting three types of SIFT points from each view. They are dense,
78
(a) Chair
(b) Desk lamp
(c) Human
Figure 4.2: Errors generated by the global feature RISH. The query is marked by a red
box as (a) chair, (b) desk lamp and (c) human. The errors are marked by yellow boxes
in each subfigure.
global and one SIFTs. The depth buffered super vector coding (DBSVC) method [50]
uses a dense power SURF feature and the super vector coding algorithm to improve the
feature discriminability.
Although local features achieve a better performance than global features when
being tested against several generic 3D shape datasets [14, 26, 47, 49], their discrim-
inative power is restricted in the global scale. In particular, they may retrieve globally
irrelevant 3D shapes in high ranks. To illustrate this point, we show five retrieval results
by applying the DG1SIFT method to the SHREC12 3D shape dataset in Fig. 4.3 (a)-
(e). With the five query shapes in the leftmost column, the top 10 retrieved results are
presented in the first row of each subfigure Obviously, errors in these retrieval results
are counter to human intuition. Being motivated by the observation, we propose a more
robust 3D shape retrieval system which is called the irrelevance filtering and similarity
ranking (IF/SR) method. Its retrieved results are shown in the second row of each sub-
figure. All mistakes generated by DG1SIFT are corrected by our method. Clearly, the
proposed IF/SR system has a more robust performance as shown in these examples.
79
(a) Bicycle
(b) Round table
(c) Desk lamp
(d) Piano
(e) Home plant
Figure 4.3: Comparison of retrieved shapes using the DG1SIFT method (the first row)
and the proposed IF/SR method (the second row) against five query shapes (from top to
bottom): (a) bicycle, (b) round table, (c) desk lamp, (d) piano, and (e) home plant.
There are two main contributions of this work. First, we develop more powerful
and robust global features to compensate for the weaknesses of local features. Fea-
ture concatenation are often adopted by traditional methods to combine local and global
80
features. However, proper feature weighting and dimension reduction remain to be a
problem. For the second contribution, we propose a robust shape retrieval system that
consists of two modules in cascade: the irrelevance filtering (IF) module and the sim-
ilarity ranking (SR) module. The IF module attempts to cluster gallery shapes that are
similar to each other by examining global and local features simultaneously. However,
shapes that are close in the local feature space can be distant in the global feature space,
and vice versa. To resolve this issue, we propose a joint cost function that strikes a bal-
ance between two distances. In particular, irrelevant samples that are close in the local
feature space but distant in the global feature space can be removed in this stage. The
remaining gallery samples are ranked in the SR module using the local feature.
The rest of this chapter is organized as follows. The proposed IF/SR method is
explained in Section 4.2. Experimental results are shown in Section 5.4. Finally, con-
cluding remarks are given in Section 5.5.
4.2 Proposed IF/SR Method
4.2.1 System Overview
The flow chart of the proposed IF/SR method is shown in Fig. 4.4. The IF module is
trained in an off-line process with the following three steps.
1. Initial label prediction. All gallery samples are assigned an initial label in their
local feature space using an unsupervised clustering method.
2. Local-to-global feature association. Samples close to each cluster centroid
are selected as the training data. A random forest classifier is trained based on
their global features. All gallery samples are re-predicted by the random forest
classifier to build an association from the local feature space to the global feature
81
Figure 4.4: The flow chart of the proposed IF/SR system.
space.
3. Label refinement. We assign every gallery sample a set of relevant cluster indices
based on a joint cost function. The joint cost function consists of two assignment
scores. One score reflects the relevant cluster distribution of the query sample
itself while the other is the mean of the relevant cluster distributions of its local
neighbors. The ultimate relevant cluster indices are obtained by thresholding the
cost function.
82
In the on-line query process, we extract both global and local features from a query
shape and proceed with the following two steps.
1. Relevance prediction. we adopt the same scheme in the label refinement step to
assign relevant cluster indices to a given query.
2. Similarity ranking. The similarity between the query and all relevant gallery
samples is measured in the local feature space. In this step, a post-processing
technique can also be adopted to enhance retrieval accuracy.
An exemplary query, desk lamp, is given in Fig. 4.4 to illustrate the on-line retrieval
process. In the dashed box “retrieval without Stage I”, the traditional local feature
(DG1SIFT) retrieves erroneous shapes such as butterfly, desk phone and keyboard in
the top five ranks. They are apparently irrelevant to the query shape and successfully
removed in the relevance prediction step in the IF stage (Stage I). The retrieved top
five samples in the SR stage (Stage II) are all desk lamp shapes. We will explain the
processing of Stages I and II in detail below.
4.2.2 Stage I: Irrelevance Filtering
3D Shape Preprocessing. We have two preprocessing steps: 1) model representation
conversion and 2) 3D shape normalization. For model representation conversion, since
we extract features from both mesh models and volumetric models, we adopt the parity
count method [65] to convert a mesh model into a volumetric model. Each volumetric
model has resolution256 256 256. We show three voxelization results converted from
three original mesh models such as piano, truck and hand in Fig. 4.5. 3D shape nor-
malization aims to align shapes of the same class consistently to achieve translational,
scaling and rotational invariance.
83
Figure 4.5: The examples of the original mesh (left) and its voxelization result (right) in
each pair for three mesh models: piano, chair and hand.
Translational invariance is achieved by aligning the center of mass with the origin.
For scale invariance, we re-scale a shape to fit a unit sphere. For rotational invariance,
we adopt the reflective symmetry axial descriptor [37] to calculate the nearly complete
symmetry function for each 3D shape. We show the visualization of the symmetry axial
descriptor for four examples: chair, motocycle, drum and insect in Fig. 4.6. The PCA
on the symmetry function extracts three principal axes to form three principal planes.
To determine the order of three principal planes, we project the shape into each plane
and the projection views with the first and second largest areas are aligned with the
XOY plane and the ZOX plane, respectively. Finally, the YOZ plane is determined
automatically. Fig. 4.7 shows some normalization results using the above-mentioned
method.
Global Features. To capture the global properties of a 3D shape, we describe it
using three feature types: 1) surface features (f
s
), 2) wavelet features (f
w
) and 3) geo-
metrical features (f
g
).
The 3D surface features, denoted byf
s
, are a generalization of the 2D polar Fourier
descriptor [108]. N rays are emitted from the origin of a normalized 3D shape. Each
ray has the orientationr p coscos;cossin;sinq with two directional parameters
p;q, where and are uniformly sampled from intervals r0; q and r0;2 q, respec-
tively, with step size
6
. For each ray, the Euclidean distance from the origin to its inter-
sected point on a face forms a functiong p;q. If a ray intersects with multiple faces,
84
Figure 4.6: The examples of the visualized reflective symmetry descriptors generated
by [37] for four mesh models: chair, motocycle, drum and insect. A farther point on the
surface to the origin indicates a larger symmtry value on the corresponding direction.
we consider the farthest one only. In this way, we convert the original surface function
f px;y;z q into a 2D distance function parameterized byg p;q. Then, we calculate the
Fourier coefficients of the 2D distance function. The magnitude information forms a
72-D feature vector denoted byf
s
. The Fourier descriptors of four shapes belonging to
two classes are visualized in Fig. 4.8, where each subfigure contains an original shape in
the left and its surface feature in the right. We see intra-class consistency and inter-class
discrimination from this figure.
For the wavelet features denoted byf
w
, we adopt the generalized 3D Haar-like filters
[20]. Seven bands of 3D Haar-like filters as shown in Fig. 4.9 are applied to a normalized
and voxelized model. The first three filters capture the left-right, top-bottom, front-back
symmetry properties. The last four filters analyze diagonal sub-regions. The responses
from these seven filters form a 7D wavelet feature vector.
Furthermore, we incorporate four geometrical features: 1) the aspect ratio, 2) xyz-
invariance, 3)
-invariance and 4) rectilinearity[52]. The aspect ratio is a 3D feature
85
(a) Cup
(b) Monoplane
(c) Non-flying insect
(d) Guitar
Figure 4.7: Shape normalization results of four 3D shape classes.
based on three side lengths -l
x
,l
y
,l
z
of the bounding box of a normalized shape. It is
expressed as
AR r
l
x
l
x
l
y
l
z
;
l
y
l
x
l
y
l
z
;
l
z
l
x
l
y
l
z
s: (4.1)
86
(a)
(b)
(c)
(d)
Figure 4.8: Visualization of surface features of four shapes, where (a) and (b) provide
two house shapes while (c) and (d) provide two truck shapes.
87
Figure 4.9: Illustration of the seven-band Haar filters.
Figure 4.10: The xyz-invariance (black box),
-variance (red box) and rectilinearity
(blue box) values of six examples from three classes: apartment house, fish and cup.
The xyz-variance and
-variance are adopted to examine the variance of cut-planes
of a normalized volumetric model. To measure the xyz-variance, we extract all cut-
planes orthogonal to the X-axis, the Y-axis and the Z-axis, respectively. The variances
of three groups of cut-planes form a 3D feature. Similarly, the
-variance measures
the variance of groups of rotated cut-planes centered at the X-axis, the Y-axis and the
Z-axis, repectively. The robust rectilinearity measure from [52] is used to obtain the
rectilinearity feature. It calculates the ratio between the total surface area and the sum
of projected triangle areas on the XOY , the ZOX and the YOZ planes. Finally, the
geometrical feature, denoted by f
g
, is a 10-D feature vector. The geometric features
of six examples are shown in Fig. 4.10 in boxes of black (xyz-invariance), red (
-
variance) and blue (rectilinearity), respectively.
Initial Label Prediction. In traditional 3D shape retrieval formulation, all shapes
in the dataset are not labeled. Under this extreme case, we select the spectral clustering
algorithm [64] to reveal the underlying relationship between gallery samples. The local
88
Figure 4.11: Several clusterd SHREC12 shapes using the spectral clustering method
using the DG1SIFT feature.
feature is strong at grouping locally similar shapes but it is sensitive to local variances
as discussed in Section 4.1. In contrast, the global feature is powerful at differentiating
global dissimilar shapes but weak at finding locally similar shapes. Thus, the combina-
tion of the two in this early stage tends to cause confusion and lower the performance.
For this reason, we use the local feature only to perform clustering.
For the SHREC12 dataset, shapes in several clusters using the DG1SIFT feature
are shown in Fig. 4.11. Some clusters look reasonable while others do not. Actually,
any unsupervised clustering method will encounter two challenges. First, uncertainty
occurs near cluster boundaries so that samples near boundaries have a higher probability
of being wrongly clustered. Second, the total number of shape classes is unknown.
When the cluster number is larger than the class number in the database, the clustering
algorithm creates sub-classes or even mixed classes. We address the first challenge in the
local-to-global feature association step and the second challenge in the label refinement
step.
Local-to-Global Feature Association. We extractN
k
samples closest to the cen-
troid of the k
th
cluster and assign them a cluster label. Clearly, samples sharing the
same cluster label are close to each other in the feature space. There is a trade-off in
choosing a proper value ofN
k
. A smallerN
k
guarantees higher clustering accuracy but
89
fewer gallery samples will be assigned cluster labels. Empirically, we set the value of
N
k
to one half of the size of thek
th
cluster. Then, we convert the gallery samples from
the local feature space to a global feature space. We will correct clustering errors in
the global feature space at a later stage. Furthermore, samples that come from the same
class but are separated in the local feature space can be merged by their global features.
To build the association, labeled samples are used to train a random forest classifier [13]
with global features. Finally, all gallery shapes are treated as test samples. The ran-
dom forest classifier is used to predict the probability of each cluster type by voting. In
this way, samples clustered in the local feature space can be linked to multiple clusters
probabilistically due to the similarity in the global feature space.
Label Refinement. The output of the IF module includes: 1) a set of indexed clus-
ters, and 2) soft classification (or multi-labeling) of all gallery samples. For item #1, we
use unsupervised spectral clustering to generate clusters as described above. If the class
number is known (or can be estimated), it is desired that the cluster number is larger
than the class number. Each of these clusters is indexed by a cluster ID. For item #2,
we adopt soft classification so that each sample can be associated with multiple clusters.
This is done for two reasons. If two sub-classes belong to the same ground truth class,
we need a mechanism to re-group them together. Clearly, a hard classification process
does not allow this to happen. Second, a hard classification error cannot be easily com-
pensated while a soft classification error is not as fatal and it is likely to be fixed in the
SR module (stage II).
We consider two relevant cluster assignment schemes below.
1) Direct Assignment
We apply the random forest classifier to both training and testing samples based on
their global features. Then, the probability for the i
th
shape sample (denoted by y
i
)
90
(a)
(b)
Figure 4.12: Selecting relevant clusters for the query desk lamp in Fig. 4.3(c) by thresh-
olding a cost function shown in Eq. 4.4.
91
belonging to thek
th
cluster (denoted byc
k
) can be estimated by the following normal-
ized voting result:
P
rf
py
i
Pc
k
q v
k
j
v
j
; (4.2)
wherev
k
is the number of votes claiming thaty
i
belongs toc
k
. Eq. (4.2) associatesy
i
to
its relevant clusters directly.
2) Indirect Assignment
Intuitively, a good cluster relevance assignment scheme should take both global
and local features into account. For query sample, y
i
, we find its K nearest neigh-
bors (denoted byx
j
) using a certain distance function in a local feature space (e.g. the
same feature space used in DG1SIFT). Then, the probability ofy
i
belonging toc
k
can
be estimated by the weighted sum of the probability in Eq. (4.2) in form of
P
knn
py
i
Pc
k
q x
j
Pknnpy
i
q
P
rf
px
j
Pc
k
q
cm
x
j
Pknnpy
i
q
P
rf
px
j
Pc
m
q
: (4.3)
Eq. (4.3) associates y
i
to its relevant clusters indirectly. That is, the assignment is
obtained by averaging the relevant clusters assignment of its K nearest neighbors.
Empirically, we choose K to be 1.5 times the average cluster size in the experiments.
We show an example that assigns a query desk lamp shape to its relevant clusters in
Fig. 4.12(a), whose x-axis and y-axis are the negative log functions of Eqs. (4.2) and
(4.3), respectively. Every dot in Fig. 4.12(a) represents a cluster after shape clustering.
To visualize shapes represented by a dot, we plot a representative sample of each cluster
in Fig. 4.12(b).
We see that the distance between the hand cluster and the desk lamp cluster is small
in the x-axis but large in the y-axis. This is because that samples of the desk lamp and
hand clusters are interleaved in the local feature space as shown in the retrieval results
92
of DG1SIFT in Fig. 4.3(c). However, the desk lamp and the hand clusters have little
intersection in the global feature space. In contrast, the wheel chair and desk lamp
clusters have large intersection in the global feature space. Yet, their distance is far in
the local feature space. It is apparent that Eqs. (4.2) and (4.3) provide complementary
relevance assignment strategies for query sampley
i
. It is best to integrate the two into
one assignment scheme. For example, we can draw a line to separate relevant and
irrelevant clusters with respect to the query apple shape in this plot.
Mathematically, we can define the following cost function
Jpy
i
;c
k
q logpP
knn
py
i
Pc
k
qP
rf
py
i
Pc
k
qq
r logpP
knn
py
i
Pc
k
qq logpP
rf
py
i
Pc
k
qqs: (4.4)
We computeJpy
i
;c
k
q for all clustersc
k
. If
Jpy
i
;c
k
q ; (4.5)
where is a pre-selected threshold. We say that clusterc
k
is a relevant cluster for query
y
i
. Otherwise, it is irrelevant.
4.2.3 Stage II: Similarity Ranking
In the SR module, we rank the similarity between a given query and gallery samples
in the retrieved relevant clusters using a local-features-based matching scheme (e.g.,
DG1SIFT). Additionally, we adopt the Local Constrained Diffusion Process (LCDP)
[103] in the post-processing step. The diffusion process is slightly modified with the
availability of relevant clusters in the IF/SR system since the diffusion process can be
conducted on a more reasonable manifold due to the processing in Stage I.
93
4.3 Experimental Results
We demonstrate the retrieval performance of the proposed IF/SR method by conducting
experiments on the generic 3D shape dataset of SHREC12 [47]. It contains 1200 3D
shapes in 60 independent classes. Samples are uniformly distributed so that each class
has 20 shape samples. The retrieval performance is measured by five standard metrics.
They are: Nearest-Neighbor (NN), First-Tier score (FT), Second-Tier score (ST), E-
measurement (E), Discounted Cumulative Gain (DCG).
We compare the proposed IF/SR method with five state-of-the-art methods:
LSD-sum [8]. It uses a local surface-based feature that considers local geodesic
distance distribution and Bag-of-Words.
ZFDR [48]. It adopts a hybrid feature that integrates the Zernike moment, the
Fourier descriptor, the ray-based features.
3DSP L2 1000 chi2 [47]. It employs a local surface-based feature that computes
the 3D SURF descriptor under the spatial pyramid matching scheme.
DVD+DB+GMR [47]. It adopts a hybrid feature that contains a dense voxel spec-
trum descriptor and a depth-buffer shape descriptor.
DG1SIFT [67]. It uses a view-based feature that extracts three types of SIFT
features (Dense SIFT, Grid SIFT and One SIFT) per view.
The IF/SR method adopts DG1SIFT as the local feature for shape clustering. We
show the first-tier (FT) scores of the IF/SR method using a different cluster numberM
for shape clustering in Table 4.1. Generally speaking, the performance degrades when
M is small due to the loss of discriminability in larger cluster sizes. The retrieval perfor-
mance improves as the cluster number increases up to 64. After that, the performance
saturates and could even drop slightly. That means that we lose the advantage of clus-
tering when the cluster size is too small. For the remaining experimental results, we
chooseM 64.
94
Table 4.1: Comparison of the First-Tier (FT) scores with different cluster numbers for
the IF/SR method in the SHREC12 dataset. The best score is shown in bold.
M 16 32 48 64 80 96 112
FT 0.666 0.672 0.709 0.720 0.717 0.717 0.715
Table 4.2: Comparison of the NN, FT, ST, E and DCG scores of five state-of-the-art
methods, the proposed IF/SR method, and the IF/SR method with LCDP postprocessing
for the SHREC12 dataset. The best score for each measurement is shown in bold.
Method NN FT ST E DCG
LSD-sum 0.517 0.232 0.327 0.224 0.565
ZFDR 0.818 0.491 0.621 0.442 0.776
3DSP L2 1000 chi2 0.662 0.367 0.496 0.346 0.678
DVD+DB+GMR 0.828 0.613 0.739 0.527 0.833
DG1SIFT 0.879 0.661 0.799 0.576 0.871
IF/SR 0.896 0.720 0.837 0.608 0.891
IF/SR+LCDP 0.893 0.734 0.858 0.620 0.899
Table 4.3: Comparison of top 20, 25, 30, 35, 40 retrieval accuracy for the SHREC12
dataset, where the best results are shown in bold.
N 20 25 30 35 40
LSD-sum 0.232 0.260 0.286 0.310 0.327
ZFDR 0.491 0.539 0.575 0.603 0.621
3DSP L2 1000 chi2 0.367 0.411 0.446 0.476 0.496
DVD+DB+GMR 0.613 0.656 0.691 0.719 0.739
DG1SIFT 0.661 0.718 0.756 0.783 0.799
IF/SR 0.720 0.775 0.802 0.824 0.837
IF/SR+LCDP 0.734 0.786 0.817 0.841 0.858
We compare the performance of seven 3D shape retrieval methods with five mea-
sures in Table 4.2. Clearly, the proposed IF/SR method (with or without LCDP post-
processing) outperforms the other five benchmarking methods. The IF/SR method with
postprocessing improves the result of DG1SIFT by around 7% in the First-Tier score.
95
(a) Door
(b) Bus
Figure 4.13: Comparison of retrieved top 20 rank-ordered shapes. For each query case
given in the leftmost column, retrieved results of DG1SIFT and the proposed IF/SR
method are shown in the first and second rows of all subfigures, repectively.
Since DG1SIFT adopts the manifold ranking process in its similarity measurement, the
gap between the IF/SR method before and after LCDP is relatively small.
Since each SHREC12 shape class contains 20 shape samples, the measure of cor-
rectly retrieved samples from the top 20 (FT) and 40 (ST) ranks cannot reflect the true
power of the proposed IF/SR method. To push the retrieval performance further, we
compare the accuracy of retrieved results from the top 20, 25, 30, 35 and 40 ranks of
the IF/SR method and five benchmarking methods in Table 4.3, whose first and last
96
(a) Non-wheel chair
(b) Guitar
(c) Bed
Figure 4.14: Comparison of retrieved top 20 rank-ordered shapes. For each query case
given in the leftmost column, retrieved results of DG1SIFT and the proposed IF/SR
method are shown in the first and second rows of all subfigures, repectively.
97
Figure 4.15: Comparison of precision and recall curves of the proposed IF/SR method
and several benchmarking methods for the SHREC12 dataset.
columns correspond the FT and ST scores reported in Table 4.2. The superiority of the
IF/SR method stands out clearly in this table.
According to the top 20 retrieval performance, the IF/SR method still makes mis-
takes for some queries. We conduct error analysis and show the results of DG1SIFT
and the IF/SR method in Figs. 4.13(a)-(b) and Figs. 4.14(a)-(c). For each query case
given in the leftmost column, retrieved results of DG1SIFT and the IF/SR method are
shown in the first and second rows of all subfigures, respectively. Each erroneous result
is enclosed by a thick frame. The errors of DG1SIFT are obvious. They are far away
from human experience. The IF/SR method makes mistakes between door/keyboard,
bus/truck, non-wheel chair/wheel chair, guitar/violin and bed/rectangle table (see the
second row of all subfigures). These mistakes are more excusable since they are closer
to each other based on human judgment.
98
Finally, we show the precision-and-recall curves of the IF/SR method and several
methods in Fig. 4.15. We see from the figure that the IF/SR method outperforms all
other methods by a significant margin.
4.4 Conclusion
The IR/SF method was proposed to solve the unsupervised 3D shape retrieval prob-
lem. In the IF stage, irrelevant shape clusters are removed for each query shape. In the
SR stage, the system can focus on the matching and ranking in a much smaller subset
of shapes. It superior retrieval performance was evaluated on the popular SHREC12
dataset.
99
Chapter 5
Design, Analysis and Application of A
Volumetric Convolutional Neural
Network
5.1 Introduction
3D shape classification [80] is an important yet challenging task arising in recent years.
Larger repositories of 3D models [85] such as google sketchup and Yobi 3D have been
built for many applications. They include 3D printing, game design, mechanical man-
ufacture, medical analysis, and so on. To handle the increasing size of repositories, an
accurate 3D shape classification system is in demand.
Quite a few hand-craft features [47], [49], [89] were proposed to solve the 3D shape
classification problem before. Interesting properties of a 3D model are explored from
different representations such as views [18], [19], [68], volumetric data [38], [66], [88],
and mesh models [15], [16], [28], [84]. However, these features are not discriminative
enough to overcome large intra-class variation and strong inter-class similarity. In recent
years, solutions based on the convolutional neural network (CNN) [41], [45] have been
developed for numerous computer vision applications with significantly better perfor-
mances. As evidenced by recent publications [17], [77], [98], the CNN solutions also
outperform traditional methods relying on hand-craft features in the 3D shape classifi-
cation problem.
100
A CNN method classifies 3D shapes using either view-based [79], [87] or volume-
based input data [60], [75], [97]. A view-based CNN classifies 3D shapes by analyzing
multiple rendered views while a volume-based CNN conducts classification directly
in the 3D representation. Currently, the classification performance of the view-based
CNN is better than that of the volume-based CNN since the resolution of the volumetric
input data has to be much lower than that of the view-based input data due to higher
memory and computational requirements of the volumetric input. On the other hand,
since volume-based methods preserve the 3D structural information, it is expected to
have a greater potential in the long run.
In this work, we attempt to improve volume-based CNN methods in two aspects and
call our solution the “V olumetric CNN” or VCNN in short . The architecture of the
VCNN is similar to that of the V oxelNet[60]. However, in contrast with the traditional
CNN design using an empirical rule to choose network parameters, we choose them for
the VCNN with theoretic justification. That is, we propose a feed-forward K-means
clustering algorithm to identify the optimal filter number and the filter size systemati-
cally. Being similar to any other real world classification problem, there exist sets of
confusing classes in the 3D shape classification problem [24]. Two confusion sets are
shown in Fig. 5.1. We propose a method to determine whether two shape classes belong
to the same confusion set without conducting the test. Instead, we analyze the filter
weights that connect the last fully connected (FC) layer and the output layer. All filter
weights associated with a 3D shape output class define the ”shape anchor vector” (SA V)
for this class. We show that two shape classes are confusing if the angle of their SA Vs
is small and a particular class has a relatively wide feature distribution. To enhance the
classification performance within a confusion set, we propose a hierarchical clustering
method to split samples of the same set into multiple subsets automatically. Then, we
can reclassify samples in a subset using a random forest (RF) classifier.
101
(a) Desk and table
(b) Cup, flower pot and vase
Figure 5.1: Two sets of confusing classes: (a) desks (the first row) and tables (the second
row), and (b) cups (the first row), flower pots (second row) and vases (third row). The
confusion is attributed to the similar global appearance of these 3D shapes.
The rest of this chapter is organized as follows. Related work is reviewed in Sec. 5.2.
The proposed VCNN system is presented in Sec. 5.3. Experimental results are given
to demonstrate the superior performance of the VCNN method in Sec. 5.4. Finally,
concluding remarks are given in Sec. 5.5.
102
5.2 Related Work
There are two main approaches proposed to classify 3D shape models: the view-based
approach and the volume-based approach. They are reviewed below. The view-based
approach renders a 3D shape into multiple views as the representation. Classifying a 3D
shape becomes analyzing a bundle of views collectively. The Multi-View Convolutional
Neural Network (MVCNN) [87] method renders a 3D shape into 12 or 80 views. By
adding a view-pooling layer in the VGG network model [82], views of the input shape
are merged before the fully connected layers to identify salient regions and overcome
the rotational variance. The DeepPano [79] method constructs panoramic views for a
3D shape. A row-wise max-pooling layer is proposed to remove the shift variance. The
MVCNN-Sphere method [75] builds a resolution pyramid for each view using sphere
rendering based on the MVCNN structure. The classification performance is improved
by combining decisions from inputs of different resolutions. View-based methods can
preserve the high resolution of 3D shapes since they leverage the view projection and
lower the complexity from 3D to 2D. Furthermore, a view-based CNN can be fine-tuned
from a pretrained CNN that was trained by 2D images. However, the view-based CNN
has two potential shortcomings. First, the surface of a 3D shape can be affected by the
shading effect. Second, reconstructing the relationship among views is difficult since
the 3D information is lost after the view-pooling process.
The volume-based approach voxelizes a 3D mesh model for a 3D representation.
Several CNN networks have been proposed to classify the 3D shapes directly. Exam-
ples include the 3D ShapeNet [97], the V oxelNet [60], and the SubV olume supervision
method [75]. V olume-based methods have two drawbacks. First, to control computa-
tional complexity, the resolution of a 3D voxel model is much lower than that of its
corresponding 2D view-based representation. As a result, high frequency components
103
of the original 3D mesh are sacrificed. Second, there are few pretrained CNN mod-
els on 3D data, volume-based networks have to be trained from scratch. Although a
volumetric representation preserves the 3D structural information of an object, the per-
formance of classifying 3D mesh models directly is still lower than that of classifying
the corresponding view-based 2D models.
5.3 Proposed VCNN Method
5.3.1 System Overview
An overview of the proposed VCNN solution is shown in Fig. 5.2. Before the supervised
network training phase, we first perform a feed-forward unsupervised clustering proce-
dure to determine the proper size and number of anchor vectors at each layer. Then, in
the training phase, the end-to-end backpropagation is conducted, and all filter weights in
the network are fine-tuned. The feature vector of the last layer of the VCNN, called the
the VCNN feature, is acquired after the training process. Afterwards, a confusion matrix
based on SA Vs is used to identify confusing sets categorized into pure sets and mixed
sets. Each pure set contains a single class of shapes. Each mixed set, which includes
multiple classes of samples, is split further into pure subsets and mixed subsets by a
proposed tree-structured hierarchical clustering algorithm. Each mixed subset is trained
by a random forest classifier. In the testing phase, a sample is assigned to a subset based
on its VCNN feature. If the sample is assigned to a pure subset, its label is immediately
output. Otherwise, it is assigned to a mixed subset and its label will be determined by
the random forest classifier associated with that subset.
104
Figure 5.2: The flow chart of the proposed system
5.3.2 Shape Anchor Vectors (SA Vs)
One key tool adopted in our work is the RECOS (Rectified-COrrelations on a Sphere)
model for CNNs as proposed in [43]. The weights of a filter at intermediate layers
105
Figure 5.3: Illustration of anctor vectors at the last stage of a CNN, where each anchor
vector points to a 3D shape class. They are called the shape anchor vectors (SA Vs). In
this example, one SA V points to the airplane class while another SA V points to the cone
class.
are interpreted as a cluster centroid of the corresponding inputs. Thus, these weights
define an ”anchor vector”. The convolution and nonlinear activation operations at one
layer is viewed as projection onto a set of anchor vectors followed by rectification in the
RECOS model. The number of anchor filters is related to the approximation capability
of a certain layer. The more the number of anchor vectors, the better the approximation
capability yet the higher the computational complexity. We will find a way to balance
approximation accuracy and computational complexity in Sec. 5.3.3.
The anchor vectors at the last stage of a CNN connect the last FC layer to the output.
They have a special physical meaning as shown in Fig. 5.3, where each anchor vec-
tor points to a 3D shape class. Thus, they are called the shape anchor vectors (SA Vs).
In this example, one SA V points to the airplane class while another SA V points to the
106
cone class. There are four blue dots and four red crosses along the surface of a high-
dimensional sphere. They represent the feature vectors of 3D shape samples from the
cone and the airplane classes, respectively. The classification of each sample to a par-
ticular class is determined by the shortest geodesic distance between the sample and the
tip of the SA V of the selected class (or, equivalently, the maximum correlation between
the sample vector and the SA V). Thus, sampley
i
is classified to the cone class as shown
in this figure.
The relationship between two SA Vs and the sample distribution of a particular class
plays a critical role in determining whether two classes are confusing or not. If the angle
between two SA Vs is small and the samples of a particular class are distributed over a
wider range, we will get confusing classes. This will be elaborated in Sec. 5.3.4.
5.3.3 Network Parameters Selection
The determination of network parameters such as the filter size and number per layer
is often conducted in an ad hoc manner in the current literature. Here, we propose a
systematic method to decide these parameters based on a feed-forward unsupervised
splitting approach. It consists of two steps: 1) representative sample selection, and 2)
representative sample clustering.
Problem Formulation. We can formulate the network design problem as follows.
The input to thej
th
RECOS (or thejth convolutional layer) has the following dimension:
M
j 1
N
j 1
D
j 1
K
j 1
:
107
whereM
j 1
N
j 1
D
j 1
are spatial dimensions whileK
j 1
represents the spectral
dimension (or the number of anchor vectors) in the previous layer. Its output has the
following dimension:
M
j
N
j
D
j
K
j
:
Since the2 2 2 to1 1 1 maximum pooling is adopted, we have:
M
j
0:5M
j 1
; N
j
0:5N
j 1
; D
j
0:5D
j 1
:
Furthermore, the filter from the input to the output has a dimension of
m
j
n
j
d
j
K
j
:
We set:
m
j
n
j
d
j
:
This is because the input volumetric data is of the cubic shape with the same resolution in
all three spatial dimension. Note pM
0
;N
0
;D
0
;K
0
q p 30;30;30;1q is the dimension of
the input 3D shape. Given pM
j 1
;N
j 1
;D
j 1
;K
j 1
q, the question is how to determine
the filter size parameter,m
j
, and the filter number,K
j
.
Representative sample selection. The dimension of the input vector ism
j
n
j
d
j
K
j 1
p m
j
q
3
K
j 1
. Let Y
j 1
denotes the set of all training input samples.
However, these input samples are not equally important. Some samples have less dis-
criminant power than others. A good sample selection algorithm can guide the network
to learn a more meaningful way to partition the feature space to select more discriminant
SA Vs for better decisions.
108
We first normalize all patterns y
j 1
P Y
j 1
to have the unit length. Then, the set
P
j 1
of all normalized inputs can be written as
P
j 1
t p
j 1
|p
j 1
y
j 1
{||y
j 1
||; y
j 1
PY
j 1
u:
Next, we adopt the saliency principle in representative pattern selection. That is, we
compute the variance of elements inp
j 1
and choose those of a larger variance value. If
the variance of an input is small, it has a flat element-wise distribution and, thus, has a
low discriminant power.
We propose a two-stage screening process.In the first stage, a small threshold is
adopted to remove inputs with very small variance values. That is, if the variance of an
input is less than, it is removed from the candidate sample set. In the second stage, we
select topT% samples with the largest variance values in the filtered candidate sample
set to keep all remaining candidates sufficiently salient.
After the above two-stage screening process, we have a smaller candidate sample set
that contain salient samples. If we want to reduce the size of the candidate set further-
more to simplify the following computations, a uniform random sampling strategy can
be adopted.
Representative sample clustering. Our objective is to findK
j
anchor vectors from
the candidate set of representative samples obtained from the above step. Here, we adopt
an unsupervised clustering algorithm such as the K-means algorithm to group them. The
cluster centroids are then set to the desired anchor vectors for the current RECOS model.
To choose the optimal parameters, m
j
and K
j
, we can maximize the inter-cluster
margin and minimize the intra-cluster variance by adopting the Bayesian Information
109
Criterion (BIC) function [71]. The approximation of the BIC function in the K-means
algorithm can be expressed as
BIC pm
j
;K
j
;P
j 1
q 2
K
j
‚
k
j
1
L
k
j
2K
j
C
j
logN; (5.1)
whereC
j
p m
j
q
3
K
j 1
is the dimension of input vectors,N is the total number of
elements inP
j 1
, andL is the log-likelihood distance. Under the normal distribution
model, we have
L
k
j
N
k
j
2
C
j
‚
c
j
1
logp
2
c
j
2
c
j
k
j
q; (5.2)
where N
k
j
is the number of samples in the k
j
th cluster,
2
c
j
is the variance of the c
j
th
feature variable over all samples, and
2
c
j
k
j
is the variance of the c
j
th feature variable
in thek
j
th cluster. By fixing parameterC
j
, the optimal value ofK
j
is determined by
detecting the valley point of the BIC function.
The relationship between two consecutive RECOS models is worthy discussion. The
problem of selecting optimal parameter pairs, pm
j
;K
j
q, across multiple layers is actu-
ally inter-dependent. Here, we adopt a greedy algorithm to predict these parameters in a
feed-forward manner. The centroids after each K-means clustering serve as the anchor
vectors of each RECOS model and they can be used to generate input samples for the
next RECOS model.
5.3.4 Confusion Sets Identification and Re-Classification
In principle, the network parameter selection algorithm presented in the last subsection
can be used to decide optimal network parameters for the RECOS models in the leading
layers. As the network goes to the end, we need to consider another factor. That is, the
110
number of SA Vs in the last layer has to be equal to the number of classes due to the
supervised learning framework. The discriminant power of the feature space before the
output layer can still be limited and, as a result, SA Vs in the last stage may not be able to
separate 3D shapes of different classes completely. Then, shapes from different classes
can be mixed, leading to confusion sets. In this subsection, we will develop a method
to identify these confusion sets. In particular, we would like to address the following
two questions: 1) how to split 3D shapes of the same class into multiple sub-classes
according to their VCNN features? and 2) what is the relationship between SA Vs and
visually similar shapes?
Generation of Sub-classes. For the first question, we adopt a tree-structured hierar-
chical clustering algorithm to split a class into sub-classes, where the class can be either
the ground-truth or the predicted one. The algorithm initially split the feature space into
two clusters according to their Euclidean distance by using the K-means algorithm with
K 2 [33]. Then, for each cluster, the variance is calculated to determine its tightness.
A cluster with a large variance is split furthermore. The termination of the splitting
process is decided by one of the following two criteria: 1) reaching a sufficiently small
variance (say, below threshold), and 2) reaching a sufficiently small size (say, below
threshold ). The number of sub-classes under each class can be automatically deter-
mined by the two pre-selected threshold values, and.
We examine the power of the VCNN features and the tree-structure hierarchical
clustering algorithm using the ground-truth label. Results for three classes are shown in
Fig. 5.4. They are chair, mantel and sink. We see that the VCNN features can generate
sub-classes that contain visually similar shapes automatically. For example, the chair
class can be divided into four sub-classes. They are round chairs, chairs with four feet,
wheel chairs and long chairs. The mantel class can be further partitioned into three sub-
classes. They are thin mantels, thick mantels and hollow mantels. The sink class can
111
(a) Chair (b) Mantel
(c) Sink
Figure 5.4: Illustration of 3D shapes in sub-classes obtained from (a) the Chair class,
(b) the Mantel class and (c) the Sink class. We provide 16 representative shapes for each
sub-class and encircle them with a blue box.
be clustered into four sub-classes; namely, deep sinks, flat sinks, sinks with a hose and
table-like sinks. This shows the discriminant power of the VCNN features, which are
highly optimized by the backpropagation of the CNN, and the power of the clustering
algorithm in identifying intra-class differences.
Confusion Matrix Analysis and Confusion Set Identification. After running sam-
ples through the CNN, we will get predicted labels (instead of ground-truth labels).
112
Then, samples of globally similar appearance but from different classes can be mixed in
the VCNN feature space. This explains the source of confusion. To resolve confusion
among multiple confusing classes, we adopt a merge-and-split strategy. That is, we first
merge several confusing classes into one confusion set and, then, split the confusion set
into multiple subsets as done before.
We denote the directional confusion score ofi
th
sample,y
i
, to thek
th
class,c
k
, as
spy
i
;c
k
q. Theoretically, it is reciprocal to the projection distance between y
i
and the
SA V ofc
k
. By normalizing the projection distances fromy
i
to the SA Vs of all classes
by using the softmax function, the directional confusion score spy
i
;c
k
q is equivalent
to the soft decision score. The confusion factor (CF) between two classesc
k
andc
l
is
determined by the average of two directional confusion scores as
CF pc
k
;c
l
q 1
2N
k
y
i
Pc
k
spy
i
;c
l
q 1
2N
l
y
j
Pc
l
spy
j
;c
k
q; (5.3)
whereN
k
andN
l
are the numbers of samples inc
k
andc
l
, respectively. This CF value
can be computed using training samples.
Since the confusion matrix defines an affinity matrix, we adopt the spectral cluster-
ing algorithm [64] to cluster samples into multiple confusion sets that have strong con-
fusion factors. After the spectral clustering algorithm, we obtain either pure or mixed
sets. A pure set contains 3D shapes from the same class. Fig. 5.5 shows the results
of both mixed and pure sets obtained by the confusion matrix analysis. Each confu-
sion set is enclosed by a blue box. Inside a confusion set, each class is represented by
two instances. Two different classes are separated by a green bar. Some pure sets are
generated because they are isolated from other classes in the clustering process. It is
worthwhile to emphasize that each mixed set contains 3D shapes of similar appearance
yet under different class labels. For example, the mixed set in the first row contains
113
Figure 5.5: A mixed or pure set is enclosed by a blue box. Each mixed set contains
multiple shape classes which are separated by green vertical bars. Two representative
3D shapes are shown for each class. Each row has one mixed set and several pure sets.
The mixed set in the first row contains bookshelf, wardrobe, night stand, radio, xbox,
dresser and tv stand; that in the second row contains cup, flower pot, lamp, plant and
vase; that in the third row contains bench, desk and table; that in the four row contains
chair and stool; that in the fifth row contains curtain and door; and that in the six row
contains mantel and sink.
seven classes: bookshelf, wardrobe, night stand, radio, xbox, dresser and tv stand. All
of them are cuboid like.
In the testing phase, if a test sample is classified to a pure set, the class label of
that set is output as the desired answer. A mixed set contains 3D shapes from multiple
classes, and further processing is needed to separate them. This will be discussed below.
Confusion Set Re-Classification. We split a confusion set into multiple subsets
using the tree-structured hierarchical clustering algorithm. Some of them contain sam-
ples from the same class while others still contain samples from multiple classes. They
form pure subsets and mixed subsets, respectively. We show splitting results of two con-
fusion sets in Fig. 5.6 and Fig. 5.7, respectively. We see from the two figures that shapes
in pure subsets are distinctive from other subsets while shapes in the mixed subsets share
similar appearances.
114
(a) (b)
(c) (d)
(e) (f)
Figure 5.6: The split of the confusion set of bench, desk and table yields three pure
subsets (a) bench, (b) desk and (c) table and three mixed subsets (d) bench and table, (e)
desk and table and (f)bench, desk and table.
115
(a) (b)
(c) (d)
(e) (f)
Figure 5.7: The split of the confusion set of lamp, flower pot, cup and vase yields three
pure subsets (a) lamp, (b) plant and (c) cup and three mixed subsets (d) flower pot and
vase, (e) flower pot and plant and (f) cup, flower pot, lamp, vase.
116
Table 5.1: Comparison of network parameters of the proposed VCNN and the V oxelNet,
where the numbers in each cell follow the format pm
j
q
3
K
j 1
:
Conv1 Conv2 FC Output
V oxelNet 5
3
1 3
3
32 6
3
32 1 1 128
Our VCNN 3
3
1 3
3
256 6
3
128 1 1 1024
To deal with the challenging cases in each mixed subset, we need to train a more
effective classifier than the softmax classifier. Furthermore, we have to avoid the poten-
tial overfitting problem due to a very limited number of training samples. For the above
two reasons, we choose the random forest classifier [13]. Classes are weighted to over-
come the unbalance sample distribution for each mixed subset. In the testing phase, a
sample is first assigned to a subset based on its VCNN feature using the nearest neighbor
rule. If it is assigned to a pure subset, the class label is output as the predicted label.
Otherwise, we run the random forest classifier trained for the particular mixed subset to
make final class decision.
5.4 Experimental Results
We test the network parameter selection algorithm and the confusion set re-classification
algorithm on the popular dataset ModelNet40 [97]. The ModelNet40 dataset contains
9,843 training samples and 2,468 testing samples categorized into 40 classes. The origi-
nal mesh model is voxelized into dimension30 30 30 before training and testing. The
basic structure of the proposed VCNN is the same as V oxelNet [60]: two convolutional
layers followed by one fully connected layer. However, we allow to have different filter
sizes and filter numbers at these layers.
Results of Network Parameters Selection Alone. We compare the BIC scores
with three spatial filter sizes m
j
3;5;7 in the two convolutional layer. The filter
number is chosen to beK
j
32;64;128;256;512;768;1024. For the fully connected
117
Figure 5.8: Three BIC curves measured under three different filter sizes m
j
3;5;7
and seven different filter numbers K
j
32;64;128;256;512;768;1024 for the first
convolutional layer.
layer, the filter number is set to K
f
128;256;512;1024;2048. We show the BIC
scores by adjusting m
j
and K
j
at the first convolutional layer in Fig. 5.8. To keep
the BIC scores at the same scale of different filter sizes and training sample numbers,
we normalize the BIC function in Eq. 5.1 by pm
j
q
3
and keep N independent of m
j
.
Three valley points can be easily detected at pm
1
;K
1
q p 3;256q, p5;512q and p7;512q.
This result is reasonable since we need more filters to represent larger spatial patterns.
By comparing the three curves, the choice of pm
1
;K
1
q p 3;256q gives the lowest BIC
score. By adopting the greedy search algorithm introduced in Section 5.3.3, the network
118
Table 5.2: Comparison of ACA and AIA scores of several state-of-the-art methods for
the ModelNet40 dataset..
V olume-based methods ACA AIA
3DShapeNets[97] 77.30% -
V oxelNet[60] 83.01% 87.40%
3D-Gan [96] 83.30% -
SubV olume[75] 86.00% 89.20%
AniProbing [75] 85.60% 89.90%
VCNN /wo ReC 85.66% 89.34%
VCNN /w ReC 86.23% 89.78%
View-based methods ACA AIA
DeepPano [79] 77.63% -
GIFT [6] 83.10% -
MVCNN [87] 90.10% -
Pairwise [35] 90.70% -
FusionNet [34] 90.80% -
parameters in all layers of the proposed VCNN are summarized in Table 5.1. We also
include the network parameters of the V oxelNet in the table for comparison.
To compare the performance of the proposed VCNN and other benchmarking meth-
ods, we use two performance measures: average class accuracy (ACA) and average
instance accuracy (AIA). The ACA score averages the prediction accuracy scores over
classes, while the AIA score takes the average of prediction scores over testing samples.
The results are shown in Table 5.2, where methods are classified into the view-based
and the volume-based two categories. For the proposed VCNN, we consider two cases:
with and without confusion set reclassification. We first focus on the contribution of the
network parameters selection algorithm without the confusion set reclassification (w/o
ReC) module. The VCNN w/o ReC outperforms the V oxelNet by 2:65% and 1:94% in
ACA and AIA, respectively. The gap betweem volume-based and view-based methods
is narrowed. Although the ACA performance of the VCNN w/o ReC is lower than that
of the SubV olume method by 0:34% and its AIA performance is lower than that of the
119
Figure 5.9: Comparison of accuracy per classe between the V oxelNet method (blue bars)
and our VCNN design (yellow bars).
AniProbing method by0:56%, these two methods use more complex network structures
such as multi-orientation pooling and subvolume supervision.
It is worthwhile to examine the classification accuracy for each individual class.
We compare the accuracy of the proposed VCNN with that of the V oxelNet in Fig.
5.9. By leveraging the network parameters selection, the performance of most classes
is boosted with VCNN. Furthermore, classes of low classification accuracy belong to
some confusion sets. The observed results are consistent with those shown in Fig. 5.5.
Results of Confusion Set Re-Classification. As shown in Table 5.2, the classifica-
tion accuracy can be boosted furthermore by the proposed re-classification algorithm.
The improved ACA and AIA values are 0:57% and 0:44%. Since the error cases in
mixed subsets are difficult cases, such performance gains are good. We observe two
cases in the improvement, where wrongly predicted samples can be corrected. First,
a sample is assigned to a correct pure subset so that its prediction can be corrected
directly. Examples are given in Figs. 5.10 (a) and (b), where a desk shape and a lamp
shape are correctly assigned to one of their pure subsets, respectively. Second, a sample
is assigned to a mixed subset and the random forest classifier helps correct the predic-
tion result. Examples are given in Figs. 5.10 (c) and (d). A chair shape is assigned to a
mixed subset containing chairs and stools in Fig. 5.10 (c). A vase shape is assigned a
mixed subset containing flower pots and vases. in Fig. 5.10 (d). They can be eventually
120
(a) Desk (b) Lamp
(c) Chair (d) Vase
Figure 5.10: Four examples of corrected errors: (a) desk, (b) lamp, (c) chair, (d) vase.
Each example has a testing sample on the top and its assigned subset on the bottom.
correctly classified using the random forest classifier. The power of the confusion set
identification and re-classification procedure is demonstrated by these examples.
Although errors still exist, they can be clearly analyzed based on our clustering
results. Uncorrected errors come from the strong feature similarity. In Fig. 5.11 (a)
121
(a) Cup (b) Desk
(c) Dresser (d) Plant
Figure 5.11: Four examples of uncorrected errors: (a) cup, (b) desk, (c) dresser, (d)
plant. Each example includes a testing sample on the top and the assigned subset on the
bottom.
and (b), a cup shape and a desk shape are wrongly assigned to a mixed subset of vase
and flower pot and a mixed subset of chair and stool, respectively. In Fig. 5.11(c),
although a dresser shape is assigned to a mixed subset containing dressers, night stands,
radio, tv stands and wardrobe, the VCNN features are not distinctive enough to produce
122
a correct classification. Similarly in Fig. 5.11 (d), a plant shape cannot be differentiated
from the flower pot in its mixed subset. These mistakes are due to their highly visual
similarity with shapes in other classes.
5.5 Conclusion
The design, analysis and application of a volumetric convolutional neural network
(VCNN) were presented. We proposed a feed-forward K-means clustering algorithm
to determine the filter number and size. The cause of confusion sets was identified. Fur-
thermore, a hierarchical clustering method followed by a random forest classification
method was proposed to boost the classification performance among confusing classes.
Finally, experiments were conducted on a popular ModelNet40 dataset. The proposed
VCNN offers the state-of-the-art performance among all volume-based CNN methods.
123
Chapter 6
Summary and Future Work
6.1 Summary of the Research
2D Shape Retrieval. we proposed a robust two-stage shape retrieval (TSR) system to
address the unsupervised 2D shape retrieval problem. We conducted a thorough review
on several state-of-the-art methods including global and local shape features and diffu-
sion methods. We discovered that the locality existing in current methods deteriorates
the retrieval performance resulting in vulnerably retrieving globally irrelevant samples
in high ranks. Recent solutions such as concatenating a global feature and a local feature
have limitations due to their more complicated feature space. In our solution, we divided
a traditional retrieval system into two stages such as irrelevant cluster filtering (ICF) and
local matching and ranking (LMR). The ICF stage guaranteed that globally dissimilar
gallery samples are efficiently removed before locally examining their similarities to a
query. To achieve this goal, we used the unsupervised clustering method to explore the
underlying pattern in the local feature space. Then, we designed more robust global
features to reassign the prediction score to each gallery sample in the dataset. Next, a
joint cost function was proposed to decide the relevance of a gallery sample by consid-
ering both the local and global features. Benefit from the ICF stage, the LMR stage,
which finally retrieves the ranked samples from the local feature space, gained more
relevant samples in higher ranks and thus the retrieval performance was improved. We
conducted thorough experiments on both problems by choosing several state-of-the-art
124
datasets such as MPEG-7, Kimia99, Tari1000 2D shape dataset. Our retrieval perfor-
mance showed a significant improvement over the state-of-the-art methods for every
dataset. The superiority of our TSR system was observed. Moreover, we conducted an
error analysis for each problem by comparing the retrieval results of our method and
the local-features-based methods. The retrieval results indicated that our TSR system
retrieved samples, which were more pleasant to human visual experience.
3D Shape Retrieval. Inspired by the work of the TSR system, we examined the
weaknesses of features designed for the 3D shape retrieval problem. The global fea-
tures for 3D shapes lose the discriminative power due to lost details, while the local
features are lack of powerful global measurements. Consequently, although local fea-
tures are more powerful, they may retrieve globally irrelevant 3D shapes in high ranks.
Being motivated by the observation, we proposed a more robust 3D shape retrieval sys-
tem which is called the irrelevance filtering and similarity ranking (IF/SR) method. We
developed more powerful and robust global features for 3D shapes to compensate for
the weaknesses of local features. The IF module attempted to cluster gallery shapes that
are similar to each other by examining global and local features simultaneously. A joint
cost function was designed to strike the balance between global and local features. In
particular, irrelevant samples that are close in the local feature space but distant in the
global feature space were removed in the IF module. The remaining gallery samples
were ranked in the SR module using the local feature. The superior retrieval perfor-
mance of our system was evaluated on the popular SHREC12 dataset by using different
measurements.
3D Shape Classification. A thorough survey on the convolutional neural network
methods for supervised 3D shape classification was conducted. View-based and volume-
based methods were compared and analyzed. We identified two important weaknesses
existing in recent volume-based CNN models. First, the traditional CNN design used an
125
empirical rule to choose network parameters such as the filter size and number. Second,
being similar to any other real world classification problem, there existed sets of confus-
ing classes in the 3D shape classification problem. Apparently, simple fully connected
layers were not powerful enough to split them. We attempted to solve these problems by
designing a V olumetric CNN (VCNN) method. We proposed a feed-forward K-means
clustering algorithm to identify the optimal filter number and filter size systematically.
After the supervised training, we proposed a method to determine whether two shape
classes belong to the same confusion set without conducting the test. To enhance the
classification performance within a confusion set, we proposed a hierarchical clustering
method to split samples of the same set into multiple subsets automatically. Finally,
we reclassified samples in a subset using a random forest classifier. Experiments were
conducted on a popular ModelNet40 dataset. The proposed VCNN offered the state-of-
the-art performance among all volume-based CNN methods.
6.2 Future Work
We are pursuing the following directions to improve our solutions proposed in this dis-
sertation:
6.2.1 2D/3D Shape Retrieval
Our current global features for 3D shapes contain roughness in discriminating
details on surfaces and structures of a 3D shape. There are two possible improve-
ments. First, more details can be preserved in the global features. For example,
two confusing classes are shown in Fig. 6.1 (a) and (b) respectively. They are
able to be separated by using some local surface descriptors. However, in order to
locally analyze a surface, robust 3D mesh repair algorithms are necessary to deal
126
(a) Door and keyboard (b) Deskphone and monitor
(c) Mug and drum (d) Non-wheel chair and wheel chair
Figure 6.1: Examples of some existing confusing pairs.
with defected manifolds existing in most popular generic 3D shape datasets. It
is still an open problem in the computer graphics field. Second, we plan to con-
tinue our global feature design by considering geon-based model representation.
By decomposing shapes as the first step, we can compare two shapes by global
structure matching and partial segment matching. For example, in Fig. 6.1 (c) and
(d), a mug and a drum can be differentiated by the handle. A non-wheel chair and
a wheel chair are able to separated by detecting the shapes of their feet.
Both our TSR and IF/SR methods begin with the unsupervised clustering algo-
rithm applied on a chosen local feature space. If some classes are severely mixed
in the local feature space, the unsupervised clustering method cannot separate
them properly. As a result, these erroneous samples fail to be removed in the
first stage because they are not separated by the clustering algorithm. A possible
solution is to introduce more local features. Multiple features can compensate for
each other and produce a more discriminative feature space for the initial cluster-
ing algorithm.
127
Currently, our TSR and IF/SR methods depend on two empirically chosen param-
eters: the cluster number for the initial clustering and the threshold for the joint
cost function. Although they are not sensitive, rough ranges for them need to
be determined before testing. We can improve our system by adaptively choos-
ing these two parameters. A hierarchical clustering algorithm is expected to help
choose the cluster number automatically. Furthermore, we plan to choose the
threshold for the joint cost function by exploring the distribution of samples in the
feature space of prediction score vectors.
There exist larger datasets in generic 3D shape retrieval problem such as the
SHREC14 generic 3D shape dataset, which consists of more than 8000 3D mod-
els. There are two main challenges in handling larger datasets. First, the com-
plexity of our algorithm needs to be reduced. The heavy load in our global feature
extraction and the unsupervised clustering stage are required to be decreased by a
better design. Second, more robust global features are expected to be proposed to
deal with severer shape variations.
6.2.2 3D Shape Classification
The filter number and filter size are determined by examining BIC scores under
different configurations. However, there are two drawbacks. First, since the num-
ber of training patches is large, it is time-consuming to perform K-means multiple
times on a large training pool. Second, the K-means algorithm minimizes the
intra-cluster variation and maximizes the inter-cluster margin globally so that the
local properties of each cluster is missed. To address the two problems above,
a hierarchical clustering algorithm is promising to reduce the time complexity
and choose the filter number automatically. The hierarchical clustering algorithm
128
splits clusters with high variance iteratively and terminates when a cluster reaches
a variance small enough.
In the testing phase of our current VCNN model, a testing is assigned to a confu-
sion set first by using the neural network prediction scores and then allocated to
a confusion subset by using the nearest neighbor rule. However, it is possible to
produce two errors before the final prediction. We call them ”accuracy leakage”.
First, a testing sample is wrongly assigned to a confusion set. This set may not
include the ground-truth class of the testing sample. Second, a testing sample may
also be allocated to a wrong confusion subset. Therefore, instead of using the hard
assignment score, the soft assignment strategy may be more robust.
We conducted the error analysis in Chapter 5 that some samples cannot be well
classified because the inter-class variation is too small. It is obvious that using
our current network, i.e., voxelNet, is not able to discriminate these classes. To
push the classification performance further, the fine-grained classification can be
introduced to deal with these difficult cases. In other words, more details can
be extracted to differentiate the minor difference between two classes. To fulfill
this idea, a finer resolution of input samples can be adopted. In order to keep
the network complexity, the input can be represented by multiple partial shapes.
Then, different network models can be trained on different parts. The features for
all parts can be further merged to reach the final prediction.
129
Bibliography
[1] N. Alajlan, M. S. Kamel, and G. H. Freeman. Geometry-based image retrieval
in binary image databases. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 30(6):1003–1013, 2008.
[2] M. Ankerst, G. Kastenm¨ uller, H.-P. Kriegel, and T. Seidl. 3d shape histograms for
similarity search and classification in spatial databases. In Advances in Spatial
Databases, pages 207–226. Springer, 1999.
[3] C. Aslan, A. Erdem, E. Erdem, and S. Tari. Disconnected skeleton: Shape at its
absolute scale. Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 30(12):2188–2203, 2008.
[4] E. Attalla and P. Siy. Robust shape similarity retrieval based on contour seg-
mentation polygonal multiresolution and elastic matching. Pattern Recognition,
38(12):2229–2241, 2005.
[5] S. Bai, X. Bai, Z. Zhou, Z. Zhang, and L. Jan Latecki. Gift: A real-time and
scalable 3d shape search engine. In The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2016.
[6] S. Bai, X. Bai, Z. Zhou, Z. Zhang, and L. J. Latecki. Gift: A real-time and
scalable 3d shape search engine. arXiv preprint arXiv:1604.01879, 2016.
[7] X. Bai and L. J. Latecki. Path similarity skeleton graph matching. Pattern Anal-
ysis and Machine Intelligence, IEEE Transactions on, 30(7):1282–1292, 2008.
[8] X. BAI, L. LI, and S. ZHANG. Software for 3d model retrieval using local shape
distributions, 2012.
[9] X. Bai, B. Wang, C. Yao, W. Liu, and Z. Tu. Co-transduction for shape retrieval.
Image Processing, IEEE Transactions on, 21(5):2747–2757, 2012.
[10] X. Bai, X. Yang, L. J. Latecki, W. Liu, and Z. Tu. Learning context-sensitive
shape similarity by graph transduction. Pattern Analysis and Machine Intelli-
gence, IEEE Transactions on, 32(5):861–874, 2010.
130
[11] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In
Computer vision–ECCV 2006, pages 404–417. Springer, 2006.
[12] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition
using shape contexts. Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, 24(4):509–522, 2002.
[13] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[14] A. Bronstein, M. Bronstein, U. Castellani, B. Falcidieno, A. Fusiello, A. Godil,
L. Guibas, I. Kokkinos, Z. Lian, M. Ovsjanikov, et al. Shrec 2010: robust large-
scale shape retrieval benchmark. Proc. 3DOR, 5:4, 2010.
[15] A. M. Bronstein, M. M. Bronstein, L. J. Guibas, and M. Ovsjanikov. Shape
google: Geometric words and expressions for invariant shape retrieval. ACM
Transactions on Graphics (TOG), 30(1):1, 2011.
[16] M. M. Bronstein and I. Kokkinos. Scale-invariant heat kernel signatures for non-
rigid shape recognition. In Computer Vision and Pattern Recognition (CVPR),
2010 IEEE Conference on, pages 1704–1711. IEEE, 2010.
[17] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li,
S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An
information-rich 3d model repository. CoRR, abs/1512.03012, 2015.
[18] M. Chaouch and A. Verroust-Blondet. A new descriptor for 2d depth image
indexing and 3d model retrieval. In Image Processing, 2007. ICIP 2007. IEEE
International Conference on, volume 6, pages VI–373. IEEE, 2007.
[19] D.-Y . Chen, X.-P. Tian, Y .-T. Shen, and M. Ouhyoung. On visual similarity based
3d model retrieval. In Computer graphics forum, volume 22, pages 223–232.
Wiley Online Library, 2003.
[20] X. Cui, Y . Liu, S. Shan, X. Chen, and W. Gao. 3d haar-like features for pedestrian
detection. In Multimedia and Expo, 2007 IEEE International Conference on,
pages 1263–1266. IEEE, 2007.
[21] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.
In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer
Society Conference on, volume 1, pages 886–893. IEEE, 2005.
[22] M. R. Daliri and V . Torre. Robust symbolic representation for shape recognition
and retrieval. Pattern Recognition, 41(5):1782–1798, 2008.
131
[23] G. Dogan, J. Bernal, and C. R. Hagwood. A fast algorithm for elastic shape
distances between closed planar curves. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 4222–4230, 2015.
[24] J. Dong, Q. Chen, J. Feng, K. Jia, Z. Huang, and S. Yan. Looking inside cate-
gory: subcategory-aware object recognition. IEEE Transactions on Circuits and
Systems for Video Technology, 25(8):1322–1334, 2015.
[25] M. Donoser and H. Bischof. Diffusion processes for retrieval revisited. In Com-
puter Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages
1320–1327. IEEE, 2013.
[26] H. Dutagaci, A. Godil, P. Daras, A. Axenopoulos, G. Litos, S. Manolopoulou,
K. Goto, T. Yanagimachi, Y . Kurita, S. Kawamura, et al. Shrec’11 track: generic
shape retrieval. In Proceedings of the 4th Eurographics conference on 3D Object
Retrieval, pages 65–69. Eurographics Association, 2011.
[27] P. F. Felzenszwalb and J. D. Schwartz. Hierarchical matching of deformable
shapes. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE
Conference on, pages 1–8. IEEE, 2007.
[28] R. Gal, A. Shamir, and D. Cohen-Or. Pose-oblivious shape signature. IEEE
transactions on visualization and computer graphics, 13(2):261–271, 2007.
[29] Y . Gao, Q. Dai, M. Wang, and N. Zhang. 3d model retrieval using weighted
bipartite graph matching. Signal Processing: Image Communication, 26(1):39–
47, 2011.
[30] D. Giorgi, S. Biasotti, and L. Paraboschi. Shape retrieval contest 2007: Watertight
models track. SHREC competition, 8, 2007.
[31] A. Godil, H. Dutagaci, C. B. Akg¨ ul, A. Axenopoulos, B. Bustos, M. Chaouch,
P. Daras, T. Furuya, S. Kreft, Z. Lian, et al. Shrec’09 track: Generic shape
retrieval. In 3DOR, pages 61–68, 2009.
[32] R. Gopalan, P. Turaga, and R. Chellappa. Articulation-invariant representation of
non-planar shapes. In Computer Vision–ECCV 2010, pages 286–299. Springer,
2010.
[33] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algo-
rithm. Journal of the Royal Statistical Society. Series C (Applied Statistics),
28(1):100–108, 1979.
[34] V . Hegde and R. Zadeh. Fusionnet: 3d object classification using multiple data
representations. arXiv preprint arXiv:1607.05695, 2016.
132
[35] E. Johns, S. Leutenegger, and A. J. Davison. Pairwise decomposition of image
sequences for active multi-view recognition. arXiv preprint arXiv:1605.08359,
2016.
[36] A. E. Johnson and M. Hebert. Using spin images for efficient object recognition
in cluttered 3d scenes. Pattern Analysis and Machine Intelligence, IEEE Trans-
actions on, 21(5):433–449, 1999.
[37] M. Kazhdan, B. Chazelle, D. Dobkin, T. Funkhouser, and S. Rusinkiewicz. A
reflective symmetry descriptor for 3d models. Algorithmica, 38(1):201–225,
2004.
[38] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation invariant spherical
harmonic representation of 3 d shape descriptors. In Symposium on geometry
processing, volume 6, pages 156–164, 2003.
[39] A. Khotanzad and Y . H. Hong. Invariant image recognition by zernike moments.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12(5):489–
497, 1990.
[40] J. Knopp, M. Prasad, G. Willems, R. Timofte, and L. Van Gool. Hough trans-
form and 3d surf for robust three dimensional classification. In Computer Vision–
ECCV 2010, pages 589–602. Springer, 2010.
[41] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing
systems, pages 1097–1105, 2012.
[42] C.-C. J. Kuo. Understanding convolutional neural networks with a mathematical
model. Journal of Visual Communication and Image Representation, 41:406–
413, 2016.
[43] C. J. Kuo. Understanding convolutional neural networks with A mathematical
model. CoRR, abs/1609.04112, 2016.
[44] L. J. Latecki, R. Lak¨ amper, and U. Eckhardt. Shape descriptors for non-rigid
shapes with a single closed contour. In Computer Vision and Pattern Recognition,
2000. Proceedings. IEEE Conference on, volume 1, pages 424–429. IEEE, 2000.
[45] Y . LeCun, Y . Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444,
2015.
[46] Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
133
[47] B. Li, A. Godil, M. Aono, X. Bai, T. Furuya, L. Li, R. J. L´ opez-Sastre, H. Johan,
R. Ohbuchi, C. Redondo-Cabrera, et al. Shrec’12 track: Generic 3d shape
retrieval. In 3DOR, pages 119–126, 2012.
[48] B. Li and H. Johan. 3d model retrieval using hybrid features and class informa-
tion. Multimedia tools and applications, 62(3):821–846, 2013.
[49] B. Li, Y . Lu, C. Li, A. Godil, T. Schreck, M. Aono, M. Burtscher, Q. Chen, N. K.
Chowdhury, B. Fang, et al. A comparison of 3d shape retrieval methods based
on a large-scale benchmark supporting multimodal queries. Computer Vision and
Image Understanding, 131:1–27, 2015.
[50] B. Li, Y . Lu, C. Li, A. Godil, T. Schreck, M. Aono, Q. Chen, N. K. Chowdhury,
B. Fang, T. Furuya, et al. Shrec14 track: Large scale comprehensive 3d shape
retrieval. In Eurographics Workshop on 3D Object Retrieval, volume 2014, pages
131–40, 2014.
[51] Z. Lian, A. Godil, B. Bustos, M. Daoudi, J. Hermans, S. Kawamura, Y . Kurita,
G. Lavoua, and P. Dp Suetens. Shape retrieval on non-rigid 3d watertight meshes.
In Eurographics Workshop on 3D Object Retrieval (3DOR), 2011.
[52] Z. Lian, P. L. Rosin, and X. Sun. Rectilinearity of 3d meshes. International
Journal of Computer Vision, 89(2-3):130–151, 2010.
[53] Z. Lian, J. Zhang, S. Choi, H. ElNaghy, J. El-Sana, T. Furuya, A. Giachetti,
R. Guler, L. Isaia, L. Lai, et al. Shrec?5 track: Non-rigid 3d shape retrieval. In
Proc Eurographics Workshop on 3D Object Retrieval, 2015.
[54] H. Ling and D. W. Jacobs. Shape classification using the inner-distance. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 29(2):286–299, 2007.
[55] H. Ling, X. Yang, and L. J. Latecki. Balancing deformability and discriminability
for shape matching. In Computer Vision–ECCV 2010, pages 411–424. Springer,
2010.
[56] W. Liu, D. Tao, J. Cheng, and Y . Tang. Multiview hessian discriminative
sparse coding for image annotation. Computer Vision and Image Understand-
ing, 118:50–60, 2014.
[57] D. G. Lowe. Object recognition from local scale-invariant features. In Computer
vision, 1999. The proceedings of the seventh IEEE international conference on,
volume 2, pages 1150–1157. Ieee, 1999.
[58] Y . Luo, T. Liu, D. Tao, and C. Xu. Multiview matrix completion for multilabel
image classification. Image Processing, IEEE Transactions on, 24(8):2355–2368,
2015.
134
[59] Y . Luo, Y . Wen, D. Tao, J. Gui, and C. Xu. Large margin multi-modal multi-task
feature extraction for image classification. Image Processing, IEEE Transactions
on, 25(1):414–427, 2016.
[60] D. Maturana and S. Scherer. V oxnet: A 3d convolutional neural network for
real-time object recognition. In Intelligent Robots and Systems (IROS), 2015
IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
[61] F. Mokhtarian, S. Abbasi, J. Kittler, et al. Efficient and robust retrieval by
shape content through curvature scale space. Series on Software Engineering
and Knowledge Engineering, 8:51–58, 1997.
[62] F. Mokhtarian and M. Bober. Curvature scale space representation: theory,
applications, and MPEG-7 standardization, volume 25. Springer Science &
Business Media, 2013.
[63] F. Mokhtarian and R. Suomela. Robust image corner detection through curvature
scale space. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
20(12):1376–1381, 1998.
[64] A. Y . Ng, M. I. Jordan, Y . Weiss, et al. On spectral clustering: Analysis and an
algorithm. Advances in neural information processing systems, 2:849–856, 2002.
[65] F. S. Nooruddin and G. Turk. Simplification and repair of polygonal models using
volumetric techniques. Visualization and Computer Graphics, IEEE Transactions
on, 9(2):191–205, 2003.
[66] M. Novotni and R. Klein. Shape retrieval using 3d zernike descriptors. Computer-
Aided Design, 36(11):1047–1062, 2004.
[67] R. Ohbuchi and T. Furuya. Distance metric learning and feature combination for
shape-based 3d model retrieval. In Proceedings of the ACM workshop on 3D
object retrieval, pages 63–68. ACM, 2010.
[68] R. Ohbuchi, K. Osada, T. Furuya, and T. Banno. Salient local visual features for
shape-based 3d model retrieval. In Shape Modeling and Applications, 2008. SMI
2008. IEEE International Conference on, pages 93–102. IEEE, 2008.
[69] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin. Shape distributions. ACM
Transactions on Graphics (TOG), 21(4):807–832, 2002.
[70] P. Papadakis, I. Pratikakis, T. Theoharis, and S. Perantonis. Panorama: A 3d
shape descriptor based on panoramic views for unsupervised 3d object retrieval.
International Journal of Computer Vision, 89(2-3):177–192, 2010.
135
[71] D. Pelleg, A. W. Moore, et al. X-means: Extending k-means with efficient esti-
mation of the number of clusters. In ICML, volume 1, 2000.
[72] W. K. Pratt. Digital Image Processing: PIKS Scientific Inside. Wiley-
Interscience, 2007.
[73] V . Premachandran and R. Kakarala. Consensus of k-nns for robust neighborhood
selection on graph-based manifolds. In Computer Vision and Pattern Recognition
(CVPR), 2013 IEEE Conference on, pages 1594–1601. IEEE, 2013.
[74] N. Qadeer, D. Hu, X. Liu, S. Anwar, and M. S. Sultan. Improving shape retrieval
by integrating air and modified mutual nn graph. Advances in Multimedia, 2015,
2015.
[75] C. R. Qi, H. Su, M. Niessner, A. Dai, M. Yan, and L. J. Guibas. V olumet-
ric and multi-view cnns for object classification on 3d data. arXiv preprint
arXiv:1604.03265, 2016.
[76] M. Reuter, F.-E. Wolter, and N. Peinecke. Laplace–beltrami spectra as hape-dnałf
surfaces and solids. Computer-Aided Design, 38(4):342–366, 2006.
[77] M. Savva, F. Yu, H. Su, M. Aono, B. Chen, D. Cohen-Or, W. Deng, H. Su, S. Bai,
X. Bai, et al. Shrec?6 track large-scale 3d shape retrieval from shapenet core55.
[78] T. B. Sebastian, P. N. Klein, and B. B. Kimia. Recognition of shapes by editing
their shock graphs. Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, 26(5):550–571, 2004.
[79] B. Shi, S. Bai, Z. Zhou, and X. Bai. Deeppano: Deep panoramic representation
for 3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339–2343,
2015.
[80] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser. The princeton shape bench-
mark. In Shape modeling applications, 2004. Proceedings, pages 167–178. IEEE,
2004.
[81] X. Shu and X.-J. Wu. A novel contour descriptor for 2d shape matching and
its application to image retrieval. Image and vision Computing, 29(4):286–294,
2011.
[82] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. CoRR, abs/1409.1556, 2014.
[83] D. Smeets, T. Fabry, J. Hermans, D. Vandermeulen, and P. Suetens. Isometric
deformation modelling for object recognition. In Computer Analysis of Images
and Patterns, pages 757–765. Springer, 2009.
136
[84] D. Smeets, J. Keustermans, D. Vandermeulen, and P. Suetens. meshsift: Local
surface features for 3d face recognition under expression variations and partial
data. Computer Vision and Image Understanding, 117(2):158–169, 2013.
[85] S. Spaeth and P. Hausberg. Can open source hardware disrupt manufacturing
industries? the role of platforms and trust in the rise of 3d printing. In The
Decentralized and Networked Future of Value Creation, pages 59–73. Springer,
2016.
[86] A. Srivastava, E. Klassen, S. H. Joshi, and I. H. Jermyn. Shape analysis of elastic
curves in euclidean spaces. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 33(7):1415–1428, 2011.
[87] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convolutional
neural networks for 3d shape recognition. In Proceedings of the IEEE Interna-
tional Conference on Computer Vision, pages 945–953, 2015.
[88] H. Sundar, D. Silver, N. Gagvani, and S. Dickinson. Skeleton based shape match-
ing and retrieval. In Shape Modeling International, 2003, pages 130–139. IEEE,
2003.
[89] J. W. Tangelder and R. C. Veltkamp. A survey of content based 3d shape retrieval
methods. Multimedia tools and applications, 39(3):441–471, 2008.
[90] D. Tao, X. Li, X. Wu, and S. J. Maybank. General tensor discriminant analysis
and gabor features for gait recognition. Pattern Analysis and Machine Intelli-
gence, IEEE Transactions on, 29(10):1700–1715, 2007.
[91] D. Tao, X. Tang, X. Li, and X. Wu. Asymmetric bagging and random subspace
for support vector machines-based relevance feedback in image retrieval. Pat-
tern Analysis and Machine Intelligence, IEEE Transactions on, 28(7):1088–1099,
2006.
[92] Z. Tu and A. L. Yuille. Shape matching and recognition–using generative mod-
els and informative features. In Computer Vision-ECCV 2004, pages 195–209.
Springer, 2004.
[93] T. Vanamali, A. Godil, H. Dutagaci, T. Furuya, Z. Lian, and R. Ohbuchi. Shrec’10
track: generic 3d warehouse. In Proceedings of the 3rd Eurographics conference
on 3D Object Retrieval, pages 093–100. Eurographics Association, 2010.
[94] P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians using patterns of motion
and appearance. International Journal of Computer Vision, 63(2):153–161, 2005.
137
[95] J. Wang, X. Bai, X. You, W. Liu, and L. J. Latecki. Shape matching and clas-
sification using height functions. Pattern Recognition Letters, 33(2):134–143,
2012.
[96] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. Learning a
probabilistic latent space of object shapes via 3d generative-adversarial modeling.
arXiv preprint arXiv:1610.07584, 2016.
[97] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets:
A deep representation for volumetric shapes. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
[98] J. Xie, Y . Fang, F. Zhu, and E. Wong. Deepshape: Deep learned shape descriptor
for 3d shape matching and retrieval. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1275–1283, 2015.
[99] C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. arXiv preprint
arXiv:1304.5634, 2013.
[100] C. Xu, D. Tao, and C. Xu. Large-margin multi-viewinformation bottleneck. Pat-
tern Analysis and Machine Intelligence, IEEE Transactions on, 36(8):1559–1572,
2014.
[101] C. Xu, D. Tao, and C. Xu. Multi-view intact space learning. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 37(12):2531–2544, 2015.
[102] C. Xu, D. Tao, C. Xu, and Y . Rui. Large-margin weakly supervised dimension-
ality reduction. In Proceedings of the 31st International Conference on Machine
Learning (ICML-14), pages 865–873, 2014.
[103] X. Yang, S. Koknar-Tezel, and L. J. Latecki. Locally constrained diffusion pro-
cess on locally densified distance spaces with applications to shape retrieval. In
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
on, pages 357–364. IEEE, 2009.
[104] X. Yang, L. Prasad, and L. J. Latecki. Affinity learning with diffusion on tensor
product graph. Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 35(1):28–38, 2013.
[105] J. Yu, Y . Rui, and D. Tao. Click prediction for web image reranking using mul-
timodal sparse coding. Image Processing, IEEE Transactions on, 23(5):2019–
2032, 2014.
[106] J. Yu, D. Tao, M. Wang, and Y . Rui. Learning to rank using user clicks and visual
features for image retrieval. Cybernetics, IEEE Transactions on, 45(4):767–779,
2015.
138
[107] A. Zaharescu, E. Boyer, K. Varanasi, and R. Horaud. Surface feature detection
and description with applications to mesh matching. In Computer Vision and
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 373–380.
IEEE, 2009.
[108] D. Zhang and G. Lu. An integrated approach to shape based image retrieval. In
Proceedings of 5th Asian Conference on Computer Vision (ACCV), Melbourne,
Australia, 2002.
[109] D. Zhang and G. Lu. Shape-based image retrieval using generic fourier descriptor.
Signal Processing: Image Communication, 17(10):825–848, 2002.
139
Abstract (if available)
Abstract
Shape classification and retrieval are two important problems in both computer vision and computer graphics. A robust shape analysis contributes to many applications such as manufacture components recognition and retrieval, sketch-based shape retrieval, medical image anaysis, 3D model repository management, etc. In this dissertation, we propose three methods to address three significant problems such as 2D shape retrieval, 3D shape retrieval and 3D shape classification, respectively. ❧ First, in the 2D shape retrieval problem, most state-of-the-art shape retrieval methods are based on local features matching and ranking. Their retrieval performance is not robust since they may retrieve globally dissimilar shapes in high ranks. To overcome this challenge, we decompose the decision process into two stages. In the first irrelevant cluster filtering (ICF) stage, we consider both global and local features and use them to predict the relevance of gallery shapes with respect to the query. Irrelevant shapes are removed from the candidate shape set. After that, a local-features-based matching and ranking (LMR) method follows in the second stage. We apply the proposed TSR system to three shape datasets: MPEG-7, Kimia99 and Tari1000. We show that TSR outperforms all other existing methods. The robustness of TSR is demonstrated by the retrieval performance. ❧ Second, a novel solution for the content-based 3D shape retrieval problem using an unsupervised clustering approach, which does not need any label information of 3D shapes, is presented. The proposed shape retrieval system consists of two modules in cascade: the irrelevance filtering (IF) module and the similarity ranking (SR) module. The IF module attempts to cluster gallery shapes that are similar to each other by examining global and local features simultaneously. However, shapes that are close in the local feature space can be distant in the global feature space, and vice versa. To resolve this issue, we propose a joint cost function that strikes a balance between two distances. Irrelevant samples that are close in the local feature space but distant in the global feature space can be removed in this stage. The remaining gallery samples are ranked in the SR module using the local feature. The superior performance of the proposed IF/SR method is demonstrated by extensive experiments conducted on the popular SHREC12 dataset. ❧ Third, the design, analysis and application of a volumetric convolutional neural network (VCNN) are studied to address the 3D shape classification problem. Although a large number of CNNs have been proposed in the literature, their design is empirical. In the design of the VCNN, we propose a feed-forward K-means clustering algorithm to determine the filter number and size at each convolutional layer systematically. For the analysis of the VCNN, we focus on the relationship between the filter weights (also known as anchor vectors) from the last fully connected (FC) layer to the output. Typically, the output of the VCNN contains a couple of sets of confusing classes, and the cause of these confusion sets can be well explained by analyzing their anchor vector relationships. Furthermore, a hierarchical clustering method followed by a random forest classification method is proposed to boost the classification performance among confusing classes. For the application of the VCNN, we examine the 3D shape classification problem and conduct experiments on a popular dataset called the ModelNet40. The proposed VCNN offers the state-of-the-art performance among all volume-based CNN methods.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Object detection and recognition from 3D point clouds
PDF
3D deep learning for perception and modeling
PDF
3D object detection in industrial site point clouds
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Object classification based on neural-network-inspired image transforms
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Green learning for 3D point cloud data processing
PDF
Classification and retrieval of environmental sounds
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Point-based representations for 3D perception and reconstruction
PDF
Machine learning techniques for outdoor and indoor layout estimation
PDF
3D face surface and texture synthesis from 2D landmarks of a single face sketch
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Classification of 3D maxillary incisor root shape
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Feature-preserving simplification and sketch-based creation of 3D models
Asset Metadata
Creator
Pan, Xiaqing
(author)
Core Title
Machine learning methods for 2D/3D shape retrieval and classification
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/13/2017
Defense Date
01/23/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
2D shape retrieval,3D shape classification,3D shape retrieval,convolutional neural network,machine learning,ModelNet40 shape dataset,MPEG-7 shape dataset,OAI-PMH Harvest,random forest,SHREC12 shape dataset
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Haldar, Justin P. (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
xiaqingp@gmail.com,xiaqingp@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-334640
Unique identifier
UC11257841
Identifier
etd-PanXiaqing-5043.pdf (filename),usctheses-c40-334640 (legacy record id)
Legacy Identifier
etd-PanXiaqing-5043.pdf
Dmrecord
334640
Document Type
Dissertation
Rights
Pan, Xiaqing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
2D shape retrieval
3D shape classification
3D shape retrieval
convolutional neural network
machine learning
ModelNet40 shape dataset
MPEG-7 shape dataset
random forest
SHREC12 shape dataset