Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Advanced techniques for object classification: methodologies and performance evaluation
(USC Thesis Other)
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Advanced Techniques for Object Classification: Methodologies
and Performance Evaluation
by
Yijing Yang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2022
Copyright 2022 Yijing Yang
Acknowledgements
I acknowledge the Center for Advanced Research Computing (CARC) at the University of South-
ern California for providing computing resources that have contributed to the partial research re-
sults reported within this thesis.
I would like to express my deepest gratitude to my supervisor, Prof. C.-C. Jay Kuo, for his
continued guidance and invaluable supervision throughout my Ph.D. study. I could not have un-
dertaken this journey without his encouraging and motivating inspiration and guidance. His endless
energy, passion and enthusiasm devoted to the research also set a role model to me. I see many
good characteristics from him, for example, high efficiency of working, attitude and curiosity to
the research, and patience to students. I hope to carry these characteristics I learnt from Prof. Kuo
forward throughout my career.
I would also like to thank my committee members who generously provided suggestions,
knowledge and expertise.
Besides, I would like to thank all of my labmates in Media Communications Lab at USC for
their kind support during my Ph.D. journey. As Prof. Kuo mentioned when I first came here, MCL
is like a big family. Labmates here are always willing to help each other, giving advice to the
research and daily life, having discussions and getting inspired from each other, sharing resources,
and providing mental support.
Last but not least, my warm and heartfelt thanks go to all of my friends and family members,
especially my parents and my boyfriend, for encouraging and supporting me throughout the entire
process and every day.
ii
Table of Contents
Acknowledgements ii
List of Tables vi
List of Figures viii
Abstract xi
Chapter 1: Introduction 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Advanced Techniques for Multi-Class Object Classification . . . . . . . . 4
1.2.2 Advanced Techniques for Resolving Binary Confusing Sets . . . . . . . . 5
1.2.3 On Supervised Feature Selection from High Dimensional Feature Spaces . 6
1.2.4 Design of Supervision-Scalable Learning Systems: Methodology and Per-
formance Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Research Background 9
2.1 Multi-scale Features for Object Classification . . . . . . . . . . . . . . . . . . . . 9
2.2 Boosting Methods in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Attention Mechanism for Image Classification . . . . . . . . . . . . . . . . . . . . 11
2.4 Hierarchical Classification Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Feature Selection in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Successive Subspace Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 3: Advanced Techniques for Multi-Class Object Classification 18
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Soft-Label Smoothing Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Intra-Hop Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Soft-Label Smoothing (SLS) . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Hard Sample Mining Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Cross-validation-based Hard Examples Identification . . . . . . . . . . . . 26
iii
3.4.2 Classification with Hard Sample Mining . . . . . . . . . . . . . . . . . . . 27
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5.2 Effects of Different Color Spaces . . . . . . . . . . . . . . . . . . . . . . 29
3.5.3 Effects of Soft-Label Smoothing (SLS) . . . . . . . . . . . . . . . . . . . 30
3.5.4 Effects of Hard Sample Mining . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.5 Ablation Study and Performance Benchmarking . . . . . . . . . . . . . . 36
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 4: Advanced Techniques for Resolving Binary Confusing Sets 39
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Confusing Set Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Attention Localization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Preliminary Annotation Generation . . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Weakly-supervised Attention Heatmap Refinement . . . . . . . . . . . . . 46
4.3.3 Adaptive Bounding Box Generation . . . . . . . . . . . . . . . . . . . . . 48
4.4 Integrated System of Multi-Class Classification and Confusing Set Resolution . . . 49
4.4.1 Overview of the integrated framework: Att-PixelHop Method . . . . . . . 49
4.4.2 Attentive Binary Class Object Classification . . . . . . . . . . . . . . . . . 50
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5.2 Weakly Supervised Attention Localization . . . . . . . . . . . . . . . . . 52
4.5.3 Attentive Binary Object Classification . . . . . . . . . . . . . . . . . . . . 55
4.5.4 Attentive Multi-class Object Classification . . . . . . . . . . . . . . . . . 56
4.5.5 Performance Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 5: On Supervised Feature Selection from High Dimensional Feature Spaces 60
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Proposed Feature Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.1 Discriminant Feature Test (DFT) . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1.1 Training Sample Partitioning . . . . . . . . . . . . . . . . . . . 63
5.2.1.2 DFT Loss Measured by Entropy . . . . . . . . . . . . . . . . . 63
5.2.1.3 Feature Selection Based on Optimized Loss . . . . . . . . . . . 64
5.2.2 Relevant Feature Test (RFT) . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.2.1 Training Sample Partitioning . . . . . . . . . . . . . . . . . . . 65
5.2.2.2 RFT Loss Measured by Estimated Regression MSE . . . . . . . 65
5.2.2.3 Feature Selection Based on Optimized Loss . . . . . . . . . . . 65
5.2.3 Robustness Against Bin Numbers . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 Image Datasets with High Dimensional Feature Space . . . . . . . . . . . 66
5.3.2 DFT for Classification Problems . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.2.1 DFT offers an obvious elbow point . . . . . . . . . . . . . . . . 72
5.3.2.2 Features selected by DFT achieves comparable and stable clas-
sification performance . . . . . . . . . . . . . . . . . . . . . . . 74
iv
5.3.2.3 Comparison between DFT and mRMR . . . . . . . . . . . . . . 74
5.3.2.4 DFT requires less running time . . . . . . . . . . . . . . . . . . 75
5.3.2.5 DFT with feature pre-processing . . . . . . . . . . . . . . . . . 76
5.3.3 RFT for Regression Problems . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.3.1 RFT offers a more obvious elbow point . . . . . . . . . . . . . . 77
5.3.3.2 Features selected by RFT achieves comparable and stable per-
formance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 6: Design of Supervision-Scalable Learning Systems: Methodology and Perfor-
mance Benchmarking 80
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Design of Learning Systems with HOG Features . . . . . . . . . . . . . . . . . . . 81
6.2.1 Design of Three Modular Components . . . . . . . . . . . . . . . . . . . . 82
6.2.1.1 Representation Learning . . . . . . . . . . . . . . . . . . . . . . 82
6.2.1.2 Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.1.3 Decision Learning . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.2 HOG-I and HOG-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3 Design of Learning Systems with SSL Features . . . . . . . . . . . . . . . . . . . 90
6.3.1 SSL-based Representation Learning . . . . . . . . . . . . . . . . . . . . . 90
6.3.2 IPHop-I and IPHop-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4.2 Performance Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chapter 7: Conclusion and Future Work 106
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2 Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Bibliography 111
v
List of Tables
3.1 Hyper parameters of the PixelHop++ feature extraction architecture. . . . . . . . . 29
3.2 Comparison of test accuracy (%) with different combination of P and Q channels . 30
3.3 Comparison of test accuracy (%) using different color spaces . . . . . . . . . . . . 30
3.4 Comparison of image-level test accuracy (%) for four confusion sets with and with-
out SLS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Summary of test accuracy for LFE-based retraining process. L/H represent for
different resolutions of feature map used in the SLS. . . . . . . . . . . . . . . . . . 34
3.6 Ablation study of E-PixelHop++’s components for CIFAR-10 . . . . . . . . . . . 37
3.7 Comparison of testing accuracy (%) of LeNet-5, PixelHop, PixelHop+ and Pixel-
Hop++ for CIFAR-10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.8 Hyper parameters of the modified LeNet-5 network as compared with those of the
original LeNet-5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 The confusion matrix for the CIFAR-10 dataset, where the first row shows the
predicted object labels and the first column shows the ground truth . . . . . . . . . 41
4.2 Comparison of image-level test accuracy (%) of the binary object classification
with and without attention localization . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Comparison of conditional test accuracy (%) between stage-1 and stage-2 for the
four most confusing groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Ablation study of Att-PixelHop components for CIFAR-10 . . . . . . . . . . . . . 59
4.5 Comparison of testing accuracy (%) of LeNet-5, PixelHop, PixelHop+, Pixel-
Hop++ and Att-PixelHop for CIFAR-10. . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 Classification accuracy (%) of LeNet-5 on MNIST and Fashion-MNIST. . . . . . . 66
5.2 Comparison of classification performance (%) on Clean MNIST between different
feature selection methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
vi
5.3 Comparison of classification performance (%) on Noisy MNIST between different
feature selection methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Comparison of classification performance (%) on Clean Fashion-MNIST between
different feature selection methods. . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5 Comparison of classification performance (%) on Noisy Fashion-MNIST between
different feature selection methods. . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.6 Comparison of classification performance (%) on MultiFeat between different fea-
ture selection methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.7 Comparison of number of errors on Colon cancer dataset between different feature
selection methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.8 Running time (sec.) comparison of different feature selection methods. . . . . . . . 73
5.9 Regression MSE comparison for MNIST (clean/noisy) images with features se-
lected by four methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.10 Regression MSE comparison for Fashion-MNIST (clean/noisy) images with fea-
tures selected by four methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.1 Comparison of the mean test accuracy (%) and standard deviation on MNIST un-
der weak, middle and strong Supervision degree, where the best performance is
highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Comparison of the mean test accuracy (%) and standard deviation on Fashion-
MNIST under weak, middle and strong Supervision degree, where the best perfor-
mance is highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
vii
List of Figures
2.1 Illustration of the tree-decomposed feature representation in channel-wise Saab
transform presented in [18]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Illustration of the E-PixelHop++ classification baseline. . . . . . . . . . . . . . . . 20
3.2 Illustration of the proposed Soft-Label Smoothing (SLS) which is an iterative pro-
cess. We show an example with three Hop units. Each node represents a pixel.
The dashed lines between two Hop units indicate the children-parent relation in
the local graph construction which is used for cross-Hop label update. . . . . . . . 23
3.3 Illustration of local graph construction based on c/w Saab transforms. . . . . . . . 24
3.4 Illustration of cross-hop label updates. . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Illustration of intermediate label maps for different methods, where each subfigure
shows label heatmaps in hop-2, hop-3 and hop-4 (from left to right). . . . . . . . . 32
3.6 Pixel-level test accuracy in hop-2, hop-3 and hop-4 and the image-level ensemble
as a function of iteration numbers for P-channel images only. . . . . . . . . . . . . 33
3.7 Test accuracy w.r.t. different difficult sample numbers N
− (left) and total number
of selected training samples N
total
(right) . . . . . . . . . . . . . . . . . . . . . . . 34
3.8 Test accuracy w.r.t. different test sample selection thresholds T
te
. . . . . . . . . . 35
3.9 Conditional test accuracy of samples with maximum probability higher than T
te
. . 36
4.1 Att-PixelHop is a two-stage sequential classification method where stage 1 is for
multi-class baseline and stage 2 is confusion set resolution. . . . . . . . . . . . . . 40
4.2 Overall pipeline for the proposed attention detection mechanism. . . . . . . . . . . 42
4.3 Illustration of the modified feature extraction architecture for preliminary annota-
tion generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Cascaded feature maps from different Hop units form a feature pyramid. . . . . . . 44
4.5 Examples of approximated preliminary attention for training images indicated in
red bounding boxes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
viii
4.6 Illustration of weakly supervised attention heatmap prediction. . . . . . . . . . . . 47
4.7 Feature extraction pipeline for weakly supervised attention heatmap prediction. . . 48
4.8 Examples of predicted attention heatmaps through weakly supervision. . . . . . . . 48
4.9 Illustration of the proposed adaptive bounding box generation . . . . . . . . . . . 50
4.10 Overall pipeline for binary object classification with classification on both full
frame images and cropped attention regions. . . . . . . . . . . . . . . . . . . . . . 51
4.11 Examples of non-animal images with the predicted attention heatmaps, the gener-
ated bounding boxes and cropped regions. . . . . . . . . . . . . . . . . . . . . . . 53
4.12 Examples of animal images with the predicted attention heatmaps, the generated
bounding boxes and cropped regions. . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.13 Test accuracy for another 11 confusion sets among the Top 15, besides the 4 groups
included in Table 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.14 The plot of CIFAR-10 test accuracy (%) as a function of the cumulative number
of resolved confusion sets, with comparison between full frame, cropped attention
and ensemble classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 An overview of the proposed feature selection methods: DFT and RFT. For the i-th
feature, DFT measures the class distribution in S
i
L
and S
i
R
to compute the weighted
entropy as the DFT loss, while RFT measures the weighted estimated regression
MSE in both sets as the RFT loss. . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Comparison of two binning schemes with B= 16 and B= 64: (a) DFT and (b) RFT. 67
5.3 Comparison of distinct feature selection capability among four feature selection
methods for the classification task on the Fashion-MNIST dataset. . . . . . . . . . 68
5.4 Error rate comparison on the MultiFeat dataset between mRMR and DFT. . . . . . 70
5.5 Comparison of the number of errors on the Colon dataset between mRMR and DFT. 70
5.6 Error rate comparison on the ARR dataset between mRMR and the DFT. . . . . . . 71
5.7 Performance comparison of DFT feature selection with and without PCA feature
pre-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.8 Histogram comparison of feature indices ranked by the energy with 20 and 40
selected feature numbers before and after PCA pre-processing. The smaller the
ranking index in the x-axis, the higher the feature energy. . . . . . . . . . . . . . . 71
5.9 Comparison of relevant feature selection capability among four feature selection
methods for the regression task on the Fashion-MNIST dataset. . . . . . . . . . . . 78
ix
6.1 An overview of the HOG-based learning system, where the input image is of size
28× 28. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Performance comparison of HOG-based learning systems on MNIST and Fashion-
MNIST datasets under four different combinations among two feature learning
methods (variance thresholding and DFT) and two classifiers (KNN and XGBoost)
as a function of the training sample number per class in the log scale. . . . . . . . . 85
6.3 Performance comparison of spatial, spectral and joint HOG features on MNIST
under strong and weak supervision conditions. . . . . . . . . . . . . . . . . . . . . 86
6.4 Performance comparison of spatial, spectral and joint HOG features on Fashion-
MNIST under strong and weak supervision conditions. . . . . . . . . . . . . . . . 87
6.5 An overview of the SSL-based learning system, where the input image is of size
32× 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.6 Performance comparison of SSL-based learning systems on MNIST and Fashion-
MNIST datasets under four different combinations among two feature learning
methods (variance thresholding and DFT) and two classifiers (KNN and XGBoost)
as a function of n
c
= log
2
(N
c
). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.7 Performance comparison of spatial, spectral and joint SSL features on MNIST
under strong and weak supervision conditions. . . . . . . . . . . . . . . . . . . . . 93
6.8 Performance comparison of spatial, spectral and joint SSL features on Fashion-
MNIST under strong and weak supervision conditions. . . . . . . . . . . . . . . . 94
6.9 Comparison of test accuracy between hybrid HOG, hybrid IPHop, and LeNet-5 for
MNIST and Fashion-MNIST. For hybrid HOG and IPHop, type I is adopted when
N
c
≤ 8 and type II is adopted for N
c
≥ 16. . . . . . . . . . . . . . . . . . . . . . . 100
6.10 The plot of Frobenius norms of difference matrices between the covariance matri-
ces learned under different supervision levels and the one learned from the full set
for MNIST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.11 IoU scores between the feature sets selected using full training size and using N
c
on MNIST and Fashion-MNIST. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.12 Learning curve using LeNet-5 on MNIST dataset with selected supervision levels. . 103
7.1 An example of different sequences for brain MRI: (a) T1- weighted, (b) T2-weighted
and (C) Flair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2 An example of different types of brain tumors: (a) Normal; (b) Glioma Grade
I; (c) Glioma Grade III; (d) Glioma Grade IV; (e) Meningioma; (f) Metastatic
Adenocarcinoma; (g) Metastatic Bronchogenic carcinoma; (h) Sarcoma. . . . . . . 110
x
Abstract
Based on successive subspace learning (SSL) methodology which was recently developed, ad-
vanced techniques for image classification problems are proposed in this thesis. Specifically, it
can be decomposed into four parts: 1) improving the performance of multi-class classification, 2)
improving the performance of resolving confusing sets using attention localization, 3) enhancing
the quality of the learnt feature space by conducting a novel supervised feature selection, and 4)
designing supervision-scalable learning systems.
The first two parts focus on improving the image classification performance by proposing three
advanced techniques: soft-label smoothing, hard sample mining, and weakly supervised attention
localization based on successive subspace learning methodology. The image classification prob-
lem is decomposed into two stages: multi-class classification in the first stage and the confusing
set resolution in the second stage. In the first stage, soft-label smoothing (SLS) is proposed to en-
hance the performance and ensure consistency of intra-hop and inter-hop pixel-level predictions.
Effectiveness of SLS shows the importance of prediction agreement at different scales. To process
the RGB color images, we introduce a new color space named PQR based on principal component
analysis (PCA) where different channels are classified separately. Besides, a hard sample min-
ing based retraining strategy is developed which encourages the classification framework to focus
more on difficult cases recognition. In the second stage, we propose a new SSL-based weakly
supervised attention localization method for low resolution images. Starting with no bounding box
information, preliminary attention predictions are generated based on image-level labels in a feed-
forward manner. Light human-in-the-loop helps to select good examples as the weak annotation
xi
for the enhanced attention learning. Experiments show that with only 1% selected images, dis-
criminative regions can be well localized through adaptive bounding box generation. Furthermore,
it is important to handle confusing classes carefully. Classification architecture is generalized from
multi-class to binary classes in confusing set resolution and the predictions from attention regions
are ensembled with the original full frame image for higher performance and better robustness.
The third part presents a novel supervised feature selection methodology inspired by infor-
mation theory and the decision tree. The resulting tests are called the discriminant feature test
(DFT) and the relevant feature test (RFT) for classification and regression tasks, respectively. Our
proposed methods belong to the filter methods, which give a score to each dimension and select
features based on feature ranking. As compared with other existing feature selection methods, DFT
and RFT are effective in finding distinct feature subspaces by offering obvious elbow regions in
DFT/RFT curves. Intensive experiments show that they provide feature subspaces of significantly
lower dimensions while maintaining near optimal classification/regression performance, computa-
tionally efficient and robust to noisy input data. The proposed methods not only work on image
generated deep features, but also work on general datasets such as handcrafted features and gene
expressions.
In the last part, the supervision-scalable learning systems are studied. Two families of modu-
larized systems are proposed based on HOG features and SSL features, respectively. We discuss
ways to adjust each module so that the design is more robust against the number of training sam-
ples. Specifically, the DFT feature selection proposed in the third work is applied to the robust
learning systems. Experiments and analysis show that both HOG-based and SSL-based learning
systems work better than LeNet-5 under weak supervision and have performance comparable with
LeNet-5 under strong supervision.
xii
Chapter 1
Introduction
1.1 Significance of the Research
Object classification has been studied for many years as a fundamental problem in computer vi-
sion. With the development of convolutional neural networks (CNNs) and the availability of larger
scale datasets, we see a rapid success in the classification using deep learning for both low- and
high-resolution images [62], [45], [47], [91], [95]. Deep learning networks use backpropagation
to optimize an objective function to find the optimal parameters of networks. Although being ef-
fective, deep learning demands a high computational cost. As the network goes deeper, the model
size increases dramatically. Research on light-weight neural networks [50], [46], [96] has received
attention to address the complexity issue. One major challenge associated with deep learning is
that its underlying mechanism is not transparent.
Another challenge is the amount of accessible labeled data. Supervised learning is the main
stream in pattern recognition, computer vision and natural language processing nowadays due to
the great success of deep learning. On one hand, the performance of a learning system should
improve as the number of training samples increases. On the other hand, some learning systems
may benefit more than others from a large number of training samples. For example, deep neural
networks (DNNs) often work better than classical learning systems that contain feature extraction
and classification two stages. How the quantity of labeled samples affects the performance of
learning systems is an important question in the data-driven era. It is known that humans can learn
1
effectively in a weakly supervised setting. In contrast, deep learning networks often need more
labeled data to achieve good performance. What makes weak supervision and strong supervision
different? Is it possible to design a supervision-scalable learning system? There is little study
on the design of supervision-scalable leaning systems. Thus, we attempt to shed light on these
questions by choosing the image classification problem as an illustrative example in this thesis.
Recently, based on the development of successive subspace learning (SSL) methodology in-
troduced by a series papers in [58, 59, 60, 61], the PixelHop [17] and the PixelHop++ [18] meth-
ods have been proposed for image classification. Both follow the traditional pattern recognition
paradigm and partition the classification problem into two cascaded modules: 1) feature extrac-
tion and 2) classification. Powerful spatial-spectral features can be extracted from PixelHop and
PixelHop++ in an unsupervised manner. Then, they are fed into a trained classifier for final deci-
sion. Every step in PixelHop/PixelHop++ is explainable, and the whole solution is mathematically
transparent.
Following the methodology of SSL, our research is conducted from three aspects that influence
the performance of classification performance: 1) framework architecture, 2) feature quality and
3) the availability of labeled data.
For the first aspect, we identify some limitations of the current SSL-based classification frame-
works, which motivate the research in this thesis. First, for image classification, only the global
coarse-scale information may not be sufficient. Mid-range to local fine-scale information can help
characterize discriminant information of an region of interest. Since the object size varies a lot
from images to images, the preference of different scales also differs. Different receptive fields are
needed for various object sizes and the context around objects. In both PixelHop and PixelHop++,
the feature representations of all the scales from fine to coarse are used for the final image-level
classification. Specifically, a feature transformation called label-assisted regression (LAG) unit
was proposed to transform the pixel-level features into a more separable subspace. However, the
2
feature transformation in different scales are processed independently. The relation between fea-
tures from the similar spatial locations are not fully explored. Motivated by this, we propose a
Soft-label Smoothing (SLS) method to enhance the usage of local features.
Besides, being built upon the SSL-based feedforward design, it is also worthwhile to think
about whether any iteration-based learning can gradually enhance the performance by resolving
confusing classes or difficult samples. The existing frameworks are only one-pass which does not
have the confusing set resolution scheme for progressive learning. Classification with hard sample
selection is studied to solve this problem. Also, in computer vision, attention localization mimics
human vision system which pays more emphasis on the foreground regions or the discriminative
parts of the objects, in order to force the classification model to learn more representative patterns
across different object classes. Although the backpropagation of deep learning networks makes the
DNN less transparent, the end-to-end optimization using image-level labels makes to resulted rep-
resentations contain more semantic meaning. Responses are highlighted on the attention regions
naturally after the training because those regions give the best discriminant power to distinguish
between different classes. Following this direction, to further improve the performance of SSL-
based feedforward classification frameworks, attention localization need to be considered into the
design. Thus, a weakly supervised attention localization method is proposed in this research for
low resolution images, and is applied to the object classification pipeline to resolve binary confus-
ing sets. In that way, pixels from the discriminative regions contribute more to the recognition.
Besides the framework architecture, feature quality is the second aspect that influences the
classification performance. Traditional machine learning algorithms are susceptible to the curse
of feature dimensionality [42]. Their computational complexity increases with high dimensional
features. Redundant features may not be helpful in discriminating classes or reducing regression
error, and they should be removed. Sometimes, redundant features may even produce negative
effects as their number grows. their detrimental impact should be minimized or controlled. To deal
with these problems, feature selection techniques [97, 74, 103] are commonly applied as a data
pre-processing step or part of the data analysis to simplify the complexity of the model. Feature
3
selection techniques involve the identification of a subspace of discriminant features from the in-
put, which describe the input data efficiently, reduce effects from noise or irrelevant features, and
provide good prediction results [40]. In the SSL methodology, we follow the traditional pattern
recognition paradigm which consists of feature extraction and classification head. The features
easily reach to a very high dimension which has features that can be redundant and less discrimi-
native. Thus, a novel supervised feature selection method is proposed in this research to solve the
feature dimensionality problem, inspired by information theory and the decision tree.
Finally, we study the influence from the quantity of labeled data. The design of supervision-
scalable learning systems is studied in this thesis. We choose the image classification problem
as an illustrative example and focus on the design of modularized systems that consist of three
learning modules: representation learning, feature learning and decision learning. Based on these
ideas, we propose two families of learning systems. One adopts the classical histogram of oriented
gradients (HOG) features [26] while the other uses successive-subspace-learning (SSL) features.
We discuss ways to adjust each module so that their design is more robust against the number of
training samples. Our proposed novel supervised feature selection method is also applied to build
the scalable systems.
1.2 Contributions of the Research
1.2.1 Advanced Techniques for Multi-Class Object Classification
In our research, we firstly study the multi-class object classification problem based on the method-
ology of successive subspace learning. An enhanced PixelHop++ method called E-PixelHop++
was proposed with several advanced techniques summarized as follows.
(1) To address the importance of multi-scale features, we conduct pixel-level classification at
each hop which corresponds to a patch of various sizes in the input image. To further im-
prove pixel-level classification accuracy, we develop a soft-label smoothing (SLS) scheme
to ensure intra-hop and inter-hop pixel-level prediction consistency.
4
(2) To improve the image-level classification performance, we propose a hard sample mining
strategy in E-PixelHop++ which includes the resampling of hard examples in both training
and inference stages as an iterative training solution for SSL in the multi-class scenario.
(3) To decouple the color channels for a color image, the principle component analysis (PCA)
is performed on the pixel-level to project RGB three color channels onto two principle sub-
spaces, named P and Q channels, which are processed separately for classification in paral-
lel. Pixel-level predictions from both projected subspaces are ensembled for the image-level
prediction.
(4) Experimental results show the effectiveness of our proposed techniques which improve the
overall classification performance by a significant margin.
1.2.2 Advanced Techniques for Resolving Binary Confusing Sets
To further improve the performance, we study the binary object classification problem for confus-
ing set resolution in the second part of the research. Our key contributions can be summarized as
follows.
(1) To highlight on the discriminative regions of an image, we propose an effective weakly
supervised attention region localization method for low resolution images which requires no
bounding box information from the input data.
(2) To resolve confusing classes for further performance boosting, we propose an attentive Pix-
elHop++ (Att-PixelHop) which is formulated as a two-stage pipeline. In the first stage,
multi-class classification is performed to get a soft decision for each class, where the top
2 classes with the highest probabilities called confusing classes are resolved in the second
stage as a binary object classification problem.
(3) We apply our proposed attention localization method to the binary object classification to
further improve the performance, which is integrated into the confusing set resolution in the
5
two-stage pipeline of Att-PixelHop. Detected attention regions are cropped and resized for
classification which is then ensembled with the classification on the original image.
(4) Experiments show the effectiveness of the confusing set resolution and the contribution from
the detected attention region.
1.2.3 On Supervised Feature Selection from High Dimensional Feature Spaces
Effective feature selection techniques identify a discriminant feature subspace that lowers compu-
tational and modeling costs with little performance degradation. We propose a simple but effective
feature selection methodology, where our key contributions can be summarized as below.
(1) Inspired by information theory and the decision tree, two supervised features selection meth-
ods, the discriminant feature test (DFT) and the relevant feature test (RFT) are proposed for
classifcation and regression problems, respectively.
(2) Intensive experiments are conducted using deep features obtained by LeNet-5 for MNIST
and Fashion-MNIST datasets as illustrative examples. Other datasets with handcrafted and
gene expressions features are also included for performance evaluation.
(3) The combination with feature pre-processing, such as feature decorrelation, is also studied.
(4) It is shown by experimental results that DFT and RFT can select a lower dimensional fea-
ture subspace distinctly and robustly while maintaining high decision performance and fast
running time on CPU only, compared with other popular feature selection methods.
6
1.2.4 Design of Supervision-Scalable Learning Systems: Methodology and
Performance Benchmarking
To investigate the behaviour of different networks with different amount of labeled data, we study
the design of supervision-scalable leaning systems to shed some light by choosing the image classi-
fication problem as an illustrative example. Our main contributions of this work can be summarized
in the following aspects.
(1) Two families of modularized learning systems that demonstrate an excellent scalable per-
formance with respect to various supervision degrees are proposed. The first one adopts
the classical histogram of oriented gradients (HOG) [26] features while the second one uses
successive-subspace-learning (SSL) features.
(2) We discuss ways to adjust each module so that their design is more robust against the number
of training samples.
(3) Experimental results on MNIST and Fashion-MNIST image classification show that the two
families of modularized learning systems have more robust performance than LeNet-5. They
both outperform LeNet-5 by a large margin for small training sizes and have performance
comparable with that of LeNet-5 for large training sizes.
(4) The robustness of each module is discussed which reveals the source of scalability of the
designed learning systems.
1.3 Organization of the Thesis
The rest of this thesis is organized as follows. Background related to the research is reviewed
in Chapter 2. Two advanced techniques including Soft-label Smoothing (SLS) and hard sample
selection for multi-class classification are proposed in Chapter 3. The proposed weakly super-
vised attention localization and its application on resolving binary confusing sets are presented in
7
Chapter 4. Then, the proposed novel supervised feature selection methods including Discriminant
Feature Test (DFT) and Relevant Feature Test (RFT) are presented in Chapter 5. The design of the
supervision-scalable systems are proposed in Chapter 6. Finally, concluding remarks and future
research directions are drawn in Chapter 7.
8
Chapter 2
Research Background
2.1 Multi-scale Features for Object Classification
Handcrafted features were extracted before the deep learning era. To obtain multi-scale features,
Mutch et al. [79] applied Gabor filters to all positions and scales. Scale-invariant features can
be derived by alternating template matching and max-pooling operations. Schnitzspan et al. [87]
proposed a hierarchical random field that combines the global-feature-based methods with the
local-feature-based approaches in one consistent multi-layer framework. In deep learning, the
decision is usually made using features from the deepest convolutional layer.
Recently, more investigations are made to exploit outputs from shallower layers to improve
the classification performance. For example, Liu et al. [68] proposed a cross-convolutional-layer
pooling operation that extracts local features from one convolutional layer and pools the extracted
features with the guidance of the next convolutional layer. Jetley et al. [52] extracted features from
shallower layers and combined them with global features to estimate attention maps for further
classification performance improvement, where global features are used to derive attention in local
features that are consistent with the semantic meaning of the underlying objects.
9
2.2 Boosting Methods in Machine Learning
Boosting is a type of algorithm in machine learning that fits the models based on basis learners. It
can be expressed by Eq. 2.1, where φ
m
represents the m-th basis learner [78]. It originally comes
from the decomposition in the function space, where φ
m
is a basis function.
f(x)= w
0
+
M
∑
m=1
w
m
φ
m
(x), (2.1)
Usually for a classification problem, the basis learners can be weak classifiers such as decision
stumps or a shallow decision tree. It was proved by [85, 36, 35] that boosting can improve the
performance of any weak learning algorithm to a significant high level in theory, which only needs
the weak learner to perform at least slightly better than a random guessing. The weak learner is
repeatedly applied to the training data with various weights based on the previous boosting rounds,
where more weight is put on the misclassified data than the correct data. The decision is from the
combination of the M weak learners into a final single classifier.
Friedman et al. [37] extended the general boosting method to handle different loss functions.
The objective function of boosting expressed as,
min
f
N
∑
n
L(y
n
, f(x
n
)), (2.2)
with L to be the loss function and f to be the model expressed in Eq. 2.1. One can apply vari-
ous loss functions to get different boosting algorithms [44, 9]. For example, for exponential loss,
squared error and log-loss, the derived boosting methods are AdaBoost [36], L2Boosting [10] and
LogitBoost [37], respectively. Originally, these boosting algorithms are designed for a binary clas-
sification problem. Particularly, in AdaBoost with exponential loss, different weights are applied
to each instance in the dataset in the(m+ 1)-th boosting round expressed by,
w
n,m+1
= w
n,m
· exp(log[(1− ε
m
)/ε
m
]I(y
n
̸=φ
m
(x
n
))), (2.3)
10
where ε
m
is the weighted error rate calculated by Eq. 2.4. Thereby, more weight is put on the
misclassified samples in each boosting round when fitting the classifier φ
m
(x).
ε
m
=
∑
N
n=1
w
n,m
· I(y
n
̸=φ
m
(x
n
))
∑
N
n=1
w
n,m
, (2.4)
Considering the exponential growth of weights on the misclassified samples because of the
exponential loss used in AdaBoost, Friedman et al. [37] proposed the LogitBoost by replacing
exponential loss with log-loss. Instead of exponential growth, log-loss has linear relation w.r.t the
error so that it is less sensitive to outliers. It was also generalized to multi-class classification
problem in [37].
Besides boosting algorithms solving different loss functions, Friedman et al. [38] proposed the
Gradient Boosting from a new perspective, which treats the boosting process as a gradient descent
in the function space. To generalize from finite data points, the base learner is fit to be most parallel
to the negative gradient direction so that it is the most correlate to the negative gradient signal. The
advantage of gradient boosting algorithm is that it does not need to apply weights on different data
points and it is more flexible to be generalized to multi-class scenario.
2.3 Attention Mechanism for Image Classification
For human, given a natural image for recognition, regions that are discriminant in the content draw
more attention than the others. With only those attention regions, the objects can still have a high
probability to be recognized correctly, which shows the concentrated amount of information con-
tained by those discriminant regions. This is even more essential for high resolution images, where
processing only the informative regions is more efficient than processing the full frame. Because
of the attention mechanism in human cognition system, several work in computer vision study
about how to explain the decisions made by a object recognition system by visualizing the dis-
criminant regions it captures [113, 123, 119]. To further improve the classification performance,
11
other work focus on studying how to use attention or object localization in the pipeline of classifi-
cation [52, 66, 105, 122, 113, 66].
Learning Soft Attention. There are several existing work trying to estimate attention maps
in the classification pipeline from the image-level labels. Since soft attention maps are differen-
tiable, CNN-based methods tend to learn in an end-to-end manner. A progressive attention process
proposed by Seo et al. [88] gradually suppress irrelevant regions in an input image, where local
contexts of each location is used to estimate a better attention probability map. Jetley et al. [52]
connect the attention estimation from local features and the classification from the global features
using a compatibility score function. This also makes their framework suitable to get different
predicted attention pattern by changing the global feature as a query vector. Zhou et al. [122]
proposed CAM which uses global average pooling to localize the discriminative regions learnt
from the CNNs. It is also shown that by cropping the most important regions, the classification
performance can be improved. Wang et al. [105] use class-specific attention to resolve confu-
sion between similar classes, which extracts signals for end-to-end model training with attention
separability and cross-layer consistency.
Object and Discriminative Part Localization. To improve the classification performance,
several work aim at combining object or part localization wit the classification head. This direction
is widely attempted by fine-grained recognition. Also, for small objects or low resolution images,
the attention detection can be approximated to object localization. Given that the available label
is usually only in the image level, several work [118, 49] rely on human annotation on parts of
object under a strongly supervision assumption. However, the annotation process can be expensive
and the key points or boxes annotated are also biased between different individuals. From another
perspective, some work [7, 28, 31] design human-in-the-loop frameworks which either click on or
mark the most discriminative regions, or ask human to answer questions for visual recognition.
Several other work solve the object localization problem with the weakly supervised learn-
ing [104, 25, 80, 22, 5]. Zhang et al. [120] localize the discriminative parts by training from
positive and nagetive image patches. Bazzani et al. [5] propose a self-taught object localization
12
method which identifies the object regions by analyzing the change in the recognition scores when
masking out different parts of the image.
2.4 Hierarchical Classification Strategy
It is easier to distinguish between classes of dissimilarity than those of similarity. For example,
one should distinguish between cats and cars better than between cats and dogs. Along this line,
one can build a hierarchical relation among multiple classes based on their semantic meaning to
improve classification performance [69, 89, 127]. Alsallakh et al. [6] investigated ways to exploit
the hierarchical structure to improve classification accuracy of CNNs. Yan et al. [109] proposed
a hierarchical deep CNN (HD-CNNs) which embeds deep CNNs into a category hierarchy. It
separates easy classes using a coarse category classifier and distinguishes difficult classes with a
fine category classifier. Chen et al. [13] proposed to merge images associated with confusing
anchor vectors into a confusion set and split the set to create multiple subsets in an unsupervised
manner. A random forest classifier is then trained for each confusion subset to boost the scene
classification performance. An identification and resolution scheme of confusing categories was
proposed by Li et al. [65] based on binary-tree-structured clustering. It can be applied to any
CNN-based object classification baseline to further improve its performance. Zhu et al. [126]
defined different levels of categories and proposed a Branch Convolutional Neural Network (B-
CNN) that outputs multiple predictions and ordered them from coarse to fine along concatenated
convolutional layers. This procedure is analogous to a hierarchical object classification scheme
and enforces the network to learn human understandable concepts in different layers.
2.5 Feature Selection in Machine Learning
Traditional machine learning algorithms are susceptible to the curse of feature dimensionality [42].
Their computational complexity increases with high dimensional features. Redundant features may
not be helpful in discriminating classes or reducing regression error, and they should be removed.
13
Sometimes, redundant features may even produce negative effects as their number grows. their
detrimental impact should be minimized or controlled. To deal with these problems, feature se-
lection techniques [97, 74, 103] are commonly applied as a data pre-processing step or part of
the data analysis to simplify the complexity of the model. Feature selection techniques involve
the identification of a subspace of discriminant features from the input, which describe the in-
put data efficiently, reduce effects from noise or irrelevant features, and provide good prediction
results [40].
Feature selection methods can be categorized into unsupervised [75, 11, 82, 92], semi-supervised
[90, 121], and supervised [48] three types. Unsupervised methods focus on the statistics of input
features while ignoring the target class or value. Straightforward unsupervised methods can be fast,
e.g., removing redundant features using correlation, removing features of low variance. However,
their power is limited and less effective than supervised methods. More advanced unsupervised
methods adopt clustering. Examples include [63, 43, 2]. Their complexity is higher, their behavior
is not well understood, and their performance is not easy to evaluate systematically. Overall, this
is an open research field.
Existing semi-supervised and supervised feature selection methods can be classified into wrap-
per, filter and embedded three classes [90]. Wrapper methods [56] create multiple models with
different subsets of input features and select the model containing the features that yield the best
performance. One example is recursive feature elimination (RFE) [41]. This process can be com-
putationally expensive. Filter methods involve evaluating the relationship between input and target
variables using statistics and selecting those variables that have the strongest relation with the target
ones. One example is the analysis of variance (ANOV A) [86]. This approach is computationally
efficient with robust performance. Another example is feature selection based on linear discrim-
inant analysis (LDA). It finds the most separable projection directions. The objective function of
LDA is used to select discriminant features from the existing feature dimensions by measuring the
ratio between the between-class scatter matrix and the within-class scatter matrix. It can be gen-
eralized from the 2-class problem to the multi-class problem. Embedded methods perform feature
14
selection in the process of training and are usually specific to a single learner. One example is
“feature importance” (FI) obtained from the training process of the XGBoost classifier/regressor
[15], which is also known as “feature selection from model” (SFM). Our proposed Discriminant
Feature Test (DFT) and Relevant Feature Test (RFT) belong to the filter class.
2.6 Weakly Supervised Learning
Strong supervision is costly in practice since data labeling demands a lot of time and resource.
Besides, it is unlikely to collect and label desired training samples in all possible scenarios. Even
with a huge amount of labeled data in place, it may still be substantially less than the need. Weak
supervision can appear in different forms, e.g., inexact supervision, inaccurate supervision, and
incomplete supervision. Labels are provided at the coarse grain (instead of the instance level) in
inexact supervision. One example is multi-instance learning [33, 106]. For inaccurate supervision,
labels provided may suffer from labeling errors, leading to the noisy label problem in supervised
learning [4, 34]. Only a limited number of labeled data is available to the training process [124] in
incomplete supervision. Here, we consider the scenario of incomplete supervision.
To improve learning performance under incomplete supervision, solutions such as semi-supervised
learning and active learning have been developed. In semi-supervised learning, both labeled and
unlabeled data are utilized to achieve better performance [125]. It is built upon several assumptions
such as smoothness, low-density, and manifold assumptions [102]. In active learning, it attempts
to expand the labeled data set by identifying important unlabeled instances that will help boost
the learning performance most. Another related technology is few-shot learning (FSL) [32] that
learns from a very limited number of labeled data without the help of unlabeled data. For example,
a N-way-K-shot classification refers to a labeled set with K samples from each of the N classes.
Meta learning is often used to solve the FSL problem [93, 16].
15
Figure 2.1: Illustration of the tree-decomposed feature representation in channel-wise Saab trans-
form presented in [18].
2.7 Successive Subspace Learning
Being inspired by deep learning, the successive subspace learning (SSL) methodology was pro-
posed by Kuo et al. in a sequence of papers [58, 59, 60, 61]. SSL-based methods learn fea-
ture representations in an unsupervised feedforward manner using multi-stage principle compo-
nent analysis (PCA). Joint spatial-spectral representations are obtained at different scales through
multi-stage transforms. Three variants of the PCA transform were developed. They are the Saak
transform [60, 19], the Saab transform [61], and the channel-wise (c/w) Saab transform [18]. The
c/w Saab transform is the most effective one among the three since it can reduce the model size and
improve the transform efficiency by learning filters in each channel separately. This is achieved
by exploiting the weak correlation between different spectral components in the Saab transform.
These features can be used to train classifers in the training phase and provide inference in the
test phase. Figure 2.1 illustrates the tree-decomposed feature representation in channel-wise Saab
transform.
Two object classification pipelines, PixelHop [17] using the Saab transform and PixelHop++
[18] using the c/w Saab transform, were designed by Chen et al.. Since the feature extraction mod-
ule is unsupervised, a label-assisted regression (LAG) unit was proposed in PixelHop as a feature
16
transformation, aiming at projecting the unsupervised feature to a more separable subspace. In
PixelHop++, cross-entropy based feature selection is applied before each LAG unit to select task-
dependent features, which provides a flexible tradeoff between model size and accuracy. Ensemble
methods were used in [20] to boost the classification performance. SSL has been successfully ap-
plied to many application domains as well. Examples include [116, 117, 115, 53, 73, 98, 84, 64,
83, 14, 114, 54, 71].
17
Chapter 3
Advanced Techniques for Multi-Class Object Classification
3.1 Introduction
In this chapter, we study about the object classification problem in the context of multiple classes.
Based on the foundations of successive subspace learning (SSL) methodologies for object classifi-
cation including PixelHop and PixelHop++, we propose an enhanced solution called E-PixelHop++
which consists of the following main steps.
(1) To decouple the color channels for a color image, E-PixelHop++ applies principle compo-
nent analysis (PCA) and project RGB three color channels onto two principle subspaces
which are processed separately for classification.
(2) To address the importance of multi-scale features, we conduct pixel-level classification at
each hop which corresponds to a patch of various sizes in the input image.
(3) To further improve pixel-level classification accuracy, we develop a soft-label smoothing
(SLS) scheme to ensure intra-hop and inter-hop pixel-level prediction consistency.
(4) Pixel-level decisions from each hop are fused together for image-level decision.
(5) To improve the image-level classification performance, we propose a hard sample mining
strategy which includes the resampling of hard examples in both training images and the test
images as an iterative training process for SSL.
18
The main contributions of E-PixelHop++ lie in Steps 1, 3 and 5.
The rest of this chapter is organized as follows. The system overview is presented in Sec. 3.2.
The label smoothing procedure is detailed in Sec. 3.3. The hard sample mining based re-training
strategy is presented in Sec. 3.4. Experimental setup and results are detailed in Sec. 3.5. Conclu-
sions are finally drawn in Sec. 3.6.
3.2 System Overview
We follow the traditional pattern recognition paradigm by dividing the problem into feature extrac-
tion and classification two separate modules. Multi-hop c/w Saab transforms are used to extract
joint spatial-spectral features in an unsupervised manner. For classification, we adopt pixel-based
classification in each hop that predicts objects of different scales at different spatial locations,
where a pixel in a deeper hop denotes a patch of a larger size in the input image. In the training,
ground truth labels at pixels follow image labels. Pixel-based classification based on the intra-hop
information only tends to be noisy. To address this problem, a soft-label smoothing (SLS) method
is proposed to reduce prediction uncertainty.
We use the CIFAR-10 dataset [57] as an example to explain the architecture of the multi-class
classification baseline. It consists of four modules: 1) Color decomposition, 2) PixelHop++ feature
extraction, 3) Pixel-level classification with label smoothing, 4) Meta classification. The system
diagram for the baseline classification is shown in Fig. 3.1.
1) Color decomposition. For input color images of RGB three channels, the feature dimension
is three times of the spatial dimension in a local neighborhood. To reduce the dimension, we
apply the principle component analysis (PCA) to the 3D color channels of all pixels. That is, we
collect RGB 3D color vectors from pixels in all input images and learn the PCA kernels from
collected samples. They project the RGB color coordinates to decoupled color coordinates, which
are named the PQR color coordinates, where P and Q channels correspond to the 1
st
and 2
nd
principle components. Experimental results show that P and Q channels contain approximately
19
Figure 3.1: Illustration of the E-PixelHop++ classification baseline.
98.5% of the energy of RGB three channels. Specifically, the P channel has 91.08% and the Q
channel 7.38%. Since P and Q channels are uncorrelated with each other, we can proceed them
separately in modules 2 and 3.
2) PixelHop++ feature extraction. As shown in Fig. 3.1, we use multiple PixelHop++ units
[18] to extract features from P and Q channels, respectively. One PixelHop++ unit consists of
two cascaded operations: 1) neighborhood construction with a defined window size in the spatial
domain, and 2) the channel-wise (c/w) Saab transform. The input to the first PixelHop++ unit is
a tensor of dimension (S
0
× S
0
)× K
0
, where S
0
× S
0
is the spatial dimension, K
0
is the spectral
dimension, and subscript 0 indicates that it is the hop-0 representation. For raw RGB images,
K
0
= 3. Through color transformation, we have K
0
= 1 for each individual P or Q channel. The
output of the i-th PixelHop++ unit is denoted by(S
i
× S
i
)× K
i
, which is the hop-i representation.
The hyper parameter K
i
is determined by the number of total c/w Saab coefficients kept in hop- i.
As i increases, the receptive field associated with a pixel in hop- i becomes larger and more global
20
information of the input image is captured. To enable faster expansion of the receptive field, we
apply max-pooling with stride 2 in shallower hops.
3) Pixel-level classification with label smoothing. The output from each PixelHop++ unit can
be viewed as a feature tensor. The feature vector associated with a pixel location is fed into a pixel-
level classifier to get a probability vector whose element indicates the probability of belonging to
a certain object class. This probability vector can be viewed as a soft decision or a soft label.
The soft label can be noisy, and we propose a soft-label smoothing (SLS) mechanism to make the
soft decision at neighboring spatial locations and hops more consistent. This is one of the main
contributions of E-PixelHop++. It will be elaborated in Sec. 3.3.
4) Meta classification. The soft decision maps from different hops of P and Q channels are
concatenated and used to train a meta classifier, which gives the baseline prediction.
5) Hard Sample Mining The above components are used for one training round. By training
the classification models with all the training images in the first round, the easy images can be
predicted correctly with a high confidence level. In contrast, the images with lower confidence
level in the previous round of classification are regarded as hard cases. With the similar motivation
with boosting methods, we develop a simple method to effectively select hard cases from all the
training images for another round of classification, instead of giving weights on all the samples.
This module is called hard sample mining in this work. In the hard sample mining, the selection
of difficult examples is not only performed for the training images, but also in the test set in the
inference stage. It will be detailed in Sec. 3.4.
3.3 Soft-Label Smoothing Technique
Soft label prediction at each pixel of a certain hop tends to be noisy. Label smoothing attempts to
convert noisy decision to a clean one using adjacent label prediction from the same hop or from
adjacent hops. The uncertainty of intra-hop prediction is first discussed in Sec. 3.3.1. The soft-label
smoothing (SLS) scheme is then presented in Sec. 3.3.2.
21
3.3.1 Intra-Hop Prediction
For image classification, only the global coarse-scale information may not be sufficient. Mid-range
to local fine-scale information can help characterize discriminant information of an region of in-
terest. However, the object size varies a lot from images to images. Different receptive fields are
needed for various object sizes and the context around objects. To exploit the characteristics of
different scales, we do pixel-wise classification at each hop and ensemble predicted class probabil-
ities from all hops. One naive approach is to use the extracted c/w Saab features of a single hop for
classification, called intra-hop prediction. For example, c/w Saab features at hop- i are of dimen-
sion(S
i
× S
i
)× K
i
, where(S
i
× S
i
) and K
i
represent spatial and spectral dimensions, respectively. N
training images contribute N× S
i
× S
i
training samples for intra-hop prediction. Each sample has a
feature vector of dimension K
i
. For the n-th training image, all S
i
× S
i
pixels share the same object
label y
n
. For the test image, the output of intra-hop prediction at hop-i is a tensor of predicted class
probabilities with the same spatial resolution of the input. The class probability vectors are soft
decision labels of a pixel at hop-i, which corresponds to a spatial region of the input image with its
size equal to the receptive field of the pixel.
Pixel-wise classification can be noisy since its training label is set to the image label. For a dog
image, only part of the image contains discriminant dog characteristics while other part belongs to
background and has nothing to do with dogs. On one hand, the shared background of dog and cat
images will have two different labels in the training so that its prediction of dog or cat becomes less
confident. On the other hand, the intra-hop classifier will be more confident in its predictions in
regions that are associated with a particular class uniquely. For example, for a binary classification
between a cat and a dog, contents in background such as grass and sofa are usually common so
that the classifier is less confident in these regions as compared with the foreground regions that
contain cat/dog faces.
22
3.3.2 Soft-Label Smoothing (SLS)
Since the decision at each pixel is made independently of others in pixel-level classification, there
are local prediction fluctuations. We expect consistency in predictions in a spatial neighborhood
as well as regions of a similar receptive field across hops that form a pyramid. To reduce the
prediction fluctuation, we propose a soft-label smoothing scheme that iteratively uses predicted
labels to enhance classification performance while keeping intra-hop and inter-hop consistency. It
consists of three components: a) local graph construction; b) label initialization; and c) iterative
cross-hop label update. The process of the step b) and c) is illustrated in Figure 3.2. A similar
high-level idea was proposed in GraphHop [108], yet the details are different.
Figure 3.2: Illustration of the proposed Soft-Label Smoothing (SLS) which is an iterative process.
We show an example with three Hop units. Each node represents a pixel. The dashed lines between
two Hop units indicate the children-parent relation in the local graph construction which is used
for cross-Hop label update.
Local Graph Construction. The tree-decomposed-structure of multi-Hop c/w Saab transforms
corresponds to a directed graph. Each spatial location is a node. As shown in Fig. 3.3, we define
three node types – child nodes, parent nodes, and sibling nodes. Hop-i features are generated by
an input of size 5× 5 through the c/w Saab transform from hop-(i− 1). The 5× 5 locations in
hop-(i− 1) form 25 child nodes for the orange node at hop-i. The orange node is then covered by a
window of size 3× 3 to generate the green node in hop-(i+ 1). Thus, it is one of the 9 child nodes
of the green node and the green node is the parent node of the orange one. As to sibling nodes,
23
Figure 3.3: Illustration of local graph construction based on c/w Saab transforms.
they are neighboring pixels at the same hop. Specifically, we consider eight nearest neighbors. The
union of the eight neighbors and the center pixel form a 3× 3 window.
Label Initialization. For the hop-i feature tensor of dimension S
i
× S
i
× K
i
, it consists of S
2
i
nodes. We treat each node at the same hop as a sample and train a classifier to generate a soft
decision as its initial label. In the training, the image-level object class is propagated to all nodes
at all hops. Only the c/w Saab features are used for label prediction initially, which is given by
Z
(0)
i,u,v
= h
(0)
i
( f
i,u,v
), (3.1)
where Z
(0)
i,u,v
is the initial label prediction for node (u,v) at hop-i, f
i,u,v
is the corresponding c/w
Saab feature vector and h
(0)
i
denotes the classifier. For the hops whose child nodes have label
initialization, we concatenate the c/w Saab features and the label predicted in the previous hop to
train the classifier. That is,
Z
(0)
i,u,v
= h
(0)
i
( f
i,u,v
⊕ Z
(0)
i− 1,u,v
), (3.2)
where the initial predicted label of M× N child nodes constructed by the local graph for node(u,v)
are averaged as
Z
(0)
i− 1,u,v
=
1
M× N
M− 1
∑
m=0
N− 1
∑
n=0
Z
(0)
i− 1,m,n
. (3.3)
24
This is different from the intra-hop prediction because of the aggregation with child nodes and the
inter-hop information is exchanged.
Cross-hop Label Update. Besides the intra-hop label update as described above, we conduct
the inter-hop label update iteratively. The label is updated from the shallow hop to the deep hop as
shown in Fig. 3.4 at each iteration. To update the label at the yellow location in hop-i, we gather
the neighborhood that forms a pyramid. The labels are averaged among sibling and child nodes in
blue and green regions, respectively. They are aggregated with labels of its parent node and itself.
For a C-class classification scenario, the degree of freedom of the predicted soft label is C− 1 so
that the aggregated label dimension is equal to 4× (C− 1). For example, in the Cat/Dog confusion
set with C = 2, only the dimension corresponding to dog in the current label is extracted. This
gives the aggregated feature of dimension 4-D. To control the model size, we do not use the c/w
Saab feature for label update. A classifier is trained against the aggregated label dimension. The
output soft decisions are the updated labels at iteration k in form of
Z
(k)
i,u,v
= h
(k− 1)
i
(Z
agg
(k− 1)
i,u,v
), (3.4)
where Z
agg
(k− 1)
i,u,v
denotes the aggregated features of the cross-hop neighborhood.
Figure 3.4: Illustration of cross-hop label updates.
25
3.4 Hard Sample Mining Technique
When doing classification based on the feature space constructed for the images, we expect the test
set feature space has the same distribution with the training set that we use to train the classifiers.
Meanwhile, the decision boundaries which are learnt from the classifiers can be influenced by the
distribution of the training samples.
In our work, instead of training the classifier based on all the available training images with
different weights, we propose a hard sample mining method which can help improve the inference
classification performance in the previous training round. The entire hard sample mining based
retraining strategy can be divided into four steps: (1) cross-validation based hard example identi-
fication; (2) training sample pool construction; (3) second round retraining using selected training
samples; (4) test sample selection in the inference stage.
3.4.1 Cross-validation-based Hard Examples Identification
In order to estimate the hardness level of each sample in the training set, we use the cross-validation
where the union of all the validation sets covers the whole training set. Suppose the cross-validation
is K-fold, the designed classification framework will be applied K times so that each of the N
train
training images will have a predicted soft decision from the classification model where it was in
the validation set.
To identify the hard examples, a loss thresholding strategy is applied. Usually, the hard ex-
amples do no only distribute in the other side of the decision region, but can also locate in the
correct side which is very closed to the decision boundary. We measure the cross-entropy loss for
each training image expressed in Eq. (3.5) based on the predicted soft decisions from the cross-
validation, where C is the class number, p(x) is the predicted soft decision for sample x, and q(x)
is the ground truth which is one-hot vectors.
CE(x)=− C
∑
i
q(x)
i
log(p(x)
i
), (3.5)
26
A threshold T
tr
for hard examples is defined on the cross-entropy loss instead of 0-1 loss, where
the former one takes into account both the errors and the confidence level of the predictions for
each individual sample. We denote the samples with CE(x)> T
tr
to be subset S
− which contains
all the identified hard examples.
3.4.2 Classification with Hard Sample Mining
Training Sample Pool Construction. With only the hard examples subset S
− as training samples,
the machine learning model is less able to learn well-defined decision boundaries. On the other
hand, if all the other easy examples participate in the model training with equal weights, hard
examples will have less contribution in the decision boundaries learning. Thus, we randomly
sample a subset from those images with CE(x)≤ T
tr
as the easy examples subset S
+
. Thereby, the
resulted training sample pool S is the union of hard and easy examples subsets. That is,
S= S
− ∪ S
+
, (3.6)
To control the contribution from either of S
− and S
+
, a budget number of selected training
samples in S is defined which is denoted as N
total
. Given the predicted soft decisions of the training
samples, the number of samples from subset S
− is fixed as N
− given the threshold T
tr
. Thus, the
number of randomly sampled easy subset S
+
has N
+
= N
total
− N
− images. This is equivalent
to using a dual hyperparameter r which controls the ratio between the numbers of easy and hard
examples, expressed by
N
+
= r· N
− , (3.7)
Classification Model with Selected Training Samples . In the classification pipeline, we
firstly perform the regular training process using all the available training images as the first round.
Then the hard sample mining is applied for the second round of training. The the training sample
pool is constructed based on the defined T
tr
and N
total
or r. A new classification module will be fit
using the re-sampled training images. Here, a new feature space can be used in the second round
27
of training. The easiest feature space substitution is through using different color spaces, such as
RGB, CIELAB and our proposed PQR space. Taking advantage of the re-training process, the
change of feature space allows a higher diversity of information used to differentiate between the
difficult cases.
Test Sample Selection. In the first round of classification, easy test examples will be predicted
correctly with high confidence. Thereby, instead of predicting on all the test images using the new
classification model in the second round, only difficult test images are passed through the second
round of prediction. Different from the training images that labels are provided, the cross-entropy
loss is unavailable for test images. Thus, a threshold T
te
is defined on the confidence level which
is the highest probability among all the C classes. Test images with confidence less than T
te
are
regarded as the difficult test samples that are predicted by the second round classification.
3.5 Experiments
3.5.1 Experimental Setup
We evaluate the classification performance of E-PixelHop++ using the CIFAR-10 [57] dataset that
contains 10 classes of tiny color images of spatial resolution 32× 32. There are 50,000 training
images and 10,000 test images. The hyper parameters of the PixelHop++ feature extraction archi-
tecture are listed in Table 3.1. The spectral dimensions are decided by energy thresholds in the c/w
Saab transform.
We adopt different schemes for the color spaces in different training rounds with the hard
sample mining. Specifically, PQR color space is used in the first round and RGB color space is
for the second training round. Comparison between different color spaces is detailed in Sec. 3.5.2.
As for the pixel-based prediction schemes for the first training round, it is conducted among three
output feature maps, namely, hop-2 after max-pooling, hop-3 and hop-4. The SLS method without
cross-hop label update is performed among these three hops. Different from the first training
round, hop-2 either before or after max-pooling are compared in Sec. 3.5.4. Yet, hop-1 features are
28
Table 3.1: Hyper parameters of the PixelHop++ feature extraction architecture.
Filter Output Feature
Window Size Stride Spatial Spectral (P/Q)
Hop-1 5× 5 (pad 2) 1× 1 32× 32 24 / 22
Max-pooling 1 3× 3 2× 2 15× 15 24 / 22
Hop-2 5× 5 (pad 2) 1× 1 15× 15 144 / 114
Max-pooling 2 3× 3 2× 2 7× 7 144 / 114
Hop-3 3× 3 1× 1 5× 5 203 / 174
Hop-4 3× 3 1× 1 3× 3 211 / 185
not used since the receptive field is too small. We use the XGBoost (extreme gradient boosting)
classifier for all classification tasks.
By following [91, 45], we adopt data augmentation in both training and test for performance
improvement. It includes original input images, random squared/rectangular croppings, contrast
manipulation, horizontal flipping. Eight variants of each input are created. The Lanczos interpola-
tion is applied to cropped images to resize them back to the 32× 32 resolution. In the test phase, the
soft decisions from the eight variants are ordered based on the confidence level and concatenated
together to get the final image-level prediction using a Logistic regression.
In the following experiments, key components including color decomposition and SLS will be
studied within the first training round in Sec. 3.5.2 and Sec. 3.5.3, respectively. The hard sample
mining based second round retraining is detailed in Sec. 3.5.4
3.5.2 Effects of Different Color Spaces
The first training round conducts classification in P and Q channels separately and then ensembles
the classification results as shown in Fig. 3.1. We compare different ways to handle color channels
for the baseline classifier in Table 3.2. Since the Q channel contains much less energy than the P
channel, its classification performance is worse. The ensemble of P and Q channels outperforms
the P channel alone by 3.46% and 2.83% in top-1 and top-2 accuracies, respectively. It shows that
P and Q channels are complementary to each other.
29
Besides, the same classification architecture is applied to RGB color space and PQ-Compound
color channels as a comparison presented in Table 3.3. The difference of the PQ-Compound color
channels from the PQ-Ensemble used in Table 3.2 is the place for combining the P and Q chan-
nels. Instead of doing classification independently and ensemble the pixel-level decisions in PQ-
ensemble, PQ-Compound extracts features for P and Q channels jointly and classifies in a single
classification pipeline. Since the information presented by different color spaces is expected to
be similar, we control the feature dimension from these three different color manipulations to be
similar with each other after feature selection. Specifically, the dimension for PQ-Ensemble is the
summation of feature dimensions in both P and Q channels.
It is proved in Table 3.3 that PQ-Ensemble performs the best among the three settings. Al-
though it has similar top-1 classification accuracy with using features from RGB, the complexity is
much lower than RGB. The feature dimension from either of P and Q is much lower and it is more
flexible to be trained in parallel, which shows the advantage of color decomposition and ensemble.
Table 3.2: Comparison of test accuracy (%) with different combination of P and Q channels
P channel Q channel Top-1 Top-2
✓ 70.26 84.92
✓ 58.8 76.13
✓ ✓ 73.72 87.75
Table 3.3: Comparison of test accuracy (%) using different color spaces
Color Space
Feature Dimension
Test Accuracy
Hop-2 Hop-3 Hop-4
RGB 183 246 271 73.42
PQ-Compound 168 263 290 72.62
PQ-Ensemble 180 264 277 73.72
3.5.3 Effects of Soft-Label Smoothing (SLS)
To demonstrate the power of SLS, we study the binary classification problem and show the results
in Table 3.4 for four frequent confusion sets: Cat vs Dog, Airplane vs Ship, Automobiles vs Truck,
and Deer vs Horse. The training set includes 5,000 images while the test set contains 1,000 images
30
from each of the two classes. For pixel-based classification, we examine test accuracies of intra-
hop prediction and the addition of soft-label smoothing (SLS), where results of P and Q channels
are ensembled. We see from the table that the addition of SLS outperforms that without SLS by
2%∼ 3% in all four confusing sets.
Table 3.4: Comparison of image-level test accuracy (%) for four confusion sets with and without
SLS.
Confusing Group intra-hop only intra-hop and SLS
Cat vs Dog 76.45 79.1
Airplane vs Ship 92.1 93.75
Automobile vs Truck 89.8 92.95
Deer vs Horse 87.8 90.95
To understand the behavior of SLS furthermore, we show intermediate label maps in heat
maps from hop-2 to hop-4 for three Dog images in Fig. 3.5 with three schemes: a) intra-hop
prediction, b) one iteration of SLS label initialization only and c) SLS with another iteration of
cross-hop label update. The heatmaps indicate the confidence score at each pixel location for the
Dog class. For the hop-2 heatmaps, regions composed by the dog face have higher Dog confidence
than other regions. By comparing Figs. 3.5(a) and (b), we see an increase in the highest confidence
level in hop-3 using one round of SLS label initialization. Clearly, SLS makes label maps more
distinguishable. Furthermore, heatmaps are more consistent across hops in Fig. 3.5(b). After one
more SLS iteration of cross-hop update, the confidence scores become more homogeneous in a
neighborhood as shown in Fig. 3.5(c), where more locations agree with each other and with the
ground truth label.
Since SLS is an iterative process, we show the pixel-level accuracy and the final image-level
accuracy as a function of the iteration number for the four confusion sets discussed earlier in Fig.
3.6. The first point is the initial label in SLS. We see that the pixel-level accuracy at each hop
increases rapidly in first several iterations. This is especially obvious in shallow hops. Then, these
curves converge at a certain level. In the experiments, we set the maximum number of iterations
to num iter = 3 for the one-versus-one confusion set resolution to avoid over-smoothing. In the
31
Figure 3.5: Illustration of intermediate label maps for different methods, where each subfigure
shows label heatmaps in hop-2, hop-3 and hop-4 (from left to right).
10-class baseline, we set num iter = 1 (i.e. only the initialization of SLS) by taking both inter-hop
guidance and model complexity into account.
Although the pixel-level accuracy saturates at the same level for all three hops, the ensemble
performance (in red dashed line) is better than that of each individual hop as shown in Fig. 3.6. As
the iteration number increases, the ensemble result also increases and saturates in 3 to 5 iterations
although the increment is smaller than that of the pixel-level accuracy.
3.5.4 Effects of Hard Sample Mining
Based on the foundations of the 1-st training round, we study the effects of the hard sample mining
based 2-nd training round. We use the PQ-Ensemble results for the 1-st training round considering
both the performance and complexity. In the 2-nd round, RGB features are used to increase the
diversity brought by the retraining.
We show the test accuracy as a function of N
− and N
total
respectively in Figure 3.7 to show
its characteristics. Generally, with the same N
total
, a model fit with more difficult samples can get
better performance than those with more easy samples. Also, although with less total number of
training samples the performance decreases, it is worthwhile to see that the degradation is not huge
which are all within 1.5%. This shows the power of hard sample resampling strategy which can
identify the most informative training samples.
32
Figure 3.6: Pixel-level test accuracy in hop-2, hop-3 and hop-4 and the image-level ensemble as a
function of iteration numbers for P-channel images only.
Besides the training sample pool construction, we evaluate the effects of another key com-
ponent in our proposed hard sample mining based retraining which is the test sample selection.
Figure 3.8 show the overall top-1 and top-2 test accuracy after the second round of prediction as
a function of different T
te
. When T
te
= 0, the results are identical to results from the 1-st round,
while when T
te
= 1 the classification results are purely from the 2-nd round. It is obvious that
when T
te
is between 0.3 and 0.6, the test performance can be improved a lot in the second round of
classification than the first round. Also, it becomes very robust to T
tr
that in the same plot different
curves share the same trend with a small gap in between.
To explain this, Figure 3.9 presents the conditional test accuracy for the first training round.
It evaluates the test accuracy among the test samples whose confidence of the most likely class
is larger than T
te
, where the numbers marked above each point in the plot represents the absolute
number of such samples among all the 10K test images. It is proved in the plot that there exists a
33
Figure 3.7: Test accuracy w.r.t. different difficult sample numbers N
− (left) and total number of
selected training samples N
total
(right)
large number of easy test samples which do not need to be predicted by the model in the second
round. For example, 46.79% of the test images have confidence higher than 0 .7, whose conditional
accuracy reaches 95%. Meanwhile, the interval between 0.3 and 0.6 gives the fastest growth to the
conditional accuracy. It provides a good trade-off range to the resampling in the inference stage.
In our design, we set N
total
= 30,000, T
tr
= 0.6, and T
te
= 0.5. The difference between different
runs is very small.
Table 3.5: Summary of test accuracy for LFE-based retraining process. L/H represent for different
resolutions of feature map used in the SLS.
Setting Top-1 (%) Top-2 (%)
Round-1 PQ-Ensemble-L 73.72 87.75
Round-2
RGB-L 76.36 88.10
RGB-H 76.54 88.49
We summarize the top-1 and top-2 test accuracy using hard sample mining base retraining in
Table 3.5, including a comparison with classification round-1 which is without the resampling.
Here, the postfix “L” represents pixel-based prediction using the hop-2 after max-pooling with a
lower resolution in the SLS process, while “H” means using hop-2 before max-pooling with a
higher resolution. We observe a 2.64% and 0.35% improvement for the top-1 and top-2 accuracy
for RGB-L in round-2, and a 2.82% and 0.74% improvement for RGB-H.
34
(a) N
total
= 30K
(b) N
total
= 35K
(c) N
total
= 40K
Figure 3.8: Test accuracy w.r.t. different test sample selection thresholds T
te
35
Figure 3.9: Conditional test accuracy of samples with maximum probability higher than T
te
3.5.5 Ablation Study and Performance Benchmarking
The ablation study of adopting various components in E-PixelHop++ is summarized in Table 3.6,
where the image-level test accuracy for CIFAR-10 is given in the last column. The study in-
cludes: data augmentation, color channel decomposition, the methods of pixel-level classification
in the baseline and the hard sample mining based retraining. The first four rows show the base-
line performance without the second round retraining while the last two rows have both training
rounds. For pixel-level classification, the intra-hop column does not have label smoothing while
the SLS init column has SLS initialization only. Moreover, the pixel-level classification does not
use augmented images to train classifiers by considering the time and model complexity. Data
augmentation is only used for the meta classification from pixel-level to image-level decision.
The E-PixelHop++ baseline achieves test accuracy of 73.72% without hard sample mining
using both P/Q channels with SLS initialization for pixel-level classification and data augmentation
when training the meta classifier. By further adding the 2
nd
round hard sample mining based
retraining, E-PixelHop++ achieves a test accuracy of 76.54% with higher resolution pixel-based
prediction and 76.36% with a lower resolution, as shown in sixth and fifth rows, respectively.
36
We compare the performance of seven methods in Table 3.7. They are modeified LeNet-5 [62],
PixelHop [17], PixelHop
+
[17] and PixelHop++ [18], the E-PixelHop++ baseline without hard
sample mining, and with two different settings in hard sample mining based retraining. To handle
color images, we modify the LeNet-5 network architectures slightly, whose hyper parameters are
given in Table 3.8. Both the E-PixelHop++ baseline and the complete E-PixelHop++ outperform
other benchmarking methods. An improvement of 9.73% over PixelHop++ and an improvement
of 3.88% over PixelHop
+
are observed, respectively.
Table 3.6: Ablation study of E-PixelHop++’s components for CIFAR-10
Augmentation
Round-1 Round-2
Test Accuracy
P channel Q channel intra-hop SLS init RGB-L RGB-H
✓ ✓ ✓ 68.69
✓ ✓ ✓ 70.26
✓ ✓ ✓ ✓ 72.29
✓ ✓ ✓ ✓ 73.72
✓ ✓ ✓ ✓ ✓ 76.36
✓ ✓ ✓ ✓ ✓ 76.54
Table 3.7: Comparison of testing accuracy (%) of LeNet-5, PixelHop, PixelHop+ and PixelHop++
for CIFAR-10.
Test Accuracy (%)
Lenet-5 68.72
PixelHop [17] 71.37
PixelHop
+
[17] 72.66
PixelHop++ [18] 66.81
E-PixelHop++ (Ours) w/o Hard Sample Mining 73.72
E-PixelHop++ (Ours) RGB-L 76.36
E-PixelHop++ (Ours) RGB-H 76.54
3.6 Conclusion
An enhanced PixelHop++ method for multi-class object classification called E-PixelHop++ was
proposed in this chapter. It consists of two key components to enhance the classification perfor-
mance. The soft-label smoothing (SLS) technique was proposed to enhance the performance and
ensure consistency of intra-hop and inter-hop pixel-level predictions. Effectiveness of SLS shows
37
Table 3.8: Hyper parameters of the modified LeNet-5 network as compared with those of the
original LeNet-5.
Original LeNet-5 Modified LeNet-5
Conv-1 Kernel Size 5x5x1 5x5x3
Conv-1 Kernel No. 6 32
Conv-2 Kernel Size 5x5x6 5x5x32
Conv-2 Kernel No. 16 64
FC-1 120 200
FC-2 84 100
Output Layer 10 10
the importance of prediction agreement at different scales. Besides, a hard sample mining tech-
nique was proposed so that the classifier is forced to learn more from the difficult cases. A second
round of classification training based on the proposed hard sample mining is proved to be able to
enhance the performance after the first round of predictions.
38
Chapter 4
Advanced Techniques for Resolving Binary Confusing Sets
4.1 Introduction
To further improve the classification performance, we study the resolution of binary confusing sets
in this chapter. An attentive PixelHop method called Att-PixelHop is proposed. We formulate
Att-PixelHop as a two-stage sequential classification pipeline as show in Figure 4.1, to resolve
confusing classes for further performance boosting. Stage 1 serves as a baseline that performs
classification among all object classes. A multi-class classification problem is performed to get a
soft decision for each class, where the top 2 classes with the highest probabilities called confusing
classes are resolved in Stage 2 as a binary object classification problem.
Since the pixels in an image are not equally important, we propose a weakly supervised at-
tention localization technique so that discriminative regions can be highlighted for low resolution
images. We apply our proposed attention localization method to the binary confusing set resolu-
tion in the second stage of Att-PixelHop. The decisions from the classification of cropped attention
regions are further fused with the classification based on the original full frame image which yields
a better and more robust classification performance.
The rest of this chapter is organized as follows. The method for confusing set identification in
Att-PixelHop is presented in Sec. 4.2. The proposed attention localization technique is detailed in
Sec. 4.3. The integrated Att-PixelHop method is presented in Sec. 4.4, which applies the attention
39
localization in the confusing set resolution. Experimental setup and results are detailed in Sec. 4.5.
Conclusions are finally drawn in Sec. 4.6.
Figure 4.1: Att-PixelHop is a two-stage sequential classification method where stage 1 is for multi-
class baseline and stage 2 is confusion set resolution.
4.2 Confusing Set Identification
Instead of constructing a hierarchical learning structure manually as done in [126], Att-PixelHop
identifies confusion sets using the predictions of the baseline classifier in Stage 1. The baseline
classifier outputs a C-dimensional soft decision vector for each image. Generally, the M classes
with the highest probabilities can define a confusion set. For a C-class classification problem, we
have at most N
cs
=
C!
M!(C− M)!
confusion sets. In practice, only a portion of N
cs
sets contribute to the
performance gain significantly since the member of some confusion set is few. Thus, we argue that
we do not need to resolve all the N
cs
confusion sets. In the designed system, we give a ranking to
the N
cs
groups based on the number of test images that fall into each confusion set based on their
C-dimensional soft decision vectors, which approximates the priority of each confusion set. Only
the first N
cand
candidate sets will be pushed through Stage 2 and they can be processed in parallel.
In the following, we focus on the case with M = 2. The confusion matrix of the baseline
classifier for CIFAR-10 is shown in Table 4.1. We see from the table that “Cat” and “Dog” are
more likely to be confused with each other.
Without loss of generality, we use the “Cat and Dog” two confusing classes from CIFAR-10
as an example to explain our approach for confusion resolution. In the training, we collect all
5,000 Cat images and 5,000 Dog images from the training set. For test images that have the top-2
40
Table 4.1: The confusion matrix for the CIFAR-10 dataset, where the first row shows the predicted
object labels and the first column shows the ground truth
airplane automobile bird cat deer dog frog horse ship truck
airplane 0.745 0.032 0.061 0.023 0.013 0.009 0.013 0.014 0.057 0.033
automobile 0.020 0.863 0.007 0.005 0.002 0.009 0.008 0.006 0.014 0.066
bird 0.047 0.003 0.684 0.066 0.073 0.040 0.054 0.022 0.009 0.002
cat 0.022 0.014 0.056 0.604 0.044 0.154 0.051 0.032 0.012 0.011
deer 0.014 0.003 0.048 0.042 0.724 0.039 0.045 0.076 0.005 0.004
dog 0.009 0.005 0.032 0.159 0.036 0.671 0.033 0.046 0.003 0.006
frog 0.006 0.007 0.044 0.047 0.027 0.031 0.829 0.002 0.004 0.003
horse 0.023 0.003 0.022 0.043 0.033 0.059 0.007 0.801 0.000 0.009
ship 0.046 0.024 0.014 0.004 0.003 0.004 0.005 0.002 0.876 0.022
truck 0.031 0.059 0.008 0.009 0.001 0.004 0.002 0.008 0.021 0.857
candidates of Cat or Dog, we include them in the Cat/Dog confusion set. If an image belongs to this
set yet its ground truth class is neither Cat nor Dog, its misclassification cannot be corrected. As a
result, the top-2 accuracy of the baseline classifier offers an upper bound for the final performance.
4.3 Attention Localization Technique
For the natural image classification task, images in the datasets are taken by focusing on the scenes
that include the objects to be captured but meanwhile covers the surrounding backgrounds. Here,
the “background” refers to regions without any semantic meanings for a particular task. Although
for some classes, the background contains concepts that are relevant to the foreground object, it
makes the classification of foreground object less interpretable if the decision is made based on
clues from the background regions.
Thus, given an image, we can consider to remove the background regions and only focus on
the foreground regions with objects. This aims at forcing the classification model to learn more
informative patterns from the objects it cares about, since the background can vary from image to
image, while the objects share similar patterns for a given class. The localized foreground regions
can be called the “first-level attention” for an image.
Besides the delineation between foreground regions and background scenes, it is worthwhile
to think about whether the partial regions from the objects are informative or not. The parts of
41
Figure 4.2: Overall pipeline for the proposed attention detection mechanism.
an object that contains the most representative semantic meaning is regarded as the discriminant
regions for that class. The highlighted discriminant regions that can be partial of an object are
regarded as the “second-level attention”.
Both the first and second level attention distill the discriminant information out of the given
images, regardless of the object size and the background variation. Most of the existing mech-
anisms for trainable attention extraction are using deep learning based approaches which require
backpropagation to adjust the predicted attention heatmaps. However, it may not be necessary once
the underlying mechanism of why the discriminative parts are informative to human recognition
system is thought over.
In this section, we present our proposed method for attention detection based on the object
level class labels only. No pre-defined bounding box information is needed. It is proved to be
able to localize the discriminant regions well on even the low resolution images, for example the
32× 32 tiny images in CIFAR-10 dataset. The overall framework for the attention region detection
is presented in Figure 4.2. Our proposed algorithm can be decomposed into three main modules:
(1) preliminary annotation generation; (2) weakly supervised attention heatmap refinement; (3)
adaptive bounding box generation. The first module creates an approximate location for the most
discriminant region followed by quality assessment by human in the loop to randomly select good
predictions from the training set. With a very small portion of accepted predictions from Module 1
as the preliminary attention annotations, we treat the attention heatmap prediction as a binary clas-
sification problem in Module 2. The resulted heatmaps will be processed by an adaptive bounding
box generation algorithm in Module 3 in order to get the final localized attention region.
42
In our work, it is applied in the scenario with two object classes only. The cat-vs-dog group
will be used as an example to illustrate the proposed methods in this section, since it is the most
confusing set in the CIFAR-10 dataset.
4.3.1 Preliminary Annotation Generation
Feature Extraction. We will first discuss the feature extraction architecture that is modified for
the attention detection which is different from the one for object recognition, presented in Figure
4.3. Cascaded PixelHop++ units are used to form the feature representations at different scales
from a neighborhood region centered at the center pixel.
Figure 4.3: Illustration of the modified feature extraction architecture for preliminary annotation
generation.
Different from object recognition tasks that we need max-pooling layers between PixelHop++
units that reduce the spatial redundancy and meanwhile increase the receptive field growing speed,
we remove the pooling layers so that the feature vectors extracted have good alignment between
different Hops. In our design, we have nine PixelHop++ units cascaded together. We set stride to
be 1 so that we get feature representations for each center pixel. The filter size used in all the Hop
units is 3× 3 without any padding. Although the feature maps is shrinking when going deeper and
deeper in the network because of no padding performed, the calculated receptive field centered at
each location of the 14× 14 in the final layer still covers the full frame of resolution 32 × 32.
43
Figure 4.4: Cascaded feature maps from different Hop units form a feature pyramid.
Benefited from the dense feature extraction, we have 9 sets of feature representations from the
9 hop units for at each location of the center 14× 14 region. Each set corresponds to a growing
size of receptive field from shallow to deep layers capturing information from different ranges.
Thereby, the concatenated feature vector forms a feature pyramids for each location in the 14× 14
region, illustrated in Figure 4.4. The largest receptive field is 19 × 19 from Hop-9. The dimension
of the resulted feature vectors for each location is given by Eq. 4.1.
K
cas
=
9
∑
i=1
K
i
. (4.1)
Preliminary Attention Prediction. After we get the concatenated feature vectors for the cen-
ter region of each image, we treat each pixel as a single sample for classification. To reduce the
spatial correlation, we do a 2× 2 max-pooling after the deepest PixelHop++ unit so that each image
generates 7× 7= 49 samples. Similar with Sec. 3.3.1, we propagate the object level label to each
pixel of that image. For example, all the 49 pixels for a cat image will be labeled as “Cat”, while
will be labeled as “Dog” if they are from a dog image. We then do a pixel-wise classification based
on feature of dimension K
cas
. The predicted soft decision is of resolution 7× 7× C, where C= 2
for the binary object recognition. To get aligned with the spatial resolution 14× 14 in the final Hop
unit, a Bilinear interpolation is done by a factor of 2 for each channel of the soft decisions. We
define a degree from trivial (DeFT) score to characterize the potential of discriminality, expressed
44
in Eq. 4.2, where C is the number of classes, and p
n
is the predicted soft decision for sample n. The
pixel with the highest D in the predicted soft decision is the most discriminant one out of the center
region of that image. We denote its x-y coordinate w.r.t the full resolution to be(X
peak
,Y
peak
).
D
n
=
1
C
C
∑
m=1
(p
n,m
− 1
C
), (4.2)
Although the pixel-level labels can be noisy, the classifier can still capture the approximate
attention region our of the full frame image. The intuition is that the samples from the background,
such as grass and floor, can be shared by both cat and dog images, while the cat or dog face is
more representative for each class. The classifier tends to be more confident in those discriminant
regions compared to the background regions with irrelevant patterns for the classification.
Since the receptive field can be calculated, a certain region from the original full frame image
centered at (X
peak
,Y
peak
) is considered as the approximated preliminary attention region. In our
design, the region is of size 19× 19. Figure 4.5 shows some examples from the Dog class with the
detected preliminary attention regions indicated in the red bounding boxes.
Figure 4.5: Examples of approximated preliminary attention for training images indicated in red
bounding boxes.
Annotation Selection. As a weak classification with noisy labels described above, the local-
ization of the 19× 19 attention region is not perfect for every image because of the variation of
45
object orientation and occlusion. In order to evaluate the quality of each predicted attention re-
gion, we involve human in the process to do quality assessment. We apply the above described
attention prediction to the training images only. Among the N training images, observers are given
the original images with the predicted bounding boxes indicating the predicted preliminary atten-
tion results in a random order. High quality attention predictions are accepted as the preliminary
annotations while those with lower quality will be rejected. The feeding of images to human is
repeated until the accepted number of training images reaches the target percentage of the total N
images. Here the quality is subjective to the observer, where a “good quality” means the predicted
attention region is consistent with the observer’s expected discriminative regions when given the
original image. It is proved by the experiments that we only need to select a very small percentage
out of the total training images for a reasonable attention heatmap refinement, say 1% out of the
total training images of those two clases.
4.3.2 Weakly-supervised Attention Heatmap Refinement
Pipeline Overview. With the support of the selected attention predictions from the training im-
ages, we design a weakly supervised attention heatmap refinement module to further improve the
localization of the discriminant regions. Figure 4.6 shows the overview of the attention refinement
pipeline. We formulate this module as a binary classification problem between positive class –
attention region, and negative class – background region, where each pixel is treated as a sample
in the classification.
Given the selected attention annotations, we firstly sample from the regions inside the annotated
bounding boxes. The pixels inside will be regarded as the positive class. As for the negative class,
since the areas outside the bounding boxes include much more pixels than inside, random sampling
is applied among those regions to get the same number of negative samples with the positive class.
Specifically, we discard all the pixels inside a 3-pixel width along the boundaries of the bounding
boxes, because the feature vectors from the two classes in the boundary regions can be similar with
each other. They are treated as noisy samples which are not considered in the model training.
46
Figure 4.6: Illustration of weakly supervised attention heatmap prediction.
Feature Extraction. As for the feature, representations are extracted for each pixel location in
the full frame image. The feature extraction architecture for this module is described in Figure 4.7.
As shown in the figure, there are 4 cascaded PixelHop++ units. We use padding before each Hop
unit to keep the resolution and good alignment. Different from Figure 4.3, here since a reasonable
size of receptive field is needed but it is not necessary to use center pixel for tracing locations, a
max-pooling layer is introduced for fast receptive field expansion. As a consequence, for an input
image of resolution S
0
× S
0
, the feature map is of resolution S
i
× S
i
= S
0
× S
0
in spatial for Hop-1
and Hop-2 units, while has resolution S
i
× S
i
=
S
0
2
× S
0
2
for Hop-3 and Hop-4.
In order to get a compound feature representation from different scales for each pixel, an up-
sampling by a factor of 2 is applied to Hop-4 feature maps in each channel. Those features with
a larger receptive field is propagated to each spatial location in the full resolution. They are then
concatenated with Hop-2 features with a smaller receptive field through the skip connection layer.
The final output feature vectors are called the context vectors for each pixel, which is of dimention
K
context
= K
2
+ K
4
.
Weak Supervision. It is shown from experiments that with a very small percentage of accepted
preliminary attention annotations, our predicted attention can be well refined compared to the
initial attention learnt from the image-level labels in Sec. 4.3.1. In our experiments, we select 100
images for the cat and dog classes, where each class has 50 images. This is only 1% of the entire
47
Figure 4.7: Feature extraction pipeline for weakly supervised attention heatmap prediction.
10,000 cat and dog training images. Figure 4.8 shows some examples of the predicted heatmaps
for test images using 1% annotated training images.
Figure 4.8: Examples of predicted attention heatmaps through weakly supervision.
4.3.3 Adaptive Bounding Box Generation
Given the predicted attention heatmaps, attention regions are localized where the optimal bounding
box that includes the discriminant regions are generated. We propose an adaptive localization
method, which is illustrated in Figure 4.9. It contains the following steps:
a) Heatmap binarization. The scores in each heatmap represent the probability for each pixel
to be a foreground pixel. We set a uniform threshold T
att
on all the heatmaps for binarization.
48
Empirically, we set T
att
= 0.5 in our design. To remove the isolated spots, we apply a median filter
with radius to be 3 pixels.
b) Bounding box regularization. The bounding boxes are not required to have regular shapes
across different images. We apply a maximum-occupancy-rate pooling strategy which estimates
the bounding box that is the tightest to include the foreground pixels after the binarization. The
initial estimated bounding box is regularized with its center location(C
i
,C
j
) and its height H and
width W.
c) Bounding box refinement. We apply two more steps to refine the bounding box to avoid too
small bounding box or too high ratio between H and W. If max(H,W) smaller than a threshold, we
enlarge the bounding box to have max(H,W)= 16 by keeping the aspect ratio unchanged. Besides,
if max(H,W) : min(H,W)> 3 : 2 which means the aspect ratio is too large, we refine H and W to
be(H+W)/2.
d) Attention region extraction and resizing. Based on the bounding boxes generated for at-
tention localization, the corresponding regions are cropped from the original full frame images so
that the background regions are removed. We further take the cropped regions as the new inputs
for classification. Thus, the cropped regions are resized to the same resolution as the original full
frame images, for example, 32× 32 for CIFAR-10 dataset. In our framework, we apply Lanczos
interpolation for the resizing step. After this step, the attention region for each image has been
extracted.
4.4 Integrated System of Multi-Class Classification and Confusing
Set Resolution
4.4.1 Overview of the integrated framework: Att-PixelHop Method
The overall framework of the Att-PixelHop is illustrated in Fig. 4.1. We formulate Att-PixelHop as
a two-stage sequential classification pipeline to resolve confusing classes for further performance
49
Figure 4.9: Illustration of the proposed adaptive bounding box generation
boosting. The first stage serves as a baseline that performs classification among all object classes.
A multi-class classification problem is performed to get a soft decision for each class. Its output is
a set of soft decision scores that indicate the probability of each class. The classes with the highest
M probabilities for each image are treated as its confusing classes. Typically, we set M= 2.
In the second stage, a one-versus-one competition is conducted to refine the prediction result.
We integrate the binary object classification into the multi-class scenario where we apply the above
binary classification framework to each confusing set in order to classify the samples with the same
confusing classes regardless of their order ranked based on the probabilities. The design of two-
stage sequential decision firstly makes a coarse prediction and, then, focuses on confusing classes
resolution for more accurate prediction.
4.4.2 Attentive Binary Class Object Classification
In this section, we present the binary object classification based on the attention region localization.
Figure 4.10 illustrates the overview of a binary object classification with attention. It consists of
two branches of classification: global classification branch and discriminative classification branch.
50
Figure 4.10: Overall pipeline for binary object classification with classification on both full frame
images and cropped attention regions.
The former one takes the original full frame image as the input, where the attention regions for both
classes are learnt and cropped as the new input for the latter attentive branch.
Specifically, in each branch, we apply the methodology presented in Sec. 3.2 to the two class
scenario. Since the predicted soft label is a 2-D vector for binary class classification whose element
sum is equal to unity, we can simplify the label to a scalar in label smoothing module. Thus, the
label map has a size of S
i
× S
i
× 1 at hop-i. We use the same ensemble scheme to fuse P and Q
predicted labels as for the 10 class scenario. The meta classifier finally yields a binary decision for
each test image to be either of the two classes.
Once the image-level soft predictions are generated through the meta classifier in Figure 3.1 for
both classification branches, they are further ensembled through another meta classifier to yield the
final prediction. Different from the first meta classification from pixel-level to image-level which
keeps the spatial coordinate relationship between each pixel-level prediction, we rank the image-
level soft predictions based on their confidence level. This makes the ensemble feature more robust
since there is no preferred order between image-level predictions from different branches. Through
experiments, we show that the ensemble between both the global and discriminative classification
gets better results than either one. The rationale behind it is that some images rely more on the
51
global shape information while for others the localized attention is more informative. The two
branches form a complimentary pair to each other so that the joint prediction is more powerful.
4.5 Experiments
4.5.1 Experimental Setup
Being the same as the 10 class classification, we evaluate the performance using the CIFAR-10
dataset that contains 10 classes of tiny color images of spatial resolution 32× 32. For baseline
classification in stage-1, we reuse the setting and results in the last row in Sec. 3.6, which consists
of P/Q color decomposition, soft-label smoothing (SLS), hard sample mining based second round
retraining based on RGB features with higher resolution feature map. The SLS method without
cross-hop label update is performed among the hop-2, hop-3 and hop-4.
For the confusion set resolution in stage-2, besides hop-3 and hop-4, we consider hop-2 without
max-pooling as well for higher resolution to leverage more local information in distinguishing
confusion classes. Different from stage-1, the full SLS method with 3 iterations is conducted in
stage-2 as the pixel-based prediction. Yet, hop-1 features are not used since their receptive field is
too small. We use the XGBoost (extreme gradient boosting) classifier for all classification tasks in
stage-2.
4.5.2 Weakly Supervised Attention Localization
We firstly present some results for the attention region localization. Figure 4.11 and Figure 4.12
include some examples from the test images with the predicted heatmaps of attention scores and
the estimated bounding boxes in red, together with their cropped and resized attention regions. The
results are learnt from only 1% selected training images with the preliminary attention prediction
as the annotation. It is shown in the examples that our proposed weakly supervised method can
localize the attention regions for the low resolution images quite well. The localization is robust to
either large or small objects with various shape and orientations.
52
Figure 4.11: Examples of non-animal images with the predicted attention heatmaps, the generated
bounding boxes and cropped regions.
53
Figure 4.12: Examples of animal images with the predicted attention heatmaps, the generated
bounding boxes and cropped regions.
54
4.5.3 Attentive Binary Object Classification
In this section, we present the attentive classification under the binary object classification sce-
nario. Among the 10 classes in CIFAR-10, the most difficult four groups of binary classification
are: Cat/Dog, Airplane/Ship, Automobile/Truck, and Deer/Horse. There are totally 10K training
images and 2K test images in each group. The high similarity in the global shape (e.g. the pose and
the natural body outline between cats and dogs, or deers and horses), the color tone from either the
foreground the background (e.g. the blue sky for airplanes and blue ocean for ships), and the low
resolution of images together make the classification between these groups more difficult than the
others. Therefore, we focus on these four groups to evaluate our proposed attentive binary object
classification method.
Table 4.2 summarizes the test accuracy for each group, including the performance for: (1)
the global classification branch based on the original full frame images; (2) the discriminative
classification branch on the cropped attention regions; (3) the ensemble between (1) and (2). It
is shown that for all the four groups, the ensemble results outperform the prediction based on the
original full frame image by a large margin. For the Automobile/Truck group, the classification on
the cropped attention performs even better than the full frame images.
To explore more into how much the attention localization can help in the binary object clas-
sification problems, Figure 4.13 shows the comparison between the full frame, cropped attention
region and the ensemble results for another 11 pairs among the 15 most confusing binary groups.
In the plots, the order of points along the x-axis is decided by 10 class baseline results on how
many test images have these two classes to be their top-2 candidates. It approximately measures
Table 4.2: Comparison of image-level test accuracy (%) of the binary object classification with and
without attention localization
Full Frame Cropped Attention Ensemble
Cat vs Dog 79.10 77.90 80.05
Airplane vs Ship 93.75 92.20 94.10
Automobile vs Truck 92.95 93.45 93.90
Deer vs Horse 90.95 90.30 92.30
55
how difficult each pair of classes is. The performance is calculated on all the test images in each
group from the CIFAR-10 dataset. It is indicated in the plots that for a great majority of the binary
groups, the classification based on the cropped attention regions outperform the classification on
the full frame images. For all the groups included, the ensemble results perform the best.
Figure 4.13: Test accuracy for another 11 confusion sets among the Top 15, besides the 4 groups
included in Table 4.2
4.5.4 Attentive Multi-class Object Classification
To integrate the binary object classification into the multi-class classification scenario, we apply it
to the confusion set resolution based on multi-class baseline. For a 10-class classification problem,
we have at most 45 confusion sets for one-versus-one competition. Images with the same top-2
candidates of classes in the baseline decision form one confusion set. The image number in each
confusion set varies. For the second stage classification, we give a higher priority to confusion sets
that have a large number of images.
Table 4.3 shows the conditional accuracy evaluated on the four confusion sets with the highest
priority: Cat/Dog, Airplane/Ship, Automobile/Truck, and Deer/Horse. The performance is calcu-
lated only on the test images who have the two classes to be their top-2 candidates in the baseline
56
Table 4.3: Comparison of conditional test accuracy (%) between stage-1 and stage-2 for the four
most confusing groups
# of test images
Conditional test accuracy (%)
Stage-1
Stage-2
Round-1 Round-2
Cat vs Dog 1089 77.69 79.89 82.46
Airplane vs Ship 967 91.21 93.07 93.27
Automobile vs Truck 896 90.07 93.42 94.64
Deer vs Horse 646 88.54 91.33 93.19
and meanwhile who are originally from either of the two classes in each set. The conditional accu-
racy for the same set of images on the baseline predictions are also included in the table. It is shown
that the one-versus-one competition in stage-2 can help improve the classification performance by
a margin of 2.55% on average.
Besides, we plot the overall test accuracy as a function of the total resolved confusing sets in
Fig. 4.14. It is shown in the plot that the test accuracy increases as more confusion sets are resolved
in the order of priority. The performance saturates after resolving around 27 confusion sets. We
set a budget number 25 for the total number of confusion sets to be resolved as a trade-off between
performance and the complexity. Also, the ensemble of the cropped attention with the full frame
classification makes stage-2 more powerful and more robust.
4.5.5 Performance Benchmark
The ablation study of adopting various components in Att-PixelHop is summarized in Table 4.4.
The study focus on the stage-2, which includes: the effects of attention localization, the hard sam-
ple mining in the top 4 confusing sets, and the number of resolved confusing sets, respectively.
The image-level test accuracy for CIFAR-10 is given in the last column. The overall comparison
with other methods are compared in Table 4.5 as a benchmark, including the methods shown in
Table 3.7. They are the modeified LeNet-5 [62], PixelHop [17], PixelHop
+
[17] and PixelHop++
[18], the E-PixelHop++. Besides, we show the complete Att-PixelHop with the stage-2. Both
57
Figure 4.14: The plot of CIFAR-10 test accuracy (%) as a function of the cumulative number
of resolved confusion sets, with comparison between full frame, cropped attention and ensemble
classification.
Att-PixelHop with small or large number of resolved confusion sets outperform other benchmark-
ing methods. An improvement of 11.98% over PixelHop++ and an improvement of 6.13% over
PixelHop
+
are observed, respectively. With the confusion set resolution, we can observe a 2.25%
improvement compared to E-PixelHop++.
4.6 Conclusion
An attentive PixelHop++ method called Att-PixelHop was proposed in this work. First, in order to
highlight the discriminative regions in an image to be classified, we proposed a weakly supervised
attention localization method which requires no provided bounding box information. Experimental
results show that our proposed method can localize the attention regions quite well for objects of
various size and orientation in low resolution images. Furthermore, we apply the attention local-
ization to object classification. A two-stage classification pipeline was designed for Att-PixelHop,
58
Table 4.4: Ablation study of Att-PixelHop components for CIFAR-10
Stage-1
Stage-2
Test Accuracy # of resolved
confusing sets
Round-1
Round-2
Full Crop Ensemble
✓ 76.54
✓ 25 ✓ 77.62
✓ 25 ✓ 77.36
✓ 25 ✓ 78.6
✓ 25 ✓ ✓ 79.13
✓ 45 ✓ 77.72
✓ 45 ✓ 77.49
✓ 45 ✓ 78.78
✓ 45 ✓ ✓ 79.26
Table 4.5: Comparison of testing accuracy (%) of LeNet-5, PixelHop, PixelHop+, PixelHop++ and
Att-PixelHop for CIFAR-10.
Test Accuracy (%)
Lenet-5 68.72
PixelHop [17] 71.37
PixelHop
+
[17] 72.66
PixelHop++ [18] 66.81
E-PixelHop++ 76.54
Att-PixelHop (Ours) - 25 Resolved Sets 79.13
Att-PixelHop (Ours) - 45 Resolved Sets 79.26
with the multi-class object classification as the baseline and the confusing set resolution as the sec-
ond stage. It is shown that it is important to handle confusing classes carefully so that the overall
classification performance can be boosted.
59
Chapter 5
On Supervised Feature Selection from High Dimensional
Feature Spaces
5.1 Introduction
For machine learning with image/video data, the deep learning technology, which adopts a pre-
defined network architecture and optimizes the network parameters using an end-to-end optimiza-
tion procedure, is dominating nowadays. Yet, an alternative that returns to the traditional pattern
recognition paradigm based on feature extraction and classification two modules in cascade has
also been studied, e.g., [19, 17, 18, 60, 61, 71, 73, 84, 117, 116]. The feature extraction module
contains two steps: unsupervised representation learning and supervised feature selection. Exam-
ples of unsupervised representation learning include multi-stage Saab [61] and Saak transforms
[19]. Here, we focus on the second step; namely, supervised feature selection from a high dimen-
sional feature space.
Inspired by information theory and the decision tree, a novel supervised feature selection
method is proposed in this work. The resulting tests are called the discriminant feature test (DFT)
and the relevant feature test (RFT), respectively, for the classification and regression problems. The
DFT and RFT procedures are described in detail. Our proposed methods belong to the filter meth-
ods, which give a score to each dimension and select features based on feature ranking. The scores
60
Figure 5.1: An overview of the proposed feature selection methods: DFT and RFT. For the i-th
feature, DFT measures the class distribution in S
i
L
and S
i
R
to compute the weighted entropy as the
DFT loss, while RFT measures the weighted estimated regression MSE in both sets as the RFT
loss.
are measured by the weighted entropy and the weighted MSE for DFT and RFT, which reflect the
discriminant power and relevance degree to classification and regression targets, respectively.
To demonstrate the power of DFT and RFT, we conduct performance benchmarking between
DFT/RFT, ANOV A and FI from XGBoost in the experimental section. To this end, we use deep
features obtained by LeNet-5 for MNIST and Fashion-MNIST datasets as illustrative examples.
Other datasets with handcrafted features and gene expressions features are also used for perfor-
mance benchmarking. Comparison with the minimal-redundancy-maximal-relevance (mRMR)
criterion [81, 29], which is a more advanced feature selection method, is also conducted. It is
shown by experimental results that DFT and RFT can select a lower dimensional feature subspace
distinctly and robustly while maintaining high decision performance.
The rest of this chapter is organized as follows. DFT and RFT are presented in Sec. 5.2.
Intensive experimental results are shown in Sec. 5.3. Finally, concluding remarks are given in Sec.
5.4.
5.2 Proposed Feature Selection Methods
Being motivated by the feature selection process in the decision tree classifier, we propose two
feature selection methods, DFT and RFT, in this section as illustrated in Fig. 5.1. They will be
61
detailed in Sec. 5.2.1 and Sec. 5.2.2, respectively. Finally, robustness of DFT and RFT will be
discussed in Sec. 5.2.3.
5.2.1 Discriminant Feature Test (DFT)
Consider a classification problem with N data samples, P features and C classes. Let f
i
, 1≤ i≤ P,
be a feature dimension and its minimum and maximum are f
i
min
and f
i
max
, respectively. DFT is used
to measure the discriminant power of each feature dimension out of a P-dimensional feature space
independently. If feature f
i
is a discriminant one, we expect data samples projected to it should
be classified more easily. To check it, one idea is to partition [ f
i
min
, f
i
max
] into M nonoverlapping
subintervals and adopt the maximum likelihood rule to assign the class label to samples inside
each subinterval. Then, we can compute the percentage of correct predictions. The higher the
prediction accuracy, the higher the discriminant power. Although prediction accuracy may serve
as an indicator for purity, it does not tell the distribution of the remaining C− 1 classes if C > 2.
Thus, it is desired to consider other purity measures.
In our design, we use the weighted entropy of the left and right subsets as the DFT loss to
measure the discriminant power of each dimension. The reason of choosing the weighted entropy
as the cost is that it considers the probability distribution of all classes instead of the maximum
likelihood rule in prediction accuracy as stated above. A lower entropy value is obtained from a
more biased distribution of classes, indicating the subinterval is dominated by fewer classes.
By following the practice of a binary decision tree, we consider the case, M= 2, as shown in
the left subfigure of Fig. 5.1, where f
i
t
denotes the threshold position of two sub-intervals. If a
sample with its ith dimension, x
i
n
< f
i
t
, it goes to the subset associated with the left subinterval.
Otherwise, it will go to the subset associated with the right subinterval. Formally, the procedure of
DFT consists of three steps for each dimension as detailed below.
62
5.2.1.1 Training Sample Partitioning
For the ith feature, f
i
, we need to search for the optimal threshold, f
i
op
, between [ f
i
min
, f
i
max
] and
partition training samples into two subsets S
i
L
and S
i
R
via
ifx
i
n
< f
i
op
, x
n
ε S
i
L
; (5.1)
otherwise, x
n
ε S
i
R
, (5.2)
where x
i
n
represents the i-th feature of the n-th training sample x
n
, and f
i
op
is selected automatically
to optimize a certain purity measure. To limit the search space of f
i
op
, we partition the entire feature
range,[ f
i
min
, f
i
max
], into B uniform segments and search the optimal threshold among the following
B− 1 candidates:
f
i
b
= f
i
min
+
b
B
[ f
i
max
− f
i
min
], b= 1,··· ,B− 1, (5.3)
where B= 2
j
, j= 1,2··· , is examined in Sec. 5.2.3.
5.2.1.2 DFT Loss Measured by Entropy
Samples of different classes belong to S
i
L
or S
i
R
. Without loss of generality, the following discussion
is based on the assumption that each class has the same number of samples in the full training set;
namely S
i
L
∪ S
i
R
. To measure the purity of subset S
i
L
corresponding to the partition point f
i
t
, we use
the following entropy metric:
H
i
L,t
=− C
∑
c=1
p
i
L,c
log(p
i
L,c
), (5.4)
where p
i
L,c
is the probability of class c in S
i
L
. Similarly, we can compute entropy H
i
R,t
for subset
S
i
R
. Then, the entropy of the full training set against partition f
i
t
is the weighted average of H
L,t
and H
R,t
in form of
H
i
t
=
N
i
L,t
H
i
L,t
+ N
i
R,t
H
i
R,t
N
, (5.5)
63
where N
i
L,t
and N
i
R,t
are the sample numbers in subsets S
i
L
and S
i
R
, respectively, and N= N
i
L,t
+ N
i
R,t
is the total number of training samples. The optimized entropy H
i
op
for the i-th feature is given by
H
i
op
= min
tεT
H
i
t
, (5.6)
where T indicates the set of partition points.
5.2.1.3 Feature Selection Based on Optimized Loss
We conduct search for optimized entropy values, H
i
op
, of all feature dimensions, f
i
, 1≤ i≤ P and
order the values of H
i
op
from the smallest to the largest ones. The lower the H
i
op
value, the more
discriminant the ith-dimensional feature, f
i
. Then, we select the top K features with the lowest
entropy values as discriminant features. To choose the value of K with little ambiguity, it is critical
the rank-ordered curve of H
i
op
should satisfy one important criterion. That is, it should have a
distinct and narrow elbow region. We will show that this is indeed the case in Sec. 5.3.
5.2.2 Relevant Feature Test (RFT)
For regression tasks, the mapping between an input feature and a target scalar function can be
more efficiently built if the feature dimension has the ability to separate samples into segments
with smaller variances. This is because the regressor can use the mean of each segment as the
target value, and its corresponding variance indicates the prediction mean squared-error (MSE) of
the segment. Motivated by this observation and the binary decision tree, RFT partitions a feature
dimension into the left and right two subintervals and evaluates the total MSE from them. We
use this approximation error as the RFT loss function. The smaller the RFT loss, the better the
feature dimension. Again, the RFT loss depends on the threshold f
i
t
. The process of selecting
more powerful feature dimensions for regression is named Relevant Feature Test (RFT). Similar to
DFT, RFT has three steps. They are elaborated below. Here, we adopt the same notations as those
in Sec. 5.2.1.
64
5.2.2.1 Training Sample Partitioning
By following the first step in DFT, we search for the optimal threshold, f
i
op
, between [ f
i
min
, f
i
max
]
and partition training samples into two subsets S
i
L
and S
i
R
for the ith feature, f
i
. Again, we partition
the feature range, [ f
i
min
, f
i
max
], into B uniform segments and search the optimal threshold among
the following B− 1 candidates as given in Eq. (5.3).
5.2.2.2 RFT Loss Measured by Estimated Regression MSE
We use y to denote the regression target value. For the ith feature dimension, f
i
, we partition the
sample space into two disjoint ones S
i
L
and S
i
R
. Let y
i
L
and y
i
R
be the mean of target values in
S
i
L
and S
i
R
, and we use y
i
L
and y
i
R
as the estimated regression value of all samples in S
i
L
and S
i
R
,
respectively. Then, the RFT loss is defined as the sum of estimated regression MSEs of S
i
L
and S
i
R
.
Mathematically, we have
R
i
t
=
N
i
L,t
R
i
L,t
+ N
i
R,t
R
i
R,t
N
, (5.7)
where N = N
i
L,t
+ N
i
R,t
, N
i
L,t
, N
i
R,t
, R
i
L,t
and R
i
R,t
denote the sample numbers and the estimated
regression MSEs in subsets S
i
L
and S
i
R
, respectively. Feature f
i
is characterized by its optimized
estimated regression MSE over the set, T , of candidate partition points:
R
i
op
= min
tεT
R
i
t
. (5.8)
5.2.2.3 Feature Selection Based on Optimized Loss
We order the optimized estimated regression MSE value, R
i
op
across all feature dimensions, f
i
,
1≤ i≤ P, from the smallest to the largest ones. The lower the R
i
op
value, the more relevant the
ith-dimensional feature, f
i
. Afterwards, we select the top K features with the lowest estimated
regression MSE values as relevant features.
65
Table 5.1: Classification accuracy (%) of LeNet-5 on MNIST and Fashion-MNIST.
Clean Noisy
MNIST 99.18 98.85
Fashion-MNIST 90.19 86.95
5.2.3 Robustness Against Bin Numbers
For smooth DFT/RFT loss curves with a sufficiently large bin number (say, B≥ 16), the optimized
loss value does not vary much by increasing B furthermore as shown in Fig. 5.2. Figs. 5.2(a) and
(b) show the DFT and RFT loss functions for an exemplary feature, f
i
, under two binning schemes;
i.e., B= 16 and B= 64, respectively. We see that the binning B= 16 is fine enough to locate the
optimal partition f
i
op
. If B= 2
j
, j= 1,2,··· , the set of partition points in a small B value is a subset
of those of a larger B value. Generally, we have the following observations. The difference of the
DFT/RFT loss between adjacent candidate points changes smoothly. Since the global minimum
has a flat bottom, the loss function is low for a range of partition thresholds. The feature will
achieve a similar loss level with multiple binning schemes. For example, Fig. 5.2(a) shows that
B= 16 reaches the global minimum at f
i
= 5.21 while B= 64 reaches the global minimum at
f
i
= 5.78. The difference is about 3% of the full dynamic range of f
i
. Similar characteristics are
observed for all feature dimensions in DFT/RFT, indicating the robustness of DFT/RFT. For lower
computational complexity, we typically choose B= 16 or B= 32.
5.3 Experimental Results
5.3.1 Image Datasets with High Dimensional Feature Space
To demonstrate the power of DFT and RFT, we consider several classical datasets. They include
MNIST [62], Fashion-MNIST [107], the Multiple Features (MultiFeat) dataset, the Arrhythmia
(ARR) dataset [39] from the UCI machine learning archive [30], and the Colon cancer dataset [3].
The latter three are used to measure DFT in the classification problem setting.
66
(a)
(b)
Figure 5.2: Comparison of two binning schemes with B= 16 and B= 64: (a) DFT and (b) RFT.
67
Figure 5.3: Comparison of distinct feature selection capability among four feature selection meth-
ods for the classification task on the Fashion-MNIST dataset.
Dataset-1: MNIST and Fashion-MNIST. Both datasets contain grayscale images of resolu-
tion 28× 28, with 60K training and 10K test images. MNIST has 10 classes of hand-written digits
(from 0 to 9) while Fashion-MNIST has 10 classes of fashion products. In order to get deep fea-
tures for each dataset, we train the LeNet-5 network [62] for the two corresponding classification
problems and adopt the 400-D feature vector before the two FC layers as raw features to evaluate
several feature selection methods. Besides original clean training images, we add additive zero-
mean Gaussian noise with different standard deviation values to evaluate the robustness of feature
selection methods against noisy data. The LeNet-5 network is re-trained for these noisy images
and the corresponding deep features are extracted. For the performance benchmarking purpose,
we list the test classification accuracy of the trained LeNet-5 for MNIST and Fashion-MNIST in
Table 5.1 to illustrate the quality of the deep features.
Dataset-2: MultiFeat [101, 100, 51]. This dataset contains features of hand-written digits
(from 0 to 9) extracted from a collection of Dutch utility maps [30], including 649 dimensional
features for 200 images per class. Different from the deep features in Dataset-1, the 649 features are
68
Table 5.2: Comparison of classification performance (%) on Clean MNIST between different
feature selection methods.
Selected Dimension Method LR SVM RF XGBoost
ANOV A 94.21 95.07 95.77 96.58
Early Elbow Point Corr. 88.73 92.47 94.04 95.11
(30-D) Feat. imp. 92.61 93.55 94.89 95.71
DFT (Ours) 94.49 95.45 96.29 96.92
ANOV A 98.24 98.22 97.98 98.66
Late Elbow Point Corr. 97.61 97.78 97.35 98.57
(100-D) Feat. imp. 98.24 98.15 98.18 98.78
DFT (Ours) 97.93 97.83 97.81 98.52
Full Set (400-D) 98.89 98.77 98.61 99.14
Table 5.3: Comparison of classification performance (%) on Noisy MNIST between different fea-
ture selection methods.
Selected Dimension Method LR SVM RF XGBoost
ANOV A 94.22 94.62 95.60 96.04
Early Elbow Point Corr. 90.97 93.06 93.64 95.21
(40-D) Feat. imp. 92.59 93.35 94.48 95.34
DFT (Ours) 94.03 95.22 95.78 96.59
ANOV A 96.81 96.87 97.16 97.99
Late Elbow Point Corr. 96.87 97.13 96.83 97.93
(100-D) Feat. imp. 97.22 97.2 97.36 97.97
DFT (Ours) 97.08 97.36 97.49 98.18
Full Set (400-D) 98.04 98.17 98.15 98.76
extracted from six perspectives such as Fourier coefficients of character shapes and morphological
features. Since the number of samples is small, we use 10-fold cross-validation and compute the
mean accuracy to evaluate the classification performance.
Dataset-3: Colon. This gene expression dataset contains 62 samples with 2000 features each.
It has a binary classification label; namely, the normal tissue or the cancerous tissue. There are
22 normal tissue and 40 cancer tissue samples. Considering its small sample size, we use the
leave-one-out validation to get the classification predictions for each sample.
Dataset-4: ARR. This cardiac arrhythmia dataset has binary labels for 237 normal and 183
abnormal samples. Each sample contains 278 features. The 10-fold cross-validation is adopted to
evaluate the classification performance.
69
Table 5.4: Comparison of classification performance (%) on Clean Fashion-MNIST between dif-
ferent feature selection methods.
Selected Dimension Method LR SVM RF XGBoost
ANOV A 78.85 80.44 83.33 83.11
Early Elbow Point Corr. 76.57 80.16 82.69 83.04
(30-D) Feat. imp. 78.96 80.49 82.99 82.96
DFT (Ours) 79.59 81.48 84.03 84.09
ANOV A 87.06 86.61 87.69 89.08
Late Elbow Point Corr. 86.99 86.96 87.36 88.81
(150-D) Feat. imp. 87.47 87.62 88.28 89.33
DFT (Ours) 87.60 87.02 87.71 89.13
Full Set (400-D) 89.05 88.18 88.74 90.07
(a) SVM (b) XGBoost
Figure 5.4: Error rate comparison on the MultiFeat dataset between mRMR and DFT.
(a) SVM (b) XGBoost
Figure 5.5: Comparison of the number of errors on the Colon dataset between mRMR and DFT.
70
(a) SVM (b) XGBoost
Figure 5.6: Error rate comparison on the ARR dataset between mRMR and the DFT.
(a) MNIST (b) Fashion-MNIST
Figure 5.7: Performance comparison of DFT feature selection with and without PCA feature pre-
processing.
(a) Select 20 features (b) Select 40 features
Figure 5.8: Histogram comparison of feature indices ranked by the energy with 20 and 40 selected
feature numbers before and after PCA pre-processing. The smaller the ranking index in the x-axis,
the higher the feature energy.
71
Table 5.5: Comparison of classification performance (%) on Noisy Fashion-MNIST between dif-
ferent feature selection methods.
Selected Dimension Method LR SVM RF XGBoost
ANOV A 75.35 76.41 77.94 78.62
Early Elbow Point Corr. 75.55 77.94 79.22 80.50
(40-D) Feat. imp. 75.73 77.1 78.06 78.63
DFT (Ours) 76.35 77.92 79.23 79.69
ANOV A 81.84 81.98 82.42 84.10
Late Elbow Point Corr. 82.26 82.9 82.59 84.72
(150-D) Feat. imp. 83.19 83.43 83.54 84.91
DFT (Ours) 82.08 82.40 82.61 84.31
Full Set (400-D) 84.35 84.24 84.23 85.91
5.3.2 DFT for Classification Problems
We compare effectiveness of four feature selection methods: 1) F scores from ANOV A (ANOV A F
Scores), 2) absolute correlation coefficient w.r.t the class labels (Abs. Corr. Coeff.), 3) feature im-
portance (Feat. Imp.) from a pre-trained XGBoost classifier, and 4) DFT. We adopt four classifiers
to validate the classification performance. They are the Logistic Regression (LR) classifier [24],
the Support Vector Machine (SVM) classifier [23], the Random Forest (RF) classifier [8], and the
XGBoost classifier [15]. We have the following two observations.
5.3.2.1 DFT offers an obvious elbow point
Fig. 5.3 compares the ranked scores of four feature selection methods on Fashion-MNIST dataset.
The lower DFT loss, the higher importance of a feature. The other three have a reversed relation,
namely, the higher the score, the higher the importance. Thus, we search for the elbow point for
DFT but the knee point for the other methods. Clearly, the feature importance curve from the
pre-trained XGBoost classifier has a clearer knee point and the DFT curve has a more obvious
elbow point. In contrast, ANOV A and correlation-coefficient-based methods are not as effective in
selecting discriminant features since their knee points are less obvious.
72
Table 5.6: Comparison of classification performance (%) on MultiFeat between different feature
selection methods.
Classifier Method 10-D 20-D 100-D All Features
ANOV A 93.90 96.00 98.55
Corr. 84.70 91.65 98.50
LR Feat. Imp. 86.35 97.35 98.90 98.75
DFT (Ours) 92.80 96.65 98.60
ANOV A 93.90 96.20 98.55
Corr. 88.15 93.75 98.70
SVM Feat. Imp. 89.65 97.60 98.80 98.45
DFT (Ours) 93.70 96.70 98.35
ANOV A 93.75 96.20 98.90
Corr. 85.35 92.30 98.55
RF Feat. Imp. 87.50 97.15 99.05 98.60
DFT (Ours) 92.70 96.75 98.70
ANOV A 94.00 96.45 98.40
Corr. 86.80 93.00 98.45
XGBoost Feat. Imp. 88.80 96.95 98.60 98.45
DFT (Ours) 93.55 96.40 98.50
Table 5.7: Comparison of number of errors on Colon cancer dataset between different feature
selection methods.
Classifier Method
Selected Dimension Full Set
5 10 20 50 80 100 2000-D
LR
ANOV A 6 9 10 10 7 7
10
Corr. 6 9 10 10 7 7
Feat. Imp. 8 6 5 5 9 9
DFT (Ours) 6 7 9 8 7 7
SVM
ANOV A 6 7 7 9 8 9
9
Corr. 6 7 7 9 8 9
Feat. Imp. 6 6 5 6 9 8
DFT (Ours) 6 6 8 8 6 6
Table 5.8: Running time (sec.) comparison of different feature selection methods.
ANOV A Feat. Imp. mRMR DFT (B=8) DFT (B=16)
MultiFeat 0.011 363.39 15.19 2.74 5.78
Colon 0.003 58.99 23.15 1.55 3.19
ARR 0.002 55.34 10.64 0.23 0.46
73
5.3.2.2 Features selected by DFT achieves comparable and stable classification performance
Tables 5.2 - 5.5 summarize the classification accuracy using four classifiers at two reduced dimen-
sions selected by the DFT loss curve based on early and late elbow points on Dataset-1. The RBF
kernel is used for SVM. We see that DFT can achieve comparable (or even the best) performance
among the four methods at the same selected feature dimension. The accuracy gap between the
late elbow point and the full feature set (400-D) is very small. They are 0.62% and 0.94% us-
ing XGBoost classifier for clean MNIST and Fashion-MNIST, respectively. The late elbow point
only uses 25-35% of the full feature set. The gaps in classification accuracy on noisy images are
0.58% and 1.6% for MNIST and Fashion-MNIST, respectively, indicating the robustness of the
DFT feature selection method against input perturbation.
Table 5.6 summarizes the classification performance for the MultiFeat dataset on two early el-
bow points (10-D and 20-D) and one late elbow point (100-D). The elbow points are selected based
on the sorted DFT loss curve. DFT can achieve comparable or even the best accuracy on early and
late elbow points using different classifiers. The performance gap between 100 selected features
and all 649 features is very small, which are 0.15% and 0.1% for LR and SVM, respectively. The
classification accuracies even improve by 0.1% and 0.05% using RF and XGBoost, respectively.
This shows that the proposed DFT can eliminate less discriminant features while maintaining or
even improving the classification performance.
We show the classification performance on the Colon dataset using LR and SVM in Table 5.7,
where the linear kernel is used in SVM. DFT has the minimum or a comparable number of errors
in leave-one-out validation. Furthermore, DFT can always achieve fewer errors as compared to the
setting of using all 2000 features.
5.3.2.3 Comparison between DFT and mRMR
The minimal-redundancy-maximal-relevance (mRMR) [81, 29] aims at finding a feature set with
high relevance to the class while keeping the selected features with small redundancy, leading
to an efficient but effective subset of features. It combines constraints measured by the mutual
74
information of both relevance to the class and redundancy between selected features and treats it as
an optimization problem. In this experiment, we compare DFT with mRMR using its incremental
selection scheme.
Figs. 5.4-5.6 compare the performance of mRMR and DFT on MultiFeat, Colon and ARR
datasets with SVM and XGBoost classifiers. For MultiFeat and Colon, DFT can achieve very
competitive performance with mRMR. Specifically, the most discriminant feature of MultiFeat
selected by DFT is identical to the first feature selected by mRMR. The error rate with the top 5
features selected by mRMR is smaller than that of DFT. Yet, the performance gap is substantially
narrowed after selecting more than 10 features out of the total 649 features. Overall, the error rate
of DFT and mRMR converges at similar reduced dimensions, as shown in Figs. 5.4 and 5.6. On
the other hand, the error rate of DFT on the ARR dataset is much lower than that of mRMR, with
around 2.5% and 5% gap at 100 selected features for SVM and XGBoost, respectively, as shown
in Fig. 5.6.
5.3.2.4 DFT requires less running time
We compare time efficiency of DFT, ANOV A, mRMR and feature importance from the XGBoost
model. Table 5.8 summarizes the running time for MultiFeat, Colon and ARR datasets. All meth-
ods are run on the same CPU. The pre-trained XGBoost classifier uses the maximum depth of one
with 300 trees. For filter methods such as ANOV A and DFT, the time is evaluated on all features
without parallel computing. For mRMR, we set the maximum to 100 for incremental selection,
which is smaller than the full feature set. ANOV A is the fastest and DFT method ranks the second
on all three datasets. The running time of DFT with B= 16 is about× 2.6,× 7.3 and× 23.1 times
faster than mRMR on MultiFeat, Colon and ARR, respectively. To further reduce the running time,
our proposed DFT can be easily improved by adopting parallel computing since it processes each
feature independently before the feature ranking.
75
Table 5.9: Regression MSE comparison for MNIST (clean/noisy) images with features selected by
four methods.
Method
Early Elbow Point Late Elbow Point Full Set
30-D / 50-D 100-D / 100-D 400-D
Var. 1.45 / 1.23 0.90 / 0.99
0.70 / 0.83
Abs. Corr. Coeff. 1.43 / 1.37 0.90 / 1.06
Feat. Imp. 1.55 / 1.47 1.04 / 1.23
RFT (Ours) 1.37 / 1.36 0.91 / 1.04
5.3.2.5 DFT with feature pre-processing
DFT assigns a score to each feature and selects a subset without any pre-processing. Yet, there
might be correlation between features so that a redundant feature subset might be selected based
on feature ranking [12]. Instead of adding redundancy measure to the DFT loss, we study the effect
of combining DFT with feature pre-processing, such as PCA for feature decorrelation. We choose
clean MNIST and Fashion-MNIST datasets as examples and perform PCA on the 400-D deep
features without energy truncation. The DFT loss is calculated for each of 400 PCA-decorrelated
features. After feature selection, the XGBoost classifier is applied. Fig. 5.7 compares the test
accuracy under different selected dimensions for each setting. We see that PCA pre-processing
improves the classification performance with the same selected dimension.
Furthermore, PCA pre-processing allows a smaller feature dimension for the same perfor-
mance. For example, the accuracy on Fashion-MNIST saturates at around 15-D and 30-D with
and without pre-processing, respectively. This can be explained by the energy compaction capa-
bility of PCA. Fig. 5.8 shows the histogram of energy ranking of the feature subset selected by
DFT with and without PCA preprocessing. The raw features are first sorted by decreasing energy
(variance) prior to feature selection. We see that the selected subset tends to gather in the first 20 to
50 principal components with PCA pre-preprocessing while the selected features are more widely
distributed without PCA.
76
Table 5.10: Regression MSE comparison for Fashion-MNIST (clean/noisy) images with features
selected by four methods.
Method
Early Elbow Point Late Elbow Point Full Set
30-D / 50-D 150-D / 150-D 400-D
Var. 2.08 / 1.98 1.46 / 1.73
1.35 / 1.62
Abs. Corr. Coeff. 1.95 / 1.96 1.49 / 1.75
Feat. Imp. 2.00 / 2.06 1.62 / 1.86
RFT (Ours) 1.97 / 1.96 1.48 / 1.73
5.3.3 RFT for Regression Problems
We convert the discrete class labels arising from the classification problem to floating numbers so
as to formulate a regression problem. We compare effectiveness of four feature selection methods:
1) variance (Var.), 2) absolute correlation coefficient w.r.t the regression target (Abs. Corr. Coeff.),
3) feature importance (Feat. Imp.) from a pre-trained XGBoost regressor (of 50 trees), and 4) RFT.
Again, we can draw two conclusions.
5.3.3.1 RFT offers a more obvious elbow point
Fig. 5.9 compares the ranked scores for different feature selection methods. The lower RFT loss,
the higher feature importance while the other three have a reversed relation. RFT has a more ob-
vious elbow point than the knee points of Variance and correlation-based methods. The feature
importance from the pre-trained XGBoost regressor saturates very fast (up to 24-D) and the differ-
ence among the remaining features is small. In contrast, RFT has a more distinct and reasonable
elbow point, ensuring the performance after dimension reduction. A larger XGBoost model with
more trees can help increase the feature number of higher importance. Yet, it is not clear what
model size would be suitable for a particular regression problem.
5.3.3.2 Features selected by RFT achieves comparable and stable performance
Table 5.9 summarizes the regression MSE at two reduced dimensions selected by the RFT loss
curves using early and late elbow points. The proposed RFT can achieve comparable (or even
the best) performance among the four methods at the same selected feature dimension regardless
77
Figure 5.9: Comparison of relevant feature selection capability among four feature selection meth-
ods for the regression task on the Fashion-MNIST dataset.
whether the input images are clean or noisy. By employing only 25-37.5% of the total feature
dimensions, the regression MSEs obtained by the late elbow point of RFT are 20-30% and 5-
10% higher than those of the full feature set for MNIST and Fashion MNIST, respectively. This
demonstrates the effectiveness of the RFT feature selection method.
5.4 Conclusion
Two feature selection methods, DFT and RFT, were proposed for general classification and regres-
sion tasks in this work. As compared with other existing feature selection methods, DFT and RFT
are effective in finding distinct feature subspaces by offering obivous elbow regions in DFT/RFT
curves. They provide feature subspaces of significantly lower dimensions while maintaining near
optimal classification/regression performance. They are computationally efficient. They are also
robust to noisy input data.
78
Recently, there is an emerging research direction that targets at unsupervised representation
learning [19, 17, 18, 71, 73, 117, 116]. Through this process, it is easy to get high dimensional
feature spaces (say, 1000-D or higher). We plan to apply DFT/RFT to them and find discrimi-
nant/relevant feature subspaces for specific tasks.
79
Chapter 6
Design of Supervision-Scalable Learning Systems:
Methodology and Performance Benchmarking
6.1 Introduction
Supervised learning is the main stream in pattern recognition, computer vision and natural lan-
guage processing nowadays due to the great success of deep learning. On one hand, the perfor-
mance of a learning system should improve as the number of training samples increases. On the
other hand, some learning systems may benefit more than others from a large number of training
samples. For example, deep neural networks (DNNs) often work better than classical learning
systems that contain feature extraction and classification two stages. How the quantity of labeled
samples affects the performance of learning systems is an important question in the data-driven era.
Is it possible to design a supervision-scalable learning system? We attempt to shed light on these
questions by choosing the image classification problem as an illustrative example in this work.
Strong supervision is costly in practice since data labeling demands a lot of time and resource.
Besides, it is unlikely to collect and label desired training samples in all possible scenarios. Even
with a huge amount of labeled data in place, it may still be substantially less than the need. Humans
can learn effectively in a weakly supervised setting. In contrast, deep learning networks often need
more labeled data to achieve good performance. What makes weak supervision and strong super-
vision different? There is little study on the design of supervision-scalable leaning systems. In
80
this work, we show the design of two learning systems that demonstrate an excellent scalable per-
formance with respect to various supervision degrees. The first one adopts the classical histogram
of oriented gradients (HOG) [26] features while the second one uses successive-subspace-learning
(SSL) features. We discuss ways to adjust each module so that their design is more robust against
the number of training samples. To illustrate their robust performance, we compare with the per-
formance of LeNet-5, which is an end-to-end optimized neural network, for MNIST and Fashion-
MNIST datasets. The number of training samples per image class goes from the extremely weak
supervision condition (i.e., 1 labeled sample per class) to the strong supervision condition (i.e.,
4096 labeled sample per class) with gradual transition in between (i.e., 2
n
, n= 0,1,··· ,12). Ex-
perimental results show that the two families of modularized learning systems have more robust
performance than LeNet-5. They both outperform LeNet-5 by a large margin for small n and have
performance comparable with that of LeNet-5 for large n.
The rest of the chapter is organized as follows. The design of HOG-based learning systems is
examined in Sec. 6.2, where two methods, called HOG-I and HOG-II, are proposed. The design
of SSL-based learning systems is investigated in Sec. 6.3, where two methods, called IPHop-I and
IPHOP-II, are presented. Performance benchmarking of HOG-I, HOG-II, IPHop-I, IPHop-II and
LeNet-5 is conducted in Sec. 6.4. Discussion on experimental results is given in Sec. 6.5. Finally,
concluding remarks and future work are given in Sec. 6.6.
6.2 Design of Learning Systems with HOG Features
Classical pattern recognition methods consist of feature extraction and classification two steps.
One well known feature extraction method is the Histograms of Oriented Gradient (HOG) [26].
Before the big data era, most datasets are small in terms of the numbers of training samples and test
samples. As a result, HOG-based solutions are typically applied to small datasets. To make HOG-
based solutions scalable to larger datasets, some modifications have to be made. In this section, we
81
Figure 6.1: An overview of the HOG-based learning system, where the input image is of size
28× 28.
propose two HOG-based learning systems, HOG-I and HOG-II. They are suitable for small and
large training sizes, respectively.
6.2.1 Design of Three Modular Components
As mentioned earlier, we focus on the design of a modularized system that can be decomposed into
representation learning, feature learning and decision learning three modules. We will examine
them one by one below.
6.2.1.1 Representation Learning
HOG was originally proposed for human detection in [26]. It measures the oriented gradient
distribution in different orientation bins evenly spaced over 360 degrees at each local region of
the input image. Modifications are made to make the HOG representation more powerful for
multi-class recognition. Images in the MNIST and the Fashion-MNIST datasets have resolution
of 28× 28 without padding. The proposed HOG representation scheme for them is illustrated in
Fig. 6.1. As shown in the figure, both spatial and spectral HOG representations are considered.
Hyper-parameters used in the experiments are specified below.
First, we decompose an input image into 7× 7 overlapping patches with stride 3, leading to
8× 8= 64 patches. HOG is computed within each patch and the number of orientation bins is set
82
to 8. For each orientation bin, there are 8× 8 responses in the spatial domain. Thus, each image
has a 512-D spatial HOG representation vector. It is called the spatial HOG representation since
each element in the vector captures the probability of a certain oriented gradient in a local region.
Next, for each bin, we apply the 2D discrete cosine transform (DCT) to 8× 8 spatial responses to
derive the spectral representation. The DCT converts 64 spatial responses to 64 spectral responses.
It is called the spectral HOG representation. Each image has a 512-D spectral HOG representa-
tion vector, too. We combine spatial and spectral HOG representations to yield a vector of 1024
dimensions as the joint spatial/spectral HOG features.
6.2.1.2 Feature Learning
The size of HOG feature set from Module 1 is large. It is desired to select discriminant features
to reduce the feature dimension before classification. We adopt two feature selection methods in
Module 2 as elaborated below.
When the training size is small, we may consider unsupervised feature selection. One common
method is to use the variance of a feature. Intuitively speaking, if one feature has a smaller variance
value among all training samples, it is not able to separate different classes well as compared with
features that have higher variance values. Thus, we can rank order features from the largest to the
smallest variance values and use a threshold to select those of larger variance.
When the training size becomes larger, we can exploit class labels for better feature selection.
The advantage of semi-supervised feature selection over unsupervised becomes more obvious as
the supervision level increases. Here, we adopt a newly developed method, called Discriminant
Feature Test (DFT) [111], for semi-supervised feature selection. DFT computes the discriminant
power of each 1D feature by partitioning its range into two non-overlapping intervals and searching
for the optimal partitioning point that minimizes the weighted entropy loss. Mathematically, we
have the entropy function of the left interval as
H
i
L,t
=− C
∑
c=1
p
i
L,c
log(p
i
L,c
), (6.1)
83
where p
i
L,c
is the probability of class c in the left interval of the ith feature and t is a threshold.
Similarly, we can compute entropy H
i
R,t
for the right interval. Then, the entropy of the whole range
is the weighted average of H
L,t
and H
R,t
, denoted by H
i
t
. Then, the optimized entropy H
i
op
for the
ith feature is given by
H
i
op
= min
tεT
H
i
t
, (6.2)
where T indicates a set of discrete partition points. The lower the weighted entropy, the higher the
discriminant power. Top K features with the lowest DFT loss are selected as discriminant features.
6.2.1.3 Decision Learning
We consider two classifiers - the k-nearest-neighbor (KNN) classifier and the XGBoost classifier.
In a weakly supervised setting with a small number of training samples, the choice is very lim-
ited and the distance-based classifier seems to be a reasonable choice. When the training sample
becomes larger, we can use more powerful supervised classifier to yield better classification per-
formance. The XGBoost classifier is a representative one.
6.2.2 HOG-I and HOG-II
Based on the three modules introduced in Sec. 6.2.1, we propose two HOG-based learning systems
below.
(1) HOG-I
• Objective: targeting at weaker supervision
• Representation Learning: HOG features
• Feature Learning: variance thresholding
• Decision Learning: KNN
(2) HOG-II
84
(a) MNIST Dataset
(b) Fashion-MNIST Dataset
Figure 6.2: Performance comparison of HOG-based learning systems on MNIST and Fashion-
MNIST datasets under four different combinations among two feature learning methods (variance
thresholding and DFT) and two classifiers (KNN and XGBoost) as a function of the training sample
number per class in the log scale.
85
(a) Weak Supervision with HOG-I
(b) Strong Supervision with HOG-II
Figure 6.3: Performance comparison of spatial, spectral and joint HOG features on MNIST under
strong and weak supervision conditions.
86
(a) Weak Supervision with HOG-I
(b) Strong Supervision with HOG-II
Figure 6.4: Performance comparison of spatial, spectral and joint HOG features on Fashion-
MNIST under strong and weak supervision conditions.
87
Figure 6.5: An overview of the SSL-based learning system, where the input image is of size 32× 32
• Objective: targeting at stronger supervision
• Representation Learning: HOG features
• Feature Learning: DFT
• Decision Learning: XGBoost
To justify these two designs, we conduct experiments on MNIST and Fashion-MNIST to gain
more insights. By adopting the joint HOG features, we consider four combinations of two fea-
ture learning methods (variance thresholding and DFT) and two classifiers (KNN and XGBoost).
The classification accuracy of test data against MNIST and Fashion-MNIST, as a function of the
number of training data per class, N
c
, is compared in Fig. 6.2, whose x-axis is in the unit of
log
2
N
c
= n
c
.
The primary factor to the performance is the classifier choice. When N
c
≤ 2
3
= 8, the KNN
classifier outperforms the XGBoost classifier by an obvious margin. If N
c
≥ 2
6
= 64, the XGBoost
classifier offers better performance. There is a cross over region between N
c
= 2
4
= 16 and N
c
=
2
5
= 32. Furthermore, by zooming into the weak supervision region with N
c
≤ 8, the variance
thresholding feature selection method is better than the DFT feature selection method. On the other
hand, in the stronger supervision region with N
c
≥ 32, the DFT featurer selection method is better
than the variance thresholding feature selection method. For this reason, we use the combination
88
of variance thresholding and KNN in HOG-I and adopt the combination of DFT and XGBoost in
HOG-II. They target at the weaker and stronger supervision cases, respectively.
The phenomenon observed in Fig. 6.2 can be explained below. When the supervision de-
gree is weak, it is difficult to build meaningful data models (e.g., low-dimensional manifolds in
a high-dimensional representation space). Variance thresholding and KNN are classical feature
selection and classification methods derived from the Euclidean distance, respectively. When the
supervision degree becomes stronger, it is feasible to build more meaningful data models. The
Euclidean distance measure is too simple to capture the data manifold information. Instead, DFT
and XGBoost can leverage the manifold structure for better feature selection and decision making,
respectively.
Next, we compare the performance of spatial, spectral and joint HOG features for HOG-I and
HOG-II under their preferred supervision range for MNIST and Fashion-MNIST. Their results
are shown in Figs. 6.3 and 6.4, respectively. We have the following observations. First, the
performance gap between spatial and spectral HOG features is small under weak supervision with
2
4
= 16 training samples per class. The performance gap becomes larger if the training sample
number per class is greater or equal to 2
5
= 32. Second, the joint HOG features provide the best
overall performance. This is not a surprise since the set of joint HOG features contain the spatial
and spectral HOG features as its two subsets. Here, we would like to point out that the performance
gap between the joint HOG features and the spatial HOG features is larger for smaller N
c
. It means
that the spectral HOG features do complement the spatial HOG features and contribute to the
performance gain. The value of spectral HOG features diminishes as N
c
is sufficiently large.
We may give the following explanation to the phenomena observed in Figs. 6.3 and 6.4. By
performing the DCT on the histogram of each bin over 8× 8 patches, the values of low-frequency
DCT coefficients are larger due to energy compaction. Their values for the same object class are
relatively stable regardless of the supervision degree. In contrast, the spatial HOG features are
distributed over the whole image. They are more sensitive to the local variation of each individual
sample. Thus, when the supervision is weak, HOG-I can benefit from spectral HOG features.
89
Second, as the supervision becomes stronger, the situation is different. Although the HOG of a
single patch provides only the local information, we can obtain both local and global information
by concatenating spatial HOG features across all patches. The HOG of at the same patch location
could be noisy (i.e. varying from one sample to the other). Yet, the variation can be filtered out by
DFT and XGBoost. On the other hand, the values of high-frequency DCT coefficients are small
and many of them are close to zero because of energy compaction. Thus, spectral HOG features
are not as discriminant as spatial HOG features under strong supervision.
6.3 Design of Learning Systems with SSL Features
Successive subspace learning (SSL) was recently introduced in [58, 59, 60, 61]. The technique has
been applied to many applications such as point cloud classification, segmentation and registration
[117, 116, 115, 53, 55], face recognition [84, 83], deepfake detection [14], anomaly detection
[114], etc. SSL-based object classification work can be found in [17, 18, 110]. We propose an
improved PixelHop (IPHop) learning system in this section.
The system diagram of IPHop is shown in the left subfigure of Fig. 6.5. It consists of three
modules: 1) unsupervised representation learning based on SSL features, 2) semi-supervised fea-
ture learning, and 3) supervised decision learning. Since its modules 2 and 3 are basically the same
as those in HOG-based learning systems, we will primarily focus on the representation learning in
Sec. 6.3.1. Afterwards, we compare the performance of IPHop-I and IPHop-II under weak and
strong supervision scenarios in Sec. 6.3.2.
6.3.1 SSL-based Representation Learning
We describe the processing procedure in Module 1 of the left subfigure of Fig. 6.5 below. The
input is a tiny image of spatial resolution 32× 32. The processing procedure can be decomposed
into two cascaded units, called Hop-1 and Hop-2, respectively. We first extract the spatial Saab
features at Hop-1 and Hop-2. For each hop unit, we apply filters of spatial size 5 × 5. At Hop-1,
90
a neighborhood of size 5× 5 centered at each of the interior 28× 28 pixels is constructed. The
Saab transform is conducted at each neighborhood to yield K
1
= 25 channel responses at each
pixel. Afterwards, a 2× 2 absolute max-pooling is applied to each channel. It reduces the spatial
resolution from 28× 28 to 14× 14. As a result, the input to Hop-2 is 14× 14× 25. Similarly,
we apply the channel-wise Saab transform with K
2
filters to the interior 10 × 10 points to get K
2
responses for each point. Here, we set K
2
= 256 and K
2
= 204 for MNIST and Fashion-MNIST,
respectively, based on the energy thresholding criterion introduced in [18].
This above design is basically the standard PixelHop++ pipeline as described in [18]. The only
modification in IPHop is that we change max-pooling in PixelHop++ to absolute max-pooling.
Note that The responses from Hop-1 can be either positive or negative since no nonlinear activation
is implemented at Hop-1. Instead of clipping negative values to zero, we find that it is advantageous
to take the absolute value of the response first and then conduct the maximum pooling operation.
The spatial filter responses extracted at Hop-1 and Hop-2 only have a local view on the object
due to the limited receptive field. They are not discriminant enough for semantic-level understand-
ing. Since there exists correlations among these local filter responses, we can conduct another Saab
transform across all local responses at each individual channel. Such a processing step provides
the global spectral Saab features at Hop-1 and Hop-2 as shown in the right subfigure of Fig. 6.5.
To explain the procedure in detail, we use Hop-1 as an example. For each of the K
1
= 25 chan-
nels, 14× 14= 196 spatial Saab features are flattened and then passed through a one-stage Saab
transform. All responses are kept without truncation. Thus, the dimension of the output spectral
Saab features is 196 for each channel. As compared to features learned by gradually enlarging
the neighborhood range, the spectral Saab features capture the long range information from a finer
scale. Finally, the spatial and spectral Saab features are concatenated at Hop-1 and Hop-2 to form
the joint-spatial-spectral Saab features.
91
(a) MNIST Dataset
(b) Fashion-MNIST Dataset
Figure 6.6: Performance comparison of SSL-based learning systems on MNIST and Fashion-
MNIST datasets under four different combinations among two feature learning methods (variance
thresholding and DFT) and two classifiers (KNN and XGBoost) as a function of n
c
= log
2
(N
c
).
92
(a) Weak Supervision with IPHop-I
(b) Strong Supervision with IPHop-II
Figure 6.7: Performance comparison of spatial, spectral and joint SSL features on MNIST under
strong and weak supervision conditions.
93
(a) Weak Supervision with IPHop-I
(b) Strong Supervision with IPHop-II
Figure 6.8: Performance comparison of spatial, spectral and joint SSL features on Fashion-MNIST
under strong and weak supervision conditions.
94
6.3.2 IPHop-I and IPHop-II
Since the two hops has different combinations of spatial and spectral information, it is desired to
treat them differently. For this reason, we partition IPHop features into two sets:
• Feature Set no. 1: spatial and spectral features of Hop-1,
• Feature Set no. 2: spatial and spectral features of Hop-2.
Feature learning is used to select the subset of discriminant features from the raw representation.
By following HOG-I and HOG-II, we consider variance thresholding and DFT two choices, apply
them to feature sets no. 1 and no. 2, and select the same number of optimal features from each
set individually. Furthermore, the same two classifiers are used for decision learning: KNN and
XGBoost. For KNN, we concatenate optimal features from feature set no. 1 and no. 2, and
compute the distance in this joint feature space. For XGBoost, we apply it to feature set no. 1
and no. 2 and make soft decision for each hop separately. Afterwards, we average the two soft
decisions and use the maximum likelihood principle to yield the final decision.
We propose two SSL-based leanring systems below.
(1) IPHop-I
• Objective: targeting at weaker supervision
• Representation Learning: Joint SSL features (i.e. both feature set nos. 1 and 2)
• Feature Learning: variance thresholding
• Decision Learning: KNN
(2) IPHop-II
• Objective: targeting at stronger supervision
• Representation Learning: Joint SSL features (i.e. both feature set nos. 1 and 2)
• Feature Learning: DFT
95
• Decision Learning: XGBoost
To justify these two designs, we consider all four possible combinations of feature and decision
learning choices and compare their performance in Fig. 6.6. We use the fashion-MNIST dataset as
an example in the following discussion. We see from Fig. 6.6(b) that KNN outperforms XGBoost
under weak supervision (i.e. the training sample number per class N
c
≤ 4). On the other hand,
XGBoost outperforms KNN under stronger supervision (N
c
≥ 16). There is a transition point at
N
c
= 8. For the weak supervision scenario, variance thresholding feature selection is better than
DFT. This is particularly obvious when N
c
= 1. The performance gap is around 25%. Thus, we use
the combination of variance thresholding and KNN in IPHop-I to be used for weaker supervision.
For the stronger supervision case, DFT is slightly better than variance thresholding in both KNN
and XGBoost. Therefore, we use the combination of DFT and XGBoost in IPHop-II to be used for
stronger supervision.
Next, we conduct ablation study on different representations for IPHop-I and IPHop-II in their
preferred operating ranges to understand the impact of each feature type. Fig. 6.7 compares the test
accuracy with individual spatial and spectral features of hop-1 and hop-2 and jointly for MNIST
under different supervision levels.
Under weak supervision, we see from Fig. 6.7(a) that spectral features are more powerful than
spatial features while spectral features of hop-2 are slightly better than those of hop-1. This is
attributed to the diversity of spatial features is different between hop-1 and hop-2.
Under stronger supervision, we see from Fig. 6.7(b) that spatial features are more powerful
than spectral features since spatial features can capture more detail information without energy
compaction and the detail information does help the classification performance as the number of
labeled sample increases. Furthermore, features of hop-2 are more useful than those of hop-1. The
main differences between hop-1 and hop-2 features lie two factors:
• spatial features are determined by the receptive field of Saab filters,
96
• spectral features are determined by spatial aggregation of Saab responses over the entire set
of grid points.
For the former, the cascaded filters in hop-2 offer a larger receptive field which has stronger dis-
criminant power than hop-1. For the latter, hop-1 has 28x28 grid points while hop-2 has only 14x14
grid points. The content in hop-1 has larger diversity than that in hop-2. Although the spatial Saab
transform can achieve energy compaction, the percentages of stable and discriminant spectral fea-
tures in hop-1 tend to less than those in hop-2. Yet, hop-1 and hop-2 do provide complementary
features so that the joint feature set gives the best performance.
Finally, we show the test accuracy with individual spatial and spectral features of hop-1 and
hop-2 and joint feature sets under different supervision levels for Fashion-MNIST in Fig. 6.8. The
same observations and discussion apply to Fashion-MNIST.
6.4 Experiments
Experiments are conducted on MNIST [62] and Fashion-MNIST [107] datasets for performance
benchmarking of HOG-I/II, IPHop-I/II and LeNet-5 learning systems against a wide range of su-
pervision levels. For HOG and IPHop, we also introduce a hybrid solution. That is, type I is used
when N
c
≤ 8 while type II is used for N
c
≥ 16. They are called hybrid HOG and hybrid IPHop,
respectively.
6.4.1 Experimental Setup
The Adam optimizer is used for backpropagation in the training of LeNet-5 network. The number
of epochs is set to 50 for all N
c
values. Both MNIST and Fashion-MNIST contain grayscale images
of resolution 28× 28, with 60K training and 10K test images. MNIST contains 10 hand-written
digits (from 0 to 9) in MNIST while Fashion-MNIST has 10 fashion classes. The training sample
number per class is around 6K. Among the 6K training samples per class, we choose its subset of
size N
c
= 2
n
c
, n
c
= 0,1,··· ,12 randomly as the training set. All classes have the same training
97
sample number. In words, we go from the extremely weak supervision condition with 1 labeled
sample per class to the strong supervision condition with 4,096 labeled sample per class with
gradual transition in between. Experiments with random training sample selection are performed
with multiple runs.
6.4.2 Performance Benchmarking
We conduct performance benchmarking of HOG-I, HOG-II, IPHop-I, IPHop-II, and LeNet-5 in
this subsection. The mean test accuracy and standard deviation values for MNIST and Fashion-
MNIST under different supervision levels are reported in Table 6.1 and Table 6.2, respectively,
based on results from 10 runs. We have the following observations.
• Under weak supervision with N
c
= 1, 2, 4, 8, HOG-I and IPHop-I outperforms LeNet-5 by
a large margin. Specifically, where these is only one labeled image per class ( N
c
= 1) or 10
labeled data for the whole dataset, HOG-I and IPHop-I can still reach an accuracy of around
50% on for both datasets. For MNIST, HOG-I and IPHop-I surpass LeNet-5 by 12.51% and
10.67%, respectively. For Fashion-MNIST, the performance gains of HOG-I and IPHop-I
are 8.16% and 5.55%, respectively. It shows that the performance of HOG-based and IPHop
learning systems is more robust as the number of labeled samples decreases.
• Under middle supervision with N
c
= 16, 32, 64, 128, HOG-I, HOG-II, IPHop-I and IPHop-II
still outperform LeNet-5. Furthermore, we start to see the advantage of HOG-II over HOG-I
and the advantage of IPHop-II over IPHop-I. Besides comparison of mean accuracy scores,
we see that the standard deviation of LeNet-5 is significantly higher than that of HOG-I,
HOG-II, IPHop-I and IPHop-II under the weak and middle supervision levels.
• Under strong supervision with N
c
≥ 1,024 (or the total training sample number is more than
10,240 since there are 10 classes), the advantage of LeNet-5 starts to show up. Yet, IPHop-
II still outperforms LeNet-5 in Fashion-MNIST while the performance difference between
IPHop-II and LeNet-5 on MNIST is very small.
98
• When the full training dataset (i.e., N
c
= 6K) is used, each of HOG-I, HOG-II, IPHop-I and
IPHop-II has a single test accuracy value and the standard deviation value is zero since the
training set is the same. In contrast, even with the same input, LeNet-5 can yield different
accuracy values due to the stochastic optimization nature of backpropagation.
It is natural to consider hybrid HOG and IPHop schemes, where type I is adopted when N
c
≤ 8
and type II is adopted for N
c
≥ 16. For ease of visual comparison, we plot the mean accuracy
curves as well as the standard deviation values (indicated by vertical bars) of hybrid HOG, hybrid
IPHop and LeNet-5 as a function of N
c
in Fig. 6.9. Clearly, hybrid IPHop provides the best overall
performance among the three. Hybrid IPHop outperforms LeNet-5 by a significant margin when
N
c
≤ 128 on MNIST and throughout the whole range of N
c
on Fashion-MNIST. As to hybrid HOG,
it outperforms LeNet-5 with N
c
≤ 128, underperforms LeNet-5 with N
c
≥ 512 and has a crossover
point with LeNet-5 at N
c
= 256 on MNIST. Hybrid HOG has higher accuracy than LeNet-5 when
N
c
≤ 1,024 while its performance is comparable with LeNet-5 when N
c
= 2,048 or 4,096.
Table 6.1: Comparison of the mean test accuracy (%) and standard deviation on MNIST under
weak, middle and strong Supervision degree, where the best performance is highlighted in bold.
Supervision N c LeNet-5 HOG-I HOG-II IPHop-I IPHop-II
Weak
1 40.07 (± 5.78) 52.58 (± 3.89) 9.80 (± 0.00) 50.74 (± 4.13) 9.80 (± 0.00)
2 54.43 (± 6.62) 58.94 (± 3.33) 38.40 (± 2.14) 59.96 (± 3.42) 45.82 (± 3.23)
4 63.19 (± 3.52) 66.55 (± 2.42) 43.48 (± 4.18) 71.28 (± 2.22) 56.00 (± 3.98)
8 72.41 (± 2.50) 74.12 (± 1.50) 63.39 (± 3.21) 79.40 (± 1.18) 73.90 (± 2.62)
Middle
16 73.38 (± 3.69) 78.61 (± 1.04) 77.35 (± 1.80) 84.78 (± 0.87) 85.47 (± 1.02)
32 82.51 (± 3.62) 82.87 (± 0.40) 85.60 (± 0.99) 88.87 (± 0.44) 90.69 (± 0.85)
64 83.92 (± 5.94) 86.01 (± 0.72) 90.47 (± 0.34) 91.60 (± 0.32) 93.60 (± 0.21)
128 90.92 (± 5.52) 88.34 (± 0.30) 93.14 (± 0.43) 93.49 (± 0.32) 95.27 (± 0.24)
Strong
256 94.87 (± 2.61) 90.29 (± 0.23) 95.09 (± 0.26) 94.99 (± 0.23) 96.49 (± 0.08)
512 97.17 (± 0.26) 91.77 (± 0.20) 96.16 (± 0.15) 95.93 (± 0.14) 97.44 (± 0.09)
1024 98.18 (± 0.16) 93.02 (± 0.12) 97.04 (± 0.14) 96.59 (± 0.08) 98.04 (± 0.07)
2048 98.64 (± 0.17) 93.95 (± 0.12) 97.68 (± 0.04) 97.23 (± 0.08) 98.55 (± 0.07)
4096 98.95 (± 0.09) 94.70 (± 0.13) 98.08 (± 0.04) 97.66 (± 0.06) 98.90 (± 0.06)
Full 99.07 (± 0.07) 95.03 98.20 97.84 99.04
99
(a) MNIST
(b) Fashion-MNIST
Figure 6.9: Comparison of test accuracy between hybrid HOG, hybrid IPHop, and LeNet-5 for
MNIST and Fashion-MNIST. For hybrid HOG and IPHop, type I is adopted when N
c
≤ 8 and type
II is adopted for N
c
≥ 16.
100
Table 6.2: Comparison of the mean test accuracy (%) and standard deviation on Fashion-MNIST
under weak, middle and strong Supervision degree, where the best performance is highlighted in
bold.
Supervision N c LeNet-5 HOG-I HOG-II IPHop-I IPHop-II
Weak
1 41.18 (± 5.06) 49.80 (± 4.29) 10.00 (± 0.00) 46.73 (± 4.87) 10.00 (± 0.00)
2 50.65 (± 5.36) 54.43 (± 4.42) 39.85 (± 2.07) 56.57 (± 2.16) 47.17 (± 2.42)
4 56.22 (± 4.23) 60.42 (± 1.99) 41.53 (± 2.21) 59.21 (± 3.39) 52.48 (± 2.73)
8 60.54 (± 3.85) 64.25 (± 1.58) 54.29 (± 2.28) 62.90 (± 1.91) 65.44 (± 1.10)
Middle
16 61.34 (± 3.17) 68.22 (± 1.71) 66.38 (± 1.33) 69.37 (± 1.47) 73.93 (± 1.41)
32 67.49 (± 2.88) 71.60 (± 0.75) 73.02 (± 0.74) 71.47 (± 0.89) 77.86 (± 0.88)
64 71.58 (± 2.33) 73.33 (± 0.48) 77.60 (± 0.66) 74.44 (± 0.52) 80.88 (± 0.89)
128 75.04 (± 1.65) 75.49 (± 0.68) 80.12 (± 0.57) 76.81 (± 0.28) 83.54 (± 0.37)
Strong
256 76.81 (± 3.05) 77.57 (± 0.52) 82.78 (± 0.41) 78.74 (± 0.37) 85.59 (± 0.33)
512 82.38 (± 2.19) 79.05 (± 0.37) 84.19 (± 0.17) 80.29 (± 0.31) 87.41 (± 0.25)
1024 84.51 (± 2.34) 80.56 (± 0.38) 86.00 (± 0.25) 81.99 (± 0.26) 88.81 (± 0.20)
2048 87.13 (± 2.06) 81.91 (± 0.24) 87.14 (± 0.17) 83.59 (± 0.26) 89.93 (± 0.13)
4096 88.97 (± 0.56) 83.06 (± 0.18) 88.35 (± 0.08) 85.01 (± 0.11) 91.03 (± 0.16)
Full 89.54 (± 0.33) 83.52 88.84 85.77 91.37
(a) Local Saab filters in Hop-1 (b) Local Saab filters in Hop-2
(c) Global Saab filters in Hop-1 (d) Global Saab filters in Hop-2
Figure 6.10: The plot of Frobenius norms of difference matrices between the covariance matrices
learned under different supervision levels and the one learned from the full set for MNIST.
101
(a) MNIST
(b) Fashion-MNIST
Figure 6.11: IoU scores between the feature sets selected using full training size and using N
c
on
MNIST and Fashion-MNIST.
102
Figure 6.12: Learning curve using LeNet-5 on MNIST dataset with selected supervision levels.
6.5 Discussion
The superiority of hybrid IPHop under both weak and strong supervision conditions is clearly
demonstrated in Fig. 6.9. We would like to provide some explanations in this section.
• Robustness in Representation Learning
The IPHop representation is determined by Saab filters. Saab filters are obtained by PCA,
which is an eigen-analysis of the covariance matrix of input vectors. If the covariance matrix
converges fast as the training sample number increases, then IPHop’s feature learning is
robust with respect to supervision degree. We show the Frobenius norm of the difference
matrix between the covariance matrix derived by N
c
training images and the full training
size in Fig. 6.10. There are four cases; namely, the local and global Saab filters in Hop-1 and
Hop-2, respectively. The results are averaged among 5 runs. We see that the Forbenius norm
of the difference covariance matrices is already small even for N
c
= 1. This is because one
image contains many small patches which contribute to a robust covariance matrix.
103
• Robustness in Feature Learning
To demonstrate the robustness of DFT, we measure the overlapping of the selected feature set
based on N
c
training samples and that based on the full training size (N
c
= 6K). These two
sets are denoted by{F}
N
c
and{F}
f ull
, respectively. We define an intersection-over-union
(IoU) score as
IoU
N
c
=
{F}
f ull
T
{F}
N
c
{F}
f ull
S
{F}
N
c
, (6.3)
where the numerator represents for the number of features agreed between the two subsets
while the denominator represents the number of features selected by at least one of the two
subsets. For each N
c
, there exists randomness in selecting a subsets of labeled samples.
To eliminate the randomness, we calculate the averaged IoU values with 10 runs. The IoU
values of selecting the top 200-D and 400-D from the 1024-D HOG features for MNIST and
FashionMNIST, respectively, are shown in Fig. 6.11. We see that, as the number of labeled
data increases, the IoU score increases. With a small N
c
value (say, 32) the IoU score can
already reach 90%. As compared with the full training set (i.e., N
c
= 6K),the ratio is only
0.5%. It clearly shows that DFT is a semi-supervised feature selection tool and it can work
well under very weak supervision condition.
• Robustness in Decision Learning
The KNN classifier is used when the training number is small. It is an exemplar-based
classifier. Instead of minimizing the loss using labeled data, it finds the most similar training
sample based on the Euclidean distance in the feature space. However, it cannot capture
the data manifold that often lies in a higher dimensional feature space. As the number of
labeled data increases, XGBoost is more powerful. It minimizes the cross-entropy loss with
a gradient boosting technique. XGBoost is a decision tool based on ensemble learning which
explains its robust decision behavior.
On the other hand, it is desired to understand the behaviour of LeNet-5 under different supervi-
sion levels. We show the learning curves of LeNet-5 on MNIST using N
c
= 2
n
c
, n
c
= 0,1,··· ,12,
104
in Fig. 6.12, which are expressed as the cross-entropy losss as a function of the epoch number. The
batch size in each epoch is set to the total number of labeled images if N
c
≤ 16 and 128 if N
c
> 16.
The loss curves are averaged among 5 random runs. The loss decreases slowly and converges at
a higher loss value when N
c
is small. In contrast, it decreases faster and converges at a lower loss
value when N
c
is larger. Clearly, the learning performance of LeNet-5 in both convergence rates
and converged loss values is highly dependent on the number of the labeled samples.
6.6 Conclusion
In this work, we compared the supervision-scalability of three learning systems; namely, the HOG-
based and IPHop-based learning systems and LeNet-5, which is an representative deep-learning
system. Both HOG-based and IPHop-based learning systems work better than LeNet-5 under weak
supervision. As the supervision degree goes higher, the performance gap narrows. Yet, IPHop-II
still outperforms LeNet-5 on Fashion-MNIST under strong supervision.
It is well known that it is essential to have a sufficient amount of labeled data for deep learning
systems to work properly. Data augmentation and adoption of pre-trained networks are two com-
monly used techniques to overcome the problem of insufficient training data at the cost of larger
model sizes and higher computational cost. Our performance benchmarking study is only pre-
liminary. In the future, we would like to conduct further investigation by considering supervision
scalability, tradeoff of accuracy, model sizes and computational complexity jointly.
105
Chapter 7
Conclusion and Future Work
7.1 Summary
In this thesis, four completed topics were presented for the natural object classification based on
successive subspace learning (SSL) methodology.
The first two parts focused on improving the image classification performance by proposing
three advanced techniques: soft-label smoothing, hard sample mining, and weakly supervised at-
tention localization based on successive subspace learning methodology. The image classification
problem are decomposed into two stages: multi-class classification in the first stage and the confu-
sion set resolution in the second stage. In the first stage, soft-label smoothing (SLS) was proposed
to enhance the performance and ensure consistency of intra-hop and inter-hop pixel-level predic-
tions. Effectiveness of SLS showed the importance of prediction agreement at different scales. To
process the RGB color images, we introduced a new color space named PQR color transformed
using principle component analysis (PCA) where different channels are classified separately. Be-
sides, a hard sample mining based retraining strategy was developed which encourages the clas-
sification framework to focus more on difficult cases recognition so that the performance can be
further improved. In the second stage, we proposed a novel SSL-based weakly supervised atten-
tion localization method for low resolution images. Starting with no bounding box information,
initial attention predictions were generated based on image-level labels in a feedforward manner.
Light human-in-the-loop helped to select good examples as the weak annotation for the enhanced
106
attention learning. Experiments showed that with only 1% selected images, discriminative regions
can be well localized through adaptive bounding box generation. Classification architecture was
generalized from multi-class to binary classes in confusion set resolution and the predictions from
attention region were ensembled with the original full frame image for higher performance and
better robustness.
The third work presented a novel supervised feature selection methodology inspired by infor-
mation theory and the decision tree. The resulting tests are called the discriminant feature test
(DFT) and the relevant feature test (RFT) for classification and regression tasks, respectively. Our
proposed methods belong to the filter methods, which give a score to each dimension and select
features based on feature ranking. As compared with other existing feature selection methods,
DFT and RFT are effective in finding distinct feature subspaces by offering obivous elbow regions
in DFT/RFT curves. Intensive experiments showed that they provide feature subspaces of sig-
nificantly lower dimensions while maintaining near optimal classification/regression performance,
computationally efficient and robust to noisy input data. The proposed methods not only work on
image generated deep features, but also work on general datasets such as handcrafted features and
gene expressions.
In the last work, the supervision-scalable learning systems were studied. Two families of mod-
ularized systems were proposed based on HOG features and SSL features, respectively. We dis-
cussed ways to adjust each module so that their design is more robust against the number of training
samples. Specifically, the DFT feature selection proposed in the third work was applied to the ro-
bust learning systems. Experiments and analysis showed that both HOG-based and IPHop-based
learning systems work better than LeNet-5 under weak supervision.
107
7.2 Future Extensions
The above study has paved some ways to improve the SSL-based image classification. As for the
future extensions, we would like to conduct further investigation by considering supervision scal-
ability, tradeoff of accuracy, model sizes and computational complexity jointly. Besides, our study
mainly focus on image classification with number of classes within 10 classes, and tiny images
with resolution 32× 32. To expand our research, attention localization is of vital importance in
resolving high resolution images and more than 100 classes, such as CIFAR-100 and ImageNet.
Extension work of the proposed attention mechanism is expected to enhance the recognition per-
formance in SSL-based frameworks.
Besides the natural image classification studied in this thesis, AI for healthcare also requires
the image classification as a important step in the computer-aided diagnosis (CAD) system. The
study in this field includes different medical image modalities, such as histological images, MRI
scans and CT scans. The assistance from AI to the healthcare field can provide doctors with higher
diagnosis efficiency, while requiring the AI algorithms to keep transparent and explainable with
their predictions. We would like to extend our research to medical image classification based on
successive subspace learning (SSL) as the future work.
For example, brain tumor draws a lot of attention in the medical field. Brain is one of the most
complex organs in the human body that works with billions of cells. Instead of the histopatholog-
ical images, we will handle magnetic resonance imaging (MRI) data for the brain tumor. Brain
MRI is one of the best imaging techniques for researchers to detect the tumors and model the tu-
mor progression [76]. It is very informative for the brain structure and abnormalities within the
brain tissues due to the high resolution of the images [112, 67, 1]. Figure 7.1 shows an example of
different sequences in brain MRI [99], including T1- weighted, T2-weighted and Fluid Attenuated
Inversion Recovery (Flair).
The American Association of Neurological Surgeons (AANS) has shown that World Health
Organization (WHO) has categorized the brain tumors into the 4 grades (Grade I –Grade IV) based
108
on their malignancy or benignity [72]. Benign tumors are less aggressive and are usually con-
sidered with long-term survival. Malignant tumors grow faster and are more difficult and critical
for treatment. They can be originated primary malignant tumor in the brain, or to be secondary
malignant tumor which are oriented from other parts in the body and spread to the brain. They
can be further divided into different classes and sub-classes. For example, Glioma, which is one
type of primary tumor, can be classified into 17 more classes. Among all brain tumors, the glioma,
meningioma and pituitary tumor account for the highest incidence rates [94].
Due to the large number of classes of tumors, classifying them into subtypes is a more chal-
lenging research problem [27], due to brain tumors exhibit high variations with respect to shape,
size and intensity [21] as well as similar appearances among different types. Among them, the
classification between primary and secondary gliomas as well as between the low and high grades
of gliomas are two examples of classification problems that are of high importance to estimate the
clinical significance. Treatments can be adaptive according to different types.
The existing work for brain tumor classification can be divided into two streams: traditional
methods using radiomics features and deep learning based approaches. The former one is more
transparent than the latter one, yet the performance is usually lower than deep learning. On the
other hand, the deep learning approaches are blackbox with large model size and longer training
time that are not practical to the clinical use. Therefore, in the future work, we attempt to continue
exploring successive subspace learning method in the brain tumor classification. Some successful
examples for the application of SSL to the medical image analysis can be found in [71, 77, 70].
Figure 7.1: An example of different sequences for brain MRI: (a) T1- weighted, (b) T2-weighted
and (C) Flair.
109
Figure 7.2: An example of different types of brain tumors: (a) Normal; (b) Glioma Grade I; (c)
Glioma Grade III; (d) Glioma Grade IV; (e) Meningioma; (f) Metastatic Adenocarcinoma; (g)
Metastatic Bronchogenic carcinoma; (h) Sarcoma.
As for the feature extraction, we can apply the PixelHop++ to the 2D problem or V oxelHop to 3D
space. Different from the existing filter banks for the image processing, the PixelHop++ filters are
learnt from the statistics. A more transparent classification framework with high performance is
expected to developed to solve the brain tumor classification and assist doctors for diagnosis.
Besides, in the medical field, the rules of big data is not always applicable. This is because
the collection of labeled data is very expensive and laborious as well as considering the privacy of
data from different institutions. Thus, the strong supervision condition sets a very high standard to
the study and application of many existing deep learning frameworks. In contrast, our study in this
thesis has shed some light on the design of supervision-scalable systems that can still work well in
middle and weak supervisions. It is worthwhile to extend our work to the healthcare field so that it
is adaptive to the tasks with less labeled data, meanwhile providing a transparent and lightweight
solution.
110
Bibliography
[1] Nyoman Abiwinanda, Muhammad Hanif, S Tafwida Hesaputra, Astri Handayani, and
Tati Rajab Mengko. Brain tumor classification using convolutional neural network. In
World congress on medical physics and biomedical engineering 2018, pages 183–189.
Springer, 2019.
[2] Charu C Aggarwal, Joel L Wolf, Philip S Yu, Cecilia Procopiuc, and Jong Soo Park. Fast
algorithms for projected clustering. ACM SIGMoD Record, 28(2):61–72, 1999.
[3] Uri Alon, Naama Barkai, Daniel A Notterman, Kurt Gish, Suzanne Ybarra, Daniel Mack,
and Arnold J Levine. Broad patterns of gene expression revealed by clustering analysis
of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the
National Academy of Sciences, 96(12):6745–6750, 1999.
[4] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning,
2(4):343–370, 1988.
[5] Loris Bazzani, Alessandra Bergamo, Dragomir Anguelov, and Lorenzo Torresani. Self-
taught object localization with deep networks. In 2016 IEEE winter conference on applica-
tions of computer vision (WACV), pages 1–9. IEEE, 2016.
[6] Alsallakh Bilal, Amin Jourabloo, Mao Ye, Xiaoming Liu, and Liu Ren. Do convolutional
neural networks learn class hierarchy? IEEE transactions on visualization and computer
graphics, 24(1):152–162, 2017.
[7] Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder, Pietro Per-
ona, and Serge Belongie. Visual recognition with humans in the loop. In European Confer-
ence on Computer Vision, pages 438–451. Springer, 2010.
[8] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[9] Peter B¨ uhlmann and Torsten Hothorn. Boosting algorithms: Regularization, prediction and
model fitting. Statistical science, 22(4):477–505, 2007.
[10] Peter B¨ uhlmann and Bin Yu. Boosting with the l 2 loss: regression and classification.
Journal of the American Statistical Association, 98(462):324–339, 2003.
[11] Deng Cai, Chiyuan Zhang, and Xiaofei He. Unsupervised feature selection for multi-cluster
data. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 333–342, 2010.
111
[12] Girish Chandrashekar and Ferat Sahin. A survey on feature selection methods. Computers
& Electrical Engineering, 40(1):16–28, 2014.
[13] Chen Chen, Shangwen Li, Xiang Fu, Yuzhuo Ren, Yueru Chen, and C-C Jay Kuo. Ex-
ploring confusing scene classes for the places dataset: Insights and solutions. In 2017
Asia-Pacific Signal and Information Processing Association Annual Summit and Confer-
ence (APSIPA ASC), pages 550–558. IEEE, 2017.
[14] Hong-Shuo Chen, Mozhdeh Rouhsedaghat, Hamza Ghani, Shuowen Hu, Suya You, and C-
C Jay Kuo. Defakehop: A light-weight high-performance deepfake detector. In 2021 IEEE
International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2021.
[15] Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho,
Kailong Chen, et al. Xgboost: extreme gradient boosting. R package version 0.4-2, 1(4):1–
4, 2015.
[16] Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, and Xiaolong Wang. Meta-baseline:
exploring simple meta-learning for few-shot learning. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 9062–9071, 2021.
[17] Yueru Chen and C-C Jay Kuo. Pixelhop: A successive subspace learning (ssl) method
for object recognition. Journal of Visual Communication and Image Representation,
70:102749, 2020.
[18] Yueru Chen, Mozhdeh Rouhsedaghat, Suya You, Raghuveer Rao, and C-C Jay Kuo. Pix-
elhop++: A small successive-subspace-learning-based (ssl-based) model for image classifi-
cation. In 2020 IEEE International Conference on Image Processing (ICIP), pages 3294–
3298. IEEE, 2020.
[19] Yueru Chen, Zhuwei Xu, Shanshan Cai, Yujian Lang, and C-C Jay Kuo. A saak transform
approach to efficient, scalable and robust handwritten digits recognition. In 2018 Picture
Coding Symposium (PCS), pages 174–178. IEEE, 2018.
[20] Yueru Chen, Yijing Yang, Wei Wang, and C-C Jay Kuo. Ensembles of feedforward-
designed convolutional neural networks. In 2019 IEEE International Conference on Image
Processing (ICIP), pages 3796–3800. IEEE, 2019.
[21] Jun Cheng, Wei Huang, Shuangliang Cao, Ru Yang, Wei Yang, Zhaoqiang Yun, Zhijian
Wang, and Qianjin Feng. Enhanced performance of brain tumor classification via tumor
region augmentation and partition. PloS one, 10(10):e0140381, 2015.
[22] Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Weakly supervised ob-
ject localization with multi-fold multiple instance learning. IEEE transactions on pattern
analysis and machine intelligence, 39(1):189–203, 2016.
[23] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning,
20(3):273–297, 1995.
112
[24] David R Cox. The regression analysis of binary sequences. Journal of the Royal Statistical
Society: Series B (Methodological), 20(2):215–232, 1958.
[25] David J Crandall and Daniel P Huttenlocher. Weakly supervised learning of part-based
spatial models for visual object recognition. In European conference on computer vision,
pages 16–29. Springer, 2006.
[26] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection.
In 2005 IEEE computer society conference on computer vision and pattern recognition
(CVPR’05), volume 1, pages 886–893. Ieee, 2005.
[27] S Deepak and PM Ameer. Brain tumor classification using deep cnn features via transfer
learning. Computers in biology and medicine, 111:103345, 2019.
[28] Jia Deng, Jonathan Krause, and Li Fei-Fei. Fine-grained crowdsourcing for fine-grained
recognition. In Proceedings of the IEEE conference on computer vision and pattern recog-
nition, pages 580–587, 2013.
[29] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray
gene expression data. Journal of bioinformatics and computational biology, 3(02):185–205,
2005.
[30] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
[31] Kun Duan, Devi Parikh, David Crandall, and Kristen Grauman. Discovering localized at-
tributes for fine-grained recognition. In 2012 IEEE conference on computer vision and
pattern recognition, pages 3474–3481. IEEE, 2012.
[32] Michael Fink. Object classification from a single example utilizing class relevance metrics.
Advances in neural information processing systems, 17, 2004.
[33] James Foulds and Eibe Frank. A review of multi-instance learning assumptions. The knowl-
edge engineering review, 25(1):1–25, 2010.
[34] Benoˆ ıt Fr´ enay and Michel Verleysen. Classification in the presence of label noise: a survey.
IEEE transactions on neural networks and learning systems, 25(5):845–869, 2013.
[35] Yoav Freund, Robert Schapire, and Naoki Abe. A short introduction to boosting. Journal-
Japanese Society For Artificial Intelligence , 14(771-780):1612, 1999.
[36] Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In icml,
volume 96, pages 148–156. Citeseer, 1996.
[37] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a
statistical view of boosting (with discussion and a rejoinder by the authors). The annals of
statistics, 28(2):337–407, 2000.
[38] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals
of statistics, pages 1189–1232, 2001.
113
[39] H Altay Guvenir, Burak Acar, Gulsen Demiroz, and Ayhan Cekin. A supervised machine
learning algorithm for arrhythmia analysis. In Computers in Cardiology 1997, pages 433–
436. IEEE, 1997.
[40] Isabelle Guyon and Andr´ e Elisseeff. An introduction to variable and feature selection. Jour-
nal of machine learning research, 3(Mar):1157–1182, 2003.
[41] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection
for cancer classification using support vector machines. Machine learning, 46(1):389–422,
2002.
[42] PC Hammer. Adaptive control processes: a guided tour (r. bellman), 1962.
[43] John A Hartigan. Direct clustering of a data matrix. Journal of the american statistical
association, 67(337):123–129, 1972.
[44] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learnin.
Cited on, page 33, 2009.
[45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
[46] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias
Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural
networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[47] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely
connected convolutional networks. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 4700–4708, 2017.
[48] Samuel H Huang. Supervised feature selection: A tutorial. Artif. Intell. Res., 4(2):22–37,
2015.
[49] Shaoli Huang, Zhe Xu, Dacheng Tao, and Ya Zhang. Part-stacked cnn for fine-grained vi-
sual categorization. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 1173–1182, 2016.
[50] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and
Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb
model size. arXiv preprint arXiv:1602.07360, 2016.
[51] Anil Jain and Douglas Zongker. Feature selection: Evaluation, application, and small
sample performance. IEEE transactions on pattern analysis and machine intelligence,
19(2):153–158, 1997.
[52] Saumya Jetley, Nicholas A Lord, Namhoon Lee, and Philip HS Torr. Learn to pay attention.
arXiv preprint arXiv:1804.02391, 2018.
114
[53] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. Unsupervised point cloud regis-
tration via salient points analysis (spa). In 2020 IEEE International Conference on Visual
Communications and Image Processing (VCIP), pages 5–8. IEEE, 2020.
[54] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. R-pointhop: A green, accurate
and unsupervised point cloud registration method. arXiv preprint arXiv:2103.08129, 2021.
[55] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. R-pointhop: A green, accurate,
and unsupervised point cloud registration method. IEEE Transactions on Image Processing,
31:2710–2725, 2022.
[56] Ron Kohavi and George H John. Wrappers for feature subset selection. Artificial intelli-
gence, 97(1-2):273–324, 1997.
[57] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny im-
ages. Technical report, University of Toronto, Toronto, Ontario, 2009.
[58] C-C Jay Kuo. Understanding convolutional neural networks with a mathematical model.
Journal of Visual Communication and Image Representation, 41:406–413, 2016.
[59] C-C Jay Kuo. The cnn as a guided multilayer recos transform [lecture notes]. IEEE signal
processing magazine, 34(3):81–89, 2017.
[60] C-C Jay Kuo and Yueru Chen. On data-driven saak transform. Journal of Visual Communi-
cation and Image Representation, 50:237–246, 2018.
[61] C-C Jay Kuo, Min Zhang, Siyang Li, Jiali Duan, and Yueru Chen. Interpretable convolu-
tional neural networks via feedforward design. Journal of Visual Communication and Image
Representation, 2019.
[62] Yann LeCun, L´ eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[63] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix
factorization. Nature, 401(6755):788–791, 1999.
[64] Xuejing Lei, Ganning Zhao, and C-C Jay Kuo. Nites: A non-parametric interpretable tex-
ture synthesis method. In 2020 Asia-Pacific Signal and Information Processing Association
Annual Summit and Conference (APSIPA ASC), pages 1698–1706. IEEE, 2020.
[65] Shangwen Li, Chen Chen, Yuzhuo Ren, and C-C Jay Kuo. Improving object classification
performance via confusing categories study. In 2018 IEEE Winter Conference on Applica-
tions of Computer Vision (WACV), pages 1774–1783. IEEE, 2018.
[66] Di Lin, Xiaoyong Shen, Cewu Lu, and Jiaya Jia. Deep lac: Deep localization, alignment
and classification for fine-grained recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1666–1674, 2015.
115
[67] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio,
Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken,
and Clara I S´ anchez. A survey on deep learning in medical image analysis. Medical image
analysis, 42:60–88, 2017.
[68] Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. The treasure beneath convolu-
tional layers: Cross-convolutional-layer pooling for image classification. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4749–4757,
2015.
[69] Weiwei Liu, Ivor W Tsang, and Klaus-Robert M¨ uller. An easy-to-hard learning paradigm
for multiple classes and multiple labels. The Journal of Machine Learning Research, 18,
2017.
[70] Xiaofeng Liu, Fangxu Xing, Hanna K Gaggin, Weichung Wang, C-C Jay Kuo, Georges El
Fakhri, and Jonghye Woo. Segmentation of cardiac structures via successive subspace learn-
ing with saab transform from cine mri. arXiv preprint arXiv:2107.10718, 2021.
[71] Xiaofeng Liu, Fangxu Xing, Chao Yang, C-C Jay Kuo, Suma Babu, Georges El Fakhri,
Thomas Jenkins, and Jonghye Woo. V oxelhop: Successive subspace learning for als disease
classification using structural mri. arXiv preprint arXiv:2101.05131, 2021.
[72] David N Louis, Hiroko Ohgaki, Otmar D Wiestler, Webster K Cavenee, Peter C Burger,
Anne Jouvet, Bernd W Scheithauer, and Paul Kleihues. The 2007 who classification of
tumours of the central nervous system. Acta neuropathologica, 114(2):97–109, 2007.
[73] Abinaya Manimaran, Thiyagarajan Ramanathan, Suya You, and C-C Jay Kuo. Visualiza-
tion, discriminability and applications of interpretable saak features. Journal of Visual Com-
munication and Image Representation, 66:102699, 2020.
[74] Jianyu Miao and Lingfeng Niu. A survey on feature selection. Procedia Computer Science,
91:919–926, 2016.
[75] Pabitra Mitra, CA Murthy, and Sankar K. Pal. Unsupervised feature selection using feature
similarity. IEEE transactions on pattern analysis and machine intelligence, 24(3):301–312,
2002.
[76] Heba Mohsen, El-Sayed A El-Dahshan, El-Sayed M El-Horbaty, and Abdel-Badeeh M
Salem. Classification using deep learning neural networks for brain tumors. Future Com-
puting and Informatics Journal, 3(1):68–71, 2018.
[77] Masoud Monajatipoor, Mozhdeh Rouhsedaghat, Liunian Harold Li, Aichi Chien, C-C Jay
Kuo, Fabien Scalzo, and Kai-Wei Chang. Berthop: An effective vision-and-language model
for chest x-ray disease diagnosis. arXiv preprint arXiv:2108.04938, 2021.
[78] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
[79] Jim Mutch and David G Lowe. Multiclass object recognition with sparse, localized features.
In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’06), volume 1, pages 11–18. IEEE, 2006.
116
[80] Maxime Oquab, L´ eon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free?-
weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 685–694, 2015.
[81] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual informa-
tion criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions
on pattern analysis and machine intelligence, 27(8):1226–1238, 2005.
[82] Mingjie Qian and Chengxiang Zhai. Robust unsupervised feature selection. In Twenty-third
international joint conference on artificial intelligence . Citeseer, 2013.
[83] Mozhdeh Rouhsedaghat, Masoud Monajatipoor, Zohreh Azizi, and C-C Jay Kuo. Succes-
sive subspace learning: An overview. arXiv preprint arXiv:2103.00121, 2021.
[84] Mozhdeh Rouhsedaghat, Yifan Wang, Xiou Ge, Shuowen Hu, Suya You, and C-C Jay Kuo.
Facehop: A light-weight low-resolution face gender classification method. arXiv preprint
arXiv:2007.09510, 2020.
[85] Robert E Schapire. The strength of weak learnability. Machine learning, 5(2):197–227,
1990.
[86] Henry Scheffe. The analysis of variance, volume 72. John Wiley & Sons, 1999.
[87] Paul Schnitzspan, Mario Fritz, and Bernt Schiele. Hierarchical support vector random
fields: Joint training to combine local and global features. In European conference on
computer vision, pages 527–540. Springer, 2008.
[88] Paul Hongsuck Seo, Zhe Lin, Scott Cohen, Xiaohui Shen, and Bohyung Han. Progressive
attention networks for visual attribute prediction. arXiv preprint arXiv:1606.02393, 2016.
[89] Yian Seo and Kyung-shik Shin. Hierarchical convolutional neural networks for fashion
image classification. Expert systems with applications, 116:328–339, 2019.
[90] Razieh Sheikhpour, Mehdi Agha Sarram, Sajjad Gharaghani, and Mohammad Ali Zare
Chahooki. A survey on semi-supervised feature selection methods. Pattern Recognition,
64:141–158, 2017.
[91] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[92] Sa´ ul Solorio-Fern´ andez, J Ariel Carrasco-Ochoa, and Jos´ e Fco Mart´ ınez-Trinidad. A review
of unsupervised feature selection methods. Artificial Intelligence Review , 53(2):907–948,
2020.
[93] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for few-
shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 403–412, 2019.
117
[94] Zar Nawab Khan Swati, Qinghua Zhao, Muhammad Kabir, Farman Ali, Zakir Ali, Saeed
Ahmed, and Jianfeng Lu. Content-based brain tumor retrieval for mr images using transfer
learning. IEEE Access, 7:17809–17822, 2019.
[95] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper
with convolutions. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 1–9, 2015.
[96] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neu-
ral networks. In International Conference on Machine Learning, pages 6105–6114. PMLR,
2019.
[97] Jiliang Tang, Salem Alelyani, and Huan Liu. Feature selection for classification: A review.
Data classification: Algorithms and applications , page 37, 2014.
[98] Tzu-Wei Tseng, Kai-Jiun Yang, C-C Jay Kuo, and Shang-Ho Tsai. An interpretable com-
pression and classification system: Theory and applications. IEEE Access, 8:143962–
143974, 2020.
[99] url:. mri-t1-vs-t2. https://helpary.wordpress.com/2019/02/26/mri-t1-vs-t2/.
[100] Martijn Van Breukelen and Robert PW Duin. Neural network initialization by combined
classifiers. In Proceedings. Fourteenth International Conference on Pattern Recognition
(Cat. No. 98EX170), volume 1, pages 215–218. IEEE, 1998.
[101] Martijn van Breukelen, Robert PW Duin, David MJ Tax, and JE Den Hartog. Handwritten
digit recognition by combined classifiers. Kybernetika, 34(4):381–386, 1998.
[102] Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine
Learning, 109(2):373–440, 2020.
[103] B Venkatesh and J Anuradha. A review of feature selection and its methods. Cybernetics
and Information Technologies, 19(1):3–26, 2019.
[104] Chong Wang, Weiqiang Ren, Kaiqi Huang, and Tieniu Tan. Weakly supervised object local-
ization with latent category learning. In European Conference on Computer Vision, pages
431–445. Springer, 2014.
[105] Lezi Wang, Ziyan Wu, Srikrishna Karanam, Kuan-Chuan Peng, Rajat Vikram Singh,
Bo Liu, and Dimitris N Metaxas. Sharpen focus: Learning with attention separability and
consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vi-
sion, pages 512–521, 2019.
[106] Xiu-Shen Wei, Jianxin Wu, and Zhi-Hua Zhou. Scalable algorithms for multi-instance
learning. IEEE transactions on neural networks and learning systems, 28(4):975–987,
2016.
[107] Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for
benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
118
[108] Tian Xie, Bin Wang, and C-C Jay Kuo. Graphhop: An enhanced label propagation method
for node classification. arXiv preprint arXiv:2101.02326, 2021.
[109] Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei
Di, and Yizhou Yu. Hd-cnn: hierarchical deep convolutional neural networks for large scale
visual recognition. In Proceedings of the IEEE international conference on computer vision,
pages 2740–2748, 2015.
[110] Yijing Yang, Vasileios Magoulianitis, and C. C. Jay Kuo. E-pixelhop: An enhanced pixel-
hop method for object classification. 2021 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC), pages 1475–1482, 2021.
[111] Yijing Yang, Wei Wang, Hongyu Fu, and C-C Jay Kuo. On supervised feature selection
from high dimensional feature spaces. arXiv preprint arXiv:2203.11924, 2022.
[112] Evangelia I Zacharaki, Sumei Wang, Sanjeev Chawla, Dong Soo Yoo, Ronald Wolf, Elias R
Melhem, and Christos Davatzikos. Classification of brain tumor type and grade using mri
texture and shape in a machine learning scheme. Magnetic Resonance in Medicine: An Offi-
cial Journal of the International Society for Magnetic Resonance in Medicine, 62(6):1609–
1618, 2009.
[113] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks.
In European conference on computer vision, pages 818–833. Springer, 2014.
[114] Kaitai Zhang, Bin Wang, Wei Wang, Fahad Sohrab, Moncef Gabbouj, and C-C Jay
Kuo. Anomalyhop: An ssl-based image anomaly localization method. arXiv preprint
arXiv:2105.03797, 2021.
[115] Min Zhang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Unsupervised feedforward fea-
ture (uff) learning for point cloud classification and segmentation. In 2020 IEEE Interna-
tional Conference on Visual Communications and Image Processing (VCIP), pages 144–
147. IEEE, 2020.
[116] Min Zhang, Yifan Wang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Pointhop++: A
lightweight learning model on point sets for 3d classification. In 2020 IEEE International
Conference on Image Processing (ICIP), pages 3319–3323. IEEE, 2020.
[117] Min Zhang, Haoxuan You, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Pointhop: An
explainable machine learning method for point cloud classification. IEEE Transactions on
Multimedia, 2020.
[118] Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. Part-based r-cnns for fine-
grained category detection. In European conference on computer vision, pages 834–849.
Springer, 2014.
[119] Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. Interpretable convolutional neural
networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 8827–8836, 2018.
119
[120] Xiaopeng Zhang, Hongkai Xiong, Wengang Zhou, Weiyao Lin, and Qi Tian. Picking deep
filter responses for fine-grained image recognition. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 1134–1142, 2016.
[121] Zheng Zhao and Huan Liu. Semi-supervised feature selection via spectral analysis. In
Proceedings of the 2007 SIAM international conference on data mining, pages 641–646.
SIAM, 2007.
[122] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning
deep features for discriminative localization. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 2921–2929, 2016.
[123] Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba. Interpretable basis decomposi-
tion for visual explanation. In Proceedings of the European Conference on Computer Vision
(ECCV), pages 119–134, 2018.
[124] Zhi-Hua Zhou. A brief introduction to weakly supervised learning. National science review,
5(1):44–53, 2018.
[125] Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis
lectures on artificial intelligence and machine learning , 3(1):1–130, 2009.
[126] Xinqi Zhu and Michael Bain. B-cnn: branch convolutional neural network for hierarchical
classification. arXiv preprint arXiv:1709.09890, 2017.
[127] Alon Zweig and Daphna Weinshall. Exploiting object hierarchy: Combining models from
different category levels. In 2007 IEEE 11th International Conference on Computer Vision,
pages 1–8. IEEE, 2007.
120
Abstract (if available)
Abstract
Based on successive subspace learning (SSL) methodology which was recently developed, advanced techniques for image classification problems are proposed in this thesis. Specifically, it can be decomposed into four parts: 1) improving the performance of multi-class classification, 2) improving the performance of resolving confusing sets using attention localization, 3) enhancing the quality of the learnt feature space by conducting a novel supervised feature selection, and 4) designing supervision-scalable learning systems.
The first two parts focus on improving the image classification performance by proposing three advanced techniques: soft-label smoothing, hard sample mining, and weakly supervised attention localization based on successive subspace learning methodology. The image classification problem is decomposed into two stages: multi-class classification in the first stage and the confusing set resolution in the second stage. In the first stage, soft-label smoothing (SLS) is proposed to enhance the performance and ensure consistency of intra-hop and inter-hop pixel-level predictions. Effectiveness of SLS shows the importance of prediction agreement at different scales. To process the RGB color images, we introduce a new color space named PQR based on principal component analysis (PCA) where different channels are classified separately. Besides, a hard sample mining based retraining strategy is developed which encourages the classification framework to focus more on difficult cases recognition. In the second stage, we propose a new SSL-based weakly supervised attention localization method for low resolution images. Starting with no bounding box information, preliminary attention predictions are generated based on image-level labels in a feedforward manner. Light human-in-the-loop helps to select good examples as the weak annotation for the enhanced attention learning. Experiments show that with only 1\% selected images, discriminative regions can be well localized through adaptive bounding box generation. Furthermore, it is important to handle confusing classes carefully. Classification architecture is generalized from multi-class to binary classes in confusing set resolution and the predictions from attention regions are ensembled with the original full frame image for higher performance and better robustness.
The third part presents a novel supervised feature selection methodology inspired by information theory and the decision tree. The resulting tests are called the discriminant feature test (DFT) and the relevant feature test (RFT) for classification and regression tasks, respectively. Our proposed methods belong to the filter methods, which give a score to each dimension and select features based on feature ranking. As compared with other existing feature selection methods, DFT and RFT are effective in finding distinct feature subspaces by offering obvious elbow regions in DFT/RFT curves. Intensive experiments show that they provide feature subspaces of significantly lower dimensions while maintaining near optimal classification/regression performance, computationally efficient and robust to noisy input data. The proposed methods not only work on image generated deep features, but also work on general datasets such as handcrafted features and gene expressions.
In the last part, the supervision-scalable learning systems are studied. Two families of modularized systems are proposed based on HOG features and SSL features, respectively. We discuss ways to adjust each module so that the design is more robust against the number of training samples. Specifically, the DFT feature selection proposed in the third work is applied to the robust learning systems. Experiments and analysis show that both HOG-based and SSL-based learning systems work better than LeNet-5 under weak supervision and have performance comparable with LeNet-5 under strong supervision.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Object classification based on neural-network-inspired image transforms
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Facial age grouping and estimation via ensemble learning
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Efficient graph learning: theory and performance evaluation
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Advanced features and feature selection methods for vibration and audio signal classification
PDF
Advanced techniques for human action classification and text localization
PDF
A learning‐based approach to image quality assessment
PDF
Advanced techniques for stereoscopic image rectification and quality assessment
PDF
Experimental design and evaluation methodology for human-centric visual quality assessment
PDF
Machine learning techniques for outdoor and indoor layout estimation
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Lexical complexity-driven representation learning
PDF
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
Asset Metadata
Creator
Yang, Yijing
(author)
Core Title
Advanced techniques for object classification: methodologies and performance evaluation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-08
Publication Date
07/27/2022
Defense Date
06/15/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
attention localization,feature selection,machine learning,model scalability,OAI-PMH Harvest,object classification,subspace learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Haldar, Justin (
committee member
), Nakano, Aiichiro (
committee member
), You, Suya (
committee member
)
Creator Email
yangyijing710@outlook.com,yijingya@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111375437
Unique identifier
UC111375437
Legacy Identifier
etd-YangYijing-11004
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Yang, Yijing
Type
texts
Source
20220728-usctheses-batch-962
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
attention localization
feature selection
machine learning
model scalability
object classification
subspace learning