Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient machine learning techniques for low- and high-dimensional data sources
(USC Thesis Other)
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Efficient Machine Learning Techniques for Low- and High-Dimensional Data Sources
by
Hongyu Fu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2023
Copyright 2023 Hongyu Fu
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Novel Tree-based Classifiers/Regressors for Low-Dimensional Data Sources . . . . 3
1.2.2 Complexity Reduction for Tree-based Classifiers/Regressors . . . . . . . . . . . . . 4
1.2.3 Feed-forward Efficient Machine Learning for High-Dimensional Data Sources . . . 4
1.2.4 Efficient SLM with Soft Partitioning for High-Dimensional Data Sources . . . . . . 5
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Novel Tree-based Classifiers/Regressors for Low-Dimensional Data Sources . . . . . . . 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Subspace Learning Machine (SLM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2.1 Selection Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2.2 Probabilistic Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2.3 Selection of Multiple Projection Vectors . . . . . . . . . . . . . . . . . . . 17
2.2.2.4 SLM Tree Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 SLM Forest and SLM Boost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 SLM Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 SLM Boost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Performance Evaluation of SLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Subspace Learning Regressor (SLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Comments on Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.1 Classification and Regression Tree (CART) . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.2 Random Forest (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.3 Gradient Boosting Decision Tree (GBDT) . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.4 Multilayer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6.5 Extreme Learning Machine (ELM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
ii
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 Appendix : Calculation of the Model Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 3: Complexity Reduction for Tree-based Classifiers/Regressors for Low-Dimensional
Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Acceleration via Adaptive Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . 41
3.2.1 Introduction to Particle Swarm Optimization(PSO) . . . . . . . . . . . . . . . . . . 42
3.2.2 Adaptive PSO (APSO) in SLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Acceleration via Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 4: Feedforward Efficient Machine Learning for High-Dimensional Data Sources . . . . . . 52
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Background Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.1 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.1.1 Successive Subspace Learning and Filter Banks . . . . . . . . . . . . . . . 57
4.3.1.2 Saab Transform and Saab Filter . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.1.3 Channelwise Saab Transform . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1.4 Haar Transform and 2-D Haar Filter . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2.1 Unsupervised Feature Dimension Reduction . . . . . . . . . . . . . . . . 63
4.3.2.2 Supervised Feature Dimension Reduction . . . . . . . . . . . . . . . . . . 64
4.3.3 Decision Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.3.1 Supplemental Feature Generation . . . . . . . . . . . . . . . . . . . . . . 65
4.3.3.2 Complementary Feature Learning . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.2 Feature Representation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 5: Efficient SLM with Soft Partitioning for High-Dimensional Data Sources . . . . . . . . . 79
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Background Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.1 Soft Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.2 Automatic Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.3 Efficient Bottleneck Design for Local Feature Learning . . . . . . . . . . . . . . . . 85
5.3 SLM Tree with Soft Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1 Overview of SLM/SP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.2 Design of SLM/SP Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.2.1 Topology and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.2.2 Hierarchical Tree Determination . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.2.3 Module Parameters Determination . . . . . . . . . . . . . . . . . . . . . 95
5.3.3 Probabilistic Inference with SLM/SP Tree . . . . . . . . . . . . . . . . . . . . . . . 95
iii
5.3.3.1 Full-Path Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.3.2 Single-Path Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4 Image Classification with SLM/SP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4.1 Design of Efficient Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4.1.2 Subspace Learning enhAnced Block (SLAB) for Efficient Local Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4.2 SLM/SP Module Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4.2.1 SLM/SP for Supervised Decision Learning . . . . . . . . . . . . . . . . . 101
5.4.2.2 SLM/SP with Supervised Local Representation Learning . . . . . . . . . 101
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.1 Results on SLM/SP decision learning with GL features . . . . . . . . . . . . . . . . 102
5.5.1.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.1.2 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . 105
5.5.2 Results on SLM/SP with Supervised Local Representation Learning . . . . . . . . . 106
5.5.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5.2.2 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . 109
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Chapter 6: Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.2 Future Research Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2.1 Efficient Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2.2 Efficient Image Classification on ImageNet Challenge . . . . . . . . . . . . . . . . . 117
6.2.3 Machine Learning for Deficient Data Source . . . . . . . . . . . . . . . . . . . . . . 118
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
iv
List of Tables
2.1 Classification accuracy comparison of 10 benchmarking methods on nine datasets. . . . . . 27
2.2 Model size comparison of five machine learning models models against 9 datasets, where
the smallest model size for each dataset is highlighted in bold. . . . . . . . . . . . . . . . . 28
2.3 Comparison of regression performance of eight regressors on six datasets. . . . . . . . . . 31
3.1 Comparison of training run-time (in seconds) of 6 settings for 9 classification datasets. . . 46
3.2 Comparison of training run-time (in seconds) of 6 settings for 6 regression datasets. . . . . 47
3.3 Comparison of classification accuracy (in terms of %) of SLM, SLM Forest and SLM Boost
probabilistic search and APSO acceleration methods on 9 classification datasets. . . . . . . 49
3.4 Comparison of regression mean-squared-errors (MSE) of SLR, SLR Forest and SLR Boost
probabilistic search and APSO acceleration methods on 6 regression datasets. . . . . . . . 50
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Evaluation results on Saab filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Evaluation results on Haar filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Comparison of the original and the modified LeNet-5 architectures on benchmark datasets 73
4.5 Comparison of 32×32 and 96×96 Input for STL10 Datasets for Deep Models . . . . . . . . 74
4.6 CIFAR10 & STL10 Results Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7 MNIST & Fashion-MNIST Results Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.8 Ablation Study for SFG and CFL modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1 Performance Comparison for CIFAR-10 with GL. . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Performance Comparison for STL-10 with GL. . . . . . . . . . . . . . . . . . . . . . . . . . 104
v
5.3 Performance Comparison for MNIST with GL. . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4 Performance Comparison for Fashion-MNIST with GL. . . . . . . . . . . . . . . . . . . . . 104
5.5 Performance Comparison for MNIST with SLM-SLAB. . . . . . . . . . . . . . . . . . . . . 111
5.6 Performance Comparison for CIFAR10 with SLM-SLAB. . . . . . . . . . . . . . . . . . . . . 112
5.7 Performance Comparison for CIFAR100 with SLM-SLAB. . . . . . . . . . . . . . . . . . . . 112
5.8 Performance Comparison for Tiny-Imagenet with SLM-SLAB. . . . . . . . . . . . . . . . . 112
vi
List of Figures
2.1 Decision combination and feature combination are two pillars for high performance
classical machine learning methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 (a) An illustration of SLM, where space S
0
is partitioned into 4 subspaces with two splits,
and (b) the corresponding SLM tree with a root note and four child nodes. . . . . . . . . . 11
2.3 Illustration of the probabilistic selection process, where the envelop function Ad that
provides a bound on the magnitude of coefficients a
′
d
in the orientation vector. The
dimensions with black dots are selected dimensions, and dots in one vertical line are
integers for selection for each dimension. In this example, the selected dimensions are
a
′
1
, a
′
2
, a
′
4
, a
′
5
and a
′
6
. For each trial, we select one black dot per vertical line to form an
orientation vector. The search can be done exhaustively or randomly with the uniform
distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 An overview of the SLM system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Visualization of 2D feature datasets: (a) Circle & Ring (b) 2-new-moon, (c) 4-new-moon.
One ground truth sample of the training data, the ground truth of the test data and the
SLM predicted results are shown in the first, second and third rows, respectively. . . . . . 26
2.6 Comparison of SLM and DT ensembles for three datasets (a) Wine, (b) B.C.W., and (c)
Pima. Each left subfigure compares the accuracy curves of SLM Forest and RF as a function
of the tree number. Each right subfigure compares the logloss curves of SLM Boost and
XGBoost as a function of the tree number. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 The overview of the Efficient Object Recognition System with example RGB image and
5 × 5 filter size in SSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 The block diagram of single stage Saab transform on one spatial location . . . . . . . . . . 59
4.3 (a) Comparison of the traditional Saab transform and the channelwise Saab transform. (b)
Illustration of the tree-decomposed Saab feature . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 2-D Haar Filters (a) DC component, (b) Horizontal Difference, (c) Vertical Difference, (d)
Diagonal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
vii
4.5 Illustration of spatial and spectral pooling from feature map . . . . . . . . . . . . . . . . . 63
4.6 Illustration of Discriminant Feature Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.7 Examples from four datasets. (a) MNIST, (b) Fashion-MNIST, (c)CIFAR-10, (d) STL-10 . . . 67
4.8 Correlation between DFT selection and GBDT performance at each stage of the Saab
Transform with 5×5 filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.9 Convergence comparison between SFG and no SFG on CIFAR10 and STL10 datasets . . . . 77
5.1 The comparison of the decision process between a Hard Decision Tree and Soft Decision
Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 The overview of an example SLM/SP tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 The hierarchical determination of SLM/SP tree . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 Image Classification Framework with SLM/SP Tree . . . . . . . . . . . . . . . . . . . . . . . 97
5.5 Overview of the SLAB. The low-dimensional manifold-of-interest feature space is of
dimension H x W x C, the expanded high-dimensional feature space for dense spatial and
spectral local representation learning is of dimension H x W x C’. . . . . . . . . . . . . . . 99
5.6 Examples from four datasets. (a) MNIST, (b) CIFAR-10, (c)CIFAR-100, (d) Tiny ImageNet . 107
6.1 Overview of the representative object detection methods. (a) YOLO (b) Faster-RCNN (c)
SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 Example images from ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
viii
Abstract
Classification and regression are two fundamental problems in machine learning and statistics. Both classification and regression are widely used in various fields, such as natural language processing, audio
and speech recognition, computer vision, etc.. While deep learning methods have made significant advancements, in which feature extraction and classification/regression are handled jointly, feature-based
standalone machine learning methods for general classification and regression are still an important area
of research. These methods provide a fundamental understanding of how to model the data to solve classification and regression problems effectively with given input features, and can be useful for tasks where
the amount of data is limited, or when interpretability of the model is crucial. In this thesis, we focus on
efficient machine learning methods for low-dimensional and high-dimensional data sources.
For low-dimensional data sources, we propose a novel feature-based, high-performance machine learning method named the Subspace Learning Machine (SLM). SLM aims to solve general classification and
regression problems with high performance and low model complexity by combining the efficiency of
decision trees and the effectiveness of multi-layer perceptrons. For high-dimensional data sources, we
propose an efficient feed forward machine learning framework for image classification, an efficient adaptive SLM design with soft partitioning(SLM/SP) framework for image classification. Our proposed methods
offer lightweight models with adaptive architecture design and low computational requirements during
inference while still achieving high performance.
ix
Chapter 1
Introduction
1.1 Significance of the Research
Classification and regression are two fundamental problems in machine learning and statistics. Classification is a problem of identifying to which set of categories a new observation belongs, on the basis of
a training set of data containing observations or instances whose category membership is known. Regression, on the other hand, is a problem of predicting a continuous target variable based on one or more
predictor variables.
Classification and regression are important in machine learning and statistics and are widely used in
a variety of applications. For example, classification is used in natural language processing, image and
speech recognition, medical diagnosis, and spam detection. Regression is used in finance, economics, and
engineering, etc.. Classification and regression modules are also fundamental building blocks for more
complex models. Feature extraction and classification or regression are treated as two separate modules
in the classical machine learning methods, in which the classification or regression modules are essential
for the final performance. Deep learning methods like neural network use a combination of classification
and regression layers. Other methods like ensemble methods often involve combining multiple classifiers
or regressors. In this thesis, we focus on the novel machine learning methods for low-dimensional data
sources and high-dimensional data sources.
1
For problems with low-dimensional data sources, we propose a novel feature-based high performance
machine learning method named SLM. SLM aims to solve general classification and regression problems
with high performance and low model complexity. Learning from the efficiency of decision tree and effectiveness of multi-layer perceptron, we propose the subspace learning method as an effective feature space
hyperplane partitioning process, and we perform the subspace learning recursively to construct a general
tree as SLM. With the general tree as data structure, we solve the general classification and regression
problems with efficiency and high-performance, with single SLM tree, or ensemble of SLM trees as SLM
forest or gradient boosting SLM tree.
To reduce the computational complexity and accelerate the training of SLM, we investigate two ways
for improvement. First, we adopt the particle swarm optimization (PSO) algorithm to speed up the subspace learning process, which is to search for a discriminant dimension as a linear combination of current
dimensions. The learning of optimal weights in the linear combination can be computationally heavy with
probabilistic search in original SLM. The acceleration of SLM by PSO requires 10-20 times fewer iterations.
Second, we leverage parallel processing in the SLM implementation. The accelerated SLM method achieves
speed-up in training time while maintaining comparable classification/regression performance of original
SLM.
For problems with high-dimensional data sources, we propose efficient machine learning methods for
the image classification task, including feed forward efficient machine learning framework, and efficient
SLM/SP framework. Image classification is one of the main research field of computer vision and crucial in
many applications such as object recognition, self-driving cars, bio-metrics, and image retrieval. It allows
computers to understand and identify objects in images, which can be used in various fields like robotics,
surveillance, and self-driving cars. In self-driving cars, for instance, image classification is used to identify
and track objects such as pedestrians, vehicles, and traffic lights. In bio-metrics, it can be used to identify
people based on their facial features or fingerprints, which is used in security systems, access control and
2
identification of individuals. In image retrieval systems, where it is used to categorize and organize large
image databases, making it easier to search and retrieve specific images.
Though with the advancement of deep learning methods, exceptional performance improvements on
public datasets and various tasks are achieved, deep learning models generally require large training data,
huge computational resource and large model size. We propose alternative learning frameworks for image
classification task. For feed forward efficient machine learning framework, we propose learning framework with successive subspace learning as rich unsupervised feature extraction, novel supervised feature
learning process, supplemental feature learning module inspired by subspace learning process, and complementary nonlinear feature learning module. For efficient SLM/SP framework, we adapt SLM with soft
partitioning for high-dimensional data sources, propose novel topology of SLM/SP tree leveraging soft
decision tree, and novel efficient local representation learning module for the SLM/SP tree topology. Our
proposed methods offer lightweight models with light requirements for computational resource during
inference and high performance.
1.2 Contributions of the Research
1.2.1 Novel Tree-based Classifiers/Regressors for Low-Dimensional Data Sources
For problems with low-dimensional data sources, we propose a novel feature-based machine learning
method named SLM for general classification and regression tasks.
• SLM builds upon a sequence of operations that partition an input feature space into multiple discriminant subspaces hierarchically and efficiently.
• SLM yields one or multiple wide and shallow decision trees, each of which is stronger than the
traditional decision tree in classification performance.
3
• SLM combines feature learning and ensemble learning, with ensemble of SLM trees, the best predictive performance on the low-dimensional datasets are achieved.
• SLM has generally smaller model parameters, and achieves more efficient gradient boosting and
ensemble training process compared to decision tree based ensembles.
• In summary, SLM is a lightweight and mathematically transparent classifier that offers state-of-theart performance on general classification and regression tasks with low-dimensional data sources.
1.2.2 Complexity Reduction for Tree-based Classifiers/Regressors
With probabilistic projection as subspace learning process, the computational cost for training is heavy. To
reduce the training complexity for SLM tree and speed up the training process, we introduced optimization
based subspace learning process and implemented parallel processing and cuda support for SLM.
• The optimization based SLM with PSO requires 10-20 times fewer iterations to achieve similar or
better performance compared to probabilistic projection based subspace learning.
• The new implementation of SLM with parallel process and cuda support achieves x40 to x100 training
process speed of SLM tree and ensemble SLM.
• The combination of the optimization based SLM and new implementation achieves up to 577 times
faster training process compare to the naive implementation of SLM with probabilistic projection
learning.
1.2.3 Feed-forward Efficient Machine Learning for High-Dimensional Data Sources
Deep learning based methods dominate the image classification field in the past decade, which requires
larger and larger training data, model size, computational complexity during training and inference. To
4
explore an efficient machine learning method alternative to deep learning, we propose a novel image
classification framework.
• We utilize successive subspace learning for rich feature extraction and feature learning. Different
subspace learning approached including Saab transform, channelwise Saab transform, Haar transform are adopted for diversified unsupervised feature extraction process.
• We incorporate unsupervised and supervised feature learning process to conduct effective dimension
reduction from the extracted rich features.
• With gradient boosting decision tree as the main classifier, we include a supplemental feature learning module and a complementary nonlinear feature learning module and a lighter model with less
computational cost during inference with high performance is achieved.
• Our method results in a lightweight model with low computational requirements during inference
and high performance, compared to traditional deep learning models which often require large training data, computational resources and model size.
1.2.4 Efficient SLM with Soft Partitioning for High-Dimensional Data Sources
To further explore the potential of the SLM on high-dimensional data sources, We propose an improved
SLM method that adopts soft partitioning and denote it with SLM/SP. SLM/SP adopts the soft decision
tree (SDT) data structure for decision learning. It starts by learning an adaptive tree structure via local
greedy subspace partitioning. Once the stopping criteria are met for all child nodes and the tree structure
is finalized, all projection vectors are updated globally. This methodology enables efficient training, high
classification accuracy, and a small model size.
• We generalize SLM from hard-partitioning to soft-partitioning and propose a new SLM/SP method.
5
• We demonstrate the effectiveness of SLM/SP in classification accuracy with several benchmarking
datasets.
• We show the efficiency of SLM/SP in its model size and computational complexity.
• Our method offers a flexible framework design, which allows for adaptable model size and FLOPs,
and enables the ability to strike a balance between efficiency and effectiveness for different classification tasks.
1.3 Organization of the Thesis
The rest of the thesis is organized as follows. In Chapter 2, we propose the novel tree-based machine learning method SLM with complete mathematical demonstration, experimental results and analysis on the
low-dimensional datasets, and comments and comparison with related classical machine learning methods. In Chapter 3, we propose the optimization, complexity reduction and enhanced implementation of
SLM, thorough experimental results and analysis are included to demonstrate the speed-up of SLM training. In Chapter 4 and Chapter 5, we propose efficient object recognition methods with lightweight model
size, inference complexity, adaptive architecture design, and high performance, i.e. feed-forward efficient
machine learning framework and efficient SLM with soft partitioning respectively. Finally, we conclude
the research and future work directions in Chapter 6.
6
Chapter 2
Novel Tree-based Classifiers/Regressors for Low-Dimensional Data
Sources
2.1 Introduction
Feature-based classification models have been well studied for many decades. Feature extraction and classification are treated as two separate modules in the classical setting. Attention has been shifted to deep
learning (DL) in recent years. Feature learning and classification are handled jointly in DL models. Although the best performance of classification tasks is broadly achieved by DL through back propagation
(BP), DL models suffer from lack of interpretability, high computational cost and high model complexity.
Under the classical learning paradigm, we propose new high-performance classification and regression
models with features as the input in this work.
Examples of classical classifiers/regressors include support vector machine (SVM) [23], decision tree
(DT) [7], multilayer perceptron (MLP) [105], feed-forward multilayer perceptron (FF-MLP) [88] and extreme learning machine (ELM) [56]. SVM, DT and FF-MLP share one common idea, i.e., feature space
partitioning. Yet, they achieve this objective by different means. SVM partitions the space by leveraging
kernel functions and support vectors. DT partitions one space into two subspaces by selecting the most
discriminant feature one at a time recursively. DT tends to overfit the training data when the tree depth
7
is high. To avoid it, one can build multiple DTs, each of which is a weak classifier, and their ensemble
yields a strong one, e.g., the random forest (RF) classifier [9] and gradient boosting decision tree (GBDT)
[14]. Built upon linear discriminant analysis, FF-MLP uses the Gaussian mixture model (GMM) to capture
feature distributions of multiple classes and adopts neuron pairs with weights of opposite signed vectors
to represent partitioning hyperplanes.
The effective feature partitioning capability and high performance of the classical classifiers/regressors
depends on two pillars, decision combination and feature combination, as shown in Fig. 2.1. With the simple learning strategy adopted in DT, electing a partition is easier since it is conducted on a single feature,
i.e. no combination of features is considered. Yet, the simplicity is paid by a price. That is, the discriminant
power of an individual feature is weak, and a DT results in a weak classifier. Moving towards the feature
combination pillar, SVM and MLP introduce feature combination for complex feature space partitioning
and high performance. The complexity of SVM and FF-MLP depends on the sample distribution in the
feature space. It is a nontrivial task to determine suitable partitions when the feature dimension is high
and/or the sample distribution is complicated. These challenges could limit the effectiveness and efficiency
of SVM and FF-MLP. As proposed in ELM [56], another idea of subspace partitioning is to randomly project
a high-dimensional space to a 1D space and find the optimal split point in the associated 1D space. Although ELM works theoretically, it is not efficient in practice if the feature dimension is high. It takes a
large number of trials and errors in finding good projections. Moving towards the other pillar decision
combination, RF and GBDT introduce ensemble of simple yet weak DT classifiers/regressors for achieving
complex feature partitioning and high performance with decision combination. Though the ensemble of
the DTs achieve significant improvement over single DT, the upper-bound of the performance is limited by
the simple feature partitioning strategy in DT, without considering feature combination. A better strategy
with considering both feature combination and decision combination is needed.
8
Figure 2.1: Decision combination and feature combination are two pillars for high performance classical
machine learning methods.
By analyzing pros and cons of SVM, FF-MLP, DT and ELM, we attempt to find a balance between
simplicity and effectiveness and propose a new classification-oriented machine learning model in this
work. Since it partitions an input feature space into multiple discriminant subspaces in a hierarchical
manner, it is named the subspace learning machine (SLM). Its basic idea is sketched below. Let X be the
input feature space. First, SLM identifies subspace S
0
from X. If the dimension of X is low, we set S
0 = X.
If the dimension of X is high, we remove less discriminant features from X so that the dimension of S
0
is
lower than that of X.
Next, SLM uses probabilistic projections of features in S
0
to yield p 1D subspaces and find the optimal
partition for each of them. This is equivalent to partitioning S
0 with 2p hyperplanes. A criterion is developed to choose the best q partitions that yield 2q partitioned subspaces among them. We assign S
0
to
the root node of a decision tree and the intersections of 2q subspaces to its child nodes of depth one. The
partitioning process is recursively applied at each child node to build an SLM tree until stopping criteria
are met, then each leaf node makes a prediction. Generally, an SLM tree is wider and shallower than a
DT. The prediction capability of an SLM tree is stronger than that of a single DT since it allows multiple
decisions at a decision node. Its performance can be further improved by ensembles of multiple SLM trees
9
obtained by bagging and boosting. The idea can be generalized to regression, leading to the subspace
learning regressor (SLR). Extensive experiments are conducted for performance benchmarking among individual SLM/SLR trees, multi-tree ensembles and several classical classifiers and regressors. They show
that SLM and SLR offer light-weight high-performance classifiers and regressors, respectively.
The rest of this chapter is organized as follows. The SLM model is introduced in Sec. 2.2. The ensemble
design is proposed in Sec. 2.3. Performance evaluation and benchmarking of SLM and popular classifiers
are given in Sec. 2.4. The generalization to SLR is discussed in Sec. 2.5. The relationship between SLM/SLR
and other machine learning methods such as classification and regression tree (CART), MLP, ELM, RF and
Gradient Boosting Decision Tree (GBDT) [36] is described in Sec. 2.6. Finally, concluding remarks are
given in Sec. 2.7.
2.2 Subspace Learning Machine (SLM)
2.2.1 Motivation
Consider an input feature space, X, containing L samples, where each sample has a D-dimensional feature
vectors. A sample is denoted by
xl = (xl,1 · · · xl,d · · · , xl,D)
T ∈ R
D, (2.1)
where l = 1, · · · , L. We use Fd to represent the dth feature set of xl
, i.e.,
Fd = {xl,d | 1 ≤ l ≤ L}. (2.2)
10
(a) (b)
Figure 2.2: (a) An illustration of SLM, where space S
0
is partitioned into 4 subspaces with two splits, and
(b) the corresponding SLM tree with a root note and four child nodes.
For a multi-class classification problem with K classes, each training feature vector has an associated class
label in form of one-hot vector
yl = (yl,1, · · · , yl,k, · · · , yl,K)
T ∈ R
K, (2.3)
where
yl,k = 1 and yl,k′ = 0, k′
̸= k, (2.4)
if the lth sample belongs to class k, where 1 ≤ k, k′ ≤ K. Our goal is to partition the feature space, R
D,
into multiple subspaces hierarchically so that samples at leaf nodes are as pure as possible. It means that
the majority of samples at a node is from the same class. Then, we can assign all samples in the leaf node
to the majority class. This process is adopted by a DT classifier. The root node is the whole sample space,
and an intermediate or leaf node corresponds to a partitioned subspace. We use S
0
to represent the sample
space at the root node and S
m, m = 1, · · · , M, to denote subspaces of child nodes of depth m in the tree.
11
The efficiency of traditional DT methods could be potentially improved by two ideas. They are elaborated below.
1. Partitioning in flexibly chosen 1D subspace:
We may consider a general projected 1D subspace defined by
Fa = {f(a) | f(a) = a
Tx}, (2.5)
where
a = (a1, · · · , ad, · · · , aD)
T
, ||a|| = 1, (2.6)
The DT is a special case of Eq. (2.5), where a is set to the dth basis vector, ed, 1 ≤ d ≤ D. On one
hand, this choice simplifies computational complexity, which is particularly attractive if D >> 1.
On the other hand, if there is no discriminant feature Fd, the decision tree will not be effective. It is
desired to find a discriminant direction, a, so that the subspace, Fa, has a more discriminant power
at a low computational cost.
2. N-ary split at one node:
One parent node in DT is split into two child nodes. We may allow an n-ary split at the parent.
One example is shown in Fig. 2.2(a) and (b), where space S
0
is split into four disjoint subspaces.
Generally, the n-ary split gives wider and shallower decision trees so that overfitting can be avoided
more easily.
The SLM method is motivated by these two ideas. Although the high-level ideas are straightforward, their
effective implementations are nontrivial. They will be detailed in the next subsection.
12
2.2.2 Methodology
Subspace partitioning in a high-dimensional space plays a fundamental role in the design of powerful
machine learning classifiers. Generally, we can categorize the partitioning strategies into two types: 1)
search for an optimal split point in a projected 1D space (e.g., DT) and 2) search for an optimal splitting
hyperplane in a high-dimensional space (e.g., SVM and FF-MLP). Mathematically, both of them can be
expressed in form of
a
Tx − b = 0, (2.7)
where −b is called the bias and
a = (a1, · · · , ad, · · · , aD)
T
, ||a|| = 1, (2.8)
is a unit normal vector that points to the surface normal direction. It is called the projection vector. Then,
the full space, S, is split into two half subspaces:
S+ : a
Tx ≥ b, and S− : a
Tx < b. (2.9)
The challenge lies in finding good projection vector a so that samples of different classes are better
separated. It is related to the distribution of samples of different classes. For the first type of classifiers,
they pre-select a set of candidate projection vectors, try them out one by one, and select the best one based
on a certain criterion. For the second type of classifiers, they use some criteria to choose good projection
vectors. For example, SVM first identifies support vectors and then finds the hyperplane that yields the
largest separation (or margin) between two classes. The complexity of the first type is significantly lower
than that of the second type.
13
In SLM, we attempt to find a mid-ground of the two. That is, we generate a new 1D space as given by
Fa = {f(a) | f(a) = a
Tx}, (2.10)
where a is a vector on the unit hypersphere in R
D as defined in Eq. (2.8). By following the first type of
classifier, we would like to identify a set of candidate projection vectors. Yet, their selection is done in
a probabilistic manner. Generally, it is not effective to draw a on the unit hypersphere uniformly. The
criterion of effective projection vectors and their probabilistic selection will be presented in Secs. 2.2.2.1-
2.2.2.3. Without loss of generality, we primarily focus on the projection vector selection at the root node
in the following discussion. The same idea can be easily generalized to child nodes.
2.2.2.1 Selection Criterion
We use the discriminant feature test (DFT) [132] to evaluate the discriminant quality of the projected 1D
subspace as given in Eq. (2.10). It is summarized below.
For a given projection vector, a, we find the minimum and the maximum of projected values f(a) =
a
Tx, which are denoted by fmin and fmax, respectively. We partition interval [fmin, fmax] into B bins
uniformly and use bin boundaries as candidate thresholds. One threshold, tb, b = 1, · · · , B − 1, partitions
interval [fmin, fmax] into two subintervals that define two sets:
Fa,tb,+ = {f(a) | a
Tx ≥ tb}, (2.11)
Fa,tb,− = {f(a) | a
Tx < tb}. (2.12)
The bin number, B, is typically set to 16 [132].
14
The quality of a split can be evaluated with the weighted sum of loss functions defined on the left and
right subintervals:
La,tb =
N+
N+ + N−
La,tb,+ +
N−
N+ + N−
La,tb,−, (2.13)
where N+ =| Fa,tb,+ | and N− =| Fa,tb,− | are sample numbers in the left and right subintervals,
respectively. One convenient choice of the cost function is the entropy value as defined by
L = −
X
C
c=1
pc log(pc), (2.14)
where pc is the probability of samples in a certain set belonging to class c and C is the total number of
classes. In practice, the probability is estimated using the histogram. Finally, the discriminant power of a
certain projection vector is defined as the minimum cost function across all threshold values:
La,opt = min
tb
La,tb
. (2.15)
We search for projection vectors, a that provide small cost values as give by Eq. (2.15). The smaller, the
better.
2.2.2.2 Probabilistic Selection
We adopt a probabilistic mechanism to select one or multiple good projection vectors. To begin with, we
express the project vector as
a = a1e1 + · · · , aded + · · · + aDeD, (2.16)
15
where ed, d = 1, · · · , D, is the basis vector. We evaluate the discriminant power of ed by setting a = ed
and following the procedure in Sec. 2.2.2.1. Our probabilistic selection scheme is built upon one observation. The discriminative ability of ed plays an important role in choosing a more discriminant a. Thus,
we rank ed according to their discriminant power measured by the cost function in Eq. (2.15). The newly
ordered basis is denoted by e
′
d
, d = 1, · · · , D, which satisfies the following:
Le
′
1
,opt ≤ Le
′
2
,opt ≤ · · · ≤ Le
′
D,opt. (2.17)
We can rewrite Eq. (2.16) as
a = a
′
1e
′
1 + · · · , a′
de
′
d + · · · + a
′
De
′
D. (2.18)
We use three hyper-parameters to control the probabilistic selection procedure.
• Pd: the probability of selecting coefficient a
′
d
Usually, Pd is higher for smaller d. In the implementation, we adopt the exponential density function
in form of
Pd = β0 exp(−βd), (2.19)
where β > 0 is a parameter and β0 is the corresponding normalization factor.
• Ad: the dynamic range of coefficient a
′
d
To save the computation, we limit the values of a
′
d
and consider integer values within the dynamic
range; namely,
a
′
d = 0, ±1, ±2, · · · , ±⌊Ad⌋, (2.20)
where Ad is also known as the envelop parameter. Again, we adopt the exponential density function
Ad = α0 exp(−αd), (2.21)
16
where α > 0 is a parameter and α0 is the corresponding normalization factor in the implementation.
When the search space in Eq. (2.20) is relatively small with chosen hyperparameters, we exhaustively test all the values. Otherwise, we select a partial set of the orientation vector coefficients
probabilistically under the uniform distribution.
• R: the number of selected coefficients, a
′
d
, in Eq (2.18)
If D is large, we only select a subset of R coefficients, where R << D to save computation. The
dynamic ranges of the remaining (D − R) coefficients are all set to zero.
By fixing parameters β, α and R in one round of a generation, the total search space of a to be tested by
DFT lies between
U.B. = ΠR
d=1(2Ad + 1), L.B. = ΠD
d=D+1−R(2Ad + 1), (2.22)
where U.B and L.B. mean upper and lower bounds, respectively. To increase the diversity of a furthermore,
we may use multiple rounds in the generation process with different β, α and R parameters.
We use Fig. 2.3 as an example to illustrate the probabilistic selection process. Suppose input feature
dimension D = 10 and R is selected as 5, we may apply α0 as 10 and α as 0.5 for bounding the dynamic
range of the a
′
d
selections. During the probabilistic selection, the R = 5 coefficients are selected with the
candidate integers for the corresponding a
′
d marked as black dots, the unselected D − R coefficients are
marked as gray and the actual coefficients are set to zero. The coefficients of the orientation vector a are
uniformly selected among the candidate integers marked as black dots in Fig. 2.3.
2.2.2.3 Selection of Multiple Projection Vectors
Based on the search given in Sec. 2.2.2.2, we often find multiple effective projection vectors
a˜j , j = 1, · · · , J, (2.23)
17
Figure 2.3: Illustration of the probabilistic selection process, where the envelop function Ad that provides
a bound on the magnitude of coefficients a
′
d
in the orientation vector. The dimensions with black dots
are selected dimensions, and dots in one vertical line are integers for selection for each dimension. In
this example, the selected dimensions are a
′
1
, a
′
2
, a
′
4
, a
′
5
and a
′
6
. For each trial, we select one black dot
per vertical line to form an orientation vector. The search can be done exhaustively or randomly with the
uniform distribution.
which yield discriminant 1D subspaces. It is desired to leverage them as much as possible. However, many
of them are highly correlated in the sense that their cosine similarity measure is close to unity, i.e.,
a˜
T
j a˜j
′ ≈ 1. (2.24)
As a result, the corresponding two hyperplanes have a small angle between them. To avoid this, we propose
an iterative procedure to select multiple projection vectors. To begin with, we choose the projection vector
in Eq. (2.23) that has the smallest cost as the first one. Next, we choose the second one from Eq. (2.23)
that minimizes its absolute cosine similarity value with the first one. We can repeat the same process by
minimizing the maximum cosine similarities with the first two, etc. The iterative mini-max optimization
process can be terminated by a pre-defined threshold value, θminimax.
18
Figure 2.4: An overview of the SLM system.
2.2.2.4 SLM Tree Construction
We illustrate the SLM tree construction process in Fig. 2.4. It contains the following steps.
1. Check the discriminant power of D input dimensions and find discriminant input subspace S
0 with
D0 dimensions among D.
2. Generate p projection vectors that project the selected input subspace to p 1D subspaces. The projected space is denoted by S˜.
3. Select the best q 1D spaces from p candidate subspaces based on discriminative ability and correlation
and split the node accordingly, which is denoted by S
1
.
The node split process is recursively conducted to build nodes of the SLM tree. The following stopping
criteria are adopted to avoid overfitting at a node.
1. The depth of the node is greater than user’s pre-selected threshold (i.e. the hyper-parameter for the
maximum depth of an SLM tree).
19
2. The number of samples at the node is less than user’s pre-selected threshold (i.e. the hyper-parameter
for the minimum sample number per node).
3. The loss at the node is less than user’s pre-selection threshold (i.e. the hyper-parameter for the
minimum loss function per node).
2.3 SLM Forest and SLM Boost
Ensemble methods are commonly used in the machine learning field to boost the performance. An ensemble model aims to obtain better performance than each constituent model alone. With DTs as the constituent weak learners, the bootstrap aggregating or bagging method, (e.g., RF) and the boosting method
(e.g. GBDT) are the most popular ensemble methods. As compared to other classical machine learning
methods, they often achieve better performance when applied to real world applications. In this section,
we show how to apply the ensemble idea to one single SLM tree (i.e., SLM Baseline). Inspired by RF, we
present a bagging method for SLM and call it the SLM Forest. Similarly, inspired by XGBoost [16], we
propose a boosting method and name it SLM Boost. They are detailed in Secs. 2.3.1 and 2.3.2, respectively.
2.3.1 SLM Forest
For traditional DTs, RF is the most popular bagging ensemble algorithm. It consists of a set of tree predictors, where each tree is built based on the values of a random vector sampled independently and with the
same distribution for all trees in the forest [9]. With the Strong Law of Large Numbers, the performance
of RF converges as the tree number increases. As compared to the individual DTs, significant performance
improvement is achieved with the combination of many weak decision trees. Motivated by RF, SLM Forest
is developed by learning a series of single SLM tree models to enhance the predictive performance of each
individual SLM tree. As discussed in Sec. 2.2, SLM is a predictive model stronger than DT. Besides, the
probabilistic projection provides diversity between different SLM models. Following RF, SLM Forest takes
20
the majority vote of the individual SLM trees as the ensemble result for classification tasks, and it adopts
the mean of each SLM tree prediction as the ensemble result for regression tasks.
It is proved in [2] that the predictive performance of RF depends on two key points: 1) the strength
of individual trees, and 2) the weak dependence between them. In other words, a high performance RF
model can be obtained through the combination of strong and uncorrelated individual trees. The model
diversity of RF is achieved through bagging of the training data and feature randomness. For the former,
RF takes advantage of the sensitivity of DTs to the data they are trained on, and allows each individual tree
to randomly sample from the dataset with replacement, resulting in different trees. For the latter, each tree
can select features only from a random subset of the whole input features space, which forces diversity
among trees and ultimately results in lower correlation across trees for better performance.
SLM Forest achieves diversity of SLM trees more effectively through probabilistic selection as discussed
in Sec. 2.2.2.2. For partitioning at a node, we allow a probabilistic selection of D0 dimensions from the D
input feature dimensions by taking the discriminant ability of each feature into account. In Eq. (2.19), β is
a hyper-parameter used to control the probability distribution among input features. A larger β value has
higher preference on more discriminant features. Furthermore, the envelope vector, Ad in Eq. (2.21) gives
a bound to each element of the orientation vector. It also attributes to the diversity of SLM trees since the
search space of projection vectors are determined by hyper-parameter α. Being similar to the replacement
idea in RF, all training samples and all feature dimensions are kept as the input at each node splitting to
increase the strength of individual SLM trees.
With individual SLM trees stronger than individual DTs and novel design in decorrelated partitioning
planes, SLM Forest achieves better performance and faster converge than RF. This claim is supported by
experimental results in Sec. 2.4.
21
2.3.2 SLM Boost
With standard DTs as weak learners, GBDT [36] and XGBoost [16] can deal with a large amount of data
efficiently and achieve the state-of-the-art performance in many machine learning problems. They take
the ensemble of standard DTs with boosting, i.e. by defining an objective function and optimizing it with
learning a sequence of DTs. By following the gradient boosting process, we propose SLM Boost to ensemble
a sequence of SLM trees.
Mathematically, we aim at learning a sequence of SLM trees, where the tth tree is denoted by ft(x).
Suppose that we have T trees at the end. Then, the prediction of the ensemble is the sum of all trees, i.e.
each of the L samples is predicted as
yˆl =
X
T
t=1
ft(xl) l = 1, 2, · · · , L. (2.25)
The objective function for the first t trees is defined as
Ω = X
L
l=1
γ(yl
, yˆ
(t)
l
), (2.26)
where yˆ
(t)
l
is the prediction of sample l with all t trees and γ(yl
, yˆ
(t)
l
) denotes the training loss for the model
with a sequence of t trees. The log loss and the mean squared error are commonly utilized for classification
and regression tasks as the training loss, respectively. It is intractable to learn all trees at once. To design
SLM Boost, we follow the GBDT process and use the additive strategy. That is, by fixing what have been
22
learned with all previous trees, SLM Boost learns a new tree at each time. Without loss of generality, we
initialize the model prediction as 0. Then, the learning process is
yˆ
(0)
l = 0 (2.27)
yˆ
(1)
l = f1(xl) = ˆy
(0)
l + f1(xl) (2.28)
· · · (2.29)
yˆ
(t)
l =
X
t
i=1
fi(xl) = ˆy
(t−1)
l + ft(xl) (2.30)
Then, the objective function to learn the tth tree can be written as
Ψ(t) = X
L
l=1
γ(yl
, yˆ
(t−1)
l + ft(xl)). (2.31)
Furthermore, we follow the XGBoost process and take the Taylor Expansion of the loss function up to
the second order to approximate the loss function in general cases. Then, the objective function can be
approximated as
Ψ
(t) ≈
X
L
l=1
(γ(yl
, yˆ
(t−1)
l
) + glft(xl) + 1
2
hlf
2
t
(xl)) + C, (2.32)
where gl and hl are defined as
gl = ∂
yˆ
(t−1)
l
γ(yl
, yˆ
(t−1)
l
) (2.33)
hl = ∂
2
yˆ
(t−1)
l
γ(yl
, yˆ
(t−1)
l
) (2.34)
23
After removing all constants, the objective function for the tth SLM tree becomes
X
L
l=1
[glft(xl) + 1
2
hlf
2
t
(xl)] (2.35)
With individual SLM trees stronger than individual DTs, SLM Boost achieves better performance and faster
convergence than XGBoost as illustrated by experimental results in Sec. 2.4.
2.4 Performance Evaluation of SLM
Datasets. To evaluate the performance of SLM, we conduct experiments on the following nine datasets.
1. Circle-and-Ring. It contains an inner circle as one class and an outer ring as the other class as shown
in Fig. 2.5(a) [100].
2. 2-New-Moons. It contains two interleaving new moons as shown in Fig. 2.5(b) [100]. Each new
moon corresponds to one class.
3. 4-New-Moons. It contains four interleaving new moons as shown in Fig. 2.5(c) [100], where each
moon is a class.
4. Iris Dataset. The Iris plants dataset [34, 100] has 3 classes, 4D features and 150 samples.
5. Wine Dataset. The Wine recognition dataset [3, 100] has 3 classes, 13D features, and 178 samples.
6. B.C.W. Dataset. The breast cancer Wisconsin (B.C.W.) dataset [3, 100] has 2 classes, 30D features and
569 samples.
7. Diabetes Dataset. The Pima Indians diabetes dataset [114] is for diabetes prediction. It has 2 classes,
8D features and 768 samples. By following [88], we removed samples with the physically impossible
24
zero value for glucose, diastolic blood pressure, triceps skin fold thickness, insulin, or BMI and used
the remaining 392 samples for consistent experimental settings.
8. Ionosphere Dataset. The Ionosphere binary classification dataset [3, 44] is used to predict whether
the radar return is good or bad. It has 2 classes, 34D features and 351 instances. For consistency with
[88], we remove the feature dimension that has the zero variance from the data.
9. Banknote Dataset. The banknote authentication dataset [3] classifies whether a banknote is genuine
or forged based on the features extracted from the wavelet transform of banknote images. It has 2
classes, 4D features and 1372 samples.
The feature dimension of the first three datasets is two, while that of the last six datasets is higher than
two. The first three are synthetic ones, where 500 samples per class are generated, with 30% noisy samples
in the decision boundary for 2-New-Moons and 20% noisy samples in the decision boundary of Circle-andRing and 4-New-Moons. The last six are real-world datasets. Samples in all datasets are randomly split
into training and test sets with 60% and 40%, respectively.
Benchmarking Classifiers and Their Implementations. We compare the performance of 10 classifiers in
Table 2.1. They include seven traditional classifiers and three proposed SLM variants. The seven classifiers
are: 1) MLP designed in a feedforward manner (denoted by FF-MLP) [88], 2) MLP trained by backpropagation (denoted by BP-MLP), 3) linear SVM (LSVM), 4) SVM with the radial basis function (SVM/RBF)
kernel, and 5) Decision Tree (DT) 6) Random Forest (RF) and 7) XGBoost. The three SLM variants are: 1)
SLM Baseline using only one SLM tree, 2) SLM Forest, and 3) SLM Boost.
For the two MLP models, the network architectures of FF-MLP and BP-MLP and their performance
results are taken from [88]. FF-MLP has a four-layer network architecture; namely, one input layer, two
hidden layers and one output layer. The neuron numbers of its input and output layers are equal to the
25
Figure 2.5: Visualization of 2D feature datasets: (a) Circle & Ring (b) 2-new-moon, (c) 4-new-moon. One
ground truth sample of the training data, the ground truth of the test data and the SLM predicted results
are shown in the first, second and third rows, respectively.
feature dimension and the class number, respectively. The neuron numbers at each hidden layer are hyperparameters determined adaptively by a dataset. BP-MLP has the same architecture as FF-MLP against
the same dataset. Its model parameters are initialized by those of FF-MLP and trained for 50 epochs. For
the two SVM models, we conduct grid search for hyperparameter C in LSVM and hyperparameters C
and γ in SVM/RBF for each of the nine datasets to yield the optimal performance. For the DT model, the
weighted entropy is used as the loss function in node splitting. We do not set the maximum depth limit
of a tree, the minimum sample number and the minimum loss decrease required as the stopping criteria.
Instead, we allow the DT for each dataset to split until the purest nodes are reached in the training. For the
ensemble of DT models (i.e., RF and XGBoost), we conduct grid search for the optimal tree depth and the
learning rate of XGBoost to ensure that they reach the optimal performance for each dataset. The number
of trees is set to 100 to ensure convergence. The hyperparameters of SLM Baseline (i.e., with one SLM tree)
26
include D0, p, Aint, α, β and the minimum number of samples used in the stopping criterion. They are
searched to achieve the performance as shown in 2.1. The number of trees in SLM Forest is set to 20 due
to the faster convergence of stronger individual SLM trees. The number of trees in SLM Boost is the same
as that of XGBoost for fair comparison of learning curves.
Table 2.1: Classification accuracy comparison of 10 benchmarking methods on nine datasets.
Datasets circleandring
2-
newmoons
4-
newmoons
Iris Wine B.C.W. Pima Ionosphere
Banknote
FF-MLP 87.25 91.25 95.38 98.33 94.44 94.30 73.89 89.36 98.18
BP-MLP 88.00 91.25 87.00 64.67 79.72 97.02 75.54 84.11 88.38
LSVM 48.50 85.25 85.00 96.67 98.61 96.49 76.43 86.52 99.09
SVM/RBF 88.25 89.75 88.38 98.33 98.61 97.36 75.15 93.62 100.00
DT 85.00 87.25 94.63 98.33 95.83 94.74 77.07 89.36 98.00
SLM Baseline 88.25 91.50 95.63 98.33 98.61 97.23 77.71 90.07 99.09
RF 87.00 90.50 96.00 98.33 100.00 95.61 79.00 94.33 98.91
XGBoost 87.50 91.25 96.00 98.33 100.00 98.25 75.80 91.49 99.82
SLM Forest 88.25 91.50 96.00 98.33 100.00 97.36 79.00 95.71 100.00
SLM Boost 88.25 91.50 96.00 98.33 100.00 98.83 77.71 94.33 100.00
Comparison of Classification Performance. The classification accuracy results of 10 classifiers are shown
in Table 2.1. We divide the 10 classifiers into two groups. The first group includes 6 basic methods: FFMLP, BP-MLP, LSVM, SVM/RBF, DT and SLM Baseline. The second group includes 4 ensemble methods:
RF, XGBoost, SLM Forest and SLM Boost. The best results for each group are shown in bold. For the first
group, SLM Baseline and SVM/RBF often outperform FF-MLP, BP-MLP, LSVM and DT and give the best
results. For the three synthetic 2D datasets (i.e. circle-and-ring, 2-new-moons and 4-new-moons), the gain
of SLM over MLP is relatively small due to noisy samples. The difference in sample distributions of training
and test data plays a role in the performance upper bound. To demonstrate this point, we show sample
distributions of their training and testing data in Fig. 2.5. For the datasets with high dimensional input
features, the SLM methods achieve better performance over the classical methods with bigger margins. The
ensemble methods in the second group have better performance than the basic methods in the first group.
27
The ensembles of SLM are generally more powerful than those of DT. They give the best performance
among all benchmarking methods.
Table 2.2: Model size comparison of five machine learning models models against 9 datasets, where the
smallest model size for each dataset is highlighted in bold.
Datasets FF/BPMLP
LSVM SVMRBF
DT SLM
Baseline
DTdepth
SLMtreedepth
Circle-and-Ring 125 2,365 825 350 39 14 4
2-new-moons 114 853 705 286 42 15 4
4-new-moons 702 1,653 1,669 298 93 11 5
Iris 47 145 253 34 20 6 3
Wine 147 346 856 26 99 4 2
B.C.W. 74 1,121 2,913 54 126 7 4
Pima 2,012 1,071 1,341 130 55 8 3
Ionosphere 278 806 1,996 50 78 10 2
Banknode 22 337 499 78 40 7 3
Comparison of Model Sizes. The model size is defined by the number of model parameters. The model
sizes of FF/BP-MLP, LSVM, SVM/RBF, DT and SLM Baseline are compared in Table 2.2. Since FF-MLP and
BP-MLP share the same architecture, their model sizes are the same. It is calculated by summing up the
weight and bias numbers of all neurons.
The model parameters of LSVM and SVM/RBF can be computed as
SVM Parameter # = (D + 2) × NSV + 1, (2.36)
where D and NSV denote the feature dimension and the number of support vectors, respectively. The
first term in Eq. (2.36) denotes the support vectors, each support vector has D feature dimensions, one
Lagrange dual coefficient, and one class label. The second term denotes the bias.
The model sizes of DTs depend on the number of splits learned during the training process, and there
are two parameters learned during each split for feature selection and split value respectively, the sizes of
DTs are calculated as two times of the number of splits.
28
The size of an SLM baseline model depends on the number of partitioning hyper-planes which are determined by the training stage. For given hyper-parameter D0, each partitioning hyper-plane involves one
weight matrix and a selected splitting threshold, with qi decorrelated partitioning learned for partitioning
each parent node. Then, the model size of the corresponding SLM can be calculated as
SLM Parameter # =
X
M
i=1
qi(D0 + 1), (2.37)
where M is the number of partitioning hyperplanes.
Details on the model and computation of each method against each dataset are given in the appendix.
It is worthwhile to comment on the tradeoff between the classification performance and the model size.
For SVM, since its training involves learning the dual coefficients and slack variables for each training
sample and memorization of support vectors, the model size is increasing linearly with the number of
training samples and the number of support vectors. When achieving similar classification accuracy, the
SVM model is heavier for more training samples. For MLPs, with high-dimension classification tasks,
SLM methods outperforms the MLP models in all benchmarking datasets. For the datasets with saturated
performance such as Iris, Banknote, and Ionosphere, SLM achieves better or comparable performance with
less than half of the parameters of MLP. As compared to DTs, the SLM models tend to achieve wider and
shallower trees, as shown in Table 2.2. The depth of SLM trees are overall smaller than the DT models,
while the number of splittings can be comparable for the small datasets, like Wine. The SLM trees tend
to make more splits at reach pure leaf nodes at shallower depth. With outperforming the DTs in all the
datasets, the SLM model sizes are generally smaller than the DTs as well with benefiting from the subspace
partitioning process.
Convergence Performance Comparison of DT Ensembles and SLM Ensembles. We compare the convergence performance of the ensemble and the boosting methods of DT and SLM for Wine, B.C.W. and Pima
29
Figure 2.6: Comparison of SLM and DT ensembles for three datasets (a) Wine, (b) B.C.W., and (c) Pima.
Each left subfigure compares the accuracy curves of SLM Forest and RF as a function of the tree number.
Each right subfigure compares the logloss curves of SLM Boost and XGBoost as a function of the tree
number.
three datas in Figs. 2.6(a)-(c). For RF and SLM Forest, which are ensembles of DT and SLM, respectively, we
set their maximum tree depth and learning rate to the same. We show their accuracy curves as a function
of the tree number in the left subfigure. We see that SLM Forest converges faster than RF. For XGBoost
and SLM Boost, which are boosting methods of DT and SLM, respectively, we show the logloss value as
30
Table 2.3: Comparison of regression performance of eight regressors on six datasets.
Datasets Make
Friedman1
Make
Friedman2
Make
Friedman3
Boston Californiahousing
Diabetes
LSVR 2.49 138.43 0.22 4.90 0.76 53.78
SVR/RBF 1.17 6.74 0.11 3.28 0.58 53.71
DT 3.10 33.57 0.11 4.75 0.74 76.56
SLR Baseline 2.89 31.28 0.11 4.42 0.69 56.05
RF 2.01 22.32 0.08 3.24 0.52 54.34
XGBoost 1.17 32.34 0.07 2.67 0.48 53.99
SLR Forest 1.88 20.79 0.08 3.01 0.48 52.52
SLR Boost 1.07 18.07 0.06 2.39 0.45 51.27
a function of the tree number in the right subfigure. Again, we see that SLM Boost converges faster than
XGBoost.
2.5 Subspace Learning Regressor (SLR)
2.5.1 Method
A different loss function can be adopted in the subspace partitioning process for a different task. For
example, to solve a regression problem, we can follow the same methodology as described in Sec. 2.2
but adopt the mean-squrared-error (MSE) as the loss function. The resulting method is called subspace
learning regression, and the corresponding regressor is the subspace learning regressor (SLR).
Mathematically, each training sample has a pair of input x and output y, where x is a D-dimensional
feature vector and y is a scalar that denotes the the regression target. Then, we build an SLR tree that
partitions the D-dimensional feature space hierarchically into a set of leaf nodes. Each of them corresponds
to a subspace. The mean of sample targets in a leaf node is set as the predicted regression value of these
samples. The partition objective is to reduce the total MSE of sample targets as much as possible. In the
partitioning process, the total MSE of all leaf nodes decreases gradually and saturates at a certain level.
31
The ensemble and boosting methods are applicable to SLR. The SLR Forest consists of multiple SLR
trees through ensembles. Its final prediction is the mean of predictions from SLR trees in the forest. To
derive SLR Boost, we apply the GBDT process and train a series of additive SLR trees to achieve gradient
boosting, leading to further performance improvement. As compared with a decision tree, an SLR tree is
wider, shallower and more effective. As a result, SLR Forest and SLR Boost are more powerful than their
counter parts as demonstrated in the next subsection.
2.5.2 Performance Evaluation
To evaluate the performance of SLR, we compare the root mean squared error (RMSE) performance of
eight regressors on six datasets in Table 2.3. The five benchmarking regressors are: linear SVR (LSVR),
SVR with RBF kernel, DT, RF and XGBoost. There are three variants of SLR: SLR Baseline (with one SLR
tree), SLR Forest and SLR Boost. The first three datasets are synthesized datasets as described in [37]. We
generate 1000 samples for each of them. The last three datasets are real world datasets. Samples in all six
datasets are randomly split into 60% training samples and 40% test samples.
1. Make Friedman 1. Its input vector, x, contains P (with P > 5) elements, which are independent
and uniformly distributed on interval [0, 1]. Its output, y, is generated by the first five elements of
the input. The remaining (P − 5) elements are irrelevant features and can be treated as noise. We
refer to [37] for details. We choose P = 10 in the experiment.
2. Make Friedman 2-3. Their input vector, x, has 4 elements. They are independent and uniformly
distributed on interval [0, 1]. Their associated output, y, can be calculated by all four input elements
via mathematical formulas as described in [37].
3. Boston. It contains 506 samples, each of which has a 13-D feature vector as the input. An element
of the feature vector is a real positive number. Its output is a real number within interval [5, 50].
32
4. California Housing. It contains 20640 samples, each of which has an 8-D feature vector. The
regression target (or output) is a real number within interval [0.15, 5].
5. Diabetes. It contains 442 samples, each of which has a 10-D feature vector. Its regression target is
a real number within interval [25, 346].
As shown in Table 2.3, SLR Baseline outperforms DT in all datasets. Also, SLR Forest and SLR Boosting
outperform RF and XGBoost, respectively. For Make Friedman1, Make Friedman3, california-housing,
Boston, and diabetes, SLR Boost achieves the best performance. For Make Friedman2, SVR/RBF achieves
the best performance benefiting from the RBF on its specific data distribution. However, it is worthwhile
to emphasize that, to achieve the optimal performance, SVR/RBF needs to overfit to the training data by
finetuning the soft margin with a large regularization parameter (i.e., C = 1000). This leads to much
higher computational complexity. With stronger individual SLR trees and effective uncorrelated models,
the ensemble of SLR can achieve better performance than DTs with efficiency.
2.6 Comments on Related Work
In this section, related prior work is reviewed and the relation between various classifiers/regressors and
SLM/SLR are commented.
2.6.1 Classification and Regression Tree (CART)
DT has been well studied for decades, and is broadly applied for general classification and regression
problems. Classical decision tree algorithms, e.g. ID3 [102] and CART [7], are devised to learn a sequence
of binary decisions. One tree is typically a weak classifier, and multiple trees are built to achieve higher
performance in practice such as bootstrap aggregation [9] and boosting methods [14]. DT and its ensembles
work well most of the time. Yet, they may fail due to poor training and test data splitting and training data
33
overfit. As compared to classic DT, one SLM tree (i.e., SLM Baseline) can exploit discriminant features
obtained by probabilistic projections and achieve multiple splits at one node. SLM generally yields wider
and shallower trees.
2.6.2 Random Forest (RF)
RF consists of multiple decisions trees. It is proved in [2] that the predictive performance of RF depends on
the strength of individual trees and a measure of their dependence. For the latter, the lower the better. To
achieve higher diversity, RF training takes only a fraction of training samples and features in building a tree,
which trades the strength of each DT for the general ensemble performance. In practice, several effective
designs are proposed to achieve uncorrelated individual trees. For example, bagging [8] builds each tree
through random selection with replacement in the training set. Random split selection [29] selects a split
at a node among the best splits at random. In [49], a random subset of features is selected to grow each tree.
Generally speaking, RF uses bagging and feature randomness to create uncorrelated trees in a forest, and
their combined prediction is more accurate than that of an individual tree. In contrast, the tree building
process in SLM Forest always takes all training samples and the whole feature space into account. Yet, it
utilizes feature randomness to achieve the diversity of each SLM tree as described in Sec. 2.3.1. Besides
effective diversity of SLM trees, the strength of each SLM tree is not affected in SLM Forest. With stronger
individual learners and effective diversity, SLM Forest achieves better predictive performance and faster
convergence in terms of the tree number.
2.6.3 Gradient Boosting Decision Tree (GBDT)
Gradient boosting is another ensemble method of weak learners. It builds a sequence of weak prediction models. Each new model attempts to compensate the prediction residual left in previous models.
The gradient boosting decision tree (GBDT) methods includes [36], which performs the standard gradient
34
boosting, and XGBoost [16], which takes the Taylor series expansion of a general loss function and defines
a gain so as to perform more effective node splittings than standard DTs. SLM Boost mimics the boosting
process of XGBoost but replaces DTs with SLM trees. As compared with standard GBDT methods, SLM
Boost achieves faster convergence and better performance as a consequence of stronger performance of
an SLM tree.
2.6.4 Multilayer Perceptron (MLP)
Since its introduction in 1958 [105], MLP has been well studied and broadly applied to many classification
and regression tasks [1, 28, 113]. Its universal approximation capability is studied in [25, 50, 85, 115].
The design of a practical MLP solution can be categorized into two approaches. First, one can propose an
architecture and fine tune parameters at each layer through back propagation. For the MLP architecture,
it is often to evaluate different networks through trials and errors. Design strategies include tabu search
[43] and simulated annealing [68]. There are quite a few MLP variants. In convolutional neural networks
(CNNs) [71, 80, 81, 82], convolutional layers share neurons’ weights and biases across different spatial
locations while fully-connected layers are the same as traditional MLPs. MLP also serves as the building
blocks in transformer models [32, 123]. Second, one can design an MLP by constructing its model layer
after layer, e.g., [35, 96, 97, 98]. To incrementally add new layers, some suggest an optimization method
which is similar to the first approach, e.g. [41, 77, 94]. They adjust the parameters of the newly added
hidden layer and determine the parameters without back propagation, e.g. [93, 99, 128].
The convolution operation in neural networks changes the input feature space to an output feature
space, which serves as the input to the next layer. Nonlinear activation in a neuron serves as a partition of
the output feature space and only one half subspace is selected to resolve the sign confusion problem caused
by the cascade of convolution operations [73, 88]. In contrast, SLM does not change the feature space at
all. The probabilistic projection in SLM is simply used for feature space partitioning without generating
35
new features. Each tree node splitting corresponds to a hyperplane partitioning through weights and
bias learning. As a result, both half subspaces can be preserved. SLM learns partitioning parameters in a
feedforward and probabilistic approach, which is efficient and transparent.
2.6.5 Extreme Learning Machine (ELM)
ELM [56] adopts random weights for the training of feedforward neural networks. Theory of random
projection learning models and their properties (e.g., interpolation and universal approximation) have been
investigated in [53, 54, 55, 57]. To build MLP with ELM, one can add new layers with randomly generated
weights. However, the long training time and the large model size due to a large search space imposes
their constraints in practical applications. SLM Baseline does take the efficiency into account. It builds a
general decision tree through probabilistic projections, which reduces the search space by leveraging most
discriminant features with several hyper-parameters. We use the term “probabilistic projection" rather
than “random projection" to emphasize their difference.
2.7 Conclusion
In this chapter, we proposed a novel machine learning model called the subspace learning machine (SLM).
SLM combines feedforward multilayer perceptron design and decision tree to learn discriminate subspace
and make predictions. At each subspace learning step, SLM utilizes hyperplane partitioning to evaluate
each feature dimension, and probabilistic projection to learn parameters for perceptrons for feature learning, the most discriminant subspace is learned to partition the data into child SLM nodes, and the subspace
learning process is conducted iteratively, final predictions are made with pure SLM child nodes. SLM is
light-weight, mathematically transparent, adaptive to high dimensional data, and achieves state-of-theart benchmarking performance. SLM tree can serve as weak classifier in general boosted and bootstrap
aggregation methods as a more generalized model.
36
Recently, research on unsupervised representation learning for images has been intensively studied,
e.g., [17, 18, 19, 74, 76]. The features associated with their labels can be fed into a classifier for supervised
classification. Such a system adopts the traditional pattern recognition paradigm with feature extraction
and classification modules in cascade. The feature dimension for these problems are in the order of hundreds. It will be interesting to investigate the effectiveness of SLM for problems of the high-dimensional
input feature space.
2.8 Appendix : Calculation of the Model Size
The sizes of the classification models in Table 2 are computed below.
• Circle-and-Ring.
FF-MLP has four Gaussian components for the ring and one Gaussian blob for the circle. As a result,
there are 8 and 9 neurons in the two hidden layers with 125 parameters in total. LSVM has 591
support vectors and one bias, resulting in 2,365 parameters. SVM/RBF kernel has 206 support vectors
and one bias, resulting in 825 parameters. DT has 175 splits to generate a tree of depth 14, resulting
in 350 parameters. SLM utilizes two input features to build a tree of depth 4. The node numbers at
each level are 1, 4, 4, 10 and 8, respectively. It has 13 partitions and 39 parameters.
• 2-New-Moons.
FF-MLP has 2 Gaussian components for each class, 8 neurons in each of two hidden layers, and
114 parameters in total. LSVM has 213 support vectors and one bias, resulting in 853 parameters.
SVM/RBF kernel has 176 support vectors and one bias, resulting in 705 parameters. DT has 143 splits
to generate a tree of depth 15, resulting in 286 parameters. SLM has a tree of depth 4 and the node
numbers at each level are 1, 4, 8, 12 and 4, respectively. It has 14 partitions and 42 parameters.
37
• 4-New-Moons.
FF-MLP has three Gaussian components for each class, 18 and 28 neurons in two hidden layers,
respectively, and 702 parameters in total. LSVM has 413 support vectors and one bias, resulting in
1653 parameters. SVM/RBF has 417 support vectors and one bias, resulting in 1669 parameters. DT
for this dataset made 149 splits with max depth of the tree as 11, results in 298 parameters. SLM has
a tree is of depth 5 and the node numbers at each level are 1, 4, 16, 22, 16 and 4, respectively. It has
31 partitions and 93 parameters.
• Iris Dataset.
FF-MLP has two Gaussian components for each class and two partitioning hyper-planes. It has 4 and
3 neurons at hidden layers 1 and 2, respectively, and 47 parameters in total. LSVM has 24 support
vectors and one bias, resulting in 145 parameters. SVM/RBF has 42 support vectors and one bias,
resulting in 253 parameters. DT has 17 splits with a tree of max depth 6, results in 34 parameters.
SLM uses all 4 input dimensions and has a tree of depth 3. The node numbers at each level are 1, 2,
2 and 4, respectively. It has 4 partitions and 20 parameters.
• Wine Dataset.
FF-MLP assigns two Gaussian components to each class. There are 6 neurons at each of two hidden
layers. It has 147 parameters. LSVM has 23 support vectors and one bias, resulting in 346 parameters.
SVM/RBF has 57 support vectors and one bias, resulting in 856 parameters. DT has 13 splits with a
tree of max depth 4, results in 26 parameters. SLM sets D0 = 8 and has a tree of depth 2. There are
1, 8 and 256 nodes at each level, respectively. It has 11 partitions and 99 parameters.
• B.C.W. Dataset.
FF-MLP assigns two Gaussian components to each class. There are two neurons at each of the two
hidden layers. The model has 74 parameters. LSVM has 35 support vectors and one bias, resulting
38
in 1121 parameters. SVM with RBF kernel has 91 support vectors and 1 bias, resulting in 2913
parameters. DT has 27 splits with a tree of max depth 7, results in 54 parameters. SLM sets D0 = 5
and has a tree of depth 4 with 1, 8, 16, 8 and 32 nodes at each level, respectively. It has 21 partitions
and 126 parameters.
• Diabetes Dataset.
FF-MLP has 18 and 88 neurons at two hidden layers, respectively. The model has 2,012 parameters.
LSVM has 107 support vectors and one bias, resulting in 1071 parameters. SVM/RBF has 134 support
vectors and one bias, resulting in 1341 parameters. DT has 65 splits with a tree of max depth 8, results
in 130 parameters. SLM sets D0 = 4 and has a tree of depth 3 with 1, 2, 16 and 20 nodes at each
level, respectively. It has 11 partitions and 55 parameters.
• Ionosphere Dataset.
FF-MLP has 6 and 8 neurons in hidden layers 1 and 2, respectively. It has 278 parameters. LSVM
has 23 support vectors and one bias, resulting in 806 parameters. SVM/RBF has 211 slack variables,
57 support vectors and one bias, resulting in 1996 parameters. DT has 25 splits with a tree of max
depth 10, resulting in 50 parameters. SLM sets D0 = 5 and has a tree of depth 2 with 1, 4 and 20
nodes at each level, respectively. It has 13 partitions and 78 parameters.
• Banknote Dataset.
FF-MLP has two neurons at each of the two hidden layers. It has 22 parameters. LSVM has 56
support vectors and one bias, resulting in 337 parameters. SVM/RBF has 83 support vectors and
one bias, resulting in 499 parameters. DT has 39 splits with a tree of max depth 7, resulting in 78
parameters. SLM uses all input features and has a tree of depth 3 with 1, 2, 8 and 16 nodes at each
level, respectively. It has 8 partitions and 40 parameters.
39
Chapter 3
Complexity Reduction for Tree-based Classifiers/Regressors for
Low-Dimensional Data Sources
3.1 Introduction
Built upon the classical decision tree[7] idea, SLM has been proposed in Chapter 2. SLM offers a highperformance machine learning solution to general classification and regression tasks. Both DT and SLM
convert the high-dimensional feature space splitting problem into a 1D feature space splitting problem.
DT compares each feature in the raw feature space and chooses the one that meets a certain criterion
and searches for the optimal split point, which is the threshold value. In contrast, SLM conducts a linear
combination of existing features to find a new discriminant 1D space and find the optimal split point
accordingly. It demands to find optimal weights of the linear combination, which is computationally heavy,
in the training stage.
It was shown in 2 that SLM achieves better performance than the support vector machine (SVM) [23],
DT [7], multilayer perceptron (MLP) [88, 105] with fewer parameters on several benchmarking machine
learning datasets. The performance improvement is reached at the expense of higher computational complexity during training. In this work, we investigate two ideas to accelerate SLM: 1) proposal of a more
40
efficient weight learning strategy via particle swarm optimization (PSO) [67], and 2) software implementation optimization with parallel processing on CPU and GPU.
For the first idea, the learning of optimal weights is accomplished by probabilistic search in original
SLM. No optimization is involved. The probabilistic search demands a large number of iterations to achieve
desired results. It is difficult to scale for a high-dimensional feature space. In contrast, PSO simulates the
behavior of a flock of birds, where each bird is modeled by a particle. PSO finds the optimal solution in the
search space by updating particles jointly and iteratively. The acceleration of SLM by PSO requires 10-20
times fewer iterations than the original SLM.
For the second idea, SLM’s training time can be significantly reduced through software implementation
optimization. In the original SLM development stage, its implementation primarily relies on public python
packages such as Scikit-learn[100], NumPy, SciPy [124], etc. After its algorithmic detail is settled, we follow
the development pathway of XGBoost [15] and LightGBM [66], implement a C++ package, and integrate
it with Cython. Both multithreaded CPU and CUDA are supported.
By combining the above two ideas, accelerated SLM can achieve a speed up factor of 577 in training
as compared to the original SLM while maintaining comparable (or even better) classification/regression
performance. The rest of this chapter is organized as follows. The acceleration of SLM through PSO is
introduced in Sec. 3.2. The optimization of software implementation is presented in Sec. 3.3. Experimental
results are shown in Sec. 3.4. Finally, concluding remarks are given in Sec. 3.5.
3.2 Acceleration via Adaptive Particle Swarm Optimization
As described in Sec. 2.2.2.2, the probabilistic search is used in the original SLM to obtain good projection
vectors. As the feature dimension increases, the search space grows exponentially and the probabilistic
search scheme does not scale well. In this section, we adopt a metaheuristic algorithm called the particle
swarm optimization (PSO) [33] to address the same problem with higher efficiency.
41
3.2.1 Introduction to Particle Swarm Optimization(PSO)
PSO is an evolutionary algorithm proposed by Kennedy and Eberhart in 1995 [67]. It finds the optimal
solution in the search space by simulating the behavior of flock using a swarm of particles. Each particle
in the swarm has a position vector and a velocity vector. Mathematically, the ith particle has the following
position and velocity vectors:
xi = [x1, x2, · · · , xD]
T
(3.1)
vi = [v1, v2, · · · , vD]
T
(3.2)
where D is the dimension of the search space, respectively. The velocity and position vectors of the
ith particle are updated in each iteration via
vi,t+1 = ωvi,t + c1r1[pBi,t − xi,t] + c2r2[gB − xi,t], (3.3)
xi,t+1 = xi,t + vi,t. (3.4)
Eq. (3.3) has the following parameters:
• inertia weight ω used to control the influence of the current speed of the particle on the update.
• gB: the position vector of the optimal particle in the population.
• pBi,t: the best position of the ith particle in the history update process up to iteration t.
• c1 and c2: two learning factors.
• r1 and r2: two uniformly distributed random values between 0 and 1 used to simulate the stochastic
behavior of particles.
42
Algorithm 1 Particle Swarm Optimization (PSO)
Input: Population, Dimension and Max Iteration
Output: Best Particle’s position vector and loss
1: Initialize all particles
2: for iteration < M axIteration do
3: ...
4: for i < P opulation do
5: Update Vi and Xi
6: if lossi < pbesti then
7: Update pbesti
8: if lossi < gbest then
9: Update gbest
10: else
11: pass
12: i = i+1 //next particle
13: iteration = iteration + 1 //next iteration
14: return gbest
The initial position and velocity vectors of all particles are randomly selected within a preset range.
At each iteration, the velocity and position of each particle are updated according to Eqs. (3.3) and (3.4),
respectively. Furthermore, we calculate the current loss and update gB and pB accordingly. If the algorithm converges, the position vector of the global optimal particle is the output. The pseudo-codes of the
PSO are provided in Algorithm 1.
3.2.2 Adaptive PSO (APSO) in SLM
The PSO algorithm has a couple of variants [72, 87, 110]. The search space of the projection vector in
our problem is complex and there are quite a few local minima. A standard PSO can easily get stuck to
the local minima. One variant of PSO, called the adaptive particle swarm optimization (APSO)[136], often
yields better performance for multimodal problems so that it is adopted as an alternative choice in the SLM
implementation.
Before an iteration, APSO calculates the distribution of particles in the space and identify it as one of the
following four states: Exploration, Exploitation, Convergence, and Jumpout. APSO chooses parameters c1
43
and c2 in different states dynamically so as to make particles more effective under the current distribution.
They are summarized below.
• In the exploration state, APSO increases the c1 value and decreases the c2 value. This allows particles
to explore in the whole search space more freely and individually to achieve better individual optimal
position.
• In exploitation state, APSO increases the c1 value slightly and decreases the c2 value. As a result,
particles can leverage the local information more to find the global optimum.
• In the convergence state, APSO increases both c1 and c2 slightly. An elite learning strategy is also
introduced to promote the convergence of globally optimal particles, which makes the search more
efficient.
• In the jumping-out state, APSO decreases the c1 value and increases the c2 value. This allows particles move away from their current cluster more easily. Thus, the particle swarm can jump out of
the current local optimal solution.
For classification tasks in SLM, particle’s position vector is the projection vector and the DFT loss
is utilized as the loss function. In original SLM, we split samples into several child nodes at each tree
level. Since APSO can find the global optimum with the lowest loss function, only one globally optimal
projection is selected per node in the accelerated version. As a result, the resulting SLM tree is a binary
tree. Furthermore, in the accelerated SLM algorithm, the top n discriminant features in each node are recomputed and they serve as the input dimension of the APSO algorithm. We will show in the experimental
section that APSO demands fewer iterations to find the global optimum than probabilistic search. Thus,
the overall search speed can be improved. The pseudo-codes of the SLM tree building based on APSO is
given in Algorithm 2.
44
Algorithm 2 SLM Tree Building with APSO
1: if Node meet the split condition then
2: Run DFT for this node
3: Select top n channels from DFT result
4: Run Adaptive PSO with n Dimension
5: Use gbest vector for partitioning
6: Get two child nodes for next building process
7: else
8: Mark this node as a leaf node
9: return
3.3 Acceleration via Parallel Processing
The SLM algorithm in [40] was implemented in pure Python. While Python code is easy to write, the
running speed is slow. Cython is a superset of Python that bridges the gap between Python and C. It
allows static typing and compilation from Cython to C. Furthermore, there are a few commonly used
math libraries in Python such as Scipy and scikit-learn.
The first step towards run-time acceleration is to convert the computational bottleneck of the SLM
implementation from Python to Cython. The bottleneck is the search of the optimal projection vector. For
a projection vector at a node, we calculate the loss at each split point so as to find the minimum loss in
the 1D space. Since these computations do not depend on each other, they can be computed in parallel
in principle. Python does not support multithreading due to the presence of the Global Interpreter Lock
(GIL). To this end, multi-processing was deemed unsuitable due to higher overhead in the current context.
Instead, we implement multithreaded C++ extension for the optimal projection vector search, and
integrate it into the remaining Cython code. In the C++ extension, we spawn multiple threads, where each
thread covers the computation of a fixed number of splits per selected projection vector. The launched
thread number is equal to the minimum of the physical core number to prevent excessive context switching.
A task is defined as calculating the loss on a single split of a projected 1D subspace. These tasks are evenly
allocated to each thread wherein each thread processes them sequentially.
45
Table 3.1: Comparison of training run-time (in seconds) of 6 settings for 9 classification datasets.
Sets circleandring
2-
newmoons
4-
newmoons
Iris Wine B.C.W Pima Ionoshpere
Banknote
SLM
A 7.026 4.015 6.019 82.939 65.652 234.597 339.077 248.410 118.000
B 0.287 0.188 0.200 0.465 0.808 2.931 3.920 5.217 0.697
C 0.346 0.0943 0.236 0.295 0.700 2.657 3.263 3.838 0.439
D 4.071 3.159 3.245 3.205 3.572 3.956 4.660 5.021 4.370
E 7.597 3.146 7.200 4.188 3.106 11.605 13.892 8.123 8.588
F 0.270 0.0887 0.278 0.192 0.831 2.837 2.738 1.371 0.619
SLM Forest
A 160.264 107.695 157.307 2333.514 2605.288 5067.356 10024.238 6582.104 3020.587
B 8.507 5.272 8.471 17.142 26.254 134.301 101.631 152.625 40.365
C 7.910 4.714 8.170 13.068 22.176 116.843 88.972 137.390 16.341
D 7.708 4.954 7.678 5.946 15.066 98.969 48.987 80.547 16.511
E 220.021 96.712 219.002 105.768 94.788 360.089 396.989 247.069 259.000
F 8.75 4.058 9.02 5.822 20.081 98.490 31.684 75.964 14.054
SLM Boost
A 143.332 116.156 151.904 2347.904 2591.919 7118.338 9887.949 6395.744 3412.654
B 7.459 5.185 7.533 16.087 21.873 117.01 104.47 148.586 23.443
C 7.987 3.434 8.293 12.144 15.624 115.226 80.273 131.712 14.207
D 7.741 4.894 8.182 6.093 15.573 97.599 46.057 76.589 15.588
E 233.071 97.541 233.347 107.709 95.534 362.295 396.989 247.069 262.546
F 6.837 3.581 8.425 5.915 20.159 94.413 40.413 79.012 15.516
Compared to CPUs, modern GPUs have thousands of cores and context switching occurs significantly
less frequently. Thus, for GPU processing, each thread is responsible for a single task in our design. Once
the task is completed, the GPU core moves onto the next available task, if any. Once the computation is
completed, the loss values are returned to the main process where the minimum loss projection and split
are computed.
3.4 Experiments
Datasets. We follow the experimental setup in [40] and conduct performance benchmarking on 9 classification datasets (i.e., Circle-and-ring, 2-new-moons, 4-new-moons, Iris, Wine, B.C.W, Pima, Ionoshpere,
and Banknote) and 6 regression datasets (i.e., Make Friedman1, Makefriedman2, Makefriedman3, Boston,
California housing and Diabetes). For details of these 15 datasets, we refer readers to [40].
46
Table 3.2: Comparison of training run-time (in seconds) of 6 settings for 6 regression datasets.
Settings Make
Friedman1
Make
Friedman2
Make
Friedman3
Boston Californiahousing
Diabetes
SLR
A 152.291 60.232 186.535 153.173 756.868 147.933
B 2.136 0.653 2.137 1.255 26.417 1.136
C 0.937 0.331 2.100 0.780 5.625 0.608
D 1.577 0.448 3.832 1.31 5.508 0.722
E 29.742 28.294 44.511 30.608 96.319 27.162
F 0.639 0.332 0.755 0.675 5.891 0.322
SLR Forest
A 5406.271 3346.595 7459.372 6454.979 21092.217 4413.089
B 62.527 19.244 57.117 55.204 1049.859 46.622
C 35.576 13.533 63.551 30.943 254.415 29.606
D 19.408 6.48 42.182 18.04 125.104 17.242
E 1025.801 940.359 1051.498 792.33 3190.336 775.153
F 17.993 13.718 19.898 11.178 193.741 10.879
SLR Boost
A 4678.86 1955.517 6770.808 4097.66 21612.672 4200.828
B 62.156 19.724 59.435 38.154 817.688 34.767
C 26.704 10.546 70.716 29.149 159.434 20.528
D 19.293 6.551 51.222 19.817 124.702 38.897
E 919.874 920.619 1834.926 910.271 3336.018 901.092
F 14.170 13.114 24.759 12.186 195.459 10.594
Benchmarking Algorithms and Accelation Settings. Ensemble methods are commonly used in the
machine learning field to boost the performance. An ensemble model aims to obtain better performance
than each constituent model alone. Inspired by the random forest (RF) [9] and XGBoost for DT, SLM Forest
and SLM Boost are bagging and boosting methods for SLM, respectively. The counter part of SLM for the
regression task is called the subspace learning regressor (SLR). Again, SLR has two ensemble methods;
namely, SLR Forest and SLR Boost. They are detailed in [40].
We compare the run-time of six settings of SLM, SLM Forest and SLM Boost three methods in Table
3.1 and six settings of SLR, SLR Forest and SLR Boost in Table 3.2. They are:
A Probabilistic search in pure Python,
B Probabilistic search in Cython and C++ singlethreaded,
C Probabilistic search in Cython and C++ multithreaded,
47
D Probabilistic search in CUDA/GPU,
E APSO in pure Python,
F APSO in Cython and C++ multithreaded,
For a given dataset, settings [A]-[D] provide the same classification (or regression) performance in terms of
classification accuracy (or regression MSE) while settings [E] and [F] offer the same classification accuracy
(or regression MSE). Their main difference lies in the model training time. All settings are implemented on
Intel(R) Xeon(R) CPU E5-2620 v3 @2.40GHz with 12 cores 24 threads and Nvidia Quadro M6000 GPU. All
hyperparameters are the same. The number of trees for two SLM/SLR ensemble methods (i.e., SLM/SLR
Forest and SLM/SLR Boost) is set to 30.
Accelation by Parallel Processing. We compare the run-time of SLM, SLM Forest and SLM Boost
for the 9 classification datasets in Table 3.1 and that of SLR, SLR Forest and SLR Boost for the 6 regression datasets in Table 3.2. The shortest run-time is highlighted in bold face. We see from the tables that
simply changing the implementation from Python to C++ yields a speed-up factor of x40 to x100 across
all datasets. After that, the speed-up is incremental. Multithreading should lead to increased performance
in theory. Yet, it is greatly influenced by the state of the machine in practice. For example, caching, task
scheduling, and whether other processors are used by other processes. Moreover, additional overhead occurs in spawning, managing, and destroying threads. Typically, a thread of longer computation is more
suitable for multithreading. Thus, the performance improvement in smaller datasets, where computations
take less time, is less pronounced. Similarly, for a single SLM tree training, the model training time on
GPU is notably worse due to the overhead of moving data to GPU memory. As the training scales (e.g.,
with ensembles of SLM/SLR trees such as SLM/SLR Forest and SLM/SLR Boost), CUDA/GPU generally
outperforms multithreading.
48
Table 3.3: Comparison of classification accuracy (in terms of %) of SLM, SLM Forest and SLM Boost probabilistic search and APSO acceleration methods on 9 classification datasets.
Settings circleandring
2-newmoons
4-newmoons
Iris Wine B.C.W Pima Ionoshpere
Banknote
#Iter
SLM-
[A,B,C,D]
96.50 100 99.75 98.33 98.61 97.23 77.71 90.07 99.09 1000
SLM-
[E,F]
96.50 100 99.75 98.33 100 98.68 77.71 92.19 99.64 110
SLM
Forest
[A,B,C,D]
100 100 100 98.33 100 97.36 79.00 95.71 99.81 1000
SLM Forest [E,F]
100 100 100 98.33 98.61 97.80 77.71 95.03 100 110
SLM
Boost
[A,B,C,D]
100 100 100 98.33 100 98.83 77.71 94.33 100 1000
SLM
Boost
[E,F]
100 100 100 98.33 100 97.80 78.34 94.33 100 110
Acceleration by APSO. To demonstrate the effectiveness of APSO, we compare the classification
and regression performance of probabilistic search and APSO in Tables 3.3 and 3.4, respectively. Their
performance in terms of classification accuracy and regression MSE is comparable. As shown in the last
column of both tables, the iteration number of APSO is about 10% (for SLM) and 5% (for SLR) of that of
probabilistic search.
Joint Acceleration by Parallel Processing and APSO. We can see the effect of joint acceleration of
parallel processing and APSO from Tables 3.1 and 3.2 method F. We observe that the single-threaded C++
implementation of APSO has a comparable speed of the GPU version of probabilistic search. We have not
yet done the GPU version of APSO acceleration but expect that it will provide another level of boosting in
shortening the training run time.
In summary, APSO acceleration reduces the time complexity of SLM/SLR by reducing the number of
iterations to 5-10%. Its C++ implementation and parallel process reduces the training time of each iteration
49
Table 3.4: Comparison of regression mean-squared-errors (MSE) of SLR, SLR Forest and SLR Boost probabilistic search and APSO acceleration methods on 6 regression datasets.
settings Make
Friedman1
Makefriedman2
Makefriedman3
Boston California
housing
Diabetes #Iter
SLR
[A,B,C,D]
20.46 125830 0.078 65.68 1.172 5238.8 2000
SLR [E,F] 13.52 12001 0.022 27.88 0.575 4018.7 110
SLR
Forest
[A,B,C,D]
5.46 1641 0.007 13.18 0.407 2624.4 2000
SLR Forest [E,F]
4.68 2144 0.012 13.48 0.419 2640.7 110
SLR
Boost
[A,B,C,D]
3.85 670 0.008 13.07 0.365 2710.0 2000
SLR
Boost
[E,F]
3.95 662 0.005 12.91 0.341 2711.2 110
by a factor of x40 to x100. With the combination of both, APSO SLM/SLR with C++ implementation gives
the best overall performance in training time, classification accuracy and regression error. As compared
to the original SLM/SLR, its training time can be accelerated by a factor up to 577.
3.5 Conclusion
Two ways were proposed to accelerate SLM in this work. First, we applied the particle swarm optimization (PSO) algorithm to speed up the search of a discriminant dimension, which was accomplished by
probabilistic search in original SLM. The acceleration of SLM by PSO requires 10-20 times fewer iterations. Second, we leveraged parallel processing in the implementation. It was shown by experimental
results that, as compared to original SLM/SLR, accelerated SLM/SLR can achieve a speed up factor of 577
in training time while maintaining comparable classification and regression performance.
50
The datasets considered in Chapter 2 and this work are of lower dimensions. Recently, research on
unsupervised representation of images and videos has been intensively studied learned representations
and their labels can be fed into a classifier/regressor for supervised classification and regression. This
leads to the traditional pattern recognition paradigm, where feature extraction and classification modules
are cascaded for various tasks. The feature dimensions for these problems are in the order of hundreds. It
is important to develop SLM for high dimensional image and video data to explore its full potential.
51
Chapter 4
Feedforward Efficient Machine Learning for High-Dimensional Data
Sources
4.1 Introduction
Image classification is a task in computer vision and machine learning that involves categorizing images
into predefined classes or labels. For example, an image classification system might be trained to recognize
and classify different types of animals, such as dogs, cats, and birds. The goal of image classification is
to accurately predict the class of an input image based on its visual features. Image classification has a
wide range of applications, including object recognition, scene understanding, and image tagging. It is an
important component of many computer vision systems and has been extensively studied in the field of
machine learning.
There are many different approaches to image classification, including traditional machine learning
algorithms and deep learning models. Traditional machine learning algorithms rely on two stage learning
framework with feature extraction and clasisfication, the feature extraction stage relied on hand-crafted
features extracted from the images, and for the classification traditional methods such as SVM are applied
for supervised learning. While deep learning models utilize end to end traing and learn to extract features automatically from the data. Deep learning has achieved remarkable success in image classification
52
field, methods like AlexNet[71], VGG[112], ResNet[47], Vision Transformer [32] achieved state-of-the-art
results for various public image classification datasets. However, it has also been criticized for its high
computational cost and large model size.
One of the main criticisms of deep learning is the high computational cost of training deep neural networks. Deep learning use predesigned architecture and backpropagation as learning diagram. Training
a deep neural network requires a large amount of data and computational resources, such as processing power and memory. This can be a significant barrier for researchers and organizations with limited
resources. Additionally, the training process can be time-consuming, requiring hours or even days to
complete. Another criticism of deep learning is the large model size of trained neural networks. Deep
neural networks can have millions or billions of parameters, which can require a significant amount of
storage space. This can be problematic for deploying deep learning models on devices with limited storage
capacity, such as smartphones or Internet of Things (IoT) devices.
While deep learning has achieved impressive results in many applications, it is important to also consider the computational cost and model size when choosing deep learning for a specific problem. With
the understanding of traditional machine learning algorithms and deep learning methods, Green learning
methods target designing an efficient learning system with a small model size and fast training time. Green
learning focus not only on the effectiveness of the ML models, but evaluate the efficiency of the training
and inference, and focus on the trade-off between the efficiency (e.g. the computational cost) and model
size and the effectiveness e.g. the accuracy for classification tasks, mean squared error for regression tasks,
or mean average precision for detection tasks etc.. It is suitable for mobile/edge computing. GL adopts a
modularized design where each module is optimized independently for implementation efficiency.
The model and computational complexity can be evaluated at training stage and inference stage, a
series of green learning frameworks [75] focus mainly on the training stage and compare the training time
and model size with deep learning methods, while our proposed method focus more on the inference stage
53
and deployment environments. Our proposed efficient object recognition method focus on the application
with limited computation and storage resources, such as models for mobile device or Internet of Things
(IoT) devices. To demonstrate the efficient object recognition, we focus on the count of floating point
operations (FLOPs) during inference and model size for the efficiency of the model, and the inference
accuracy for the effectiveness of the model. The efficient object recognition system offers a flexible tradeoff between efficiency and effectiveness and promising ML models with high performance.
4.2 Background Review
The image classification methods can be generally categorized into traditional machine learning methods
and deep learning methods.
Traditional machine learning approaches have been used for image classification for many years. These
approaches typically involve hand-crafted feature extraction from the image data and using these features
as input to a machine learning algorithm, such as a support vector machine (SVM) or a decision tree.
Popular traditional machine learning feature extraction techniques include Scale-Invariant Feature Transform (SIFT)[89], the Speeded Up Robust Features (SURF)[5], local binary patterns (LBP)[101], Bag of Visual
Words (BoVW) [129], and Histogram of Gradient (HOG) [26] are used in traditional machine learning for
image classification. In addition, machine learning algorithms like Random Forest, Naive Bayes, k-Nearest
Neighbors have been used as classifiers to be trained on the extracted features.
The field of image classification has witnessed remarkable advancements in performance due to the
emergence of deep learning techniques. These advancements began with the introduction of the ImageNet
dataset [27] and the pioneering work of AlexNet[71] which marked the onset of the deep learning era. Subsequent models such as VGGNet [112], GoogLeNet [117], and ResNetResNet[47] have achieved impressive
accuracy levels exceeding 90% . Over time, the performance of these models has gradually improved to the
extent that it rivals human capability, as evaluated by top-5 error metrics [6], and ResNet has surpassed
54
human-level performance on the large-scale ImageNet dataset [108]. The next recent breakthrough in image classification is the introduction of the Vision Transformer (ViT) model [32]. The ViT model replaces
the convolutional layers with self-attention mechanisms, allowing it to capture global dependencies and
relationships within the image. This paradigm shift has garnered significant attention and has shown
promising results. The ViT model has achieved competitive performance on various image classification
benchmarks, demonstrating its capability to handle both large-scale datasets and complex visual tasks. As a
result, attention in image classification research has primarily shifted towards deep learning methods over
the past decade. The availability of large-scale training datasets, end-to-end optimization, and architecture
search encompassing convolutional neural networks (CNNs) and transformers have further propelled the
performance of image classification, continually pushing the boundaries of what is achievable.
Learning from the traditional machine learning approaches and deep learning methods, Green learning
(GL) was proposed aiming to create an efficient learning system with a small model size and fast training
time. GL is a method of designing an efficient learning system that is suitable for mobile and edge computing. It aims to have a small model size and fast training time. GL uses a modularized design where
each module is optimized independently for efficient implementation. Prof. Jay Kuo first introduced GL
[75] in his research with the goal of understanding deep learning. Later on, several joint spatial-spectral
transforms such as Saak and Saab transforms were proposed to extract image embeddings without the
need for backpropagation. These transforms were used to build learning systems such as Pixelhop[143],
Pixelhop++[18] and E-Pixelhop[131]. Recently, a feature selection tool called the discriminant feature test
(DFT) was proposed, which bridges the gap between image embeddings and discriminant features, making
GL more mature. GL has had some successful applications in areas include image classification [18, 131,
143], image enhancement [4], image quality assessment [95], 3D medical image analysis [91], point cloud
classification, segmentation, registration [63, 64, 65], face biometrics [106, 107], deepfake image/video
detection [12, 13, 143, 144], object tracking [140, 141, 142], etc.
55
Figure 4.1: The overview of the Efficient Object Recognition System with example RGB image and 5 × 5
filter size in SSL
4.3 Proposed Method
The proposed object recognition system takes the traditional two stage framework for image classification
task, i.e. the representation learning stage and the decision learning stage. As shown in Fig. 4.1, for feature
learning on image classification, at the representation learning stage, we follow feedforward unsupervised
feature learning framework, named successive subspace learning (SSL) [17] for rich feature representation
extraction, then we propose unsupervised and supervised feature learning diagram to effectively reduce
feature dimension and generate feature representation for the decision learning stage. At decision learning part, we proposed supplemental feature generation to enhance the efficiency and effectiveness of the
main classifier, gradient boosting decision tree, and we propose a complementary feature learning model
for learning more separable non-linear features with single hidden layer MLP to efficiently enhance the
performance of the confusing classes from the main classifier.
In this section, the SSL framework for rich feature generation and proposed feature learning methods
are discussed in 4.3.1 and 4.3.2, the supplemental feature generation and complementary feature learning
are discussed in 4.3.3
56
4.3.1 Representation Learning
In this section, we first discuss the SSL framework generating multistage feature maps with different receptive field that serve as rich feature representation for images in Sec. 4.3.1.1. Then various subspace
learning methods and corresponding filter banks are detailed in the following sections.
4.3.1.1 Successive Subspace Learning and Filter Banks
Subspace learning methods are a class of techniques that have been widely utilized in various fields, including signal and image processing, pattern recognition, and computer vision. These methods are often
referred to by different names, depending on the context in which they are used, such as manifold learning.
In general, subspace methods involve representing the feature space of a specific object class, such as the
subspace of the dog object class, or identifying the dominant feature space by eliminating less important
features, such as through the use of principal component analysis (PCA). The subspace representation
provides a useful tool for signal analysis, modeling, and processing. Subspace learning involves finding
subspace models that enable concise data representation and accurate decision-making based on training
samples.
Many existing subspace methods are implemented in a single stage. However, there are situations
where a multi-stage subspace learning approach may be beneficial. This is the case when the input subspace is not fixed, but rather grows from one stage to the next. This process, referred to as "successive
subspace growing" can occur naturally in convolutional neural network (CNN) architectures, where the
response in deeper layers has a larger receptive field. Additionally, special attention should be paid to the
interface between two consecutive stages in a cascade configuration when implementing a multi-stage
subspace learning approach.
In this section, we discuss subspace learning methods that utilize different filter banks to generate
multi-stage feature maps with varying receptive fields. These feature maps serve as a rich representation
57
of features for images and are produced through a successive process. The use of filter banks allows for
the creation of feature maps with diverse receptive fields, providing a more comprehensive representation
of the input image. These feature maps are an important component of subspace learning methods and
can be utilized for a variety of tasks such as image classification.
4.3.1.2 Saab Transform and Saab Filter
With the effort to understand deep learning from Kuo et al. [73], several joint spatial-spectral transforms such as Saak[19] and Saab transforms[76] were proposed to extract image embeddings without backpropagation. With Saak and Saab filters for multi-stage subspace learning, unsupervised feature learning
system named Pixelhop [17] and Pixelhop++ [18] are developed for efficient and effective rich feature
generation for high dimensional datasets such as images and videos.
The Saab transform is a feature extraction method developed for use in image and signal processing
applications. It is a type of dimensionality reduction technique that aims to reduce the number of features
in a dataset while preserving as much of the original information as possible. The transform is a variant of
PCA that aims to address the nonlinearity of activations in a neural network. This is achieved by adding
a bias vector to the PCA transformation, which helps to annihilate the nonlinearity of the activations.
It is based on the idea of projecting the data onto a set of basis functions, which are chosen to capture
the most important features of the data. These basis functions are constructed using a combination of a
Karhunen-Loève transform and PCA. The resulting basis functions are used to transform the original data
into a lower-dimensional space.
More specifically, for feature learning on image datasets, the Saab transform on a 4-dimensional input
tensor with dimension [N, H, W, K] can be summarized as follow, in which N represents for number
of samples, H and W represent the spatial dimension of the input feature, D represents for the spectral
dimension of the input feature.
58
Figure 4.2: The block diagram of single stage Saab transform on one spatial location
1. Neighborhood Construction For each spatial location from [H, W], with predefined Saab filter
size S, the spatial neighborhood S×S with spectral dimension D are concatenated to for a 1-dimensional
tensor [1, S×S×K].
2. DC and AC subspace For each of the 1-dimensional tensor, the patch mean is defined as DC(direct
current) component, the residual after removing DC is defined as AC(alternating current) component, then
the original signal space is decomposed into two subspaces DC and AC.
3. Saab Filter With the neighborhood construction and DC AC subspace separation, the AC subspace
from input tensor of dimension [N, H, W, K] has 2-dimension [N×H×W, S×S×K]. PCA is conducted on
the AC subspace, the eigenvectors from PCA are recorded as Saab Filters, the eigenvalues from PCA are
recorded as Saab Energy. An Saab Energy based Saab Filter selection is applied to discard low energy
Saab Filters for dimension reduction, and a bias term defined as maximum absolute value is recorded to
annihilate the nonlinearity of the activations.
4. Saab Response With preserved Saab Filters, the AC subspace of the input tensor can be transforms
into a tensor named AC response with 2-dimension, the AC response is then combined with DC subspace
to form the Saab Response from current stage Saab Response with dimensions [N×H×W, K’], the response
feature map can be then transformed into 4-dimensional tensor [N, H, W, K’] for the next stage.
The single stage Saab transform on one spatial location is shown in figure 4.2.
59
Figure 4.3: (a) Comparison of the traditional Saab transform and the channelwise Saab transform. (b)
Illustration of the tree-decomposed Saab feature
4.3.1.3 Channelwise Saab Transform
To reduce the size of the Saab transform, channelwise Saab transform is proposed in PixelHop++ [18].
Since PCA can decorrelate the covariance matrix into a diagonal matrix, all channel components are decoupled. As a variant of PCA, we assume the original Saab coefficients to be weakly correlated in the
spectral domain, thus during single stage Saab transform, each channel of the spectral dimension can be
transformed independently in channelwise Saab transform. The comparison between the two different
Saab transforms is illustrated in Fig. 4.3 (a).
The feature learning structure forms a tree structure as show in Fig. 4.3 (b). With preserved Saab
energy from Saab filters learning, we define channels with low Saab Energy as leaf nodes and discard the
channel from Saab transform, and only preserve the channel as local feature. For the channels with high
Saab Energy, we keep them as parent nodes and conduct channelwise Saab transform.
Compared to the traditional Saab, the channelwise Saab takes less memory requirement and computational complexity given the same input tensor. When conducting the transform on input tensor with [N,
H, W, K] and Saab filter size S×S, the PCA takes tensor with input dimension [N×H×W, S×S×K], while
60
for channelwise Saab transform, the PCA takes input with dimension [N×H× W, S×S ], the peak memory requirement is reduced K times. For the computational complexity, in channelwise Saab the channel
selection as parent channel for each stage learning are based on Saab Energy from previous stage, while
for traditional Saab no feature selection is applied. Benefit from the Saab Energy threshold for selecting
the parent channel and discarding the low energy spectral dimensions, the model size of PixelHop++ can
be flexibly controlled.
To further reduce the computational complexity and model size, for channelwise Saab transform, we
can greedily apply parent channel selection on the Pixelhop++ units utilizing only the primary eigenvector
from PCA, i.e. we only pass the DC and AC1 channel to the next stage of Pixelhop++.
The Saab transform and the variants have been shown to be effective at extracting features from highdimensional datasets, and has been applied to a variety of applications, including image classification, face
recognition, and speech processing. It has also been used in combination with other techniques, such as
support vector machines (SVMs) [23] or gradient boosting decision tree [15] to improve performance.
4.3.1.4 Haar Transform and 2-D Haar Filter
The Haar transform is a type of wavelet transform that uses a series of sequentially nested Haar
wavelets to represent a signal or an image as a sum of wavelets. It was developed by mathematician
Alfred Haar and is used for various purposes, including image compression and feature detection in image
processing. The Haar transform works by decomposing an image into a series of increasingly fine resolution sub-images, or wavelets, which can be used to represent the image in a more compact form. The
transformation is particularly useful because it can be implemented very efficiently, making it well-suited
for use in real-time applications.
The 2D Haar transform can be used to decompose a two-dimensional image into a series of increasingly
fine resolution sub-images, or wavelets. It works by applying the Haar transform separately to the rows
61
Figure 4.4: 2-D Haar Filters (a) DC component, (b) Horizontal Difference, (c) Vertical Difference, (d) Diagonal
and columns of the image, resulting in a set of wavelets that can be used to represent the image in a more
compact form. One of the key features of the Haar transform is its ability to capture both the average and
the variation of the pixel values in an image. This makes it useful for tasks such as image compression,
where the goal is to represent the image with as few bits as possible while still maintaining a reasonable
level of quality. The Haar transform is also used in image recognition algorithms, where it can be used to
extract features from an image that are indicative of the content or class of the image.
Fitting into the SSL, we utilize 2×2 Haar transform as subspace learning method, which captures the
DC component, horizontal difference, vertical difference and diagonal different as four responses for each
patch. The set of 2×2 2D Haar filters are show in Fig. 4.4. For image datasets, we conduct Haar transform
channelwise at each stage, for the first stage, we can utilize the RGB input channels, or HSV input channels,
or conduct PCA on input three channels. For multistage Haar transform, we only use the DC channel
from each channelwise Haar response as parent node for the next stage, and we apply 2×2 max pooling to
reduce the spatial dimension at the end of each stage. The Haar transform serves as the most computational
efficient SSL method.
62
Figure 4.5: Illustration of spatial and spectral pooling from feature map
4.3.2 Feature Learning
With the SSL as feature learning diagram , rich feature maps as representations are extracted for image
datasets. To further reduce the feature dimension and fuse the feature maps of different receptive field
from the learned SSL features, we propose spatial and spectral feature pooling as unsupervised feature
dimension reduction for each stage of feature map, and we adopt DFT [133] as supervised feature selection for both reducing the dimension from each feature map and combining different feature maps. With
output feature map from SSL, we first utilize unsupervised feature dimension reduction to transform the
feature into 1-D, then we utilize supervised feature dimension reduction to further reduce the feature and
concatenate the features from multi stages.
4.3.2.1 Unsupervised Feature Dimension Reduction
With given output from SSL with dimensions as [S, S, K], as shown in Fig. 4.5, we perform unsupervised
feature pooling to reduce the feature dimension from the feature map. To extract a diversified set of
features from each stage, we consider multiple aggregation schemes, such as spatial pooling along the
S×S dimension and spectral pooling along the K dimension.
63
Figure 4.6: Illustration of Discriminant Feature Test
For spatial pooling, we can take the maximum or the mean values of responses in small non-overlapping
regions for each spectral channel in K, and the spatial size of the features after aggregation is denoted as
P×P, when the pooling region is S×S, P = 1. For spectral pooling, we can take the L2 norm or L1 norm
of responses at each spatial location on S×S, and the feature map dimension after aggregation is [S, S,
1]. Finally, after concatenation of corresponding aggregated features, the 1-D feature generated from the
feature map is of dimension 2 × S × S + 2 × P × P × K.
4.3.2.2 Supervised Feature Dimension Reduction
For supervised feature dimension reduction, DFT is adopted to select discriminate features from the
rich feature set. Similar to 2, for each 1-D feature with given sample and class labels, we traverse the
splitting locations between the maximum and minimum values from the feature dimension and calculate
the weighted entropy loss for each splitting point. Among the possible splittings, we record the minimum
weighted entropy loss as the DFT loss for the given 1-D feature. An illustration is shown in Fig. 4.6. With
traverse evaluation of each 1-D feature, features with lowest DFT loss are kept as discriminate feature
subspace. This supervised feature dimension reduction is specially effective for training decision tree
based classifiers, detailed result and analysis are discussed in Sec. 4.4.2.
64
4.3.3 Decision Learning
With the unsupervised SSL for rich feature generation and feature processing for dimension reduction, the
generated feature representations from image are then fed into the decision learning module. For decision
learning, we first run DFT to flexibly adjust the feature dimension for efficiency control, then we utilize
GBDT as the main classifier and proposed two modules to enhance the efficiency and effectiveness of
the decision learning respectively, the supplemental feature generation (SFG) and complementary feature
learning (CFL).
4.3.3.1 Supplemental Feature Generation
As discussed in Chapter 2, the performance of decision tree based classifiers depends primarily on the
input features. With DFT as feature evaluation method, we explored the possibility of generating more
discriminate features from given input features in SLM through linear projection, which contributes to the
faster converge, smaller depth and better performance of SLM tree over decision tree. For classification
tasks, the projection weights and bias can be learned efficiently through optimization based methods as
discussed in Chapter 3.
To achieve faster converge as well as boost the performance of GBDT, we propose SFG aiming to
generate more discriminate features than the original features. SFG takes the features after DFT dimension
reduction, and learn a series of new dimension features. We evaluate the input and generated feature with
DFT loss, to generate new features with lower DFT loss, we can apply probabilistic random projection
or optimization methods like APSO or SGD. As the DFT loss is not differentiable, the random projection
and APSO fit more into SFG, while the training time and computational complexity during training is
unpredictably high. To enhance the training efficiency, we apply the SGD by formulating the problem into
binary classification, i.e. given C classes in the task, we divide the classes into two sets with 2
C−1 − 1
combinations, e.g. when C=10, we generate 511 extra features from SFG. This formulation assumes the
65
low DFT loss features are from the linear splittings among the two sets of the classes, and the assumption
holds for the tasks with relative small number of class.
In 4.4.4 we demonstrate the effectiveness of the SFG generated feature for GBDT as classifier for the
example two confusing classes and the whole training datasets. SFG serves as a support model for the
main GBDT classifier for faster converge, which reduces the FLOPs and model size of the GBDT while
maintain similar or even better performance compare to original GBDT.
4.3.3.2 Complementary Feature Learning
As the feature learning system and the SFG from decision learning take linear transformation and selection, the ability of nonlinear learning relies merely on the depth of the GBDT from the proposed object
recognition. When the confusing classes and low confidence predictions are encountered from the linear
feature learning and main GBDT classifier, we propose CFL module to efficiently learn nonlinear features
to further enhance the performance.
To efficiently learn the separable features for the confusing classes, we apply one hidden layer MLP as
classifier for CFL. With the input feature from DFT, we design the number of neurons in the hidden layer
s.t. the number of parameters in the CFL is no larger than the number of training samples. The training
of the CFL module is based on the main GBDT classifier, we first identify the confusing classes from the
confusion matrix through the cross-validation during the training of GBDT, and select top C’ confusing
classes for CFL. During the training of CFL, we first use the whole training dataset to initialize the model
parameters, and then zoom into the C’ confusing classes for training.
For the ensemble of the main GBDT classifier and the CFL model, we first inference and make temporary decision from main classifier, and when the decision falls into the C’ classes, the inference sample will
be sent to CFL for decision within the C’ classes. The decision for the C’ class is ensemble from the GBDT
and CFL decisions with confidence based criterion. For each sample, the GBDT and CFL predict a vector of
66
Figure 4.7: Examples from four datasets. (a) MNIST, (b) Fashion-MNIST, (c)CIFAR-10, (d) STL-10
length C’ as decision, we utilize entropy to evaluate the prediction confidence, i.e. the lower the entropy of
the prediction vector is, the higher the prediction is. When one of the classifiers has high confidence while
the other has low confidence during prediction, we take the decision from the high confidence classifier.
When the confidence of the classifiers are both high or both low, we take the average of the prediction
vector to fuse the prediction results.
67
4.4 Experiments
To demonstrate the proposed efficient object recognition framework, experiments are conducted on four
popular datasets to evaluate the quality of the proposed feature representations, effectiveness of supplemental feature generation and complementary nonlinear feature learning. The efficiency of the overall proposed method is evaluated through the FLOPs and model size, and the effectiveness is evaluated through
prediction accuracy.
4.4.1 Datasets
For the experimental setup, we apply our proposed framework on the four datasets: MNIST [83], FashionMNIST[125], CIFAR10 [70], and STL10[22]. Examples from four different datasets are shown in Figure
4.7.
MNIST is a dataset of handwritten digits, consisting of a training set of 60,000 examples and a test set
of 10,000 examples. Each example is a 28x28 grayscale image of a digit, representing a number from 0 to
9. The MNIST dataset is widely used for training and testing machine learning algorithms, particularly
in the field of computer vision. The MNIST dataset is often used to train and evaluate machine learning
models for image classification tasks, as it provides a large, standard dataset for this purpose.
Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label
from 10 classes. Fashion-MNIST is intended to serve as a direct drop-in replacement for the original MNIST
dataset for benchmarking machine learning algorithms, as it shares the same image size and structure. The
images in Fashion-MNIST are higher quality and more diverse than those in the original MNIST dataset,
making it a more challenging and realistic dataset for machine learning tasks.
CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test
68
batch contains exactly 1000 randomly-selected images from each class. The training batches contain the
remaining images in random order, but some training batches may contain more images from one class
than another. Between them, the training batches contain exactly 5000 images from each class.
STL-10 dataset is a dataset for image classification that was developed by Andrew Karpathy and
Justin Johnson at Stanford University. It is inspired by the CIFAR-10 dataset but with some modifications.
In particular, each class has fewer labeled training examples than in CIFAR-10. It consists of 10 classes of
images, with a total of 5,000 images in the training set and 8,000 images in the test set. It is considered to
be a more challenging and realistic dataset than some other popular image classification datasets, such as
MNIST and Fashion-MNIST, due to the larger image size and the greater diversity of the images.
Table 4.1: Datasets
Dataset Input Train Size Test Size #Classes
CIFAR-10 32x32x3 50,000 10,000 10
STL-10 32x32x3 5,000 8,000 10
MNIST 28x28x1 60,000 10,000 10
Fashion-MNIST 28x28x1 60,000 10,000 10
4.4.2 Feature Representation Analysis
Results for SSL analysis In this section, performance of proposed feature learning and feature processing models detailed in 4.3.1 is discussed. Different filters proposed in 4.3.1.1 are evaluated concerning both
efficiency and effectiveness, results on CIFAR10 datasets are summarized in this section to demonstrate
the flexibility provided by various subspace learning methods. more specifically, SSL feature with Saab
transform are summarized in Table 4.2 and SSL feature with Haar transform are summarized in Table 4.3.
For each table, SSL feature from different settings at different stages are compared, the efficiency of the
subspace learning is evaluated with cumulative FLOPs to generate the feature map from original image,
and the effectiveness of the feature is evaluated with the accuracy of trained GBDT with feature provided,
the number of trees in GBDT is fixed as 2000, other hyperparameters of the GBDT including learning rate
69
and max tree depth are tuned to maximize the accuracy, the FLOPs of GBDT is not included in the tables.
For each stage of the SSL from different settings, original features concatenated from the feature map as
well as the aggregated feature are evaluated.
For Saab transform shown in Table 4.2, we compared the channelwise saab results with 5×5 filters,
5×5 filters with greedily selecting AC1 channel for parent node only, 3×3 filters, and 3×3 filters with
greedily selecting AC1 channel, the four different settings are ranked from top to bottom with #FLOPs as
efficiency measurement. Results from 5×5 filters and 3×3 filters shows that the second stage SSL feature
maps are most effective, while the results from 5×5 AC1 and 3×3 AC1 shows the decreasing effectiveness
of the feature maps similar to Haar transform in Table 4.3. The Aggregation of the features offers effective
dimension reduction, e.g. from the 5×5 results, we find that the aggregated features from stage 1 are as
effective as the features from stage 3 with similar feature dimensions. For Haar transform shown in Table
4.3, we compared the results from different input image channel settings, including RGB, HSV and PQR
as results from PCA. As comparison between and Table 4.2 and Table 4.3, Haar transform provides extra
efficiency compared to Saab transform while preserving reasonable effectiveness with extreme low feature
dimensions.
From the different settings of the SSL including Saab transform and Haar transform, a flexible trade off
between FLOPs, accuracy and feature dimensions is established for the rich feature representation.
Table 4.2: Evaluation results on Saab filters
Stage idx Feature Map Dimension #FLOPs Accuracy Aggregated Feature Dimension Accuracy
5×5
1 14x14x42 1.58M 74.10% 560 69.25%
2 5x5x254 5.27M 76.53% 304 63.36%
3 1x1x535 5.43M 70.97% 537 70.83%
5×5 AC1
1 14x14x52 1.52M 75.15% 600 65.42%
2 5x5x104 4.78M 58.42% 154 44.10%
3 1x1x 208 4.85M 42.80% 210 43.51%
3×3
1 15x15x20 245.92K 69.51% 630 63.71%
2 7x7x87 534.40K 73.76% 881 66.78%
3 3x3x244 716.67K 49.72% 262 45.55%
4 1x1x414 736.44K 44.93% 416 45.66%
3×3 AC1
1 15x15x20 245.92K 73.25% 630 65.25%
2 7x7x40 526.46K 66.15% 458 60.49%
3 3x3x80 609.46K 53.81% 98 41.58%
4 1x1x160 615.94K 50.91% 162 50.82%
70
Table 4.3: Evaluation results on Haar filters
Stage idx Feature Map Dimension #FLOPs Accuracy Aggregated Feature Dimension Accuracy
RGB
1 16x16x12 57.66K 70.32% 560 60.82%
2 8x8x12 111.66K 66.96% 176 58.43%
3 4x4x12 123.42K 58.65% 44 45.85%
4 2x2x12 125.58K 45.33% 20 36.81%
HSV
1 16x16x12 57.66K 68.78% 560 59.13%
2 8x8x12 111.66K 67.55% 176 58.15%
3 4x4x12 123.42K 60.60% 44 45.04%
4 2x2x12 125.58K 46.23% 20 36.08%
PQR
1 16x16x12 57.66K 70.80% 560 62.95%
2 8x8x12 111.66K 69.19% 176 61.01%
3 4x4x12 123.42K 60.81% 44 46.75%
4 2x2x12 125.58K 44.53% 20 35.05%
Results for DFT analysis To evaluate the effectiveness of DFT, we utilized the SSL feature from Table
4.2 in different stages and compare the DFT curve with the effectiveness of the DFT selected features, i.e.
the accuracy of GBDT classifier with the selected feature from DFT. The results are shown in Fig 4.8, in
which correlation between the DFT loss curve on the left and the GBDT accuracy curve on the right with
number of selected features as x-axis are shown at each row, the 4th row shows the correlation of the
fused feature selection, in which elbow points from each stage are used to select features from each stage
respectively.
From Fig 4.8 we observer reasonable correlation between the DFT loss curve and the classification
accuracy curve for each individual stage, the elbow points of the accuracy curve are slightly later than
the DFT loss curve elbow point in terms of number of selected features. For each stage, we use DFT to
select the post-elbow point as the number of feature dimension selection and concatenate the selected
features to fuse. Shown in row 4 of the Fig 4.8, the fused feature possess a strong saturation and early
elbow point in the DFT curve, while the saturation of the accuracy curve is reduced. When comparing the
best performance stage-2 and fused-stages, we reduce the feature dimension from more than 4000 to 2500
without losing the effectiveness of the features.The number of features can be reduced effectively when
using DFT as supervised feature selection method to fuse the features from different stages of SSL.
71
Figure 4.8: Correlation between DFT selection and GBDT performance at each stage of the Saab Transform
with 5×5 filters
72
4.4.3 Results and Analysis
We conduct benchmarking of deep learning methods, Pixelhop and our proposed methods evaluated with
FLOPs for inference, model size and accuracy achieved on test dataset. For LeNet-5, the architecture designs are modified as shown in Table 4.4, for MNIST we follow the settings from [81], for fashion MNIST
and CIFAR10, we modify the architecture to have more parameters for better performance, the results on
the four datasets are based on these settings.
Table 4.4: Comparison of the original and the modified LeNet-5 architectures on benchmark datasets
Dataset MNIST Fashion MNIST CIFAR-10 & STL-10
1st Conv. Kernel Size 5 × 5 × 1 5 × 5 × 1 5 × 5 × 3
1st Conv. Kernel No. 6 16 32
2nd Conv. Kernel Size 5 × 5 × 6 5 × 5 × 16 5 × 5 × 32
2nd Conv. Kernel No. 16 32 64
1st FC. Filter No. 120 200 200
2nd FC. Filter No. 84 100 100
Output Node No. 10 10 10
For CIFAR10 and STL10 datasets, the results from four datasets are compared for methods including
ResNet101, VGG16, LeNet5, Pixelhop, and our proposed efficient object recognition. The FLOPs, model
size, and prediction accuracy are compared for each method. We apply data augmentation with Random
Horizontal Flip and Random Cropping for all compared methods. For ResNet101 and VGG16, we follow
the standard design of the architecture, for LeNet5, we follow Table 4.4 as a modified version. For the
deep methods, the networks are trained with Adams Optimizer, the learning rate is set as 0.01 for LeNet5
and 0.001 for VGG and Resnet101, we set the momentum as 0.9 and weight decay as 1e
−3
. For Pixelhop,
we apply the design from [143]. For the proposed method, we trade off the efficiency and effectiveness
of the SSL filter design, and hyperparameters for SFG and CFL to balance the performance. We apply the
5×5 filters for the best performance SSL feature, then to fuse the features from multistage, we apply the
post elbow feature dimension selection criterion and selected 2000 features in total for decision learning.
With the 2000 features, we use DFT to select the most discriminant 200 features as base for SFG, and apply
73
SGD to generate new features that traverse the multi vs multi combination among the 10 classes, i.e. 511
dimensional new features are generated from SFG, we discuss the effectiveness of the SFG in 4.4.4. We
train XGBoost classifier with the combination of 2000 input features and SFG generated 511 features, we
set the depth as 5 and number of trees as 1000. For the CFL, we use the 200 most discriminant features
selected from SFG to train an MLP for the top 4 confusing classes from the XGBoost among the animal
classes [bird, cat, deer, dog], and we set the hidden layer neuron number as 1000. The results are shown
in Table 4.6. Compare to LeNet5, our proposed method achieves near 10% accuracy with less FLOPs and
50% more model parameters.
Table 4.5: Comparison of 32×32 and 96×96 Input for STL10 Datasets for Deep Models
Input Resolution Metrics Resnet101 VGG16 Lenet5
32x32
FLOPs 324.4 M 875.06 M 28M
Model Size 44.55 M 138.36 M 866K
Accuracy (%) 57.25 65.75 50.89
96x96
FLOPs 2.88 G 5.90 G 233.36 M
Model Size 44.55 M 138.36 M 5.72 M
Accuracy (%) 59.46 68.95 53.95
For STL10 dataset, we resize the images from the data sets from 96×96 to 32×32 using bilinear interpolation. The performance comparison including FLOPs, model size and accuracy for 96×96 and 32×32
input resolution are summarized in Table 4.5, the lower resolution input improves the efficiency of the
deep models with reducing 7 - 9 times FLOPs during inference with losing 2-3 % accuracy. Then for each
method we follow the same settings as CIFAR10 and apply the same hyperparameter designs as in CIFAR10
models, for the XGBoost, we set the number of trees as 300 for STL10 dataset. Due to the lack of training
data for the STL10 dataset, the deep methods cannot leverage the large model size compared to the other
datasets, compare to LeNet5, our proposed method achieves more than 10% accuracy with similar model
size and 3 times less FLOPs.
For MNIST and Fashion MNIST datasets, we compare the results for methods including LeNet5, Pixelhop and our method. The performance comparison include FLOPs, model size, and inference accuracy.
74
Table 4.6: CIFAR10 & STL10 Results Comparison
Methods VGG16 Resnet101 Lenet5 Pixelhop Our Method
CIFAR-10
FLOPs 875.06 M 324.4 M 14.74 M 21.30 M 12.82 M
Ratio-to-Lenet5 59.40x 22.01x 1x 1.45x 0.87x
Model Size 138.36 M 44.55 M 395.01 K 1.66 M 644.09 K
Ratio-to-Lenet5 350.27x 112.78x 1x 4.20x 1.63x
Accuracy (%) 93.15 93.75 68.72 71.37 77.5
STL-10
FLOPs 875.06 M 324.4 M 14.74 M 76.72 M 4.44 M
Ratio-to-Lenet5 59.40x 22.01x 1x 5.20x 0.30x
Model Size 138.36 M 44.55 M 395.01 K 8.10 M 427.02 K
Ratio-to-Lenet5 350.27x 112.78x 1x 20.51x 1.08x
Accuracy (%) 65.75 57.25 51.89 56.48 62.07
Table 4.7: MNIST & Fashion-MNIST Results Comparison
Methods Lenet5 Pixelhop Our Method
MNIST
FLOPs 846.08 K 14.23 M 1.49 M
Ratio-to-Lenet5 1x 16.81x 1.76x
Model Size 61.71 K 3.128 M 57.82 K
Ratio-to-Lenet5 1x 50.69x 0.94x
Accuracy (%) 99.04 99.09 99.19
Fashion-MNIST
FLOPs 3.58 M 18.61 M 2.69 M
Ratio-to-Lenet5 1x 5.20x 0.75x
Model Size 194.56 K 5.99 M 233.03 K
Ratio-to-Lenet5 1x 30.79x 1.20x
Accuracy (%) 89.74 89.81 91.37
75
The results are summarized in Table 4.7. For LeNet5 we use the respective settings in 4.4, and the networks
are trained with the same hyperparameters as CIFAR10 and STL10. For Pixelhop models, we follow the
same settings in [17]. For our proposed method, we apply 5×5 Saab filters for both MNIST and Fashion
MNIST for rich feature extraction. Then for MNIST, we apply DFT to select 652 features from different
feature maps, and train the main classifier XGBoost with max depth of 3 and 200 trees. We then select the
most discriminant 300 features and train an MLP with two hidden layers and 50, 30 hidden neurons for
each hidden layer. For Fashoin MNIST, we apply DFT to select 769 features from all feature maps, and train
XGBoost with max depth of 5 and 500 trees. We select the most discriminant 469 features to train MLP
with two hidden layers and 80, 20 neurons for each hidden layer. For both MNIST and Fashion MNIST, due
to the simplicity of the dataset and much more lightweight GBDT required for performance saturation,
applying the SFG modules results in similar model size, FLOPs, and inference accuracy compared with
no SFG modules. For the decision learning block, only the GBDT and CFL modules are applied. From
Table 4.7, due to the performance saturation for the MNIST and Fashion-MNIST datasets, the model size
and FLOPs increases drastically to make accuracy improvement. As the Pixelhop utilize the non channelwise Saab transform for SSL, and SVM with RBF kernel for nonlinearity, more than 10 times of FLOPs
and 50 times of model size increase are required to achieved 0.05% inference accuracy improve on MNIST,
and more than 5 times of FLOPs and 30 times of model size increase are required to achieve 1.56 % inference accuracy improve on Fashion-MNIST, which demonstrates the efficiency of light-weight deep models
compared to the traditional machine learning methods. While compared to LeNet-5, our proposed method
achieves 0.15 % more accuracy improve with smaller model size and 2 times of FLOPs increase on MNIST,
and 1.63 % accuracy improve with comparable FLOPs and model size.
In summary, our proposed method archives better accuracy utilizing similar model size and FLOPs
during inference compared to lightweight deep models such as LeNet5. For data deficient case like STL10,
76
Figure 4.9: Convergence comparison between SFG and no SFG on CIFAR10 and STL10 datasets
we achieve comparable or better performance compared to the deep models with large model size and
FLOPs.
4.4.4 Ablation Study
We demonstrate the effectiveness of SFG and CFL modules in this section. The proposed models in Table
4.6 for CIFAR10 dataset and STL10 dataset are used to demonstrate the effectiveness of the SFG module.
Given the 2000 features from the feature learning framework, we select the top 10% most discriminate
features using DFT to generate new 511 features. The convergences of the XGBoost models with and
without SFG are shown in Fig. 4.9. The XGBoost models are trained with depth as 5, learning rate as
0.2 and number of trees as 2000. With the support from SFG, we reach the performance saturation with
1000 trees for CIFAR10 and slightly better accuracy when compared to 2000 trees without SFG. For STL10,
we reach a fast performance saturation with 300 trees and similar accuracy compared with the 700 trees
saturation without SFG. For the ablation study on CFL and more details on SFG, results are shown in the
ablation study Table 4.8.
77
Table 4.8: Ablation Study for SFG and CFL modules
SFG CFL FLOPs Model Size Accuracy (%)
CIFAR10
17.73M 739.89K 75.29
✓ 11.86M 490.09K 76.07
✓ ✓ 12.82M 644.09K 77.50
STL10
8.16M 265.62K 60.59
✓ 4.14M 362.75K 60.45
✓ ✓ 4.44M 427.02K 62.07
MNIST 1.45M 40.93K 99.13
✓ 1.49M 57.82K 99.19
Fashion MNIST 2.61M 193.60K 91.02
✓ 2.69M 233.03K 91.37
4.5 Conclusion
We propose a novel efficient object recognition method. Compared to deep learning methods and traditional two-stage machine learning methods, our method achieves better accuracy on the four selected
datasets with comparable FLOPs and model size with larger training dataset and smaller inference dataset.
For data insufficient settings, our proposed method manages to achieve better inference accuracy with
much smaller FLOPs and model size. With various SSL methods and flexible framework design including
hyperparameter and module selection, our method achieves adaptive model size and FLOPs, and offers a
flexible trade-off between efficiency and effectiveness upon different classification task requirements.
78
Chapter 5
Efficient SLM with Soft Partitioning for High-Dimensional Data Sources
5.1 Introduction
Feature extraction and decision making are two modules in cascade in the classical pattern recognition (PR)
or machine learning (ML) paradigm. We consider this learning paradigm and focus on a novel learning
diagram design which includes specific modules for classification-oriented feature learning and decision
making. Classification can be done in a single stage, such as the support vector machine (SVM), or in multistages, such as the decision tree (DT) and multilayer perceptron (MLP). The recently proposed subspace
learning machine (SLM) [38, 39] adopts the DT architecture. The difference between DT and SLM is that a
single feature is used in DT while multiple features are linearly combined to yield a new variable in SLM
for decision making at each node.
SLM can be viewed as a generalized version of DT. The linear combination of multiple features can
be written as the inner product of a projection vector and a feature vector. When the projection vector
is a one-hot vector, SLM is nothing but DT. The effectiveness of SLM depends on the selection of good
projection vectors. Two projection vector selection methods were studied in [39]; namely, probabilistic
search and optimization-based search. Both SLM and DT apply a hard split to a feature using a threshold
at a decision node. We propose a new SLM method that adopts soft partitioning and denote it with SLM/SP
in this work.
79
SLM/SP adopts the soft decision tree (SDT) data structure and a novel topology is proposed with inner nodes of SDT for data routing, leaf nodes of SDT for local decision making, and edge between parent
and child nodes for representation learning. Specific modules are designed for the nodes and edges respectively. The training of a SLM/SP tree starts by learning an adaptive tree structure via local greedy
exploration between subspace partitioning and feature subspace learning. The tree structure is finalized
once the stopping criteria are met for all leaf nodes, and all module parameters are updated globally. This
methodology enables efficient training, high classification accuracy, and a small model size. It is shown by
experimental results that an SLM/SP tree offers a lightweight and high performance classification solution.
The contributions of this work are summarized below.
• We generalize SLM from hard-partitioning to soft-partitioning and propose a new SLM/SP method.
• We demonstrate the effectiveness of SLM/SP in classification accuracy with several benchmarking
datasets.
• We show the efficiency of SLM/SP in its model size and computational complexity.
The rest of this chapter is organized as follows. Background information is reviewed in Sec. 5.2.
The SLM/SP method is proposed in Sec. 5.3. Experimental results are presented to show the excellent
tradeoff between effectiveness and efficiency of SLM/SP in Sec. 5.5. Finally, concluding remarks and future
extensions are given in Sec. 5.6.
5.2 Background Review
5.2.1 Soft Decision Tree
Our SLM/SP method has been inspired by research on the soft decision tree (SDT). The very first soft
decition tree (SDT) was introduced in [116]. The comparison of the decision process between a standard
80
Figure 5.1: The comparison of the decision process between a Hard Decision Tree and Soft Decision Tree.
hard decision tree (HDT) and SDT is illustrated in Fig. 5.1. Research on SDT in 90s included [62, 116].
The work in [116]. examined a specific scenario, where axis-aligned features are used as the input and
parent nodes are either static distributions over classes or linear functions. A similar idea, called the hierarchical mixture of experts (HMEs) [62], was introduced earlier. Parent nodes are linear classifiers and
the tree structure is fixed [62]. Several follow-up work was conducted in the last decade. A computationally efficient training method capable of directly optimizing hard partitioning through differentiation with
stochastic gradient estimators was investigated in [84]. More contemporary SDTs, such as [59, 78], have
incorporated MLPs or convolutional layers in parent nodes to enable more complex partitioning of the
input space. Another direction is to combine non-linear data transformations with DTs to enhance model
performance. For example, the neural decision forest (NDF) [69] achieved state-of-the-art performance on
81
the ImageNet in 2015. A similar idea was developed in [126], where an MLP was used as the root transformer. One can optimize it to minimize a differentiable information gain loss. However, it is important to
point out that the model architectures are predetermined and fixed in all of these methods. The choice of
an effective architecture is still an open question.
Architecture growth is a key facet of DTs [24], and typically performed in a greedy mannor with
stopping criterion based on validation set error [60, 116]. Prior research in the field of DTs has endeavored
to enhance this greedy growth strategy. Decision jungles [111] utilize a training mechanism to merge
partitioned input spaces between different sub-trees, thereby rectifying suboptimal “splits” caused by the
locality of optimization. Irsoy et al. [60] introduced budding trees, which are incrementally grown and
pruned based on the global optimization of existing nodes. Although our training algorithm, for the sake of
simplicity, expands the architecture by greedily selecting the best option between deepening and splitting
the input space, it is certainly receptive to these advancements. Recently, a novel approach called Adaptive
Neural Trees (ANTs)[121] was proposed that unites the paradigms of deep neural networks and decision
trees. ANTs incorporate representation learning into the edges, routing functions, and leaf nodes of a
decision tree1. This is achieved through a backpropagation-based training algorithm that adaptively grows
the architecture from primitive modules, such as convolutional layers. The advantages of these neural
tree models include lightweight inference via conditional computation, hierarchical separation of features
useful to the task, and a mechanism to adapt the architecture to the size and complexity of the training
dataset.
5.2.2 Automatic Feature Extraction
As described in Sec. 4.2, the image classification methods can be generally categorized into deep learning
(DL) methods, and traditional machine learning methods.
82
Conventional ML techniques are composed of two sequential modules: 1) extraction of features from
images, and 2) classification based on these extracted features. In contrast, DL techniques perform feature
extraction and classification concurrently within a single module. The inception of DL was marked by
successive Convolutional Neural Network (CNN) models like AlexNet[71], VGG[112], GoogLeNet[117],
and ResNet[47] have demonstrated remarkable accuracy levels. A significant advancement in this field is
the Vision Transformer (ViT) model [32], which currently provides unparalleled performance on several
image classification benchmark datasets. The success of DL techniques can be attributed to factors such
as the availability of extensive training datasets, ample computational resources, end-to-end optimization,
neural architecture search, and the implementation of transformers. However, DL techniques are hindered
by their lack of interpretability, high computational expenses, and complex model structures. In this research, our objective is to devise an interpretable, computationally efficient, and compact learning model
by adhering to the conventional ML paradigm. Learning from the DL methods and traditional methods,
green learning (GL) was proposed to address several concerns associated with DL, such as the substantial
carbon footprints produced by large DL networks in recent years. GL-based models are characterized by
their low carbon footprints, small model sizes, low computational complexity, and logical transparency.
GL comprises three sequential modules: 1) unsupervised representation learning, 2) supervised feature
learning, and 3) supervised decision learning. Modules 1 and 2 collectively correspond to the “feature extraction” module in the classical PR paradigm. However, there is a significant distinction: Modules 1 and
2 in GL are automated, whereas feature extraction in PR is conducted manually in an ad hoc manner.
In GL, automatic feature extraction is achieved by unsupervised representation learning such as the
Saab transform [76]. The Saab transform is a joint spatial-spectral transform that decompose a local patch
into a DC (direct current) component and several AC (alternating current) components. The AC filters
are derived based on the principal component analysis (PCA). Typically, multi-stage Saab transforms are
applied. They are used to build several GL-based image classification methods, including Pixelhop [17],
83
Pixelhop++ [18] and IPhop [130]. The Saab coefficients in various stages offer new representations for
patches of different receptive fields. They are used as feature candidates. Note that Saab coefficients are
not handcrafted since they are obtained automatically by exploiting the correlation of pixels in a local
neighborhood. Besides, no supervision labels are used in the derivation of the Saab transform. The dimension of the representation space is usually very large. It is essential to reduce its dimension using a
supervised learning method. A supervised feature learning method, called the discriminant feature test
(DFT), was proposed in [132]. The multi-stage Saab transforms and DFT serve as Modules 1 and 2 of a GL
system, respectively.
In DL, automatic feature extraction is accomplished by joint representation and decision learning methods such as CNNs and Vision Transformers (ViTs). CNNs perform a joint spatial-spectral transform that
decomposes an image into a set of feature maps. The filters in CNNs are learned based on the backpropagation algorithm. For CNNs, multi-layer CNNs are applied, leading to the development of several DL-based
image classification methods. The feature maps at various layers provide new representations for different
receptive fields and serve as feature candidates. These feature maps are not handcrafted but are learned
automatically by minimizing a loss function that measures the discrepancy between the prediction and the
ground truth label. For ViTs, the basic structure of ViT involves breaking down input images into a series
of patches, which are then tokenized before applying the tokens to a standard Transformer architecture1.
An image is split into fixed-size patches, each of which is then linearly embedded. Position embeddings are
added to these linear embeddings, and the resulting sequence of vectors is fed to a standard Transformer
encoder.The attention mechanism in a ViT repeatedly transforms representation vectors of image patches,
incorporating more and more semantic relations between image patches in an image1. This is analogous to
how in Natural Language Processing (NLP), as representation vectors flow through a Transformers, they
incorporate more and more semantic relations between words, from syntax to semantics. The dimension
of the representation space can be very large, necessitating efficient model and learning diagram designs.
84
5.2.3 Efficient Bottleneck Design for Local Feature Learning
Contemporary deep neural networks are predominantly constructed by stacking building blocks. These
blocks are typically designed based on either the traditional residual block with a bottleneck structure [47]
or the inverted residual block [109]. In this section, we classify all related networks based on these two
types of building blocks and provide a brief description of each.
Classic residual bottleneck blocks The bottleneck structure was initially introduced in ResNet[47].
A typical bottleneck structure comprises three convolutional layers: a 1 × 1 convolution for channel reduction, a 3 × 3 convolution for spatial feature extraction, and another 1 × 1 convolution for channel expansion.
A residual network is typically constructed by stacking a series of such residual blocks. The bottleneck
structure was further refined in subsequent works by expanding the channels in each convolutional layer
[135], applying group convolutions to the middle bottleneck convolution to aggregate richer feature representations [127], or incorporating attention-based modules to explicitly model inter dependencies between
channels [52, 86]. Other works [20, 122] have combined residual blocks with dense connections to enhance
performance. Despite its success in heavy-weight network design, it is seldom used in light-weight networks due to model complexity. Our research demonstrates that by appropriately adjusting the residual
block, this type of traditional bottleneck structure can also be suitable for light-weight networks and can
achieve state-of-the-art results.
Inverted residual blocks The inverted residual block, first introduced in MobileNetV2 [109], inverts
the concept of the traditional bottleneck structure and connects shortcuts between linear bottlenecks. This
significantly enhances performance and optimizes model complexity compared to the classic MobileNet
[51], which consists of a sequence of 3 × 3 depthwise separable convolutions. Due to its high efficiency,
the inverted residual block has been widely adopted in subsequent mobile network architectures. ShuffleNetV2 [137] incorporates a channel split module before the inverted residual block and adds another
channel shuffle module after it. In HBONet [22], down-sampling operations are introduced into inverted
85
residual blocks to model richer spatial information. MobileNetV3 [16] proposes to search for optimal activation functions and the expansion rate of inverted residual blocks at each stage. More recently, MixNet
[120] proposes to search for optimal kernel sizes of the depthwise separable convolutions in the inverted
residual block. EfficientNet[119] also utilizes the inverted residual block but employs a scaling method to
control the network weight in terms of input resolution, network depth, and network width. Unlike all
the aforementioned approaches, our work enhances the standard bottleneck structure and demonstrates
the superiority of our building block over the inverted residual block in mobile settings.
Sandglass blocks In response to the limitations of the inverted residual block, a novel design called
the Sandglass block has been proposed in [138]. The design principle of the Sandglass block is primarily
based on two insights: (i) To retain more information from the lower layers when transitioning to the upper
layers and to facilitate gradient propagation across layers, shortcuts should be positioned to connect highdimensional representations. (ii) Depthwise convolutions with small kernel size (e.g., 3 × 3) are lightweight,
so a couple of depthwise convolutions can be appropriately applied to higher-dimensional features to
encode richer spatial information. The Sandglass block rethinks the positions of expansion and reduction
layers. To ensure that shortcuts connect high-dimensional representations, it is proposed to reverse the
order of the two pointwise convolutions, keep the bottleneck in the middle of the residual path, which
allows using the shortcut connection to connect representations with a large number of channels instead of
bottleneck ones. Instead of placing the shortcut between bottlenecks, shortcuts are placed between higherdimensional representations in the Sandglass block. The ‘wider’ shortcut delivers more information from
the input to the output compared to the inverted residual block and allows more gradients to propagate
across multiple layers. Pointwise convolutions can encode inter-channel information but fail to capture
spatial information. In the Sandglass block, depthwise spatial convolutions are adopted following previous
mobile networks to encode spatial information. The inverted residual block adds depthwise convolutions
between pointwise convolutions to learn expressive spatial context information.
86
Structural Reparameterization The structural reparameterization was first proposed in RepVGG[30],
decouples the training-time and inference-time architecture.This technique ensures model flexibility while
reducing computational effort, enhancing the model’s feature extraction ability and speeding up computation. It also decreases the memory requirement, enabling the remote deployment of models for industrial
applications. The general pipeline of reparameterization involves training networks with a multi-branch
topology first, and then merging them into standard 3x3 convolutions for efficient inference5. This approach not only enhances the reasoning speed of the model but also reduces memory requirements. In
RepVGG, this technique is used to introduce residual branches and 1x1 convolutional branches into the
original VGG architecture2. These additions are made with the intention of enabling subsequent structural
reparameterization into a single pathway. The RepVGG model used in training comprises three pathways:
conventional 3x3 convolutions , 1x1 convolutions , and an identity pathway. Each of these pathways is
accompanied by batch normalization (BN) layers. This technique allows for a detailed explanation of how
these three pathways are merged into a unified 3x3 convolutions unit during inference2. By observing the
convolution processes of 1x1 convolutions and 3x3 convolutions, it can be seen that they both follow the
same path starting from a certain position in the input2. Therefore, to combine 3x3 convolutions and 1x1
convolutions, padding is applied to the 1x1 convolutions kernel to match the shape of 3x3 convolutions,
followed by an element-wise addition with 3x32 convolution. This innovative approach to structural reparameterization demonstrates that by reasonably adjusting the residual block, this kind of classic bottleneck
structure is also suitable for light-weight networks and can yield state-of-the-art results.
Model compression and neural architecture search Model compression algorithms are effective in
eliminating superfluous parameters in neural networks, such as network pruning [11, 46, 92, 103], quantization [21, 58], factorization [61, 139], and knowledge distillation[48]. Despite the efficiency of these
networks, the performance of the compressed networks is still heavily influenced by the architectures of
the original networks. Therefore, it is crucial to design more efficient network architectures to produce
87
efficient models. Neural architecture search accomplishes this by automatically searching for efficient
network architectures [10, 45, 118]. However, the search space necessitates human expertise, and the performance of the searched networks is largely contingent upon the designed search space as highlighted in
[31, 134]. In this chapter, we demonstrate that our proposed building block complements existing search
space design principles and can further enhance the performance of searched networks if incorporated
into existing search spaces.
5.3 SLM Tree with Soft Partition
5.3.1 Overview of SLM/SP
For SLM with Soft Partition (SLM/SP), we follow the settings in chapter 2 and consider a K-class classification problem. The input feature space X contains N samples, where each sample has a D-dimensional
feature vector. A sample is denoted by
xn = (xn,1 · · · xn,d · · · , xn,D)
T ∈ R
D, n = 1, · · · , N. (5.1)
The partitioning in the feature space in SLM can be expressed mathematically in form of
a
Tx + b = 0, (5.2)
where b is the bias and
a = (a1, · · · , ad, · · · , aD)
T
, ||a|| = 1, (5.3)
88
is the unit vector that points to the surface normal direction. It is also called the projection vector. Then,
the full space, S, is split into two half subspaces:
S+ : a
Tx ≥ −b, and S− : a
Tx < −b. (5.4)
The above process can be conducted recursively to lead to a binary decision-tree that offers a hierarchical
partition of the feature space. One challenge in Eq. (5.4) lies in finding a good projection vector a at
each intermediate (or called inner) node so that samples of different classes are better separated. This is
related to the distribution of samples of different classes at the node. The ultimate objective is to lower the
weighted entropy of all leaf nodes.
The output of the root node contains two child nodes, denoted by S+ and S−. With hard partitioning,
an input sample is assigned to one of the two. With soft partitioning, its assignment is a probabilistic one.
For linear soft partitioning, the probabilities of going to S+ and S− are
p+(x) = σ(a
Tx + b), and p−(x) = 1 − σ(a
Tx + b), (5.5)
where σ is the sigmoid logistic function, respectively. The dimension of x determines the complexity of
each soft partitioning. The soft partitioning Eq. (5.5) can be generalized to any differentiable linear or
nonlinear function. For example, to achieve higher modeling capability, the linear function a
Tx + b in
Eq. (5.5) can be replaced by a simple MLP with one hidden layer. With nonlinear activation functions
such as ReLU or Leaky ReLU, an MLP can be trained via back-propagation. In this work, we use a single
hidden layer MLP as a pre-processing unit for soft partitioning. Then, the probability of going to S+ can
be written as
p+(x) = σ(a
′T ReLU(a
Tx + b) + b
′
). (5.6)
89
The same soft partitioning process can be repeated at child nodes recursively. For example, we conduct
soft partitioning on S+ so that the probability for an input to arrive at the (+,+)-grand-child node p+,+(x)
is the cascaded multiplication of p+(x) at inner nodes. It is straightforward to get the mathematical expressions for p+,+(x), p+,−(x), p−,+(x), and p−,−(x). The combination of SLM and soft partitioning leads
to an SLM/SP tree. SLM/SP learns the parameters in the training stage. After training, SLM/SP employs
them to assign an input sample to one of a set of partitioned subspaces with a path probability. The generalization of the SLM/SP process to the tree topology, the probabilistic inference of the SLM/SP trees, and
the application of the SLM/SP tree to image classification are discussed in Sec. 5.3.2, Sec. 5.3.3, and Sec.
5.4 respectively.
5.3.2 Design of SLM/SP Tree
In this section, we formalize the definition of SLM/SP trees, including the topology of the SLM/SP trees,
the determination of the tree structure, and the general formulation of the parameter learning. In general,
the topology of the SLM/SP tree is the form of DT enhanced with SLM/SP, and the aim of the SLM/SP tree
is to learn the conditional distribution p(y|x) from a set of N labelled samples denoted in 5.1 as training
data.
The design of an SLM/SP tree consists of two main choices: 1) determining the hierarchical tree structure, 2) determining optimal parameters (i.e., ai and bi
) at inner nodes. In this section, we first introduce
the topology of the SLM/SP tree, and respective modules and operations corresponding to the topology.
Then the two design choices above are elaborated.
5.3.2.1 Topology and Operations
The overview of the SLM/SP Tree is shown in Fig. 5.3. In short, the SLM/SP Tree is characterized with a
set of hierarchical partitions of input space X, a series of transformations, and separate predictive models
90
Figure 5.2: The overview of an example SLM/SP tree.
in the respective component regions. For the topology of the model, we restrict it to be binary tree, defined
as a graph with its node as either an inner node or a leaf node, and is the child of exactly one parent node,
except for the root node at the top. We define the topology T of the tree as T := {N , E}, in which N is
the set of all nodes and E is the edges between them. Nodes with no children are leaf nodes Nleaf , and all
the other nodes are inner nodes Ninner. Every inner node i ⊂ Ninner has two child nodes, represented as
lef t(i) and right(i). Also E contains an edge which connects input space X to the root node as shown in
Fig. 5.3.
91
For each node and edge, operations are assigned which then act upon the allocated data samples as
illustrated in Fig.5.3. The process begins at the root, where each sample undergoes transformation and
navigates through the tree according to the assigned operation. An SLM/SP tree is constructed based on
three fundamental modules of differentiable operations.
Inner Node, I : each inner node i ⊂ Ninner is assigned with an inner module, I
θ
i
: Xi → [0, 1] ⊂ I,
parameterized with θ, in which Xi denotes the representation at node i. Each inner node routes the
samples from the incoming edge to either the left or right child, we use the soft partitioning described in
5.3.1 where the decision output from the node denotes the probability of routing to the left child.
Edge , E : each edge e ⊂ E is assigned with one or multiple edge module(s), each edge module E
ψ
e :⊂ E,
parameterized with ψ, transforms samples from the previous module and passes them to the next one.
Leaf Node, L : each leaf node l ⊂ L is assigned with a leaf module, each leaf module L
ϕ
l
:⊂ L,
parameterized with ϕ, estimates the conditional distribution p(y|x). For classification tasks with K classes,
for example, we can define l
ϕ
as a linear classifier on the transformed feature space Xl
, that outputs the
distribution over classes q
l
k
. Each leaf node, l has a K-D output vector, q
l
, whose kth element is equal to:
q
l
k =
exp(ϕ
l
k
)
PK
k=1 exp(ϕ
l
k
)
, k = 1, · · · , K. (5.7)
where q
l
k
is the probability of class k at the l-th leaf.
From input space X, data is passed through edges, inner nodes, and leaf nodes. For example, in Fig. 5.3,
to reach the distribution q
l
at leaf node l = 1, input X undergoes a series transformations X → X
ψ
0
:=
E
ψ
0
(X) → X
ψ
1
:= E
ψ
1
(X
ψ
0
) → X
ψ
2
:= E
ψ
2
(X
ψ
1
) and the leaf module L1 yields the predicted distribution
q
l = p
ϕ,ψ(y) := L
ϕ
1
(X
ψ
2
). The probability of selecting this path is given by (1 − I
θ
0
(X
ψ
0
))I
θ
1
(X
ψ
1
), which
is the cascaded multiplication of the probability of the inner node module I
θ
0
routing right and inner node
module I
θ
1
routing left.
92
Figure 5.3: The hierarchical determination of SLM/SP tree .
The definition of specific operations assigned to the SLM/SP tree topology T can be generalized to any
differentiable functions, which yields the generalization capability of the SLM/SP tree. For example, when
identity transformation is assigned to the edge modules E, the topology of the SLM/SP tree is reduced to
standard binary SDT. Commonly used operations in CNNs and ViTs could be applied to the operations in
SLM/SP tree for the effectiveness of the model. For high-dimensional data source like image classification,
more details about operation assignments are discussed in Sec. 5.4.
5.3.2.2 Hierarchical Tree Determination
The tree architecture growth process is illustrated in Fig. 5.2. We use a greedy search to find parameters
at each node l to decide whether there is a gain in reducing the loss function by splitting the node, or
extending the node l with an edge. We employ a loss function that minimizes the cross-entropy at each
leaf weighted by its path probability and the target distribution. Mathematically, the loss function at the
th leaf node can be written as
Lℓ(x) = −
X
N
n=1
X
L
l=1
P
ℓ
(xn)
X
K
k=1
T
l
k
log q
l
k
, (5.8)
93
where P
ℓ
(xn) is the probability for input xn to arrive at leaf node ℓ, and q
l
k
is the probability distribution
at leaf node ℓ for class k, and T
ℓ
k
is the target distribution of class k at node l. The target distribution
T
l
k
is obtained by putting all training samples through the tree and finding the statistics of all classes at
node ℓ. The loss function evaluates the discrepancy between the predicted distribution q
l
k
and the target
distribution T
l
k
, taking into account its path probability.
Starting from the root node, the tree growth process one of the leaf nodes in the breadth-first order
and incrementally modify the hierarchy of the SLM/SP tree. More specifically, as illustrated in Fig. 5.2,
we evaluate three cases during the process of each leaf node. 1) Extend the node, 2) Split the node into
two child nodes, 3) Keep the child node. We use the validation dataset to evaluate the effectiveness of
the three cases with the loss in Eq. 5.8. That is, we fix the previous optimized parameters and optimize
the parameters of the newly added modules, compare the validation loss improvement of the 1)extend
and 2)split, and greedily select the case with the lowest validation loss. The intuition of evaluation the
three cases is to utilize the local distribution and greedily exploring the most effective option between soft
feature space partitioning, richer local representation learning. We stop the split at a particular leaf node if
there is no further validation loss improvement on the validation dataset, which is greedily adopted during
the tree growth process. We also use the maximum tree depth, and the minimum sample number per node
as stopping criteria.
94
5.3.2.3 Module Parameters Determination
After the tree structure is finalized, we apply global optimization to update the projection vector and the
bias at inner nodes for furthermore performance improvement. The total loss function from all leaf nodes
can be written as
L =
X
N
n=1
L(xn) = X
N
n=1
X
ℓ∈LeafNodes
Lℓ(xn)
= −
X
N
n=1
X
ℓ∈LeafNodes
P
ℓ
(xn)
X
k
T
ℓ
k
log q
ℓ
k
. (5.9)
Since L as well as all modules and operations assigned to the nodes and edges of the SLM/SP tree are
differentiable, we use the mini-batch stochastic gradient descent (SGD) method to determine parameters
and minimize L(x). The global optimization of the module parameters can correct sub-optimal decisions
made during the local optimization during the growth and determination of the hierarchical tree structure,
and empirically improve the generalization error.
5.3.3 Probabilistic Inference with SLM/SP Tree
The SLM/SP tree models the conditional distribution p(y|x), which is a root-to-leaf path in the tree with
the final distribution determined by the leaf node. Each input xn to the SLM/SP tree stochastically traverses
the tree based on the decisions of inner node modules and undergoes a sequence of transformations with
each edge module, until it reaches a leaf node where the corresponding module predicts the label yˆ.
The inference with SLM/SP can implemented in two schemes based on a trade-off between accuracy
and computation: 1) full-path inference and 2) single-path inference.
95
5.3.3.1 Full-Path Inference
The full-path inference calculates the probabilistic distributions over all leaves. The predicted class of a
single test sample, xn is given by
yˆ(xn) = arg max
k
X
L
l=1
P
ℓ
(xn)q
ℓ
k
(xn). (5.10)
To get the global estimation of ˆk(xn), we need to traverse the full SLM/SP tree to estimate the probability
P
ℓ
(xn) for input (xn) to arrive at leaf node ℓ, and the local probabilistic distribution q
ℓ
(xn) among k
classes. When the depth of SLM/SP tree increases, the time and memory complexity increases exponentially, and may become too high to be practical if the total number of test samples is large.
5.3.3.2 Single-Path Inference
For a larger test sample size, we must simplify the multi-path inference for higher memory- and timeefficiency at the expense of lower accuracy. The simplified scheme adopted in our experiments is the
single-path inference. That is, it greedily traverses the tree in the directions of highest confidence of inner
nodes to reach a leaf node, ℓ, and then predicts its class based on the maximum likelihood at that node;
namely,
yˆ(xn) = arg max
k
q
ℓ
k
, (5.11)
where q
ℓ
k
is the probability at leaf node ℓ for class k learned from the training samples.
5.4 Image Classification with SLM/SP
The image classification framework with SLM/SP tree is illustrated in Fig. 5.4. Following the efficient feature learning in Chapter 4, we utilize the SSL feature as input feature space X for the proposed SLM/SP tree
image classification framework. For efficient SLM/SP design, we propose an efficient local representation
96
Figure 5.4: Image Classification Framework with SLM/SP Tree .
learning module for the edges E in the SLM/SP tree topology T . To evaluate both the decision learning
capability and full effective of SLM/SP, different module designs are adopted for the SLM/SP tree. The
efficient learning module and SLM/SP module designs are elaborated in the following subsections.
5.4.1 Design of Efficient Feature Learning
In this section, we focus on automatic efficient feature learning. The overall learning diagram of the
SLM/SP image classification framework is a hybrid of unsupervised learning and supervised learning. For
the feature learning specifically, we possess the flexibility of using SSL feature, with unsupervised feature
extraction and supervised feature selection, as global feature, and using supervised efficient local feature
learning via back propagation.
5.4.1.1 Preliminaries
For the supervised efficient local feature learning module, preliminaries for efficient and effective feature
learning designs, including depthwise separable convolutions, squeeze and excitement blocks, and linear
feature subspace learning are introduced below.
97
Depthwise separable convolutions To enhance computational efficiency, depthwise separable convolutions are utilized as a replacement for standard convolutions. As outlined in [51], a convolution with
a weight tensor of dimensions k × k × M × N (where k × k represents the kernel size, and M and N
denote the number of input and output channels, respectively) can be decomposed into two separate convolutions. The first is a depthwise, or channel-wise, convolution with an M-channel k × k kernel, which
independently learns spatial correlations within each channel. The second is a pointwise convolution that
combines channels to generate new features. Since the combination of a pointwise convolution and a k ×
k depthwise convolution significantly reduces the number of parameters and computations, incorporating
depthwise separable convolutions into basic building blocks can drastically cut down on parameters and
computational cost. Our proposed SLAB employs these separable convolutions.
Squeeze and excitement block As introduced in [52], the squeeze and excitement(SE) block is designed to enhance the representational power of the feature map by enabling it to perform dynamic channel
wise feature recalibration. The SE block takes a convolutional feature map as input and investigates the
relationship between channels and models the interdependencies between the channels of the convolutional features. Each channel in the input feature map is squeezed into a single numeric value using global
average pooling, yielding a 1D tensor feature map. Then two pointwise convolution layers are applied to
the feature map, the first layer performs a dimension reduction of the input, followed by a ReLU activation
function. The second layer performs a dimension increase, followed by a sigmoid activation function. The
output of the second layer is used to perform a channel-wise multiplication with the original input feature
map. This process allows the network to adaptively adjust the importance of each feature map, thereby
enhancing its representational power. The SE block has been shown to significantly improve performance
with negligible additional computational cost.
Linear feature subspace learning For a given set of real images as input, the set of features in the
convolutional layers forms a manifold of interest. It has been widely assumed that these manifolds of
98
Figure 5.5: Overview of the SLAB. The low-dimensional manifold-of-interest feature space is of dimension
H x W x C, the expanded high-dimensional feature space for dense spatial and spectral local representation
learning is of dimension H x W x C’.
interest in neural networks could be embedded in low-dimensional subspaces. This implies that when
we examine all individual pixels of a deep convolutional layer, the information encoded in those values
actually resides in some manifold, which can be embedded into a low-dimensional subspace. This fact can
be leveraged by simply reducing the dimension of a layer, thus reducing the dimension of the operating
space. This approach has been effectively utilized in [51], and has been incorporated into efficient model
designs of other networks as well. With the standard design of the linear convolution and nonlinear
activation setting of the dimensional reduction embedding, when activation function such as ReLU is
applied to certain channel, it inevitably loses information in that channel. However, if we have numerous
channels and there is structure in the activation manifold, there could be redundant information which
might still be preserved in other channels.
5.4.1.2 Subspace Learning enhAnced Block (SLAB) for Efficient Local Representation Learning
Inspired by the designs of depthwise separable convolutions, squeeze and excitement blocks, and linear
feature subspace learning, we propose a novel feature learning block named Subspace Learning enhAnced
Block(SLAB). SLAB utilize the depthwise separable convolution for efficient convolution design, squeezeand-excitement block for enhancing the local representational power of feature map, and linear feature
subspace learning for preserving the information on manifold of interest.
99
The overview of the SLAB is illustrated in Fig. 5.5, we use relative size of the cubic blocks to represent
the spectral dimension difference between different feature maps. For the convolution layer design in the
SLAB, we utilize the pointwise convolution for feature channel adaptation, and depthwise convolution
for channel wise spatial information learning. In particular, with the input low-dimensional manifold of
interest feature map, we first apply pointwise convolution to expand the spectral dimension, as discussed
in the preliminary section, with the channel expansion, more redundant information is introduced to the
high-dimensional feature map, and with applying the nonlinear activation ReLU, effective nonlinear local
representation is learned without losing much information from manifold of interest. Then with the highdimensional feature map, a depthwise convolution follows the pointwise convolution for the capability
of expressive spatial dense feature learning. Then the high dimensional feature is embedded into a lowdimensional subspace as the transformed manifold of interest, and linear transformation is applied during
the feature dimension reduction to preserve the information in manifold of interest.
To enhance the local representative power, SE block is adopted in the proposed SLAB design. As discussed in the preliminaries, the SE block is capable of improving the performance with negligible additional
computational cost. To maximize the performance improvement capability, we attach the SE block to the
high-dimensional feature map in the SLAB, after the dense spatial feature learning step with depthwise
convolution. For the residual learning setting, the SLAB utilize the low-dimensional subspace embedding
learned for residual connection, i.e. the skip residual connection is between the input and output subspace
as shown in Fig. 5.5.
5.4.2 SLM/SP Module Design
As illustrated in Fig. 5.4, we utilize the SSL feature as input X for SLM/SP tree, and with the topology
T of the SLM/SP tree, we possess flexible designs of the tree module with adopting different modules for
the nodes N and edges E. To evaluate the efficiency and effectiveness of the capabilities of SLM/SP tree,
100
we propose two learning roles of the SLM/SP tree in the image classification framework, i.e. 1) supervised
decision learning and 2) supervised local representation learning. The two choices are elaborated in the
following sections respectively.
5.4.2.1 SLM/SP for Supervised Decision Learning
To evaluate the decision learning capability of SLM/SP tree, we adopt identity function for the edges E in
the tree for the image classification framework. The topology of the SLM/SP tree is then reduced to the
standard binary SDT tree graph without edge as connections between the pairs of parent and child nodes.
With certain restriction, the SLM/SP tree nodes process the fixed global feature learned with SSL, and the
goal of the tree is to model the conditional distribution p(y|X) by soft subspace partitioning with the inner
nodes I and local conditional distribution estimation with the leaf nodes L.
To achieve effective soft feature space partitioning, we adopt MLP with two hidden layers for the
inner nodes I. Then for the local conditional distribution estimation, we adopt linear classifiers for the
leaf nodes L. We adopt the same designs of the SLM/SP tree as described in Sec. 5.3.1, including tree
growth and inference strategies, as well as stopping criterion. For SSL feature learning framework, the
same settings as described in Chapter. 4 are adopted, i.e. the SSL feature for SLM/SP tree input X has 1D
spatial dimension. It is for the decision learning capability comparison with other classification methods
like SVM and XGBoost. In Sec. 5.5, the SLM/SP serves as efficient and effective supervised decision learning
module.
5.4.2.2 SLM/SP with Supervised Local Representation Learning
To evaluate the full capability of the SLM/SP tree for image classification, we adopt the SLAB proposed in
Sec. 5.4.1 for the edges E in the tree for the image classification framework. The SLAB serves as effective
local representation learning module, and can be stacked on the edge of SLM/SP tree with the extension
101
strategy in tree growth. We keep the manifold-of-interest subspace feature channel as a hyper-parameter
for the SLM/SP tree design, and the feature map represents the samples from more global to more local
when going deeper into the partitioned subspace.
Similar to the SLM/SP for decision learning, the inner nodes I in SLM/SP tree serves as feature space
soft partitioning module. While with the local representation learning with SLAB, we adopt much more
efficient design for the inner node module, i.e. a global average pooling layer for the input feature space,
and a one hidden layer MLP for the routing of the samples. For the leaf node modules, to fit with the SLAB
edge modules, we also adopt a global average pooling layer prior to the linear classifier.
For the SSL feature that serves as global representation, with the SLAB as main representation learning
module, we preserve the spatial dimensions of the input image during SSL that differs from the SLM/SP
for decision learning only. The SSL module in this design serves the role of introducing rich decorrelated
spectral information via Saab transform.
In Sec. 5.5, the SLM/SP with SLAB for local representation learning has tremendous performance
improvement over SLM/SP for decision learning design in term of efficiency and effectiveness.
5.5 Experiments
5.5.1 Results on SLM/SP decision learning with GL features
5.5.1.1 Experimental Setup
Datasets and Performance Metrics. To demonstrate the SLM/SP decision learning capability, we conduct experiments on the four image classification datasets utilized in Sec. 4.4: MNIST [83], Fashion-MNIST
[125], CIFAR10 [70], and STL10 [22]. We adopt the SSL framework for GL features and follow the same
settings applied for the four datasets as in Sec. 4.4. The experimental results in this section shows the
complex decision learning capability of SLM/SP tree compared to SVM and XGBoost.
102
Benchmarking Image Classification Methods. For performance benchmarking, we consider a couple of representative GL- and DL-based image classification methods. As mentioned earlier, GL-based
methods have three three cascaded modules: 1) unsupervised representation learning, 2) supervised feature learning, and 3) supervised decision learning. Here, we compare three GL-based methods:
• GL-1: pixelhop[17] for module 1 and SVM for module 3;
• GL-2: pixelhop++[18] for module 1 and XGBoost for module 3;
• GL-3: pixelhop++ for module 1 and SLM/SP for module 3.
For GL-2, we replace the linear classifier in [18] with the XGBoost Classifier, use DFT to select 2000
features for decision learning, and set the tree depth to 5 and the tree number to 1000. For GL-3, we use
DFT to select 512-D features as the input to SLM/SP and set the hidden layer neuron number to 512 and
the maximum depth of the SLM/SP tree to 5. For CIFAR-10 and STL-10 datasets, we include VGG-16 [112]
and LeNet-5 [81] in performance benchmarking. They are representatives of heavy and lightweight neural
networks, respectively. We follow the standard design of VGG-16 and the settings in [18] for LeNet-5. We
adopt the same hyper-parameters in DL networks for CIFAR-10 and STL-10.
Table 5.1: Performance Comparison for CIFAR-10 with GL.
CIFAR-10
Methods VGG-16 LeNet-5 GL-1 GL-2 GL-3 (SLM/SP)
FLOPs 875.06 M 14.74 M 21.30 M 17.73 M 20.29 M
(Against LeNet-5) (59.40x) (1x) (1.45x) (1.20x) (1.38x)
Model Size 138.36 M 395.01 K 1.66 M 739.89 K 4.39 M
(Against LeNet-5) (350.27x) (1x) (4.20x) (1.87x) (11.11x)
Accuracy (%) 93.15 68.72 71.37 75.29 87.36
103
Table 5.2: Performance Comparison for STL-10 with GL.
STL-10
Methods VGG-16 LeNet-5 GL-1 GL-2 GL-3 (SLM/SP)
FLOPs 875.06 M 14.74 M 76.72 M 8.16 M 8.19 M
(Against LeNet-5) (59.40x) (1x) (5.20x) (0.55x) (0.56x)
Model Size 138.36 M 395.01 K 8.10 M 427.02 K 4.40 M
(Against LeNet-5) (350.27x) (1x) (20.51x) (1.08x) (11.14x)
Accuracy (%) 65.75 51.89 56.48 62.07 66.15
Table 5.3: Performance Comparison for MNIST with GL.
MNIST
Methods LeNet-5 GL-2 GL-3 (SLM/SP)
FLOPs 846.08 K 1.49 M 2.05 M
(Against LeNet-5) (1x) (1.76x) (2.42x)
Model Size 61.71 K 57.82 K 1.12 M
(Against LeNet-5) (1x) (0.94x) (18.15x)
Accuracy (%) 99.04 99.19 99.30
Table 5.4: Performance Comparison for Fashion-MNIST with GL.
Fashion MNIST
Methods LeNet5 GL-2 GL-3 (SLM/SP)
FLOPs 3.58 M 2.69 M 3.05 M
(Against LeNet-5) (1x) (0.75x) (0.85x)
Model Size 194.56 K 233.03 K 1.25 M
(Against LeNet-5) (1x) (1.20x) (6.41x)
Accuracy (%) 89.74 91.37 92.20
104
5.5.1.2 Experimental Results and Discussion
CIFAR-10 The performance comparison of five benchmarking methods for CIFAR-10 is shown in Table
5.1. We can categorize them into lightweight and heavyweight two groups based on the inference complexity and the model size given in the first two rows. LeNet-5, GL-1, GL-2, and GL-3 belong to the lightweight
group while VGG-16 belongs to the heavyweight group. GL-3, which uses the proposed SLM/SP classifier, outperforms LeNet-5, GL-1, GL-2 by 18.64%, 15.99%, and 12.07%, respectively. In particular, GL-2 and
GL-3 are almost identical except for the last module. GL-2 uses the XGBoost classifier while GL-3 uses the
proposed SLM/SP classifier. Their significant performance gap demonstrates the effectiveness of SLM/SP
over XGBoost. The additional costs in inference complexity (FLOPs) and model sizes appears to be well
justified. As compared to the heavyweight VGG-16 model, GL-3 is inferior by 5.79% in classification accuracy. However, GL-3 demands much less inference FLOPs and fewer model parameters. The savings in
memory and computation are attractive for mobile and edge computing.
STL-10 As mentioned in Sec. 4.4, STL-10 is used to study the data deficiency setting. Due to the small
number of training data in STL-10, DL-based methods cannot benefit much from their large model sizes
as compared to the CIFAR-10 dataset. The performance comparison of five benchmarking methods for
STL-10 is shown in Table 5.2. GL-3 achieves the best classification accuracy among all five benchmarking
methods. At the same time, its inference complexity is close to the lowest. For the training data deficiency
case, VGG-16 can only achieve classification accuracy similar to GL-3 but with much higher inference
complexity (106x) and memory requirement (33x). The experimental results show that GL-3 can utilize
the limited training data effectively and provide the best tradeoff between efficiency and effectiveness.
MNIST For MNIST and Fashion MNIST two datasets, we compare the performance of three benchmarking methods, i.e., LeNet-5, GL-2, and GL-3. MNIST is an easy dataset. The results are shown in Table
105
5.3. The classification accuracy saturates at 99% for most benchmarking methods. The improved classification accuracy rates of GL-3 over LeNet-5 and GL-2 are 0.26% and 0.11%, respectively. Although the gains
are relatively small, they are achieved with additional inference complexity and a larger model size.
Fashion MNIST The results are shown in Table 5.4. Its classification accuracy is around 90% for most
benchmarking methods. The improved classification accuracy rates of GL-3 over LeNet-5 and GL-2 are
2.46% and 0.83%, respectively. The inference complexity of all three methods is comparable. Although the
model size of GL-3 is larger than those of LeNet-5 and GL-2, it is still relatively small (i.e., 1.25M). It can
be well accommodated by mobile and edge devices.
5.5.2 Results on SLM/SP with Supervised Local Representation Learning
5.5.2.1 Experimental Setup
Datasets and Performance Metrics. To demonstrate the SLM/SP with Efficient Local Feature Learning,
we conduct experiments on the four image classification datasets: MNIST [83], CIFAR10, CIFAR100 [70]
and Tiny-Imagenet [79]. Examples from the four datasets are shown in Fig. 5.6.
CIFAR-100 is introduced in [70] along with CIFAR-10. It is a subset of the Tiny Images Dataset and
consists of 60,000 32x32 color images. The dataset is divided into 100 classes, each containing 600 images.
These 100 classes are further grouped into 20 super classes. Each image in the dataset comes with two
labels: a fine label that is the class it belongs to and a coarse label which is the super class it belongs to.
There are 500 training images and 100 testing images per fine class. The CIFAR-100 dataset is widely used
for benchmarking in the field of machine learning, particularly for tasks related to image classification. It
provides a challenging test bed due to its fine-grained classification tasks and relatively small size of the
images. We use CIFAR-100 for showing the capability of the proposed classification framework for higher
class numbers, and it is a direct comparison with the CIFAR10 dataset since the input images are from
similar domain.
106
Figure 5.6: Examples from four datasets. (a) MNIST, (b) CIFAR-10, (c)CIFAR-100, (d) Tiny ImageNet
107
Tiny-Imagenet is proposed in [79], it is a subset of the original ImageNet dataset [27]. It consists of
100,000 images across 200 classes, with each class containing 500 training images, 50 validation images,
and 50 test images. The images are downsized to a resolution of 64x64 pixels, which makes the dataset
more challenging for information extraction and image classification tasks. This dataset has since been
used in numerous benchmarks. We use Tiny-Imagenet as a more challenging image classification task
with more class numbers compared with CIFAR-100, and higher input image resolution.
Benchmarking Image Classification Methods. As discussed in Sec. 5.4.1, for experimental setup
of the SLM/SP with SLAB for local representation learning (SLM/SP-SLAB), we utilize the optimization
in SLAB for major global to local representation learning at each node in the SLM/SP tree, and SSL for
unsupervised rich decorrelated spectral information via Saab transform.
For the overall SLAB setting, we utilize 3x3 filer size for the depthwise convolution, set 0.0625 as
squeeze ratio for the SE block, and set the pointwise expansion ratio to 6. For the optimization, we apply
Adams optimizer and set the learning rate as 1e-3 and weight decay as 1e-6. We set the learning rate scheduler as step scheduler, and the learning rate drops to 0.1x every 50 epoch. We set the local optimization
for each dataset as 100 epochs, and global optimization as 200 epochs.
For MNIST, we apply 5x5 Saab filters to the input 28x28 gray scale images and utilize DFT to greedily
select the most discriminate 16 channels with lowest spatial global average DFT loss, output 28x28x16
feature map as the global SSL representation. Then we apply SLM/SP to the global representation for
joint local representation and decision learning. For SLAB, we set the channel number of the manifold-ofinterest subspace feature as 64. To compare with the results in Sec. 5.5.1, we demonstrate the results with
three learned SLM/SP tree architectures, named SLM-SLAB-tiny, SLM-SLAB-small, and SLM-SLAB-large,
by settings the maximum depth of the SLM/SP tree to 3 for tiny, 4 for small, and 5 for large, respectively.
For the inner node module, we set the hidden layer dimension as 0.5x of the input dimension. The batch
size is set to 256.
108
For CIFAR10 and CIFAR100, similar to MNIST, we apply 5x5 Saab filters to the input 32x32x3 color
images and DFT to greedily select 32 channels, yielding 32x32x32 global SSL feature map for SLM/SP tree.
For SLAB, we set the channel number of the manifold-of-interest subspace feature as 64 for CIFAR10 and
128 for CIFAR100. For the inner node module, we set the hidden layer dimension as 0.5x of the input
dimension. For the leaf node modules, linear classifiers are adopted. The batch size is set to 256.
For Tiny-Imagenet, we apply SSL for yielding 64x64x32 feature map for SLM/SP tree. For SLAB, we set
the channel number of the manifold-of-interest subspace feature as 256. For the inner node module, we
set the hidden layer dimension as 0.5x of the input dimension. For the leaf node modules, linear classifiers
are adopted. The batch size is set to 64.
5.5.2.2 Experimental Results and Discussion
MNIST For MNIST, we compare the performance of benchmarking methods in Sec. 5.4.1 along with the
SLM-SLAB. The results are shown in Table 5.5. The classification accuracy saturates at 99% for all the
benchmarking methods. We compare the results from the three proposed designs SLM-SLAB-tiny, SLMSLAB-small, and SLM-SLAB-large with the previous results. For SLM-SLAB-tiny, the module has the most
lightweight model size and inference among the benchmark methods, and still outperforms the LeNet-5
and GL series. The SLM-SLE-tiny utilize 0.3x of the model parameters and 0.76x FLOPs compare to LeNet-5,
and the improved classification accuracy rates over LeNet-5 is 0.31%. The SLM-SLAB-tiny also outperform
GL-3 with SLM/SP for decision learning by and 0.05%, with more than 50x smaller model size. Although
the gains are marginal, the remaining test error cases are corner cases that are hard to generalize from the
training. For SLM-SLAB-small, the performance can be further pushed to another 0.12% with 1.6x model
size and FLOPs, which is still less than 0.5x of the previously most lightweight LeNet-5 model. The best
performance of MNIST is achieved with SLM-SLAB-large, with 2.25x model size and 4.5x FLOPs compare
109
to LeNet-5, the SLM-SLAB-large achieves 0.52% accuracy improvement, which is significant considering
the hard cases and saturated accuracy.
CIFAR-10 The performance comparison of five benchmarking methods for CIFAR-10 is shown in
Table 5.6. We propose SLM-SLAB-small and SLM-SLAB-large for CIFAR-10 dataset and compare the results with the methods in Sec. 5.5.1. With GL-3 which uses the proposed SLM/SP classifier, there is already significant performance gap with previous GL and LeNet-5, which demonstrates the effectiveness
of SLM/SP for decision learning over XGBoost. The SLM-SLAB method further pushes the performance
to the next level by 3.37% and 4.44%, with the small and large settings respectively. With the proposed
SLAB for efficient local representation learning, the overall parameter distribution is more efficient, and
the SLM-SLAB-small module uses 0.3x number of parameters to achieve the performance improvement
above. While with introducing the convolutions in the SLAB, the filters need to apply to each spatial
location of the feature map, which yields 1.85x FLOPs compare to GL-3. To match the model size with
GL-3 for more direct comparison, SLM-SLAB-large is proposed. With similar model size and 3x FLOPs,
SLM-SLAB-large achieves efficient module design with the best accuracy.
CIFAR-100 The performance benchmarking results are summarized in Table 5.7. With the increase
of number of class in the dataset, the GL features fail to represent the distribution of the data with the
fine class labelling. For example, with the GL-2 setting same as CIFAR-10, the model only achieves 25.24%
test accuracy with model size as 3.73 M, which is higher than the compared mobilenetv2 and SLM-SLAB
listed in Table 5.7. Hence we compare our SLM-SLAB method with representative deep learning methods,
i.e. VGG-16, Resnet-18, and mobilenetv2. We utilize the VGG-16 as the represent of straightforward CNN
design for image classification, Resnet-18 as the represent of the effective CNN design, and mobilenetv2
as the represent of the efficient CNN design. Compare to VGG-16, the SLM-SLAB outperforms the heavy
model by 0.18%, while utilize 25x less model parameters and 18x less FLOPs, which demonstrates the high
performance and lightweight design of the method. Compare to mobilenetv2, the SLM-SLAB has 1.32
110
accuracy less, while utilize 1.8x less model parameters and 1.8x less FLOPs. Resnet-18 achieves comparable
performance in terms of accuracy, with 8.4x larger model parameters utilized for optimization compare to
SLM-SLAB. In summary, the results demonstrate that out proposed SLM-SLAB achieves lightweight model
and inference compare to the straight forward and efficient designs of deep learning methods.
Tiny-Imagenet The performance benchmarking results are summarized in Table 5.8. Similar to CIFAR100, we compare the proposed SLM-SLAB with representative deep learning methods, i.e. VGG-16,
Resnet-18, and mobilenetv2. With the distribution of Tiny-Imagenet among 200 classes, the task is much
more challenging compare to CIFAR-100. With the 43x model parameters and 20x FLOPs, the straightforward CNN design achieves the best accuracy among the listed methods. Similar to CIFAR-100, our
SLM-SLAB achieves similar results compare to mobilenetv2 with 2.73x less model parameters and 2x less
FLOPs. Our method outperforms resnet-18 model by 6.22% with 12x less model parameters and 4.7x less
FLOPs. In summary, the results demonstrate that out proposed SLM-SLAB achieves lightweight model and
inference compare to the efficient and effective designs of deep learning methods.
Table 5.5: Performance Comparison for MNIST with SLM-SLAB.
MNIST
Methods FLOPs Model Size Accuracy (%)
LeNet-5 846.08 K 61.71 K 99.04
GL-1 14.23 M 3.128 M 99.09
GL-2 1.51 M 104.28 K 99.19
GL-3 2.05 M 1.12 M 99.30
SLM-SLAB-tiny 645.62 K 18.5 k 99.35
SLM-SLAB-small 1.14 M 30.5 K 99.47
SLM-SLAB-large 3.8 M 139.35 K 99.56
5.6 Conclusion
A tree-based classifier called Subspace Learning Machine with Soft Partitioning (SLM/SP) was proposed in
this work. The main difference between the classic decision-tree and the SLM tree is that the combination
111
Table 5.6: Performance Comparison for CIFAR10 with SLM-SLAB.
CIFAR10
Methods FLOPs Model Size Accuracy (%)
LeNet-5 14.74 M 395.01 K 68.72
GL-1 21.30 M 1.66 M 71.37
GL-2 17.83 M 1.47 M 75.29
GL-3 20.29 M 4.39 M 87.36
SLM-SLAB-small 37.62 M 1.31 M 90.73
SLM-SLAB-large 60.08 M 4.89 M 91.80
Table 5.7: Performance Comparison for CIFAR100 with SLM-SLAB.
CIFAR100
Methods FLOPs Model Size Accuracy (%)
VGG-16 666.34 M 34.01 M 64.48
Resnet-18 148.76 M 11.23M 65.61
MobileNetv2 68.4 M 2.36M 65.98
SLM-SLAB(Ours) 37.65 M 1.34 M 64.66
Table 5.8: Performance Comparison for Tiny-Imagenet with SLM-SLAB.
Tiny-Imagenet
Methods FLOPs Model Size Accuracy (%)
VGG-16 2.56 G 40.71 M 38.75
Resnet-18 595.24 M 11.28 M 25.90
MobileNetv2 253.60 M 2.57 M 33.13
SLM-SLAB(Ours) 125.58 M 0.94 M 32.12
112
of input features is conducted at all inner nodes of an SLM tree. The SDT data structure was employed
by SLM/SP. It learned an adaptive tree structure with local greedy subspace partitioning. To achieve efficient local representation learning, SLAB is proposed for the edge in the SLM/SP tree. With the proposed
SLM/SP, the optimal weights at each edge and node can be obtained by optimizing a differentiable loss
function using the mini-batch SGD method. The SLM/SP classifier enables efficient training, high classification accuracy, and a small model size, and can be highly competitive with DL networks against image
classification datasets in this work.
113
Chapter 6
Summary and Future Work
6.1 Summary of the Research
In this thesis, we focus on machine learning methods for low-dimensional data sources and high-dimensional
data sources.
For low-dimensional data source, we propose a novel machine learning method named SLM for general
classification and regression tasks. SLM adopts the simplicity of the data structure from decision tree and
much more effective learning diagram from MLP to learn discriminate subspace and make predictions. SLM
is light-weight, mathematically transparent, and achieves state-of-the-art benchmarking performance on
low-dimensional data sources. SLM tree can serve as weak classifier in general boosted and bootstrap
aggregation methods as a more generalized model. To reduce the training complexity of the SLM and
provide training acceleration, we propose the acceleration of projection learning for SLM with optimization
based methods, and implemented parallel processing and cuda support to effectively reduce the training
time required for SLM models. It is shown by experimental results that, as compared to original SLM/SLR,
accelerated SLM/SLR can achieve a speed-up factor of 577 in training time while maintaining comparable
classification and regression performance.
For high-dimensional data source, we propose novel efficient object recognition methods, including feed forward efficient machine learning framework, and efficient SLM/SP framework. Our methods
114
demonstrate improved accuracy when compared to both deep learning and traditional two-stage machine
learning methods, while requiring lightweight computational resources in terms of FLOPs and model size.
Furthermore, our method incorporates a flexible framework design, which allows for adaptable model size
and FLOPs, and enables the ability to strike a balance between efficiency and effectiveness for different
classification tasks.
6.2 Future Research Topics
Our proposed methods have shown a much more efficient and flexible learning diagram for image based
machine learning tasks. To further exploit the potential of this learning diagram, we are interested in
exploring more image based tasks, such as efficient object detection method, efficient image classification
on more challenging dataset such as ImageNet, and effective solutions for data deficient machine learning
tasks such as medical image classification.
6.2.1 Efficient Object Detection
Object detection is one of the major research field for computer vision and image processing that deals with
detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in images and
videos. With the development of deep learning, object detection has made significant progress in recent
years and has been applied in various fields such as self-driving cars, surveillance, and robotics. The main
goal of object detection is to identify and locate objects of interest in an image or video, and it typically
involves three main steps: feature extraction, object proposal generation and object classification.
Recent object detection methods are all based on deep learning methods. One of the most popular
deep learning-based object detection methods is the single shot multi-box detector (SSD)[90], which is a
single deep neural network that can perform object detection and classification. SSD uses a convolutional
neural network (CNN) to extract features from the image and then predicts the location of objects using
115
Figure 6.1: Overview of the representative object detection methods. (a) YOLO (b) Faster-RCNN (c) SSD
default boxes of different aspect ratios and scales. Another popular deep learning based object detection
method is the You Only Look Once (YOLO)[104] algorithm, which uses a single CNN to predict the class
and location of objects in an image. YOLO v8 is the latest version of YOLO, which is able to detect objects at
multiple scales, and it is generally faster than other state-of-the-art object detectors. Another well-known
method is the region-based convolutional neural network (R-CNN) which uses region proposal techniques
to identify the regions of interest in an image and then classify the objects in these regions. One of its
variants, the faster R-CNN[42], uses a shared convolutional network for feature extraction and region
proposal. Faster R-CNN is an extension of the original R-CNN algorithm, which uses a Region Proposal
Network (RPN) to generate object proposals, instead of using a separate object proposal algorithm. In
addition to the above, RetinaNet and Mask R-CNN are another popular methods which used the concept
of anchor boxes and multitasking to detect object and perform segmentation both at the same time. The
116
overview of the methods above is summarized in Fig. 6.1. Deep learning has made significant progress in
object detection in recent years, the advancements in architectures and techniques such as anchor-based
detection, multitasking and feature pyramid networks have allowed to improve the performance and make
the detection more accurate.
While the deep learning based methods always require large training data, huge computational cost
during training as well as inference, and not suitable for many real world tasks where the conditions can
not be satisfied. For the future work of this thesis, we are interested in exploring an efficient learning
diagram compared to the deep learning based methods in terms of model size and FLOPs, and achieve a
better trade off between the model performance evaluated with mean average precision (mAP), inference
speed, and efficiency.
6.2.2 Efficient Image Classification on ImageNet Challenge
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual competition organized
by the ImageNet project. The goal of the competition is to evaluate the performance of computer vision
algorithms, specifically object recognition algorithms, on a large and challenging dataset called ImageNet.
The dataset contains over 14 million images labeled with more than 22,000 object categories. The ILSVRC
competition has been held annually since 2010, and it has played a major role in advancing the state-ofthe-art in object recognition. The competition has attracted entries from leading research institutions and
companies, and it has become one of the most prestigious and influential events in the field of computer
vision. Example images are shown in Fig 6.2.
As discussed in Chapter 4 & 5, in recent years, deep learning techniques, particularly convolutional
neural networks (CNNs) and transformers, have dominated the competition and have set new records in
object recognition performance. Traditional machine learning methods have been used for image classification on the ImageNet dataset as well, typically involve the use of hand-crafted features, such as SIFT,
117
Figure 6.2: Example images from ImageNet
HOG, and SURF, which are extracted from the images and then used as input to a machine learning algorithm, such as a support vector machine (SVM) or a random forest. However, traditional machine learning
methods for image classification have several limitations compared to deep learning-based methods. They
often require a large amount of labeled data in order to achieve good performance, and the features that
are hand-crafted may not be the most suitable for the task of image classification.
For the future work of the thesis, we plan to further explore the efficient image classification task
on large datasets like ImageNet-1k and compete in ILSVRC challenge. We aim to provide an adaptive
framework and flexible trade off between efficiency and performance, providing lightweight models for
distributed devices and tasks.
6.2.3 Machine Learning for Deficient Data Source
With the success of deep learning methods in a wide range of computer vision tasks, such as object classification, object detection, and image segmentation, it is more worthy of noticing when the amount of
118
labeled training data is limited, e.g. medical image analysis tasks and tasks with data privacy issue, these
methods may not perform as well due to overfitting and a lack of generalization. There are many research
topics on the data deficiency tasks for deep learning including semi-supervised learning, including selftraining and co-training, transfer learning, data augmentation, and architecture designs for these specific
tasks such as EfficientNet, while with our proposed method we aim to explore more efficient models with
lightweight distribution and flexible building blocks.
Our proposed method has shown impressive potentials in efficiency and effectiveness on data deficient
tasks. For the future work, we are interested in exploring a distributed model ensemble for specific data
deficient data source tasks. For example, with medical image analysis, labeled data may not be sufficient
for deep methods and are usually distributed at local institutions, organizations, companies with no public
access. Our proposed efficient machine learning method provides a much better data usage with training
data that is deficient for deep models, and we may distribute our model for each deficient data source
for the specific tasks. With the public data source, instead of using pretrained models with deep learning
methods, for our efficient method we can afford another independent model with only a small increase in
computational and memory requirements. With each distributed lightweight models for different specific
data sources and tasks, we may ensemble the models and make the final decision with certain criterion
such as predictions confidence, which serves as a bridge for the distributed data privacy issue and an
alternative solution than single deep models for the deficient data source problems.
119
Bibliography
[1] Abdul Ahad, Ahsan Fayyaz, and Tariq Mehmood. “Speech recognition using multilayer
perceptron”. In: IEEE Students Conference, ISCON’02. Proceedings. Vol. 1. IEEE. 2002, pp. 103–109.
[2] Yali Amit and Donald Geman. “Shape quantization and recognition with randomized trees”. In:
Neural computation 9.7 (1997), pp. 1545–1588.
[3] Arthur Asuncion and David Newman. UCI machine learning repository. 2007.
[4] Zohreh Azizi, Xuejing Lei, and C-C Jay Kuo. “Noise-aware texture-preserving low-light
enhancement”. In: 2020 IEEE International Conference on Visual Communications and Image
Processing (VCIP). IEEE. 2020, pp. 443–446.
[5] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. “Surf: Speeded up robust features”. In:
European conference on computer vision. Springer. 2006, pp. 404–417.
[6] Ali Borji and Laurent Itti. “Human vs. computer in scene and object recognition”. In: Proceedings
of the IEEE conference on computer vision and pattern recognition. 2014, pp. 113–120.
[7] L Breiman, J Friedman, CJ Stone, and RA Olshen. Classification and Regression Trees. CRC, Boca
Raton, FL, 1984.
[8] Leo Breiman. “Bagging predictors”. In: Machine learning 24.2 (1996), pp. 123–140.
[9] Leo Breiman. “Random forests”. In: Machine learning 45.1 (2001), pp. 5–32.
[10] Han Cai, Ligeng Zhu, and Song Han. “Proxylessnas: Direct neural architecture search on target
task and hardware”. In: arXiv preprint arXiv:1812.00332 (2018).
[11] Mathilde Caron, Ari Morcos, Piotr Bojanowski, Julien Mairal, and Armand Joulin. “Pruning
convolutional neural networks with self-supervision”. In: arXiv preprint arXiv:2001.03554 (2020).
[12] Hong-Shuo Chen, Shuowen Hu, Suya You, C-C Jay Kuo, et al. “Defakehop++: An enhanced
lightweight deepfake detector”. In: APSIPA Transactions on Signal and Information Processing 11.2
(2022).
120
[13] Hong-Shuo Chen, Mozhdeh Rouhsedaghat, Hamza Ghani, Shuowen Hu, Suya You, and
C-C Jay Kuo. “DefakeHop: A Light-Weight High-Performance Deepfake Detector”. In: 2021 IEEE
International Conference on Multimedia and Expo (ICME). IEEE. 2021, pp. 1–6.
[14] Tianqi Chen and Carlos Guestrin. “Xgboost: A scalable tree boosting system”. In: Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016,
pp. 785–794.
[15] Tianqi Chen and Carlos Guestrin. “Xgboost: A scalable tree boosting system”. In: Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016,
pp. 785–794.
[16] Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho,
Kailong Chen, et al. “Xgboost: extreme gradient boosting”. In: R package version 0.4-2 1.4 (2015),
pp. 1–4.
[17] Yueru Chen and C-C Jay Kuo. “Pixelhop: A successive subspace learning (ssl) method for object
recognition”. In: Journal of Visual Communication and Image Representation 70 (2020), p. 102749.
[18] Yueru Chen, Mozhdeh Rouhsedaghat, Suya You, Raghuveer Rao, and C-C Jay Kuo. “Pixelhop++:
A small successive-subspace-learning-based (ssl-based) model for image classification”. In: 2020
IEEE International Conference on Image Processing (ICIP). IEEE. 2020, pp. 3294–3298.
[19] Yueru Chen, Zhuwei Xu, Shanshan Cai, Yujian Lang, and C-C Jay Kuo. “A saak transform
approach to efficient, scalable and robust handwritten digits recognition”. In: 2018 Picture Coding
Symposium (PCS). IEEE. 2018, pp. 174–178.
[20] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. “Dual path
networks”. In: Advances in neural information processing systems 30 (2017).
[21] Yoni Choukroun, Eli Kravchik, Fan Yang, and Pavel Kisilev. “Low-bit quantization of neural
networks for efficient inference”. In: 2019 IEEE/CVF International Conference on Computer Vision
Workshop (ICCVW). IEEE. 2019, pp. 3009–3018.
[22] Adam Coates, Andrew Ng, and Honglak Lee. “An analysis of single-layer networks in
unsupervised feature learning”. In: Proceedings of the fourteenth international conference on
artificial intelligence and statistics. JMLR Workshop and Conference Proceedings. 2011,
pp. 215–223.
[23] Corinna Cortes and Vladimir Vapnik. “Support-vector networks”. In: Machine learning 20.3
(1995), pp. 273–297.
[24] Antonio Criminisi and Jamie Shotton. Decision forests for computer vision and medical image
analysis. Springer Science & Business Media, 2013.
[25] George Cybenko. “Approximation by superpositions of a sigmoidal function”. In: Mathematics of
control, signals and systems 2.4 (1989), pp. 303–314.
121
[26] Navneet Dalal and Bill Triggs. “Histograms of oriented gradients for human detection”. In: 2005
IEEE computer society conference on computer vision and pattern recognition (CVPR’05). Vol. 1. Ieee.
2005, pp. 886–893.
[27] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale
hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern recognition.
Ieee. 2009, pp. 248–255.
[28] A Victor Devadoss and T Antony Alphonnse Ligori. “Forecasting of stock prices using multi layer
perceptron”. In: International journal of computing algorithm 2.1 (2013), pp. 440–449.
[29] Thomas G Dietterich. “An experimental comparison of three methods for constructing ensembles
of decision trees: Bagging, boosting, and randomization”. In: Machine learning 40.2 (2000),
pp. 139–157.
[30] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun.
“Repvgg: Making vgg-style convnets great again”. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 2021, pp. 13733–13742.
[31] Xuanyi Dong and Yi Yang. “Nas-bench-201: Extending the scope of reproducible neural
architecture search”. In: arXiv preprint arXiv:2001.00326 (2020).
[32] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
“An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint
arXiv:2010.11929 (2020).
[33] Russell C Eberhart, Yuhui Shi, and James Kennedy. Swarm intelligence. Elsevier, 2001.
[34] Ronald A Fisher. “The use of multiple measurements in taxonomic problems”. In: Annals of
eugenics 7.2 (1936), pp. 179–188.
[35] Marcus Frean. “The upstart algorithm: A method for constructing and training feedforward
neural networks”. In: Neural computation 2.2 (1990), pp. 198–209.
[36] Jerome H Friedman. “Greedy function approximation: a gradient boosting machine”. In: Annals of
statistics (2001), pp. 1189–1232.
[37] Jerome H Friedman. “Multivariate adaptive regression splines”. In: The annals of statistics 19.1
(1991), pp. 1–67.
[38] Hongyu Fu, Yijing Yang, Yuhuai Liu, Joseph Lin, Ethan Harrison, Vinod K Mishra, and
C-C Jay Kuo. “Acceleration of Subspace Learning Machine via Particle Swarm Optimization and
Parallel Processing”. In: arXiv preprint arXiv:2208.07023 (2022).
[39] Hongyu Fu, Yijing Yang, Vinod K Mishra, and C-C Jay Kuo. “Classification via Subspace Learning
Machine (SLM): Methodology and Performance Evaluation”. In: ICASSP 2023-2023 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023, pp. 1–5.
122
[40] Hongyu Fu, Yijing Yang, Vinod K Mishra, and C-C Jay Kuo. “Subspace Learning Machine (SLM):
Methodology and Performance”. In: arXiv preprint arXiv:2205.05296 (2022).
[41] Stephen I Gallant et al. “Perceptron-based learning algorithms”. In: IEEE Transactions on neural
networks 1.2 (1990), pp. 179–191.
[42] Ross Girshick. “Fast r-cnn”. In: Proceedings of the IEEE international conference on computer vision.
2015, pp. 1440–1448.
[43] Fred Glover. “Future paths for integer programming and links to artificial intelligence”. In:
Computers & operations research 13.5 (1986), pp. 533–549.
[44] Jan Philip Göpfert, Heiko Wersing, and Barbara Hammer. “Interpretable Locally Adaptive Nearest
Neighbors”. In: Neurocomputing 470 (2022), pp. 344–351.
[45] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun.
“Single path one-shot neural architecture search with uniform sampling”. In: Computer
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part
XVI 16. Springer. 2020, pp. 544–560.
[46] Song Han, Huizi Mao, and William J Dally. “Deep compression: Compressing deep neural
networks with pruning, trained quantization and huffman coding”. In: arXiv preprint
arXiv:1510.00149 (2015).
[47] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 770–778.
[48] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network”. In:
arXiv preprint arXiv:1503.02531 (2015).
[49] Tin Kam Ho. “The random subspace method for constructing decision forests”. In: IEEE
transactions on pattern analysis and machine intelligence 20.8 (1998), pp. 832–844.
[50] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Multilayer feedforward networks are
universal approximators”. In: Neural networks 2.5 (1989), pp. 359–366.
[51] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,
Tobias Weyand, Marco Andreetto, and Hartwig Adam. “Mobilenets: Efficient convolutional
neural networks for mobile vision applications”. In: arXiv preprint arXiv:1704.04861 (2017).
[52] Jie Hu, Li Shen, and Gang Sun. “Squeeze-and-excitation networks”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2018, pp. 7132–7141.
[53] Guang-Bin Huang and Lei Chen. “Convex incremental extreme learning machine”. In:
Neurocomputing 70.16-18 (2007), pp. 3056–3062.
[54] Guang-Bin Huang and Lei Chen. “Enhanced random search based incremental extreme learning
machine”. In: Neurocomputing 71.16-18 (2008), pp. 3460–3468.
123
[55] Guang-Bin Huang, Lei Chen, Chee Kheong Siew, et al. “Universal approximation using
incremental constructive feedforward networks with random hidden nodes”. In: IEEE Trans.
Neural Networks 17.4 (2006), pp. 879–892.
[56] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. “Extreme learning machine: theory and
applications”. In: Neurocomputing 70.1-3 (2006), pp. 489–501.
[57] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. “Extreme learning machine: theory and
applications”. In: Neurocomputing 70.1-3 (2006), pp. 489–501.
[58] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. “Quantized
neural networks: Training neural networks with low precision weights and activations”. In: The
Journal of Machine Learning Research 18.1 (2017), pp. 6869–6898.
[59] Yani Ioannou, Duncan Robertson, Darko Zikic, Peter Kontschieder, Jamie Shotton,
Matthew Brown, and Antonio Criminisi. “Decision forests, convolutional networks and the
models in-between”. In: arXiv preprint arXiv:1603.01250 (2016).
[60] Ozan Irsoy, Olcay Taner Yildiz, and Ethem Alpaydin. “Budding trees”. In: 2014 22nd international
conference on pattern recognition. IEEE. 2014, pp. 3582–3587.
[61] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. “Speeding up convolutional neural
networks with low rank expansions”. In: arXiv preprint arXiv:1405.3866 (2014).
[62] Michael I Jordan and Robert A Jacobs. “Hierarchical mixtures of experts and the EM algorithm”.
In: Neural computation 6.2 (1994), pp. 181–214.
[63] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. “R-PointHop: A Green, Accurate and
Unsupervised Point Cloud Registration Method”. In: arXiv preprint arXiv:2103.08129 (2021).
[64] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. “R-PointHop: A Green, Accurate, and
Unsupervised Point Cloud Registration Method”. In: IEEE Transactions on Image Processing 31
(2022), pp. 2710–2725.
[65] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. “Unsupervised Point Cloud Registration
via Salient Points Analysis (SPA)”. In: 2020 IEEE International Conference on Visual
Communications and Image Processing (VCIP). IEEE. 2020, pp. 5–8.
[66] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and
Tie-Yan Liu. “Lightgbm: A highly efficient gradient boosting decision tree”. In: Advances in neural
information processing systems 30 (2017).
[67] James Kennedy and Russell Eberhart. “Particle swarm optimization”. In: Proceedings of
ICNN’95-international conference on neural networks. Vol. 4. IEEE. 1995, pp. 1942–1948.
[68] Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. “Optimization by simulated annealing”.
In: science 220.4598 (1983), pp. 671–680.
124
[69] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel Rota Bulo. “Deep neural
decision forests”. In: Proceedings of the IEEE international conference on computer vision. 2015,
pp. 1467–1475.
[70] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.
Tech. rep. Toronto, Ontario: University of Toronto, 2009.
[71] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep
convolutional neural networks”. In: Advances in neural information processing systems 25 (2012).
[72] Renato A Krohling and Leandro dos Santos Coelho. “Coevolutionary particle swarm optimization
using Gaussian distribution for solving constrained optimization problems”. In: IEEE Transactions
on Systems, Man, and Cybernetics, Part B (Cybernetics) 36.6 (2006), pp. 1407–1416.
[73] C-C Jay Kuo. “Understanding convolutional neural networks with a mathematical model”. In:
Journal of Visual Communication and Image Representation 41 (2016), pp. 406–413.
[74] C-C Jay Kuo and Yueru Chen. “On data-driven saak transform”. In: Journal of Visual
Communication and Image Representation 50 (2018), pp. 237–246.
[75] C-C Jay Kuo and Azad M Madni. “Green learning: Introduction, examples and outlook”. In:
Journal of Visual Communication and Image Representation (2022), p. 103685.
[76] C-C Jay Kuo, Min Zhang, Siyang Li, Jiali Duan, and Yueru Chen. “Interpretable convolutional
neural networks via feedforward design”. In: Journal of Visual Communication and Image
Representation (2019).
[77] Tin-Yan Kwok and Dit-Yan Yeung. “Objective functions for training new hidden units in
constructive neural networks”. In: IEEE Transactions on neural networks 8.5 (1997), pp. 1131–1148.
[78] Dmitry Laptev and Joachim M Buhmann. “Convolutional decision trees for feature learning and
segmentation”. In: Pattern Recognition: 36th German Conference, GCPR 2014, Münster, Germany,
September 2-5, 2014, Proceedings 36. Springer. 2014, pp. 95–106.
[79] Ya Le and Xuan Yang. “Tiny imagenet visual recognition challenge”. In: CS 231N 7.7 (2015), p. 3.
[80] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: nature 521.7553 (2015),
pp. 436–444.
[81] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard,
Wayne Hubbard, and Lawrence D Jackel. “Backpropagation applied to handwritten zip code
recognition”. In: Neural computation 1.4 (1989), pp. 541–551.
[82] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-based learning applied
to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.
[83] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-based learning applied
to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.
125
[84] Aurélia Léon and Ludovic Denoyer. “Policy-gradient methods for Decision Trees.” In: ESANN.
2016.
[85] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. “Multilayer feedforward
networks with a nonpolynomial activation function can approximate any function”. In: Neural
networks 6.6 (1993), pp. 861–867.
[86] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. “Selective kernel networks”. In: Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 510–519.
[87] Jing J Liang, A Kai Qin, Ponnuthurai N Suganthan, and S Baskar. “Comprehensive learning
particle swarm optimizer for global optimization of multimodal functions”. In: IEEE transactions
on evolutionary computation 10.3 (2006), pp. 281–295.
[88] Ruiyuan Lin, Zhiruo Zhou, Suya You, Raghuveer Rao, and C-C Jay Kuo. “From two-class linear
discriminant analysis to interpretable multilayer perceptron design”. In: arXiv preprint
arXiv:2009.04442 (2020).
[89] Tony Lindeberg. “Scale invariant feature transform”. In: (2012).
[90] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and
Alexander C Berg. “Ssd: Single shot multibox detector”. In: European conference on computer
vision. Springer. 2016, pp. 21–37.
[91] Xiaofeng Liu, Fangxu Xing, Chao Yang, C-C Jay Kuo, Suma Babu, Georges El Fakhri,
Thomas Jenkins, and Jonghye Woo. “VoxelHop: Successive Subspace Learning for ALS Disease
Classification Using Structural MRI”. In: arXiv preprint arXiv:2101.05131 (2021).
[92] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.
“Learning efficient convolutional networks through network slimming”. In: Proceedings of the
IEEE international conference on computer vision. 2017, pp. 2736–2744.
[93] Mario Marchand. “Learning by minimizing resources in neural networks”. In: Complex Systems 3
(1989), pp. 229–241.
[94] FM Frattale Mascioli and Giuseppe Martinelli. “A constructive algorithm for binary neural
networks: The oil-spot algorithm”. In: IEEE Transactions on Neural Networks 6.3 (1995),
pp. 794–797.
[95] Zhanxuan Mei, Yun-Cheng Wang, Xingze He, and C-C Jay Kuo. “GreenBIQA: A Lightweight
Blind Image Quality Assessment Method”. In: 2022 IEEE 24th International Workshop on
Multimedia Signal Processing (MMSP). IEEE. 2022, pp. 1–6.
[96] Marc Mézard and Jean-P Nadal. “Learning in feedforward layered networks: The tiling
algorithm”. In: Journal of Physics A: Mathematical and General 22.12 (1989), p. 2191.
[97] Rajesh Parekh, Jihoon Yang, and Vasant Honavar. “Constructive neural network learning
algorithms for multi-category real-valued pattern classification”. In: Dept. Comput. Sci., Iowa State
Univ., Tech. Rep. ISU-CS-TR97-06 (1997).
126
[98] Rajesh Parekh, Jihoon Yang, and Vasant Honavar. “Constructive neural-network learning
algorithms for pattern classification”. In: IEEE Transactions on neural networks 11.2 (2000),
pp. 436–451.
[99] Rajesh Parekh, Jihoon Yang, and Vasant Honavar. “Constructive neural-network learning
algorithms for pattern classification”. In: IEEE Transactions on neural networks 11.2 (2000),
pp. 436–451.
[100] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,
Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al.
“Scikit-learn: Machine learning in Python”. In: the Journal of machine Learning research 12 (2011),
pp. 2825–2830.
[101] Matti Pietikäinen, Abdenour Hadid, Guoying Zhao, and Timo Ahonen. Computer vision using
local binary patterns. Vol. 40. Springer Science & Business Media, 2011.
[102] J. Ross Quinlan. “Induction of decision trees”. In: Machine learning 1.1 (1986), pp. 81–106.
[103] Valentin Radu, Kuba Kaszyk, Yuan Wen, Jack Turner, José Cano, Elliot J Crowley, Björn Franke,
Amos Storkey, and Michael O’Boyle. “Performance aware convolutional neural network channel
pruning for embedded GPUs”. In: 2019 IEEE International Symposium on Workload
Characterization (IISWC). IEEE. 2019, pp. 24–34.
[104] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. “You only look once: Unified,
real-time object detection”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2016, pp. 779–788.
[105] Frank Rosenblatt. “The perceptron: a probabilistic model for information storage and
organization in the brain.” In: Psychological review 65.6 (1958), p. 386.
[106] Mozhdeh Rouhsedaghat, Yifan Wang, Xiou Ge, Shuowen Hu, Suya You, and C-C Jay Kuo.
“Facehop: A light-weight low-resolution face gender classification method”. In: arXiv preprint
arXiv:2007.09510 (2020).
[107] Mozhdeh Rouhsedaghat, Yifan Wang, Shuowen Hu, Suya You, and C-C Jay Kuo. “Low-resolution
face recognition in resource-constrained environments”. In: Pattern Recognition Letters 149 (2021),
pp. 193–199.
[108] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. “Imagenet large scale
visual recognition challenge”. In: International journal of computer vision 115.3 (2015), pp. 211–252.
[109] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.
“Mobilenetv2: Inverted residuals and linear bottlenecks”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2018, pp. 4510–4520.
[110] Yuhui Shi and Russell C Eberhart. “Fuzzy adaptive particle swarm optimization”. In: Proceedings
of the 2001 congress on evolutionary computation (IEEE Cat. No. 01TH8546). Vol. 1. IEEE. 2001,
pp. 101–106.
127
[111] Jamie Shotton, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, John Winn, and
Antonio Criminisi. “Decision jungles: Compact and rich models for classification”. In: Advances in
neural information processing systems 26 (2013).
[112] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale
image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).
[113] K Sivakumar and Uday B Desai. “Image restoration using a multilayer perceptron with a
multilevel sigmoidal function”. In: IEEE transactions on signal processing 41.5 (1993),
pp. 2018–2022.
[114] Jack W Smith, James E Everhart, WC Dickson, William C Knowler, and Robert Scott Johannes.
“Using the ADAP learning algorithm to forecast the onset of diabetes mellitus”. In: Proceedings of
the annual symposium on computer application in medical care. American Medical Informatics
Association. 1988, p. 261.
[115] M Stinchombe. “Universal approximation using feed-forward networks with nonsigmoid hidden
layer activation functions”. In: Proc. IJCNN, Washington, DC, 1989 (1989), pp. 161–166.
[116] Alberto Suárez and James F Lutsko. “Globally optimal fuzzy decision trees for classification and
regression”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 21.12 (1999),
pp. 1297–1311.
[117] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. “Going deeper with convolutions”.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 1–9.
[118] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and
Quoc V Le. “Mnasnet: Platform-aware neural architecture search for mobile”. In: Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 2820–2828.
[119] Mingxing Tan and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural
networks”. In: International Conference on Machine Learning. PMLR. 2019, pp. 6105–6114.
[120] Mingxing Tan and Quoc V Le. “Mixconv: Mixed depthwise convolutional kernels”. In: arXiv
preprint arXiv:1907.09595 (2019).
[121] Ryutaro Tanno, Kai Arulkumaran, Daniel Alexander, Antonio Criminisi, and Aditya Nori.
“Adaptive neural trees”. In: International Conference on Machine Learning. PMLR. 2019,
pp. 6166–6175.
[122] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jégou. “Fixing the train-test
resolution discrepancy”. In: Advances in neural information processing systems 32 (2019).
[123] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural information
processing systems 30 (2017).
128
[124] Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy,
David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al.
“SciPy 1.0: fundamental algorithms for scientific computing in Python”. In: Nature methods 17.3
(2020), pp. 261–272.
[125] Han Xiao, Kashif Rasul, and Roland Vollgraf. “Fashion-mnist: a novel image dataset for
benchmarking machine learning algorithms”. In: arXiv preprint arXiv:1708.07747 (2017).
[126] Han Xiao and Ge Xu. “Neural decision tree towards fully functional neural graph”. In: Unmanned
Systems 8.03 (2020), pp. 203–210.
[127] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. “Aggregated residual
transformations for deep neural networks”. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. 2017, pp. 1492–1500.
[128] Jihoon Yang, Rajesh Parekh, and Vasant Honavar. “DistAl: An inter-pattern distance-based
constructive learning algorithm”. In: Intelligent Data Analysis 3.1 (1999), pp. 55–73.
[129] Yi Yang and Shawn Newsam. “Bag-of-visual-words and spatial extensions for land-use
classification”. In: Proceedings of the 18th SIGSPATIAL international conference on advances in
geographic information systems. 2010, pp. 270–279.
[130] Yijing Yang, Hongyu Fu, and C-C Jay Kuo. “Design of supervision-scalable learning systems:
Methodology and performance benchmarking”. In: arXiv preprint arXiv:2206.09061 (2022).
[131] Yijing Yang, Vasileios Magoulianitis, and C.-C. Jay Kuo. “E-PixelHop: An Enhanced PixelHop
Method for Object Classification”. In: 2021 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC). 2021, pp. 1475–1482.
[132] Yijing Yang, Wei Wang, Hongyu Fu, and C-C Jay Kuo. “On Supervised Feature Selection from
High Dimensional Feature Spaces”. In: arXiv preprint arXiv:2203.11924 (2022).
[133] Yijing Yang, Wei Wang, Hongyu Fu, and C-C Jay Kuo. “On Supervised Feature Selection from
High Dimensional Feature Spaces”. In: arXiv preprint arXiv:2203.11924 (2022).
[134] Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter.
“Nas-bench-101: Towards reproducible neural architecture search”. In: International conference on
machine learning. PMLR. 2019, pp. 7105–7114.
[135] Sergey Zagoruyko and Nikos Komodakis. “Wide residual networks”. In: arXiv preprint
arXiv:1605.07146 (2016).
[136] Zhi-Hui Zhan, Jun Zhang, Yun Li, and Henry Shu-Hung Chung. “Adaptive Particle Swarm
Optimization”. In: IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39.6
(2009), pp. 1362–1381. doi: 10.1109/TSMCB.2009.2015956.
[137] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. “Shufflenet: An extremely efficient
convolutional neural network for mobile devices”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2018, pp. 6848–6856.
129
[138] Daquan Zhou, Qibin Hou, Yunpeng Chen, Jiashi Feng, and Shuicheng Yan. “Rethinking bottleneck
structure for efficient mobile network design”. In: Computer Vision–ECCV 2020: 16th European
Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer. 2020, pp. 680–697.
[139] Mingyi Zhou, Yipeng Liu, Zhen Long, Longxi Chen, and Ce Zhu. “Tensor rank learning in CP
decomposition via convolutional neural network”. In: Signal Processing: Image Communication 73
(2019), pp. 12–21.
[140] Zhiruo Zhou, Hongyu Fu, Suya You, Christoph C Borel-Donohue, and C-C Jay Kuo. “UHP-SOT:
An Unsupervised High-Performance Single Object Tracker”. In: 2021 International Conference on
Visual Communications and Image Processing (VCIP). IEEE. 2021, pp. 1–5.
[141] Zhiruo Zhou, Hongyu Fu, Suya You, and C-C Jay Kuo. “GUSOT: Green and Unsupervised Single
Object Tracking for Long Video Sequences”. In: arXiv preprint arXiv:2207.07629 (2022).
[142] Zhiruo Zhou, Hongyu Fu, Suya You, C-C Jay Kuo, et al. “UHP-SOT++: An Unsupervised
Lightweight Single Object Tracker”. In: APSIPA Transactions on Signal and Information Processing
11.1 (2022).
[143] Yao Zhu, Xinyu Wang, Hong-Shuo Chen, Ronald Salloum, and C-C Jay Kuo. “A-PixelHop: A
Green, Robust and Explainable Fake-Image Detector”. In: ICASSP 2022-2022 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, pp. 8947–8951.
[144] Yao Zhu, Xinyu Wang, Ronald Salloum, Hong-Shuo Chen, C-C Jay Kuo, et al. “RGGID: A Robust
and Green GAN-Fake Image Detector”. In: APSIPA Transactions on Signal and Information
Processing 11.2 (2022).
130
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Efficient graph learning: theory and performance evaluation
PDF
Green learning for 3D point cloud data processing
PDF
Object classification based on neural-network-inspired image transforms
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Green image generation and label transfer techniques
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Machine learning techniques for outdoor and indoor layout estimation
PDF
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
PDF
A green learning approach to image forensics: methodology, applications, and performance evaluation
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Explainable and lightweight techniques for blind visual quality assessment and saliency detection
PDF
Advanced knowledge graph embedding techniques: theory and applications
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Facial age grouping and estimation via ensemble learning
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
PDF
A green learning approach to deepfake detection and camouflage and splicing object localization
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
Asset Metadata
Creator
Fu, Hongyu
(author)
Core Title
Efficient machine learning techniques for low- and high-dimensional data sources
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-12
Publication Date
11/28/2023
Defense Date
11/14/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
green learning,image classification,machine learning,OAI-PMH Harvest,subspace learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, Jay C.-C. (
committee chair
), Nakano, Aiichiro (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
fuhongyu19941122@gmail.com,hongyufu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113777705
Unique identifier
UC113777705
Identifier
etd-FuHongyu-12496.pdf (filename)
Legacy Identifier
etd-FuHongyu-12496
Document Type
Dissertation
Format
theses (aat)
Rights
Fu, Hongyu
Internet Media Type
application/pdf
Type
texts
Source
20231129-usctheses-batch-1109
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
green learning
image classification
machine learning
subspace learning