Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Object classification based on neural-network-inspired image transforms
(USC Thesis Other)
Object classification based on neural-network-inspired image transforms
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
OBJECT CLASSIFICATION BASED ON NEURAL-NETWORK-INSPIRED IMAGE TRANSFORMS by Yueru Chen A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2019 Copyright 2019 Yueru Chen Contents List of Tables v List of Figures viii Abstract xii Chapter 1 Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Robust Image Representation with Saak Transform . . . . . . . 3 1.2.2 Ensembles of Saab-Transform-Based Decision System . . . . . 4 1.2.3 Semi-Supervised Learning using Saab-Transform-Based Deci- sion System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 PixelHop System Based on Successive Subspace Learning . . . 6 1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2 Research Background 8 2.1 Image Feature Representation for Object Classification . . . . . . . . . 8 2.1.1 Conventional Handcrafted Features . . . . . . . . . . . . . . . 8 2.1.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 9 2.1.3 Interpretation of Convolutional Neural Networks . . . . . . . . 12 2.2 Neural-Network-Inspired Image Transforms . . . . . . . . . . . . . . . 14 2.2.1 Saak Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Saab Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Ensemble Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 3 Robust Image Representation with Saak Transform 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Saak Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Lossless Saak Transform . . . . . . . . . . . . . . . . . . . . . 25 ii 3.2.2 Lossly Saak Transform . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Image Representation and Reconstruction . . . . . . . . . . . . . . . . 32 3.3.1 Distributions of Saak Coefficients . . . . . . . . . . . . . . . . 32 3.3.2 Image Synthesis via Inverse Saak Transform . . . . . . . . . . . 34 3.4 Application to Handwriten Digits Recognition . . . . . . . . . . . . . . 35 3.4.1 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.3 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 4 Ensembles of Saab-Transform-Based Decision System 43 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Saab-Transform-Based Decision System . . . . . . . . . . . . . . . . . 45 4.2.1 Construction of Convolutional Layers . . . . . . . . . . . . . . 45 4.2.2 Construction of FC Layers . . . . . . . . . . . . . . . . . . . . 48 4.2.3 Case Study: LeNet-5 . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Enhancing Diversity of Ensemble System . . . . . . . . . . . . . . . . 52 4.3.1 Three Types of Diversity . . . . . . . . . . . . . . . . . . . . . 52 4.3.2 Hard Examples Mining . . . . . . . . . . . . . . . . . . . . . . 55 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.1 Performance of Individual FF-CNN . . . . . . . . . . . . . . . 56 4.4.2 Performance of Ensemble Systems . . . . . . . . . . . . . . . . 58 4.4.3 Hard Examples Mining . . . . . . . . . . . . . . . . . . . . . . 61 4.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter 5 Semi-Supervised Learning via Saab-Transform-Based Decision System 68 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Semi-Supervised Learning and Feedforward-designed CNNs . . . . . . 69 5.3 Semi-Supervised Learning System . . . . . . . . . . . . . . . . . . . . 71 5.3.1 Design of the Semi-supervised Learning System . . . . . . . . . 71 5.3.2 Proposed Ensemble System . . . . . . . . . . . . . . . . . . . . 74 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.1 Individual Semi-Supervised FF-CNN . . . . . . . . . . . . . . 77 5.4.2 Ensembles of Multiple Semi-Supervised FF-CNNs . . . . . . . 81 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Chapter 6 PixelHop: A Successive Subspace Learning (SSL) Method for Object Recognition 83 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2 Successive Subspace Learing (SSL) . . . . . . . . . . . . . . . . . . . 86 iii 6.3 PixelHop Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3.2 Design of PixelHop Units . . . . . . . . . . . . . . . . . . . . . 92 6.3.3 Label-Assisted Regression (LAG) . . . . . . . . . . . . . . . . 101 6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.4.3 Error Analysis, Color Spaces and Scalability . . . . . . . . . . . 109 6.4.4 Benchmarking between PixelHop and LeNet-5 . . . . . . . . . 112 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 7 Conclusion and Future Work 118 7.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 120 References List 123 iv List of Tables 3.1 The percentages of the leading 256 last-stage Saak coefficients that pass the normality test for each digit class before and after outlier removal with 100 and 1000 samples per digit class, where S denotes the number of samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 The classification accuracy (%) of the Saak transform approach for the MNIST dataset, where the first column indicates the kernel numbers used from stages 1-5 in the feature selection module while the second to the fifth columns indicate dimensions of the reduced feature dimension. The cutoff energy thresholds for the 2nd to 5th rows are 1%, 3%, 5% and 7% of the total energy, respectively. . . . . . . . . . . . . . . . . . 37 3.3 The cosine similarity of transform kernels obtained with subsets and the whole set of MNIST training data, where the first column indicates the number of images using for training and the second to sixth columns indicate the cosine similarity in each stage. . . . . . . . . . . . . . . . . 38 3.4 The effect of the training set sizes on the MNIST dataset classification accuracy where the first row indicates the number of images used in transform kernel training and the second row indicates the classification accuracy (%). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 The cosine similarity of transform kernels using fewer class numbers and the whole ten classes, where the first column indicates the object class number in training, and the second to sixth columns indicate the cosine similarity in each stage. . . . . . . . . . . . . . . . . . . . . . . 38 3.6 The classification accuracy (%) on noisy images. All methods are trained on clean images. The first to fourth columns report the results of adding Salt and pepper noise as increasing noise level. The fifth to eighth columns display the results of adding Speckle noise, adding Gaussian noise, replacing background with uniform noise, and replacing back- ground with texture images, respectively. . . . . . . . . . . . . . . . . . 41 v 4.1 Ablation study of the three modules in FF design: (1) multiple FC layer design (MFC), (2) adaptation of C-PCA, and (3) modification on the pseudo-label encoding (MPE), assessed on the CIFAR-10 testing dataset. 57 4.2 Comparison between the BP-CNN and FF-CNNs on the MNIST and CIFAR-10 datasets in terms of classification accuracy (%). . . . . . . . 58 4.3 The classification accuracy (%) on the MNIST and the CIFAR-10 datasets, where the first to the fourth columns indicate the four types of FF- designed CNNs, and the fifth column presents the ensemble results. . . . 58 4.4 The testing classification accuracy (%) on the MNIST and the CIFAR- 10 datasets. The first to the fifth columns indicate using feature sub- sets which are the entireF conv2 set, two subsets collected following the third rule, the subset selected following the first rule, and the subset selected following the second rule described in Section4.2, respectively. We examine different ensemble combinations: ED-1: combination of Conv1, Conv1-1, Conv1-2; ED-2: combination of six different Conv1- RD; ED-3: combination of twelve different Conv2-RD; ED-4: combi- nation of six different Conv1-RD and twelve different Conv1-RD. . . . 59 4.5 The testing classification accuracy (%) on the MNIST and the CIFAR-10 datasets, where L1 to L9 denote the filtered maps by L3L3, E3E3, S3S3, L3S3, S3L3, L3E3, E3L3, S3E3, and E3S3 Laws filters, respectively. The last column indicates the ensemble results. . . . . . . . . . . . . . 62 4.6 The classification accuracy (%) on MNIST and CIFAR-10 datasets, where the first, second and fourth columns indicate results of the original ensem- ble system evaluating on the easy, hard and the entire sample sets accord- ingly. The third and fifth columns present the results of the new ensem- ble system evaluating on the hard set and the entire set, respectively. . . 62 4.7 Diversity measurements on CIFAR-10 training set. Three types of diver- sity sources are evaluated separately in the first to third columns, and the last column reports the measurements on all base classifiers in an ensem- ble. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.1 Network architectures with respect to different input types and different conv layer parameter settings. The second to fourth columns indicate inputs from MNIST, RGB inputs and the single channel inputs from SVHN and CIFAR-10, accordingly. . . . . . . . . . . . . . . . . . . . . 76 5.2 Testing accuracy (%) comparison under three settings: 1) without using unlabeled data; 2) using the entire set of unlabeled data; and 3) using a subset of unlabeled data based on quality scores defined by Eq. (5.1). 1/256 of the labeled data is used on MNIST and SVHN datasets, and 1/128 of the labeled data is used on CIFAR-10 dataset. . . . . . . . . . 77 6.1 Ablation study for the Fashion MNIST dataset. . . . . . . . . . . . . . 106 vi 6.2 Comparison of the classification accuracy (%) using features from an individual PixelHop unit, the PixelHop system with the default setting and the PixelHop system with the advanced aggregation setting for the Fashion MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.3 The confusion matrix for the MNIST dataset, where the first row shows the predicted object labels and the first column shows the true object labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4 The confusion matrix for the Fashion MNIST dataset, where the first row shows the predicted object labels and the first column shows the true object labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.5 The confusion matrix for the CIFAR-10 dataset, where the first row shows the predicted object labels and the first column shows the true object labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.6 Comparison of classification accuracy (%) using different color repre- sentations on the CIFAR-10 dataset. . . . . . . . . . . . . . . . . . . . 112 6.7 Comparison of the original and the modified LeNet-5 architectures. . . . 114 6.8 Comparison of testing accuracy (%) of the LeNet-5 and the PixelHop method on the MNIST, the Fashion MNIST and the CIFAR-10 datasets. 115 6.9 Comparison of training time of the LeNet-5 and the PixelHop method on the MNIST, the Fashion MNIST and the CIFAR-10 datasets. . . . . 115 vii List of Figures 1.1 Overview of the feature extraction in machine learning . . . . . . . . . 1 1.2 Feature extraction in convolutional neural networks . . . . . . . . . . . 2 2.1 Importance of corners and junctions in visual recognition [2] and an image example with key points provided by a corner detector [73] . . . . 10 2.2 Multilayer neural networks and backpropagation. a. and b. illustrate the equations used for computing the forward pass and backward pass in a neural net, respectively. [48] . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Building blocks of networks. . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 A 2D example to illustrate the need of correlation rectification in the unit circle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Three fundamental reasons why an ensemble may work better than a single classifer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 The relationship between a pair of classifiers. . . . . . . . . . . . . . . 19 3.1 The block diagram of forward and inverse Saak transforms, wheref p ,g s andg p are the input in position format, the output in sign format and the output in position format, respectively. . . . . . . . . . . . . . . . . . . 25 3.2 Illustration of the kernel augmentation idea. . . . . . . . . . . . . . . . 27 3.3 Illustration of the forward and inverse multi-stage Saak transforms, where p is the set of the spatial-spectral representations in thepth stage, and S p andS 1 p are the forward and inverse Saak transforms between stages (p 1) andp, respectively. . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 The distributions of Saak coefficients for the (a) first, (b) second and (c) third components for 10 digit classes. . . . . . . . . . . . . . . . . . . . 33 viii 3.5 Illustration of image synthesis with multi-stage inverse Saak transforms (from top to bottom): six input images are shown in the first row and reconstructed images based on the leading 100, 500, 1000 and 2000 last-stage Saak coefficients are shown from the second to the fifth rows. Finally, reconstructed images using all last-stage Saak coefficients are shown in the last row. Images of the first and the last rows are identical, indicating lossless full reconstruction. . . . . . . . . . . . . . . . . . . 34 3.6 Illustration of the proposed Saak transform approach for pattern recog- nition. First, a subset of Saak coefficients is selected from each stage. Next, the feature dimension is further reduced. Finally, the reduced fea- ture vector is sent to an SVM classifier. . . . . . . . . . . . . . . . . . . 36 3.7 The classification results of using fewer classes in training. where the blue line indicates the Saak transform approach and the green line indi- cates the LeNet-5 method. . . . . . . . . . . . . . . . . . . . . . . . . 40 3.8 Example of noisy test samples, where the first to fourth rows are noisy images with added Salt and pepper noise with an increasing noise level while the fifth to the eighth row are noisy images with added Speckle noise, Gaussian noise, background replaced by uniform noise, and back- ground replaced by texture images, respectively. . . . . . . . . . . . . . 41 4.1 The LeNet-5 architecture [49], where the Conv layers are enclosed by a blue parallelogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Visualization of the correlation matrix respect to each filter in the last conv layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 The LeNet-5 architecture [49], where the FC layers are enclosed by a blue parallelogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4 Illustration of intra-class variabilities [44]. . . . . . . . . . . . . . . . . 50 4.5 Summary of the FF design of the LeNet-5. . . . . . . . . . . . . . . . . 51 4.6 Overview of the proposed ensemble method. . . . . . . . . . . . . . . . 52 4.7 The relation between testing accuracy (%) and the number of FF-designed CNNs included in the ensemble on the MNIST dataset. Three types of diversity sources are indicated as: S1: different learning parameter set- tings; S2: subsets of the features; S3: different input representations. . . 60 4.8 The relation between testing accuracy (%) and the number of FF-designed CNNs included in the ensemble on the CIFAR-10 dataset. Three types of diversity sources are indicated as: S1: different learning parameter settings; S2: subsets of the features; S3: different input representations. 61 4.9 The confusion matrices for the testing set of the MNIST dataset. . . . . 63 4.10 The confusion matrices for the testing set of the CIFAR-10 dataset. . . . 64 4.11 Visualization of training and testing samples on CIFAR-10 dataset using t-SNE, where different colors indicate different class labels. . . . . . . . 65 ix 5.1 The proposed semi-supervised learning (SSL) system. . . . . . . . . . . 72 5.2 An ensemble of multiple SSL decision systems. . . . . . . . . . . . . . 75 5.3 The comparisons of testing accuracy (%) using BP-CNNs, and semi- supervised FF-CNNs on MNIST, SVHN and CIFAR-10 datasets, respec- tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 The comparisons of testing accuracy (%) using BP-CNNs, semi-supervised FF-CNNs, and ensembles of semi-supervised FF-CNNs on the small labeled portion against MNIST, SVHN and CIFAR-10 datasets, respec- tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.5 The relation between test accuracy (%) and the number of semi-supervised FF-CNNs in the ensemble, where three diversity types are indicated as T1, T2 and T3, and T0 indicates the individual semi-supervised FF- CNN. The experiments are conducted on CIFAR-10 with 1/16 of the whole labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.1 An overview of the PixelHop method. . . . . . . . . . . . . . . . . . . 89 6.2 (a) Illustration of selected positions of 5 5 neighborhood. (b) Illustra- tion of the PixelHop Unit and subspace-based dimensionality reduction. 90 6.3 Illustration of non-overlapping and overlapping multi-stage Saab trans- forms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4 Visualization of Saab kernels of spatial sizes 5 5, 6 6 and 8 8. . . 95 6.5 The plot of Pearson correlation coefficients with respect to different spa- tial distances along 0-, 45-, and 90-degree lines for the MNIST, the Fash- ion MNIST, and the Cifar-10 datasets. . . . . . . . . . . . . . . . . . . 96 6.6 Convergence of covariance matrix and Saab kernels for the MNIST dataset: the first row indicates the convergence of covariance matrix, second to fourth rows indicate the convergence curves of the top 18 kernels from the first stage, and fifth to seventh rows indicate the convergence curves of the top 18 kernels from the second stage. . . . . . . . . . . . . . . . 97 6.7 The cosine similarity of the first nine transform kernels obtained with subsets and the whole set of MNIST training data on MNIST, Fashion MNIST, and Cifar-10 dataset, respectively. . . . . . . . . . . . . . . . . 98 6.8 The cumulative energy distribution of last stage Saab coefficients on the MNIST, Fashion MNIST, and Cifar-10 datasets, where all images are converted to Y channel firstly. The second plot focus on the first ten AC coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.9 Illustration of image reconstruction with non-overlapping inverse Saab transforms. The first four rows indicate the reconstruction results using one stage Saab transform with kernel number 8, 16, 24, 32. The last four rows indicate the reconstruction results using two stages Saab transform with kernel number 50,150,250,350. . . . . . . . . . . . . . . . . . . . 101 x 6.10 The classification accuracy as a function of the total energy preserved by AC filters tested on the Fashion-MNIST dataset. . . . . . . . . . . . 107 6.11 The log energy plot as a function of the number of AC filters tested on the Fashion MNIST dataset, wherer the yellow, purple, green and red dots indicate the cumulative energy ratio of 95%, 96%, 97%, and 98%, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.12 Representative misclassified images in three benchmark datasets, where the first and the second rows show erroneous predictions in the MNIST dataset, the third and the fourth rows erroneous predictions of the “shirt” class in the Fashion-MNIST dataset, and the fifth and the sixth rows show two confusing pairs, “dog vs. cat” and “ship vs. airplane”, in the CIFAR-10 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.13 Comparison of testing accuracy (%) of the LeNet-5 and the PixelHop method with different training sample numbers for (a) the MNIST and (b) the Fashion-MNIST datasets. . . . . . . . . . . . . . . . . . . . . . 113 7.1 Sample images from CIFAR-100 [40]. . . . . . . . . . . . . . . . . . . 121 7.2 High-resolution sample images from ImageNet. [21]. . . . . . . . . . . 122 7.3 Examples of hierarchical labeling of ImageNet [21]. . . . . . . . . . . . 122 xi Abstract Convolutional neural networks (CNNs) have recently demonstrated impressive perfor- mance in image classification and change the way building feature extractors from care- fully handcrafted design to automatically deep learned from a large labeled dataset. However, a great majority of current CNN literature are application-oriented, and there is no clear understanding and theoretical foundation to explain the outstanding perfor- mance and indicate the way to improve. In this thesis, we focus on solving the image classification problem based on the neural-network-inspired transforms. Being motivated by the multilayer RECOS (REctified-COrrelations on a Sphere) transform [43], [44], a data-driven signal transform is proposed, called the “Subspace approximation with augmented kernels” (Saak) transform [45] corresponding to each Convolutional layers in CNNs. Along this direction, without concerning of inverse operation, Kuo [46] propose an interpretable feedforward (FF) design of CNN models without any backpropagation as a reference. To construct convolutional layers, a new signal transform is developed, called the Saab (Subspace approximation with adjusted bias) transform. As to fully-connected layers, they can be constructed using a cascade of multi-stage linear least squared regressors. Based on the lossy Saak transform, We firstly proposed an efficient, scalable and robust approach to the handwritten digits recognition problem. We conduct a compar- ative study on the performance of the LeNet-5 and the Saak-transform-based solutions xii in terms of scalability and robustness as well as the efficiency of lossless and lossy Saak transforms under a comparable accuracy level. We also develop an ensemble method that fuses the output decision vectors of Saab- transform-based decision system (i.e. the FF-CNN model) to solve the image classifi- cation problem. To increase the diversity of FF-CNN models, we introduce three strate- gies: 1) different parameter settings in convolutional layers, 2) flexible feature subsets fed into the Fully-connected (FC) layers, and 3) multiple image embeddings of the same input source. Furthermore, we partition input samples into easy and hard ones based on their decision confidence scores. As a result, we can develop a new ensemble system tailored to hard samples to further boost classification accuracy. After that, a semi-supervised learning framework using the FF-CNN model is pro- posed for image classification in this thesis. Since unlabeled data may not always enhance semi-supervised learning [88], we define an effective quality score and use it to select a subset of unlabeled data in the training process. Furthermore, we develop an ensemble system that combines the output decision vectors of different semi-supervised FF-CNNs to boost classification accuracy. The ensemble systems can achieve further performance gains on all three benchmarking datasets. In the last, we proposed a unified framework, called successive subspace learning (SSL). With this new viewpoint, the whole CNN pipeline contains multiple subspace processing modules in cascade. To illustrate the SSL principle in the context of image- based object recognition, we propose a novel PixelHop method. The PixelHop method provides a rich set of representations for image classification. To further decrease the complexity of the PixelHop system, we propose a new label-assisted dimension reduc- tion method. Extensive experiments are conducted to demonstrate the superior perfor- mance of the PixelHop method. xiii Chapter 1 Introduction 1.1 Significance of the Research Object Classification is one of the core problems in machine learning or specifictly in digit image process. Image classification refers to assigning an input image one label from one of a number of predefined categories. The procedure of image classification includes includes image sensors, image pre-processing, feature extraction and object classification. Figure 1.1: Overview of the feature extraction in machine learning Within all the steps, image feature extraction usually is the most important one, which learns from an set of raw images in digital form to abstract valuable information intended to be informative and non-redundant. Feature extraction also plays a very 1 important role in other digit image processing problems e.g. image detection, image segmentation and image restoration as illustrated in Figure 1.1. Overall, it is crucial to develop an effective image feature representations. Convolutional neural networks (CNNs) have recently demonstrated impressive image classification performance and change the way developing image feature rep- resentation from carefully handcrafted design to automatically deep learned from large labeled dataset as demonstrated in Figure 1.2. However, a great majority of current CNN literature are application-oriented, and there is still an open question about the theoretical explanation of the outstanding performance. It is worthwhile to have good understanding of CNNs, that provides us the guidance for improvements and innova- tions. Figure 1.2: Feature extraction in convolutional neural networks It is also well known that CNNs-based methods have weaknesses in terms of effi- ciency, scalability and robustness. First, the CNN training is computationally intensive. There are many hyper-parameters to be finetuned in the backpropagation process for new datasets and/or different network architectures [30, 86]. Second, trained CNN models are not scalable to the change of object class numbers and the dataset size [57, 27]. If a network is trained for the certain object classes, we need to re-train the network again 2 even if the number of object classes increases or decreases by one. Similarly, if the training dataset size increases by a small percentage, we cannot predict its impact to the final performance and need to re-train the network as well. Third, these CNN models are not robust to small perturbations due to their excess dependence on the end-to-end optimization methodology [23, 58, 61]. It is desired to develop an alternative solution to overcome these shortcomings. In this thesis, we focus on developing the effective and efficient image feature rep- resentations based on the understanding of CNN methods from the view of the signal transforms. Then we adopt the proposed image representations to solve the image clas- sification problem. 1.2 Contributions of the Research 1.2.1 Robust Image Representation with Saak Transform Lossy Saak transform Being motivated by CNNs, the Saak (Subspace approximation with augmented kernels) transform was recently proposed by Kuo and Chen in [45]. To achieve efficient calcula- tion in practical applications, the lossy Saak transform is proposed in this thesis. Many useful properties are preserved in the lossy Saak transform such as the orthonormality of transform kernels and its capability to provide a family of spatial-spectral representa- tions. Handwritten digits recognition The inverse Saak transform can be used for image synthesis/generation, and MNIST dataset is used as an example for demonstration. We can synthesize an input image 3 with multi-stage inverse Saak transforms with a subset of its last-stage Saak Coeffi- cients. Multi-stage Saak transform or lossy Saak transform provide a family of spatial- spectral representations which can be utilized as powerful features for Handwritten Dig- its Recognition. We proposed an effective approach to the handwritten digits recognition problem based on the Saak transform and conduct a comparative study on the perfor- mance of the LeNet-5 and the Saak-transform-based solutions in terms of scalability and robustness as well as the efficiency of lossless and lossy Saak transforms under a comparable accuracy level. 1.2.2 Ensembles of Saab-Transform-Based Decision System Saab-Transform-Based Decision System Saab-Transform-Based Decision System (i.e. the FF-CNN) was recently proposed by Kuo in [45], which derives network parameters of the current layer based on data statis- tics from the output of the previous layer in a one-pass manner without any BP as a reference. It not only provides valuable insights into CNN architectures but also offers an efficient approach to solve image classification task. We make one modification on the FF-CNN design to achieve higher classification performance. That is, we apply the channel-wise PCA to spatial outputs of the conv layers to remove spatial-dimension redundancy. This reduces the dimension of feature vectors furthermore. Ensemble Methods Ensembles are often used to integrate multiple weak classifiers and make them be a stronger one [83]. In order to imporve the classification performance of the single FF- CNN, we develop an ensemble system by adopting multiple FF-CNNs as base classi- fiers. Ensemble methods may not necessarily result in better classification performance than individual ones. Diversity is critical to the success of an ensemble system [9, 42]. Another main contribution of this thesis is to introduce three kinds of diversity to the 4 proposed ensemble system: 1) different parameter settings in the conv layers, 2) differ- ent subsets of features, and 3) different representations (or embeddings) of input images. Furthermore, we define a concept called the confidence score based on the final decision vector of the ensemble classifier. We use the confidence score to separate easy examples from hard ones. A new ensemble system targeting at hard examples can be constructed to tackle with them solely. 1.2.3 Semi-Supervised Learning using Saab-Transform-Based Decision System When there are no sufficient labeled data available, we need to resort to semi-supervised learning (SSL) in tackling machine learning problems. SSL is essential in real world applications, where labeling is expensive and time-consuming. It attempts to boost the performance by leveraging unlabeled data. Its main challenge is how to use unla- beled data effectively to enhance decision boundaries that have been obtained by a small amount of labeled data. It is easier to formulate an SSL framework using Saab- Transform-Based Decision System. Because the system is less dependent on data labels than CNN-based methods. In this thesis, we proposed a semi-supervised learning framework using Saab- Transform-Based Decision System for image classification. Because unlabeled data may not always be useful for SSL [88], we define an effective quality score and use it to eliminate useless unlabeled data in the training process. The experimental results show that the proposed semi-supervised FF-CNN solution outperforms the CNN trained by backpropagation (BP-CNN) when the amount of labeled data is reduced on the MNIST, SVHN, and CIFAR-10 datasets. Also, we adopted the ensemble idea to achieve fur- ther performance gains. The designed ensemble system combines the output decision vectors of different semi-supervised FF-CNNs to boost classification accuracy. 5 1.2.4 PixelHop System Based on Successive Subspace Learning Successive Subspace Learning (SSL) Being inspired by interpretable deep learning research, we introduce a new machine learning paradigm, called successive subspace learning (SSL) in chapter 6. The sub- space technique has been widely used in signal/image processing, pattern recognition, computer vision, etc. [72], [5], [39], [53], [29], [76]. Different with the existing sub- space methods conducted in a single stage, the SSL framework performs subspace learn- ing in multiple stages. This research is the sequel of previous work in [43], [44], [45], [46]. They are integrated into one unified framework. With this new viewpoint, the whole CNN pipeline can be interpreted as the multiple subspace processing modules in cascade and the model parameters are determined layer-by-layer in a feedforward manner. PixelHop Method Instead of using the SSL principle to interpret CNNs, we propose a new SSL-based object classification method and call it the PixelHop method. The PixelHop method has several unique ingredients. First, each basic PixelHop unit contains a subspace concatenation step and a subspace approximation step. The former is used to merge several smaller subspaces into a larger one while the latter reduces the dimension of the exact subspace into that of an approximate one. Second, it has multiple PixelHop units in cascade so as to capture the characteristics of neighborhoods of varying sizes. Third, it contains an efficient way to aggregate attributes from all PixelHop units and reduce the dimension furthermore using a novel Label-Assisted reGression(LAG). 6 1.3 Organization of the Dissertation The rest of this thesis is organized as follows. The research background, including tra- ditional handcrafted feature extraction method, convolutional neural network and it’s explanation, two neural-network-inspired image transforms, and the ensemble approach are reviewed in Chapter 2. The proposed robust image representation with Saak trns- form is presented in Chapter 3. Next, the ensembles of Saab-transform-based decision system to solve image classification problem is described in Chapter 4. Then, the Semi- supervised system based on FF-CNNs is introduced in Chapter 5. After that, a new SSL-based object classification method, called PixelHop method, is presented in Chap- ter 6. Finally, concluding remarks and future research are drawn given in Chapter 7. 7 Chapter 2 Research Background 2.1 Image Feature Representation for Object Classifica- tion 2.1.1 Conventional Handcrafted Features It usually requires considerable engineering and domain expertise to design a effective image feature representation to solve the machine learning problems, such as image classification. The raw data (e.g. the pixel values of an image) are intended to be trans- formed into an informative and non-redundant representation or feature vector. Gener- ally, image representations can be classified into two categories: global representations and local representations. Global representations Global representations describe the image as a whole to the generalize the entire object. The common global image visual features includes color features, texture features and topological features. Color features are defined on a particular color space or model, such as RGB, YCbCr, and Lab color spaces. Once color space is decided, there are numerous color features proposed in the literature, e.g. color histogram [35], color moments(CM) [24], and color coherence vector (CCV) [62], etc. As for texture fea- tures, many of them are computed in the frequency domain of an image, such as Gabor filter [56]. The image is filtered with Gabor filters bank or Gabor wavelets of different 8 preferred spatial frequencies and orientations. Each filter map captures energy informa- tion at a specific frequency and direction. Turk and Pentland [72] and later Murase and Nayar [59] proposed the feature extractors based on the computation of the principal component analysis (PCA). Local representations Local representations describe the images as a collection of independent patches or key points. The procedures of learning local representations are in a bottom-up manner: key points are detected first, and local patches or feature vectors are extracted around these points, that are combined as the final representation of the image. Figure 2.1 shows some examples of local features. Local features can be points, edge pixels or small image patches which differs from its immediate surroundings by texture, color, or intensity. The local features can be used for various applications, such as image registration, object detection and classification, and tracking. Comparing with global representations, local representations are less sensitive to noise and partial occlusion, and do not strictly require background subtraction or tracking. 2.1.2 Convolutional Neural Networks In past several years, deep convolutional neural networks have big achievements in machine learning field. Generally speaking, a CNN architecture consists of a few con- volutional layers followed by several fully connected (FC) layers. The convolutional layer is composed of multiple convolutional operations of vectors defined on a cuboid, a nonlinear activation function such as the Rectified Linear Unit (ReLUs) and response pooling. It is used for feature extraction. The FC layer serves the function of a mul- tilayer perceptron (MLP) classifier. All CNN parameters (or called filter weights) are learned by the stochastic gradient descent (SGD) algorithm through backpropagation. Convolutional neural networks is a deep, feedforward network which learn to map a 9 Figure 2.1: Importance of corners and junctions in visual recognition [2] and an image example with key points provided by a corner detector [73] . fixed-size input (for example, an image) to a fixed-size output (for example, a prob- ability for each of several categories). The Figure 2.2 demonstrates the equations for computing the forward pass and backward pass in a simple multilayer neural network. Feedforward pass. At each layer in, firstly compute the total input z to each unit, which is a weighted sum of the outputs of the units in the layer below. Then a non-linear functionf(:), such as the rectified linear unit (ReLU), the more conventional sigmoids, is applied toz to get the output of the unit. Backforward pass. At each hidden layer, we compute the error derivative with respect to the output of each unit, which is a weighted sum of the error derivatives with respect to the total inputs to the units in the layer above. We then convert the error derivative with respect to the output into the error derivative with respect to the input by multiplying it by the gradient off(z). At the output layer, the error derivative with respect to the output of a unit is computed by differentiating the cost function, which is y l t l and t l is the target value. Here, we assume the cost function for unitl is 0:5 (y l t l ) 2 . 10 Figure 2.2: Multilayer neural networks and backpropagation. a. and b. illustrate the equations used for computing the forward pass and backward pass in a neural net, respectively. [48] Once the@E=@z k is known, the error-derivative for the weightw jk can be computes as y j @E=@z k . Figure 2.3: Building blocks of networks. Applications. CNNs have been applied to various computer vision problems. In image classification, some popular architectures include AlexNet[41], VGG[68] and ResNet[31] and Inception network[71], etc. The networks become deeper and deeper, that tends to achieve better performance. AlexNet and VGG have the linear architec- ture, while ResNet and Inception network aren’t. The latter two architectures contains 11 residual blocks and inception modules respectively to seek deeper network architectures as demonstrated in Figure 2.3. In image segmentation, researchers have developed the network keeping a reasonable output resolution. For example, the FCN[52] uses the deconvolution operation to upsample step by step. In another well-known architecture, DeepLab[12], some pooling layers are removed to avoid loss of resolution. 2.1.3 Interpretation of Convolutional Neural Networks A great majority of current CNN literature are application-oriented, yet efforts are made to build theoretical foundation of CNNs. Cybenko [18] and Hornik et al. [32] proved that the multi-layer perceptron (MLP) is a universal approximator in late 80s. Later on, studies on analysis of CNNs include: visualization of filter responses at various layers [67, 82, 87], scattering networks [55, 10, 78], tensor analysis [15], generative modeling [19], and over-parameterized shallow neural network optimization [69], etc. More recently, CNN’s interpretability has been examined by a few researchers from various angles. Examples include interpretable knowledge representations [85], critical nodes and data routing paths identification [77], the role of nonlinear activation [43], convolutional filters as signal transforms [44, 45], etc. RECOS transform. To offer an explanation of CNNs, Kuo [43, 44] modeled the the compound operation of convolution followed by nonliear activation in CNN with the RECOS (REctified-COrrelations on a Sphere) transform, and interpreted the whole CNN as a multi-layer RECOS transform. In the context of image processing, the forward RECOS transform defines a mapping from a real-valued function defined on a three-dimensional (3D) cuboid to a one-dimensional (1D) rectified spectral vector, where the transform kernels (or filter weights) are are optimized by backpropagation. Because the filters in the RECOS transform are not orthogonal to each other, the inverse RECOS transform demands the solution of a linear system of equations. 12 Also [43] provides the explanation of the necessity of the rectification. As illustrated in Figure 2.4, where x anda k (k = 1, 2, 3) denote an input and three anchor vectors on the unit circle, respectively. The geodesic distances between x anda k can be computed as: (x;a k ) =cos 1 (x T a k ) (2.1) The correlations between x anda 1 , x anda 2 are positive, while the correlation between x anda 3 is negative. So for positive correlations, the larger the correlation, the shorter the distance. As for the negative correlation, the two vectors are far apart in terms of the geodesic distance, yet their correlation is strong (although a negative one), so that negative correlation does not serve as a good indicator of the geodesic distance and is good to be set to zero. Figure 2.4: A 2D example to illustrate the need of correlation rectification in the unit circle. Interpretable feedforward design. Given a CNN architecture, the determination of network parameters can be formulated as a non-convex optimization problem and solved 13 by backpropagation (BP). Yet, since non-convex optimization of deep networks is math- ematically intractable [69], a new methodology [46], interpretable feedforward (FF) design, is proposed to tackle the interpretability problem. It derives network parameters of the current layer based on data statistics from the output of the previous layer in a one- pass manner. This complementary methodology not only offers valuable insights into CNN architectures but also enriches CNN research from the angles of linear algebra, statistics and signal processing. It derives network parameters of a target layer based on statistics of output data from the previous layer in a one-pass manner. No BP is adopted at all. This feedforward design provides valuable insights into the CNN operational mechanism. Besides, under the same network architecture, its training complexity is significantly lower than that of the BP-design CNN. A FF-CNN consists of two modules in cascade: 1) the module of convolutional (conv) layers and 2) the module of fully-connected (FC) layers. They are designed using completely different strategies. To construct the conv layers, Kuo em et al. [46] proposed a new signal transform called the Saab (Subspace approximation with adjusted bias) transform. Each conv layer contains one Saab transform followed by a maximum spatial pooling operation. The design of FC layers is cast as a multi-stage linear least squared regression (LSR) problem in [46]. 2.2 Neural-Network-Inspired Image Transforms 2.2.1 Saak Transform Being motivated by CNNs, the Saak (Subspace approximation with augmented kernels) transform was recently proposed by Kuo and Chen in [45]. The Saak transform consists of the following three steps: 1) build the optimal linear subspace approximation that 14 has an orthonormal basis from input samples, 2) augment each transform kernel with its negative and adopt both as basis functions, 3) apply the rectified linear unit (ReLU) to the transformed output to yield the final Saak coefficients. Through this construction, it is straightforward to conduct the inverse Saak transform as well. To transform images of larger sizes, one cascade multiple Saak transforms, which leads to the multi-stage Saak (m-Saak) transform. The m-Saak transform offers a wide range of joint spatial-spectral representations for images between two extremes; namely, the full spatial-domain rep- resentation and the full spectral-domain representation which can be utilized for image classification and synthesis. The Saak transforms have several interesting and desired features. First, it has orthonormal transform kernels that facilitate the computation in both forward and inverse transforms. Second, the Saak transform can eliminate the rectification loss to achieve lossless conversion between different spatial-spectral representations. Third, the distance between two vectors before and after the Saak transform is preserved to a cer- tain degree. Fourth, the kernels of the Saak transform are derived from the second-order statistics of input random vectors. Neither data labels nor backpropagation is demanded in kernel determination. 2.2.2 Saab Transform Kuo em et al.[46] proposed a new signal transform called the Saab (Subspace approximation with adjusted bias) transform, in which a bias vector is added to annihi- late nonlinearity of the activation function. The Saab transform is a variant of PCA, and it contributes to dimension reduction. Similar with Saak transform, Saab transform is totally unsupervised since no labels are needed in the construction process. The transform kernels of Saab transform (also called anchor vectors) set can be computed from the statistic of the input vector set. We separate the anchor vector to 15 two categories, the DC anchor vector and the AC anchor vectors. The DC anchor vector can be simply computed bya 0 = 1 p K (1; ; 1) T , which can preserve the mean infor- mation of the input data. Here, K is the dimension of the anchor vector. As for the AC anchor vectors, we conduct the principal component analysis (PCA) on the set of the mean-removed inputs, then choose the first (M) principal components as the AC anchor vectorsa m ,m = 1;:::;M. Different with the Saak transform, the Saab transform adopt the affine transformation instead of the linear transformation. With careful choosing the bias terms, the rectifica- tion operation can be discarded and the sign confusion problem can be handled. Two constraints are designed for the bias terms selection. Firstly, themth bias,b m , should be chosen to guarantee themth response a non-negative value. Secondly, all bias terms are equal for simple computations. In this way, we introduce the non-linearity to the trans- form that perform a similar function as the rectified linear (ReLU) activation function in CNN. 2.3 Ensemble Method Ensemble methods are learning algorithms that construct a set of first stage base classi- fiers and then make final decision by combining all base classifier predictions. There is a well established literature about ensemble learning [83]. T. Dietterich [22] provided the reasons for better performance of ensembles from statistical angle, computational angle, and representational angle as illustrated in Figure 2.5. A learning algorithm can be viewed as searching a space H of hypotheses to identify the best hypothesis in the space. By constructing an ensemble out of all of these base classifiers and averaging their outputs, the algorithm can somehow reduce the risk of choosing the wrong results. 16 Tumer and Ghosh (1996a, 1996b, 1999) derive an expression about the added classifi- cation error: E ave add =E add ( 1 +(L 1) L ) (2.2) whereE add is the added error of L individual classifiers (all have the same error), and is a correlation coefficient. They claimed that positively correlated classifiers only slightly reduce the added error, uncorrelated classifiers reduce the added error by a factor of 1/L, and negatively correlated classifiers reduce the error even further. Figure 2.5: Three fundamental reasons why an ensemble may work better than a single classifer. Ensemble Approaches 17 For decades, researchers have since exploded different ensemble approaches, in most cases, ensemble system can be categorized to two general settings: classifier selection and classifier fusion [83]. For classifier selection, each classifier is trained in some local neighborhood of the entire feature space. Given a new instance, we need select the classifier trained with data closest to the vicinity of this instance, and make the final decision depended on this selected classifier, or assign the highest weight in contributing to the final decision [26][80]. In classifier fusion, all classifiers are trained over the entire feature space, and then combined to obtain a composite classifier, such as bagging [7], random forests (an ensemble of decision trees) [8], boosting/AdaBoost [66, 25], and stacked generalization [79], etc. Combining the individual classifiers can be based on the labels or continuous valued decision vectors. In the latter case, some simple algebraic combination rules can be used for fusion, e.g. simple or weighted majority voting, maximum/minimum/sum/product, or other combinations class-specific outputs[83]. Diversity Issues Fusion of the different base classifier outputs does not necessarily lead to a better clas- sification performance than the individual base classifier. Diversity is very important for the success of the ensemble system [9, 42], even though the explicit relationship between diversity and ensemble performances has not been established. There are two type diversity measures, pairwise diversity measures and non-pairwise diversity mea- sures. 18 Figure 2.6: The relationship between a pair of classifiers. The Q statistics. The Q statistics is one of the pairwise measurements. Assume Z = fz 1 ;:::;z N g be a labeled data set, we can represent the output of a classifier D i as an N-dimensional binary vector. D i = 8 < : 1, ifD i is the correct prediction ofz j , 0, Otherwise. i = 1; 2; ;L: (2.3) Yules Q statistic (1900) for two classifiers,D i andD k , is Q i;k = N 11 N 00 N 01 N 10 N 11 N 00 +N 01 N 10 (2.4) where N ab represents the number of elements z j 2 Z satisfied the row and column conditions in Figure 2.6. The averaged Q statistics over all pairs of classifiers is, Q ave = 2 L(L 1) L1 X i=1 L X k=i+1 Q i;k (2.5) For statistically independent classifiers, the expectation ofQ i;k is 0. Q varies between 1 and 1. Classifiers that tend to recognize the same objects correctly will have positive values of Q, and those which commit errors on different objects will render Q negative. Lower Q value indicates the higher possible diversity. 19 The entropy measure E. The entropy measure is one of the non-pairwise measure- ments. The highest diversity among L base classifiers for a particularz j 2 Z is mani- fested byL=2 of the votes inz j with the same value (0 or 1) and the otherLL=2 with the alternative value. The entropy measure E is defined as: E = 1 N N X j=1 2 L minfl(z j );Ll(z j )g (2.6) wherel(z j ) denote the number of classifiers that correctly predictz j . E varies between 0 and 1, where 0 indicates no difference and 1 indicates the highest possible diversity. 20 Chapter 3 Robust Image Representation with Saak Transform 3.1 Introduction To offer an explanation of convolutional neural networks (CNNs), Kuo [43, 44] modeled the convolutional operation at each CNN layer with the RECOS (REctified-COrrelations on a Sphere) transform, and interpreted the whole CNN as a multi-layer RECOS trans- form. The RECOS transform is derived from labeled training data, and its transform kernels (or filter weights) are optimized by backpropagation. The RECOS transform has two loss terms: the approximation loss and the rectification loss. The approximation loss is caused by the use of a limited number of transform kernels. This error can be reduced by increasing the number of filters at the cost of higher computational com- plexity and higher storage memory. The rectification loss is due to nonlinear activation. Furthermore, since the filters in the RECOS transform are not orthogonal to each other, the inverse RECOS transform demands the solution of a linear system of equations. It is meaningful to develop a new data-driven transform that has neither approxi- mation loss nor the rectification loss as the RECOS transform. Besides, it has a set of orthonormal transform kernels so that its inverse transform can be performend in a straightforward manner. To achieve these objectives, we propose the Saak (Subspace approximation with augmented kernels) transform. As indicated by its name, the Saak 21 transform has two main ingredients: 1) subspace approximation and 2) kernel augmen- tation. To seek the optimal subspace approximation to a set of random vectors, we analyze their second-order statistics and select orthonormal eigenvectors of the covari- ance matrix as transform kernels. This is the well-known Karhunen-Lo eve transform (KLT) [70]. When the dimension of the input space is very large, it is difficult to con- duct one-stage KLT. Then, we may decompose a high-dimensional vector into multiple lower-dimensional sub-vectors. This process can be repeated recursively to form a hier- archical representation. For example, given a set of images of size NN, the total number of variables in these images isN 2 and their covariance matrix is of dimension N 4 . It is not practical to conduct the KLT on the full image for a largeN. For Instead, we may decompose images into smaller blocks and conduct the KLT on each block. If two or more transforms are cascaded directly, there is a “sign confusion” problem [43, 44]. To resolve it, we insert the Rectified Linear Unit (ReLU) activation function in between. The ReLU inevitably brings up the rectification loss, and a novel idea called kernel augmentation is proposed to eliminate this loss. That is, we augment each trans- form kernel with its negative vector, and use both original and augmented kernels in the Saak transform. When an input vector is projected onto the positive/negative ker- nel pair, one will go through the ReLU operation while the other will be blocked. This scheme greatly simplifies the signal representation problem in face of ReLU nonlinear activation. It also facilitates the inverse Saak transform. The integration of kernel aug- mentation and ReLU is equivalent to the sign-to-position (S/P) format conversion of the Saak transform outputs, which are called the Saak coefficients. By converting the KLT to the Saak transform stage by stage, we can cascade multiple Saak transforms to transform images of a large size. The Saak coefficients in intermediate stages indicate the spectral component values in the corresponding covered spatial regions. The Saak coefficients in the last stage 22 represent spectral component values of the entire input vector (i.e., the whole image). We can use the Saak coefficient in the last stage to achieve image reconstruction through inverse Saak transform. The multi-stage Saak transforms offer a family of joint spatial-spectral representa- tions between two extremes - the full spatial-domain representation and the full spectral- domain representation. We can adopt this representations to solve Handwritten digits recognition and conduct a comparative study on the performance of the LeNet-5 and the Saak-transform-based solutions in terms of their accuracy, efficiency, scalability, and robustness. Handwritten digits recognition is one of the important tasks in pattern recognition. It has numerous applications such as mail sorting, bank check processing, etc. Methods based on the convolutional neural network (CNN) offer the state-of-the- art performance in handwritten digits recognition nowadays. Besides handwritten digits recognition, we have seen a resurgence of the CNN methodology [41, 50, 48, 74] and its applications to many computer vision problems in recent years. It is well known that CNNs-based methods have weaknesses in terms of efficiency, scalability and robustness. First, the CNN training is computationally intensive. There are many hyper-parameters to be finetuned in the backpropagation process for new datasets and/or different network architectures [30, 86]. Second, trained CNN models are not scalable to the change of object class numbers and the dataset size [57, 27]. If a network is trained for a certain object classes, we need to re-train the network again even if the number of object classes increases or decreases by one. Similarly, if the training dataset size increases by a small percentage, we cannot predict its impact to the final performance and need to re-train the network as well. Third, these CNN models are not robust to small perturbations due to their excess dependence on the end-to-end optimization methodology [23, 58, 61]. It is desired to develop an alternative solution to overcome these shortcomings. 23 Being different with CNNs, all transform kernels in multi-stage Saak transforms are computed by one-pass feedforward process. Neither data lables nor backpropagation is needed for kernel computation. To achieve higher efficiency, we adopt the lossy Saak transform using the principal component analysis (PCA) for subspace approximation so as to control the number of transform kernels (or the principal component num- ber). As to scalability, the feature extraction process in the Saak transform approach is an unsupervised one, it is not sensitive to the class number. Furthermore, the lossy Saak transform can alleviate the influence of small perturbations by focusing on princi- pal components only. We will conduct an extensive set of experiments on the MNIST dataset [49] to demonstrate the above-mentioned properties. 3.2 Saak Transform The Saak transform is a mapping from a real-valued function defined on a 3D cuboid to a 1D rectified spectral vector. Both forward and inverse Saak transforms can be well defined. The 3D cuboid consists of two spatial dimensions and one spectral dimension. Typically, the spatial dimensions are set to 2 2. As to the spectral dimension, it can grow very fast when we consider the lossless Saak transform. In practice, we should leverage the energy compaction property of the KLT and adopt the lossy Saak transform by replacing the KLT with the truncated KLT (or PCA). The Saak transforms have several interesting and desired features. First, it has orthonormal transform kernels that facilitate the computation in both forward and inverse transforms. Second, the Saak transform can eliminate the rectification loss to achieve lossless conversion between different spatial-spectral representations. Third, the distance between two vectors before and after the Saak transform is preserved to a cer- tain degree. Fourth, the kernels of the Saak transform are derived from the second-order 24 statistics of input random vectors. Neither data labels nor backpropagation is demanded in kernel determination. Figure 3.1: The block diagram of forward and inverse Saak transforms, wheref p ,g s and g p are the input in position format, the output in sign format and the output in position format, respectively. 3.2.1 Lossless Saak Transform The block diagram of the forward and inverse Saak transforms is given in Figure 3.1. The forward Saak transform consists of three steps: 1) building the optimal linear subspace approximation with orthonormal bases using the Karhunen-Lo eve transform (KLT) [70], 2) augmenting each transform kernel with its negative, 3) applying the recti- fied linear unit (ReLU) to the transform output. The second and third steps are equivalent to the sign-to-position (S/P) format conversion. The inverse Saak transform is conducted by performing the position-to-sign (P/S) format conversion before the inverse KLT. Gen- erally speaking, the Saak transform is a process of converting the spatial variation to the spectral variation and the inverse Saak transform is a process of converting the spectral variation to the spatial variation. 25 Forward Saak Transform It is desired to have orthonormal transform kernels to facilitate forward and inverse transforms. We utilize the Karhunen-Lo eve transform (KLT) on a set of mean-removed input vectors. Consider an anchor vector (transform kernel) set [43, 44]: A =fa 0 ;a 1 ; ;a k ; ;a K g; jja k jj = 1 anda k 2R N (3.1) We divide anchor vectors into two types. The vector a 0 = 1 p N (1; 1; ; 1) T (3.2) is the DC anchor vector while the remaining ones,a 1 ; ;a K , are the AC anchor vec- tors. Since the projection of inputf2R N on the DC anchor vector gives its mean, the residual p k =a T k f; k = 0; 1;K: (3.3) ~ f =fp 0 a 0 (3.4) is a zero-mean random vector. Then, We compute the correlation matrix of ~ f, R = E[ ~ f ~ f T ]2R NN . MatrixR has (N 1) positive eigenvalues and one zero eigenvalue. The eigenvector associated with the zero eigenvalue is the constant vector. The remain- ing (N 1) unit-length eigenvectors define the Karhunen-Lo eve (KL) basis functions, which are KLT’s kernel vectors. It is well known that the KLT basis functions are orthonormal. To address the information loss caused by the rectification operation, we propose to augment each kernel vector with its negative vector. That is, if a k is a kernel, we choosea k to be another kernel. An example is shown in Figure 3.2. In this figure, we 26 Figure 3.2: Illustration of the kernel augmentation idea. show the projection of inputf onto two AC anchor vectorsa 1 anda 2 withp 1 > 0 and p 2 < 0 as their respective projection values. The ReLU preservesp 1 but clipsp 2 to 0. By augmenting them with two more anchor vectors,a 1 anda 2 , we obtain projection valuesp 1 < 0 andp 2 > 0. The ReLU clipsp 1 to 0 and preservesp 2 . The kernel augmentation idea is powerful in two aspects: 1) eliminating the rectification loss, and 2) annihilating the nonlinear effect of ReLU to greatly simplify the analysis. The Pro- jection onto the Augmented Kernel Set ~ A =fa 0 ;a 1 ;a 1 ; ;a k ;a k ; ;a K ;a K g p k = ~ a T k ~ f; (3.5) p = (p 0 ;p 1 ; ;p 2(K1) ) T 2R 2K1 (3.6) 27 Then we can introduce non-linearity by applying the ReLU to the projection vector except the first element to yield the final output: g = (g 0 ;g 1 ; ;g 2(N1) ) T 2R 2K1 ; (3.7) whereg 0 =p 0 and g 2k1 =p 2k1 andg 2k = 0; ifp 2k1 > 0; (3.8) g 2k1 = 0 andg 2k =p 2k ; ifp 2k > 0: (3.9) fork = 1; ;K 1. Inverse Saak Transform The inverse Saak transform can be written as f p = N1 X k=0 g s;k a k ; (3.10) wherea k is a basis function. Given position formatg p , we need to perform the position- to-sign (P/S) format conversion to get sign formatg s and then feed it to the inverse KLT. This is illustrated in the lower branch of Figure 3.1. It is straightforward to verify the following. The input and the output of the forward/inverse KLT have the samel 2 distance. The input and the output of the S/P format conversion and the P/S format conver- sion have the samel 1 distance. 28 Besides, thel 1 distance is always no less than thel 2 distance. Letg p;1 andg p;2 be two output vectors of input vectorsf p;1 andf p;2 , respectively. It is easy to verify that jjg p;1 g p;2 jj 2 jjg s;1 g s;2 jj 2 =jjf p;1 f p;2 jj 2 : (3.11) As derived above, thel 2 -distance of any two outputs of the forward Saak transform is bounded above by the l 2 -distance of their corresponding inputs. This bounded-output property is important when we use the outputs of the Saak transform (i.e. Saak coeffi- cients) as the features. Furthermore, we have jjf p;1 f p;2 jj 2 =jjg s;1 g s;2 jj 2 jjg s;1 g s;2 jj 1 =jjg p;1 g p;2 jj 1 ; (3.12) which is important for the inverse Saak transform since the roles of inputs and outputs are reversed. In this context, the l 2 -distance of any two outputs of the inverse Saak transform is bounded above by thel 1 -distance of their corresponding inputs. Multi-stage Saak Transforms As shown in Figure 3.3, multi-stage Saak transforms are developed to transform images of a larger size. To begin with, we decompose an input into four quadrants recursively to form a quad-tree structure with its root being the full image and its leaf being a small patch of size 2 2. Then, we conduct the Saak transform by merging four child nodes into one parent node stage by stage and from the leaf to the root. The whole process ter- minates when we reach the last stage (or the root of the tree) that has a spatial dimension of 1 1. The signed KLT coefficients in each stage are called the Saak coefficients that can serve as discriminant features of the input image. Multi-stage Saak transforms pro- vide a family of spatial-spectral representations. They are powerful representations to be used in many applications such as handwritten digits recognition, object classification, and image processing. 29 Figure 3.3: Illustration of the forward and inverse multi-stage Saak transforms, where p is the set of the spatial-spectral representations in thepth stage, andS p andS 1 p are the forward and inverse Saak transforms between stages (p 1) andp, respectively. 3.2.2 Lossly Saak Transform The kernel number for a lossless Saak transform is uniquely specified. For the lossless multi-stage Saak transforms, the spatial-spectral dimension in Stage p is 2 2K p , whereK p can be recursively computed as K p = 2 4K (p1) ; K 0 = 1; p = 1; 2; : (3.13) 30 The right-hand-side (RHS) consists of three terms. The first term is due to the S/P conversion. The second and third terms are from that the degree of freedom of the input cuboid and the output vector should be preserved through the KLT. As a consequence of Eq. (3.13), we haveK p = 8 p . To avoid this exponential growth in terms of the stage numberp, we can leverage the energy compaction property of the KLT and adopt the lossy Saak transform by replacing the KLT with the truncated KLT (or PCA). The properties of lossy Saak transforms. Many useful properties are preserved in the lossy Saak transform such as the orthonormality of transform kernels and its capa- bility to provide a family of spatial-spectral representations. However, there is a loss in reconstructing the original input. This is the reason that it is called the “lossy” transform. Since we deal with the classification problem, the exact reconstruction is of little concern here. To reduce the number of transform kernels, we select the leadingK 0 p components that have the largest eigenvalues from all K p components in stage p. Empirically, we haveK 0 p <<K p . Feature selection. The total number of Saak coefficients in lossly multi-stage Saak transforms is still large. To use Saak coefficients as image features for classification, we can select a small subset of coefficients by taking their discriminant capability into account. To search Saak coefficients of higher discriminant power, we adopt the F-test statistic (or score) in the ANalysis Of V Ariance (ANOV A), which is in form of F = between-group variability (BGV) within-group variability (WGB) : (3.14) The between-group variability (BGV) and the within-group variability (WGV) can be written, respectively, as BGV = C X c=1 n c ( S c S) 2 =(C 1); (3.15) 31 and WGV = C X c=1 nc X d=1 (S c;d S c ) 2 =(TC); (3.16) whereC is the class number,n c is the number of Saak coefficients of thecth class, S c is the sample mean of thecth class, S is the mean of all samples,S c;d is thedth sample of thecth class, andT is the total number of samples. 3.3 Image Representation and Reconstruction In this section, we analyze the Saak coefficients and adopt inverse multi-stage Saak transform to realize the image reconstruction. All experiments are conducted on the MNIST datasets. 3.3.1 Distributions of Saak Coefficients There are 10 handwritten digit classes ranging from 0 to 9 in the MNIST dataset. For image samples of the same digit, we expect their Saak coefficients to have a clustering structure. We plot the distributions of the first three last-stage Saak coefficients in Figure 3.4. Each plot has ten curves corresponding to the smoothed histograms of ten digits obtained by 1000 image samples for each digit. By inspection, each of them looks like a normal distribution. To verify normality, we show the percentages of the leading 256 last-stage Saak coefficients for each digit class that pass the normality test with 100 and 1000 samples per class in Table 3.1. We adopt the Jarque-Bera test [36] for the normality test and the Gubbbs’ test [28] for outlier removal in this table. We see from this table that, when each class has 100 samples, 90% or higher of these 256 last-stage Saak coefficients pass the normality test after outlier removal. However, the percentages drop when the sample number increases from 100 to 1000. This sheds 32 (a) The first spectral component (b) The second spectral component (c) The third spectral component Figure 3.4: The distributions of Saak coefficients for the (a) first, (b) second and (c) third components for 10 digit classes. Table 3.1: The percentages of the leading 256 last-stage Saak coefficients that pass the normality test for each digit class before and after outlier removal with 100 and 1000 samples per digit class, where S denotes the number of samples. Digit Class 0 1 2 3 4 5 6 7 8 9 S=100, raw 91 73 96 90 87 89 88 88 90 86 S=100, outlier removed 97 89 98 97 97 96 96 93 98 96 S=1000, raw 63 30 74 68 56 66 53 46 61 53 S=1000, outlier removed 77 52 87 85 80 84 72 75 85 74 light on the difference between small and big datasets. It is a common perception that the Gaussian mixture model (GMM) often provides a good signal model. Even it is valid when the sample number is small, there is no guarantee that it is still an adequate model when the sample number becomes very large. 33 3.3.2 Image Synthesis via Inverse Saak Transform Figure 3.5: Illustration of image synthesis with multi-stage inverse Saak transforms (from top to bottom): six input images are shown in the first row and reconstructed images based on the leading 100, 500, 1000 and 2000 last-stage Saak coefficients are shown from the second to the fifth rows. Finally, reconstructed images using all last- stage Saak coefficients are shown in the last row. Images of the first and the last rows are identical, indicating lossless full reconstruction. We can synthesize an input image with multi-stage inverse Saak transforms with a subset of its last-stage Saak Coefficients. An example is shown in Figure 3.5, where six input images are shown in the first row and their reconstructed images using leading 100, 500, 1000, 2000 last-stage Saak coefficients are shown in the 2nd, 3rd, 4th and 5th rows, respectively. The last row gives reconstructed images with all Saak coefficients (a 34 total of 16; 000 coefficients). Images in the first and the last rows are actually identical. It proves that multi-stage forward and inverse Saak transforms are exactly the inverse of each other numerically. The quality of reconstructed images with 1000 Saak coefficients is already very good. It allows people to see them clearly without confusion. Thus, we expect to get high classification accuracy with these coefficients. Furthermore, it indicates that we can consider lossy Saak transforms in all stages to reduce the storage and the computational complexities, which will be confirmed in Sec. 3.4. 3.4 Application to Handwriten Digits Recognition We adopt Saak transform to handwriten digits recognition and conduct a compara- tive study on the efficiency of lossless and lossy Saak transforms under a comparable accuracy level in Sec. 3.4.1 as well as the performance of the LeNet-5 and the Saak- transform-based solutions in terms of scalability and robustness in Secs. 3.4.2 and 3.4.3, respectively. 3.4.1 Efficiency The proposed Saak-transform-based pattern recognition approach is illustrated in Figure 3.6. For an input image of size 32 32, we conduct five-stage lossy Saak transforms to compute spatial-spectral representations with spatial resolutions of 16 16(= 256), 8 8(= 64), 4 4(= 16), 2 2(= 4) and 1 1(= 1). The feature selection module in the figure consists of two steps. For the first step, we select transform kernels using the largest eigenvalue criterion. This controls the number of Saak coefficients in each stage effectively while preserving the discriminative power. To give an example, if we choose transform kernels with their energy just greater than 3% of the total energy of 35 Figure 3.6: Illustration of the proposed Saak transform approach for pattern recogni- tion. First, a subset of Saak coefficients is selected from each stage. Next, the feature dimension is further reduced. Finally, the reduced feature vector is sent to an SVM classifier. the PCA, the kernel numbers from Stages 1 to 5 are 4, 5, 8, 7 and 9, respectively. Then, the number of the Saak features can be computed as 4 256 + 5 64 + 8 16 + 7 4 + 9 1 = 1; 509; For the second step, we adopt the F-test statistic (or score) in the ANalysis Of V Ariance (ANOV A) [6] to select features that have the higher discriminant capability. That is, we order the features selected in the first step based on their F-test scores from the largest to the smallest, and select the top 75%. This will lead to a feature vector of 36 dimension 1; 509 0:75 = 1; 132. The feature selection module is followed by the feature reduction module, which is achieved by the PCA. We consider four reduced dimension cases of size 32, 64, 128 and 256, respectively, as shown in Table 3.2. Finally, we feed these reduced-dimension feature vectors to the SVM classifier in the SVM classification module. We compare different cutoff energy thresholds in selecting the number of transform kernels in each stage in Table 3.2. The first row gives the lossless Saak transform results and the best one is 98.58%. We see little performance degradation by employing the lossy Saak transform, yet its complexity is significantly reduced. Table 3.2: The classification accuracy (%) of the Saak transform approach for the MNIST dataset, where the first column indicates the kernel numbers used from stages 1-5 in the feature selection module while the second to the fifth columns indicate dimen- sions of the reduced feature dimension. The cutoff energy thresholds for the 2nd to 5th rows are 1%, 3%, 5% and 7% of the total energy, respectively. #Kernels for each stage 32 64 128 256 All kernels 98.19 98.58 98.53 98.14 (4, 11, 16, 20, 17) 98.24 98.54 98.33 97.84 (4, 5, 8, 7, 9) 98.30 98.54 98.26 97.68 (4, 5, 5, 6, 7) 98.28 98.52 98.21 97.70 (4, 4, 4, 5, 5) 98.22 98.42 98.08 97.58 3.4.2 Scalability Here, we adopt multi-stage lossy Saak transforms with the energy threshold of 3% and reduce the feature dimension to 64 before applying the SVM classifier. We compare the average cosine similarity of the Saak transform kernels using the whole training set and the training subset in each stage in Table 3.3. We see from the table that the transform kernels are very stable, and the cosine similarity decreases slightly as going to higher stages. Even the lossy Saak transform is trained using only 10,000 samples, the cosine similarities between the resulting ones and the one trained using the whole training set 37 consisting of 60,000 samples is still higher than 0.99. This shows that the trained kernels are very stable against the training data size. Next, we report the classification accuracy on the MNIST dataset in Table 3.4 and compare results among training sets of different sizes. It is interesting to see that we can obtain almost the same classification accuracy using lossy Saak coefficients based on a training dataset of a much smaller size. Table 3.3: The cosine similarity of transform kernels obtained with subsets and the whole set of MNIST training data, where the first column indicates the number of images using for training and the second to sixth columns indicate the cosine similarity in each stage. Size Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 50000 0.9999 0.9999 0.9999 0.9999 0.9996 40000 0.9999 0.9999 0.9999 0.9999 0.9993 30000 0.9999 0.9999 0.9999 0.9999 0.9988 20000 0.9999 0.9999 0.9999 0.9996 0.9972 10000 0.9999 0.9999 0.9997 0.9992 0.9945 Table 3.4: The effect of the training set sizes on the MNIST dataset classification accu- racy where the first row indicates the number of images used in transform kernel training and the second row indicates the classification accuracy (%). Size 60000 5000 40000 30000 20000 10000 Accuracy 98.54 98.53 98.53 98.53 98.52 98.52 Table 3.5: The cosine similarity of transform kernels using fewer class numbers and the whole ten classes, where the first column indicates the object class number in training, and the second to sixth columns indicate the cosine similarity in each stage. Class No. Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 8 0.9998 0.9996 0.9942 0.9940 0.9550 6 0.9982 0.9983 0.9866 0.9639 0.5586 4 0.9993 0.9990 0.9816 0.9219 0.4557 2 0.9390 0.9672 0.6567 0.6694 0.3294 The experimental results of scalability against a varying object class number are shown in Table 3.5 and Figure 3.7, where the class is arranged in an increasing order. 38 For example, if there are 4 classes in the training, they are digits 0, 1, 2, 3. We derive transform kernels with a subset of 10 digit classes and show the averaged cosine simi- larities in Table 3.5. As shown in the table, the transform kernels are relatively stable in the early stages. For example, the averaged cosine similarities of the first and the second stages are all above 0.9 even we only have one object class for training. The cosine similarities for later stages drop since the Saak coefficients of later stages capture the global view and they are effected more by decreasing the training classes. We see from Figure 3.7 that the lossy Saak coefficients learned from fewer object classes can still handle the classification task of ten digits. In contrast, we train the convolutional layers of the LeNet-5 under the same above-mentioned condition while it fully connected (FC) layers are kept the same with ten output classes. Then, the performance of the LeNet-5 drops quickly when the object class number decreases. The Saak transform approach is more stable and it recognition accuracy is about the same even the number of training classes decreases. In general, the performance of the Saak transform offers a scalable solution to the handwritten digits recognition problem because the class features are obtained from sample class images. The LeNet-5 is more sensitive since it is built upon the end-to-end optimization framework. The network has to be re-trained whenever the object class number changes. 3.4.3 Robustness We may encounter undesirable noisy or low quality images in real-world applications, and need a robust method to handle these situations. To simulate the condition, we modify the MNIST dataset using two ways – adding noise and changing background of the images as shown in Figure 3.8. To test the robustness of classification methods fairly, we use only clean images in the training and the unseen noisy images in the testing. Table 3.6 compares the Saak transform based method, the LeNet-5 and the AlexNet. 39 Figure 3.7: The classification results of using fewer classes in training. where the blue line indicates the Saak transform approach and the green line indicates the LeNet-5 method. The LeNet-5 contains 2 CONV layers and 3 FC layers with around 60K parameters. The AlexNet consists of 5 CONV layers and 3 FC layers with around 60N parameters. For the Saak transform-based method, we set the energy threshold to 3% as described earlier and train the SVM classifier using 32D feature vectors. The results in Table 3.6 indicate that the performance of the lossy Saak transform method is less affected by noise, especially the Salt and Pepper noise. Although the Salt and Pepper noise is significantly increased in the column of S&P 4, our method can still achieve 87.49% accuracy. The AlexNet method is more robust to background change. 40 Figure 3.8: Example of noisy test samples, where the first to fourth rows are noisy images with added Salt and pepper noise with an increasing noise level while the fifth to the eighth row are noisy images with added Speckle noise, Gaussian noise, background replaced by uniform noise, and background replaced by texture images, respectively. Table 3.6: The classification accuracy (%) on noisy images. All methods are trained on clean images. The first to fourth columns report the results of adding Salt and pepper noise as increasing noise level. The fifth to eighth columns display the results of adding Speckle noise, adding Gaussian noise, replacing background with uniform noise, and replacing background with texture images, respectively. Method S&P 1 S&P 2 S&P 3 S&P 4 Speckle Gaussian random bg texture bg LeNet-5 89.13 86.12 74.62 67.68 84.10 81.75 94.11 85.59 AlexNet 82.83 84.22 62.49 53.99 75.94 97,63 98.36 98.12 Saak 95.71 95.31 91.16 87.49 83.06 94.08 94.67 87.78 3.5 Conclusion Data-driven forward and inverse Saak transforms were proposed and offer an new angle to explain the convolutional neural networks (CNNs). By applying multi-stage Saak 41 transforms to a set of images, we can derive multiple representations of these images ranging from the pure spatial domain to the pure spectral domain as the two extremes. There is a family of joint spatial-spectral representations with different spatial-spectral trade offs between them. Comparing with CNNs, The Saak transform doesn’t demand data labels nor backpropagation in kernel determination. The MNIST dataset was used as an example to demonstrate this application including image reconstruction and classi- fication. A lossy Saak transform based approach was proposed to solve the handwritten digits recognition problem. This new approach has several advantages such as higher efficiency than the lossless Saak transform, scalability against the variation of training data size and object class numbers and robustness against noisy images . 42 Chapter 4 Ensembles of Saab-Transform-Based Decision System 4.1 Introduction The recent state-of-the-art image classification methods are based on the convolu- tional neural network (CNN) which first introduced by [18], and further improved and refined by [41, 31, 33]. All CNN parameters are learned by the stochastic gradient descent (SGD) algorithm through backpropagation(BP) which requires high computa- tional efforts even with the help of fast parallel calculation using graphics cards (GPUs). Several works are proposed to explain the super-performance and functionality of CNNs [85, 43], and to improve the computational efficiency by introducing without BP struc- tures [13, 45]. Interpretable feedforward (FF) design based on the Saab (Subspace approximation with adjusted bias) transform was recently proposed by Kuo in [45], which derives network parameters of the current layer based on data statistics from the output of the previous layer in a one-pass manner without any BP as a reference. It not only provides valuable insights into CNN architectures but also offers an efficient approach to solve image classification task. In this work, we use the FF designed mod- els as the base classifiers in the ensemble system and show the promising capability to solve the image classification problem. Although the ensemble method can apply to both BP and FF designed models, the cost of building an ensemble system on multiple FF designed models is significantly lower than that on multiple BP networks. 43 Generally, the FF design can be separate to two modules, the convolutional (conv) layers construction and the Fully-connected (FC) layers development. For the conv lay- ers, a new signal transform, called the Saab transform is proposed. It is a variant of the principal component analysis (PCA) with an added bias vector to annihilate activa- tion’s nonlinearity. Multiple Saab transforms in cascade yield multiple conv layers. As to FC layers, they are constructed through a cascade of multi-stage linear least squared regressors (LSRs) with pseudo-label assignments. To further achieve higher efficiency and performance, we adopt channel-wised PCA on the final outputs of the conv lay- ers to remove the feature redundancy in the spatial dimenstions, and design the new pseudo-label encoding method leading to more sparse and discriminative features. Ensemble methods are learning algorithms that construct a set of first stage base classifiers and then make final decision by combining all base classifier predictions. There is a well established literature about ensemble learning [83]. Researchers have since exploded different ensemble approaches, such as bagging [7], random forests (an ensemble of decision trees) [8], stacked generalization [79], etc. Fusion of the different base classifier outputs does not necessarily lead to a better classification performance than the individual base classifier. Diversity is very important for the success of the ensemble system [9, 42], even though the explicit relationship between diversity and ensemble performances has not been established. In our ensemble system, we introduce the diversity of the FF designs in following three aspects, using different parameter set- tings of FF design, using different subsets of the available features to train each classifier, and using different representations of input images. In this chapter, we develop an ensemble system built on FF designs to solve image classification problem. We adopt channel-wised PCA and new pseudo-label encoding method in FF design process, and we introduce three types of diversity into the ensemble system to pursue better performance. The base classifier predictions are fused though 44 a simple yet efficient method,that we concatenate base classifier predictions, and then adopt PCA to reduce feature dimension before feed into the ensemble classifier. Fur- thermore, we also define the confidence score based on the final decision vector of the ensemble classifier and the prediction results of all base classifiers to select hard sample set. The new ensemble system targeting at hard set can be trained, and have a chance to correct the mispredicted hard samples. The proposed method is evaluated on two benchmark datasets, the MNIST dataset [49] and the CIFAR-10 dataset [40]. The comparative study on the performance of the BP optimized and FF designed models are conducted, and the effect of using different combinations of FF designs and re-prediction on hard sample set are reported. 4.2 Saab-Transform-Based Decision System The Saab-transform-based decision system (i.e. the FF-designed CNN) adopts a data- centric approach without any BP, where the parameters of the current layer are derived from the data statistics collected from the output of the previous layer in a one-pass man- ner. Generally, the design can be separate to two modules, the conv layers construction through subspace approximations and projections, and the FC layers development via training sample clustering and remapping. 4.2.1 Construction of Convolutional Layers As illustrated in Figure 4.1, Conv layers are enclosed by a blue parallelogram. The conv layers offer a sequence of spatial-spectral filtering operations which realizing by a proposed Saab (Subspace approximation with adjusted bias) transform [46]. Similar with Saak transform, the i th layer transform kernels (also called anchor vectors) set 45 Figure 4.1: The LeNet-5 architecture [49], where the Conv layers are enclosed by a blue parallelogram. B i = [b i0 ;:::;b iK ] can be computed from the statistic of the input vector set X i = [x i1 ;:::;x iN ], where x in 2 R C i W i H i , b ik 2 R K i S 2 i C i , and C i , W i and H i denote the channel number, width and height, respectively, whileK i andS i denote the kernel number and kernel size. We separate the anchor vector to two categories, the DC anchor vector and the AC anchor vectors. The DC anchor vector can be simply computed by b i0 = 1 p C i S i S i (1; ; 1) T , which can preserve the mean information of the input data. As for the AC anchor vectors, we firstly acquire the mean-removed input using Equations 3.3, and conduct the principal component analysis (PCA) on the set of the mean-removed inputs, then choose the first (K i ) principal components as the AC anchor vectorsb ik ,k = 1;:::;K i . Bias selection. Different with the Saak transform, the Saab transform adopt the affine transformation instead of the linear transformation. We have: y k =b T k ~ x +c k ; k = 1; ;K; (4.1) , where ~ x is the mean-removed input. With careful choosing the bias terms, the rectifi- cation operation can be discarded and the sign confusion problem can be handled. We impose two constraints on the bias terms. 46 Positive response constraint We choose the kth bias, b k , to guarantee the kth response a non-negative value. Mathematically, we have y k = N1 X n=0 a k;n x n +b k =a T k x +b k 0; (4.2) for all inputx. Constant bias constraint We demand that all bias terms are equal; namely, b 1 =b 2 = =b K d p K: (4.3) In this way, we introduce the non-linearity to the system that perform a similar func- tion as the rectified linear (ReLU) activation function in CNN. After each Saab trans- form, the max-pooling is conducted to further reduce the spatial redundancy. For the final output of the conv layers, it always has a spatial dimension ofW N H N , and the feature redundancy still exists in the spatial dimensions as illustrated in Figure 4.2. To further reduce the feature dimensions, we proposed the channel-wised PCA (C-PCA). For each channel in the feature maps, we reduce the feature dimension fromW N H N toL through PCA transform, and have the final feature dimension ofC N L. The whole conv layer construction can be treated as the unsupervised feature extraction process that the feature dimensions are reduced and the feature redundancy are removed gradually. The cascade of multiple convolutional layers can generate a rich set of image patterns of various scales as object signatures. In the FF design, a target pattern is typically rep- resented as a linear combination of responses of a set of orthogonal PCA filters. These responses are signatures of the corresponding receptive field in the input image. 47 Figure 4.2: Visualization of the correlation matrix respect to each filter in the last conv layer. 4.2.2 Construction of FC Layers The network architecture the LeNet-5 is shown in Figure 4.3, where FC layers enclosed by a blue parallelogram. The FC layers design corresponds to a multi-layer perceptron (MLP) but with supervision on each layers, and learning in a one-pass manner. We use ReLU as the non-linearity after each FC layer. Instead of directly learning the mapping to the target class labels space, we build several intermediate pseudo-labels spaces and learn the mappings by solve the simple linear least squares equations. Least-squared regressor (LSR) In the FF design, each FC layer is treated as a linear least-squared regressor. The output of each FC layer is a one-hot vector whose elements are all set to zero except for one element whose value is set to one. The one-hot vector is typically adopted in the output layer of CNNs. Here, we generalize the concept and use it at the output of each FC layer. There is one challenge in this generalization. That is, there is no label associated with an input vector when the output is one of the hidden layers. To address it, we conduct the K-means clustering on the input. Then, each 48 Figure 4.3: The LeNet-5 architecture [49], where the FC layers are enclosed by a blue parallelogram. input has two labels – the original class label and the new cluster label. We combine the class and the cluster labels to createC pseudo-labels for each input. Then, we can set up a linear least-squared regression problem using the one-hot vector defined byC pseudo-labels at this layer. Then, we can conduct the label-guided linear least-squared regression in multiple stages (or multi-layers). To derive a linear least-squared regressor, we set up a set of linear equations that relates the input and the output variables. Suppose that x = (x 1 ;x 2 ; ;x n ) T 2 R n andy = (y 1 ;y 2 ; ;y C ) T 2R C are input and output vectors. That is, we have 2 6 6 6 6 6 6 6 4 a 11 a 12 a 1n w 1 a 21 a 22 a 2n w 2 . . . . . . . . . . . . . . . a c1 a c2 a Cn w C 3 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 4 x 1 x 2 . . . x n 1 3 7 7 7 7 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 6 4 y 1 y 2 . . . y C 3 7 7 7 7 7 7 7 5 ; (4.4) wherew 1 ,w 2 , ,w C are scalars to account forC bias terms. After nonlinear activation, each FC layer is a rectified linear least-squared regressor. 49 Pseudo-labels generation To build least-squared regressors, we need to define labels for input. Here, we consider the combination of two factors: the original label and the auxiliary label of an input data sample. Each input training sample has its own original label. To create its auxiliary label, we conduct the K-means clustering algorithm on training samples of the same class. For example, we can divide samples of the same digit into M clusters to generate M pseudo classes. If we have N original classes, then we will generate MxN pseudo categories in total. The centroid of a pseudo class provides a representative sample for that class. The reason to generate pseudo classes is to capture the diversity of a single class with more representative samples as illstrated in Fig 4.4. Figure 4.4: Illustration of intra-class variabilities [44]. We can further modify the one-hot vector to set all zero terms to be 1 p MN which maximize the distance between different label vectors in the output space. Also, because of the ReLU function, more sparse outputs will be obtained. 4.2.3 Case Study: LeNet-5 The block-diagram of the FF design of the first two convolutional layers of the LeNet-5 is shown in Figure 4.5. 50 Figure 4.5: Summary of the FF design of the LeNet-5. In this work, we adopt the LeNet-like networks which contains two convolutional layers and three FC layers. There are the LeNet-like architecture for the MNIST dataset. First Conv layer: consists of 1 DC and 5 AC filters of kernel size 5x5 following with max pooling. Second Conv layer: consists of 1 DC and 15 AC filters of kernel size 5x5 follow- ing with max pooling and C-PCA. First FC layer:n 1 = 300 andm 1 = 120. Second FC layer:n 2 = 120 andm 2 = 80. Third FC layer:n 3 = 80 andm 3 = 10. The input data to the first FC layer is a data cuboid of dimension 20 15 = 300 since the DC responses are removed and C-PCA are adopted. The target of the third FC layer output is the following ten one-hot vector of dimension 10: (1; 0; ; 0) T ; (0; 1; 0; ; 0) T ; ; (0; ; 0; 1; 0) T ; (0; ; 0; 1) T : (4.5) 51 Each one-hot vector denotes the label of a hand-written digit. For example, we can use the above ten one-hot vector to denote digits “0”, “1”, , “9”, respectively. 4.3 Enhancing Diversity of Ensemble System 4.3.1 Three Types of Diversity Figure 4.6: Overview of the proposed ensemble method. 52 To further improve the performance of individual FF-CNNs, we develop a simple yet effective ensemble method as illustrated in Figure 4.6. Specifically, we consider ensembles of LeNet-like networks that contain two convolutional layers, two FC layers and one output layer. We compose multiple FF-designed CNNs as the first stage base classifiers into an ensemble and concatenate their outputs with the dimension of the number of classes, and then adopt PCA to reduce feature dimension before feed into the second stage ensem- ble classifier. The success of ensemble results are highly related to the diversity of the base classifiers. Here, we proposed several ways to increase the diversity of the ensem- ble, with different FF-design CNNs along with different learning parameter sets, with different subsets of the available features to train FF-design CNN, with different repre- sentations of input images feeding into FF design. They are elaborated as below. Different Learning Parameter Settings We change the parameter of the filter size in the conv layers construction only. Original, the filter size for two conv layers is (5x5, 5x5) as mimic the LeNet-5 architecture. We then modify the filter size to (3x3, 5x5), (5x5, 3x3), and (3x3, 3x3) augmenting another three types of the FF designs. Different filter size will directly affect the receptive field size of each conv layers and induce the different statistic collected from the input data. In this way, distinct feature representations of the input data will be extracted. The final decision vectors can be formulated as P (yjx) =S[P 1 (yjx);P 2 (yjx);P 3 (yjx);P 4 (yjx)]; (4.6) where x, y denotes the input data and corresponding target, and S() represents the combination function. Subsets of the Available Features 53 For each FF-designed CNN, we can have available feature setF = [F conv1 ;F conv2 ] for each sample, which are acquired from the outputs of each conv layers. Actually,F conv1 and F conv2 can represent the input objects independently, yet with different receptive field scopes. The subsetV i can be selected fromF by the following selection rules: For each channel inF conv1 , randomly select features with dimension 0 W i H i and apply C-PCA to acqiure transformed feature F 0 conv1 2 R 1K F 1 . Then further randomly select features with dimension 1 K F 1 fromF 0 conv1 First adopt C-PCA transform on F conv2 to generate F 0 conv2 2 R 1K F 2 , and ran- domly select features subsets with dimension 2 K F 2 fromF 0 conv2 . Checkerboard partition of F conv1 in the spatial dimension, and adopt C-PCA on each part to generate two feature subsets. We can combine decision vectors from all different feature subsets as below: P (yjx) =S[P 1 (yjV 1 );:::;P C (yjV C )] (4.7) Different Representations of Input Images We also consider adopting different input representations. Researchers have developed various color models to present color information and shows their suitable application field [34]. In this paper, we utilize RGB, YCbCr and Lab color spaces to represent color images in which the features are supposed to be extracted. We also decompose images into a set of feature maps using the 3x3 Laws filter bank [47], where each feature map emphasizes different characteristics of the input image, for instance the brightness, edgeness, etc. The final decision vector by combining FF-designed CNNs with different input representations can be devised as: P (yjx) =S[P 1 (yjf 1 (x));:::;P C (yjf C (x))]; (4.8) 54 wheref c represents different operations on input images. 4.3.2 Hard Examples Mining The ensemble methods can easily apply to find hard samples in the datasets by comput- ing the confidence score(CS) of its decision. We define the confidence score based on the final decision vector of the ensemble classifier and the prediction results of all base classifiers. The larger value of the maximum probability in the final decision vector interprets the higher confidence of its prediction, and the more base classifiers agreeing on the majority class label in an ensemble indicates the higher possibility to make the correct prediction. So the hard samples can be determined based on the following rules: CS1 i =max(P final (yjx i )) CS2 i =N i =N all X hard =fx i :CS1 i <th 1 and CS2 i <th 2 g; (4.9) wherex i ,CS1 i , andCS2 i denote the input data and corresponding confidence scores. AndN i indicates the number of base classifiers producing the majority class label for input samplex i , whileN all denotes the total number of the base classifiers. The new FF-designed CNNs targeting at hard samples set X hard can be trained, and the proposed ensemble methods described in the section 4.2 can also be adopted to further boost the performance on the hard samples. 55 4.4 Experiments We conduct experiments on two popular datasets: the MNIST dataset [49] and the CIFAR-10 dataset [40]. The MNIST dataset contains gray-scale images of handwrit- ten digits 0-9. It has 60,000 training images and 10,000 testing images of size 32 32. The CIFAR-10 dataset has 10 classes of tiny images size 32 32. It has 50,000 training images and 10,000 testing images. We adopt the LeNet-5 architecture [49] for the gray-scale MNIST dataset. Since the CIFAR-10 dataset is a color image dataset, we set the filter numbers of the first and the second conv layers and the first and the second FC layers to 32, 64, 200 and 100, respectively. The C-PCA is applied to the output of the second conv layer. We set the reduced feature dimension to 20 and 12 of the second conv layer outputs for the MNIST and CIFAR-10 datasets, respectively. Sometimes, we feed the responses from of the first conv layer directly to the FC layers to increase feature diversity. When this happens, we set reduced feature dimension to 30 and 20 of the first conv layer. We adopt the Radial Basis Function (RBF) SVM classifier as the ensemble classifier in all experiments. Before training the SVM classifier, we apply the PCA to the cascaded decision vectors from base classifiers. The reduced feature dimensions are determined based on the correlation of decision vectors of base classifiers in an ensemble. 4.4.1 Performance of Individual FF-CNN We study the effectiveness of three components of the FF design: the importance of using multiple FC layers to gradually transform to the final output space, the efficiency of adopting C-PCA to reduce the feature dimensions, and the influence of the modified encoding strategy for the pseudo-labels. From the results in Table 4.2, one can see the necessity of constructing multiple FC layers, which leads to 2.3% performance gain for 56 CIFAR-10 dataset. Meanwhile, the C-PCA and modified pseudo-labels encoding yield 0.6% and 1.2% performance gain respectively. We also compare the performance of individual BP- and FF-designed CNNs in the same network structure. The training and testing accuracy for the MNIST and the CIFAR-10 datasets are presented in Table 4.2. We examine the feature extraction part (construction of conv layers) and the feature classification part (development of FC lay- ers) separately. For the first part, we extract outputs from the last conv layer of BP-CNN as features and feed into the FC layers of the FF-designed CNN. As for the second part, we keep the construction of conv layers in the FF design and use BP optimization to learn the FC layers parameters. It is observed that the overall performance gaps for MNIST and CIFAR-10 due to no BP optimization are 2.7% and 6% respectively. Specifically, for MNIST dataset, the primary error caused by the feature classification part, while for CIFAR-10, the main degradation comes from the feature extraction part. This can be explained that the conv layers design has the capability to extract discriminative features from simple input data, while for more complicated inputs, no label information involved during the feature extraction part can lead to less discriminate power than BP optimization models. Table 4.1: Ablation study of the three modules in FF design: (1) multiple FC layer design (MFC), (2) adaptation of C-PCA, and (3) modification on the pseudo-label encoding (MPE), assessed on the CIFAR-10 testing dataset. MFC C-PCA MPE Accuracy(%) 7 7 7 59.6 3 7 7 61.9 +2.3 3 3 7 62.5 +0.6 3 3 3 63.7 +1.2 57 Table 4.2: Comparison between the BP-CNN and FF-CNNs on the MNIST and CIFAR- 10 datasets in terms of classification accuracy (%). FF BP BP+FF FF+BP MNIST Training 97.1 100 97.9 99.6 Testing 97.1 99.1 97.9 98.4 CIFAR-10 Training 69.0 99.6 72.9 81.4 Testing 63.7 68.7 67.0 64.2 4.4.2 Performance of Ensemble Systems To clearly analyze the effectiveness of proposed ensemble method, we have conducted several experiments targeting at the three different sources of diversity. Convolutional-layer parameter settings. In Table 4.3, we report the performance obtained by assembling different learning parameter settings. Four types of FF-design CNNs with different filter size in the conv layers are reported, including (5x5,5x5), (3x3,5x5), (5x5,3x3), and (3x3,3x3). The corresponding filter numbers are (32,64), (24,64), (32,64), (24,48) for RGB images or (16,32), (8,32), (16,32), (8,24) for single channel of the images on CIFAR-10 dataset. As for MNIST dataset, the filter numbers are (6,16) for all different settings. The results show that the ensemble of these four models provides 4% improvement than the best single FF-designed CNN. Different fil- ter size will directly affect the receptive field size of each conv layer and thus induce different statistics collected from the input data. In this way, we can introduce diverse feature representations into the ensemble system. Table 4.3: The classification accuracy (%) on the MNIST and the CIFAR-10 datasets, where the first to the fourth columns indicate the four types of FF-designed CNNs, and the fifth column presents the ensemble results. FF-1 FF-2 FF-3 FF-4 Ensemble Filter Size (5,5) (5,3) (3,5) (3,3) - MINST 97.1 97.0 97.2 97.3 98.2 CIFAR-10 63.7 65.3 64.2 65.9 69.9 58 Feature Subsets. The detailed evaluation results of using the subsets of the available features are shown in Table 4.4. We evaluate the FF-1 design presented in Table 4.3 and set 0 , 1 and 2 to be 75% for the feature subset selections, as described in Section 4.2. It can be seen that the combination of the FF-designed CNNs trained on the randomly selected feature subsets from F conv1 and F conv2 feature sets boost the performance by about 3.7% and 4.2% on CIFAR-10 dataset. In addition, simply combining three clas- sifiers (one trained on F conv1 feature set and two trained on F conv2 feature set) yields 68.7% and 97.7 % accuracy on CIFAR-10 and MNIST dataset respectively, which is one effective and efficient way for the development of the ensemble system. And we adopt this setting when combining with other sources of diversity. Table 4.4: The testing classification accuracy (%) on the MNIST and the CIFAR-10 datasets. The first to the fifth columns indicate using feature subsets which are the entireF conv2 set, two subsets collected following the third rule, the subset selected fol- lowing the first rule, and the subset selected following the second rule described in Section4.2, respectively. We examine different ensemble combinations: ED-1: com- bination of Conv1, Conv1-1, Conv1-2; ED-2: combination of six different Conv1-RD; ED-3: combination of twelve different Conv2-RD; ED-4: combination of six different Conv1-RD and twelve different Conv1-RD. Conv2 Conv1-1 Conv1-2 Conv1-RD Conv2-RD MINST 97.1 95.4 95.3 96.8 95.2 CIFAR-10 63.7 64.3 64.4 62.3 64.2 ED-1 ED-2 ED-3 ED-4 MINST 97.7 97.6 97.2 98.0 CIFAR-10 68.7 66.0 68.4 69.3 Different representations of input images. In Table 4.5, we report experiments adopt- ing different input representations to the FF-designed CNNs using FF-1 architecture. Here, we filter greyscale images with 3x3 Laws filters to represent images in different bandwidth. In addition, for color images in CIFAR-10 dataset, we also consider to rep- resent the color information in different color spaces: RGB, YCbCr, and Lab, where we treat the three channels separately for the last two color spaces. We can observe 1.1% 59 and 5.9% performance improvements on the MNIST and CIFAR-10 datasets respec- tively by assembling various input representations, which demonstrates the effectiveness of utilizing different input representations as the diversity source in an ensemble. Figure 4.7: The relation between testing accuracy (%) and the number of FF-designed CNNs included in the ensemble on the MNIST dataset. Three types of diversity sources are indicated as: S1: different learning parameter settings; S2: subsets of the features; S3: different input representations. Overall, we can combine three types of diversity sources to boost the performance. Figure 4.7 and Figure 4.8 illustrate the relation between testing accuracy and the com- plexity of the ensemble methods. It can be observed that generally combine more classifiers gives better performance, and the best performance achieved on MNIST and CIFAR-10 are 98.7% and 74.2% in terms of testing accuracy. Comparing with the single 60 Figure 4.8: The relation between testing accuracy (%) and the number of FF-designed CNNs included in the ensemble on the CIFAR-10 dataset. Three types of diversity sources are indicated as: S1: different learning parameter settings; S2: subsets of the features; S3: different input representations. BP-CNN reported in Table 4.2, the ensemble result is 5.5% higher on CIFAR dataset, but still 0.4% lower on MNIST dataset. 4.4.3 Hard Examples Mining The confidence scores can be derived and the hard sample set can be collected follow- ing the selection rules as described before. Theth1 andth2 are set to 0.98 and 0.7 for MNIST dataset, and set to 0.97 and 0.65 for CIFAR-10 dataset. The results are reported in Table 4.6. As regards the hard set, the new ensemble system trained on hard sample 61 Table 4.5: The testing classification accuracy (%) on the MNIST and the CIFAR-10 datasets, where L1 to L9 denote the filtered maps by L3L3, E3E3, S3S3, L3S3, S3L3, L3E3, E3L3, S3E3, and E3S3 Laws filters, respectively. The last column indicates the ensemble results. RGB Grey YCbCr Lab MINST - 97.1 - - CIFAR-10 63.7 - 54.0/41.4/41.1 53.2/40.0/41.0 L1 L2 L3 L4 L5 L6 L7 L8 L9 ED MINST 97.0 95.1 87.8 92.6 93.7 94.9 95.6 93.1 92.6 98.2 CIFAR-10 50.6 44.8 44.5 46.3 48.3 44.9 47.6 43.0 45.8 69.6 set provides 5.6% and 2.6% enhancements in terms of testing accuracy on MNIST and CIFAR-10 datasets, which indicates that some hard samples can be corrected. Overall, our proposed method outperforms the single BP- and FF-designed CNNs with testing accuracy of 99.3% and 76.2 % on the entire MNIST and CIFAR-10 datasets, respec- tively. The final accuracy is 0.2% and 7.5% higher than the individual BP network on MNIST and CIFAR-10 datasets. Table 4.6: The classification accuracy (%) on MNIST and CIFAR-10 datasets, where the first, second and fourth columns indicate results of the original ensemble system evaluating on the easy, hard and the entire sample sets accordingly. The third and fifth columns present the results of the new ensemble system evaluating on the hard set and the entire set, respectively. Easy Hard Hard+ FF FF + MNIST Train 99.9 90.0 98.2 98.9 99.7 Test 99.9 88.0 93.6 98.7 99.3 Cifar-10 Train 99.9 73.5 82.3 80.1 87.2 Test 98.2 66.2 68.8 74.2 76.2 4.4.4 Discussion To better understand the diversity among different FF-designed CNNs, we evaluate the correlation among the output of different classifiers using two diversity measures: 62 (a) Result of the single FF-CNN (b) Result of the ensembles Figure 4.9: The confusion matrices for the testing set of the MNIST dataset. Yules Q-statistic and entropy measure [17]. Those measurements are built on the cor- rect/incorrect decision. The higher Q-statistic or the lower entropy measure indicate the higher possible diversity among the base classifiers. The average measurements among 63 (a) Result of the single FF-CNN (b) Result of the ensembles Figure 4.10: The confusion matrices for the testing set of the CIFAR-10 dataset. 64 different sources of diversity are reported in Table 4.7. The best diversity measurements are achieved by combining all types of base classifiers leading to a large amount of dif- ference into the system, which is consistent with the classification accuracy assessment reported in Figure 4.8. Table 4.7: Diversity measurements on CIFAR-10 training set. Three types of diversity sources are evaluated separately in the first to third columns, and the last column reports the measurements on all base classifiers in an ensemble. S1 S2 S3 ALL Q-statistic 0.88 0.89 0.66 0.61 Entropy measure 0.21 0.24 0.47 0.49 Figure 4.11: Visualization of training and testing samples on CIFAR-10 dataset using t-SNE, where different colors indicate different class labels. In this work, we use a simple SVM ensemble classifier to combine the outputs of different FF-designed CNNs. To examine the efficiency of this ensemble method, we compare with the stacked generalization methods which can be easily applied in our framework. We need to change the training data for the ensemble classifier (SVM) by dividing the entire training set into K subsets, then train the same base classifiers K times on K-1 subsets to predict the data in the rest subset, and finally combine all 65 the predictions from K subsets as the complete outputs of a single base classifier. We adopt the stacked generalization on CIFAR-10 dataset respect to Type 1 diversity source. The ensemble accuracy on testing set is 69.7%, which is almost the same as the results in Table 4.3. Yet the stacked generalization method leads to K-times computational expenses. The result implies that the FF-designed CNN can be well generalized in the testing set. We visualize the samples represented in the final decision space of the single FF design using t-SNE [54], as presented in Figure 4.11. In the 2D embedded space, the training and testing samples from different classes are partitioned similarly which sup- ports the generality of the FF-designed CNN. It is interesting to observe that some diffi- cult classes, for example the bird, is mixed with others, and the cat and dog are heavily confused with each other as well. To further analysis the confusion classes, we plot the confusion matrices as presented in Figure 4.9 and Figure 4.10. It can observed that ensembles help to resolve some confusions, e.g. 2 and 8, 7 and 9 in the MNIST dataset, airplane and ship in the CIFAR-10 dataset. 4.5 Conclusion An interpretable CNN design based on the feedforward methodology offers a comple- mentary approach in CNN filter weights selection. We make two modifications on the FF-CNN design to achieve higher classification performance. That is, we apply the channel-wise PCA to spatial outputs of the conv layers to remove spatial-dimension redundancy and we introduce negative values when encoding persudo-labels. The pro- posed ensemble approach built on the FF-designed CNNs has successfully solved the image classification problems, and outperformed the single BP- and FF-designed CNN on two benchmark datasets. To enhance the performance of the ensemble system, we 66 introduce diversities by adopting three strategies: 1) different parameter settings in con- volutional layers, 2) flexible feature subsets fed into the Fully-connected (FC) layers, and 3) multiple image embeddings of the same input source. Furthermore, we parti- tion input samples into easy and hard ones based on their decision confidence scores. As a result, we can develop a new ensemble system tailored to hard samples to further improve the performance. 67 Chapter 5 Semi-Supervised Learning via Saab-Transform-Based Decision System 5.1 Introduction When there are no sufficient labeled data available, we need to resort to semi-supervised learning (SSL) in tackling machine learning problems. SSL is essential in real world applications, where labeling is expensive and time-consuming. It attempts to boost the performance by leveraging unlabeled data. Its main challenge is how to use unlabeled data effectively to enhance decision boundaries that have been obtained by a small amount of labeled data. Convolutional neural networks (CNNs) are typically a fully- supervised learning tool since they demand a large amount of labeled data to represent the cost function accurately. As the amount of labeled data is small, the cost function is less accurate and, thus, it is not easy to formulate an SSL framework using CNNs. Model parameters of CNNs are traditionally trained under a certain optimization framework with the backpropagation (BP) algorithm. They are called BP-CNNs. Recently, a feedforward-designed convolutional neural network (FF-CNN) methodol- ogy was proposed by Kuo et al. [46]. The model parameters of FF-CNNs at a target layer are determined by statistics of the output from its previous layer. Neither an opti- mization framework nor the BP training algorithm is utilized. Clearly, FF-CNNs are 68 less dependent on data labels. We propose an SSL system based on FF-CNNs in this work. This work has several novel contributions. First, we apply FF-CNNs to the SSL con- text and show that FF-CNNs outperforms BP-CNNs when the size of the labeled data set becomes smaller. Second, we propose an ensemble system that fuses the output decision vectors of multiple FF-CNNs so as to achieve even better performance. Third, we con- duct experiments in three benchmarking datasets (i.e., MNIST, SVHN, and CIFAR-10) to demonstrate the effectiveness of the proposed solutions as described above. The rest of this chapter is organized as follows. Both SSL and FF-CNN are reviewed in Sec. 5.2. The proposed SSL solutions using a single FF-CNN and ensembles of multiple FF-CNNs are described in Sec. 5.3. Experimental results are shown in Sec.5.4. Finally, concluding remarks are drawn and future work is discussed in Sec. 6.6. 5.2 Semi-Supervised Learning and Feedforward- designed CNNs Semi-Supervised Learning (SSL). There are several well-known SSL methods proposed in the literature. Iterative learning, including self-training [64] and co-training [4], learns from unlabeled data that have high confidence predictions. Transductive SVMs [37] extend standard SVMs by maximizing the margin on unlabeled data as well. Another SSL method is to construct a graph to represent data structures and propagate the label information of a labeled data point to unlabeled ones [89, 3]. More recently, several SSL methods are proposed based on deep generative models, such as the variational auto-encoder (V AE) [38], and generative adversarial networks (GAN) [20, 65]. All parameters in these networks are 69 determined by the stochastic gradient descent (SGD) algorithm through BP and they are trained based on both labeled and unlabeled data. Feedforward-designed CNNs (FF-CNNs). The BP training is computationally intensive, and the learning model of a BP-CNN is lack of interpretability. New solutions have been proposed to tackle these issues. Exam- ples include: interpretable CNNs [85, 43, 44] and feedforward-designed CNNs (FF- CNNs) [13, 45, 46]. FF-CNNs contain two modules: 1) construction of convolutional (conv) layers through subspace approximations, and 2) construction of fully-connected (FC) layers via training sample clustering and least-squared regression (LSR). They are elaborated below. The construction of conv layers is realized by multi-stage Saab transforms [46]. The Saab transform is a variant of the principal component analysis (PCA) with a constant bias vector to annihilate activation’s nonlinearity. The Saab transform can reduce feature redundancy in the spectral domain, yet there still exists correlation among spatial dimen- sions of the same spectral component. This is especially true in low-frequency spectral components. Thus, a channel-wise PCA (C-PCA) was proposed in [14] to reduce spatial redundancy of Saab coefficients furthermore. Since the construction of conv layers is unsupervised, they can be fully adopted in an SSL system. The construction of FC layers is achieved by the cascade of multiple rectified linear least-squared regressors (LSRs) [46]. Let the input and output dimensions of a FC layer beN in andN out (withN in > N out ), respectively. To construct an FC layer, we cluster input samples of dimensionN in intoN out clusters, and assign pseudo-labels based on clustering results. Next, all samples are transformed into a vector space of dimension N out via LSR, where the index of the output space dimension defines a pseudo-label. In this way, we obtain a supervised LSR building module to serve as one FC layer. It accommodates intra-class variability while enhancing discriminability gradually. 70 5.3 Semi-Supervised Learning System 5.3.1 Design of the Semi-supervised Learning System Problem Formulation and Data Pre-processing. The semi-supervised classification problem can be defined as follows. We have a set ofM unlabeled samples X ul =fx ul 1 ; ;x ul M g; wherex ul i 2 R D in is theith input unlabeled sample, and a set of labeled samples that can be written in form of pairs: (X l ;Y l ) =f(x l 1 ;y l 1 ); ; (x l N ;y l N g; wherex l i 2R D in is theith input labeled sample andy l i 2f1; ;Lg is its class label. We omit indexi whenever the context is clear. We adopt the multi-stage Saab transforms and C-PCA for unsupervised image fea- ture extraction. They can be expressed as z l =T saab (x l ); andz ul =T saab (x ul ); where z l ;z ul 2 R Dout , and D in > D out . This is used to facilitate classification with more powerful image features in a lower dimensional space. Pseudo-Label Assignment. Next, we propose a semi-supervised method in the design of FC layers using multi- stage rectified LSRs as illustrated in Fig. 5.1. The pseudo-labels should be generated for both labeled and unlabeled samples in solving the LSR problem. To achieve this goal, we conduct K-means clustering on labeled samples of the same class. For example, we 71 Figure 5.1: The proposed semi-supervised learning (SSL) system. can cluster samples of a single original class intoM sub-classes to generateM pseudo- categories. If there areN original classes, we will generateMN pseudo-categories in total. Then, pseudo-labels of labeled data can be generated by representing pseudo categories with one-hot vectors, denoted as Y p =fy p 1 ; ;y p M g: The centroid of thejth pseudo-category, denoted byc j , provides a representative sample for the corresponding original class. We define the probability vector of the pseudo-category for each unlabeled sample as p(t k jz ul ) = e d k P j e d j ; (5.1) 72 wheret k represents thekth pseudo-category and d k = z ul c k kz ul kkc k k : (5.2) Then, the probability vector in (5.1) can be used as the pseudo-label of each unlabeled sample to set up a system of linear regression equations that relates the input data sam- ples and pseudo-labels, [Y p ;P(TjZ ul )] =W fc [Z l ;Z ul ]; where the parameters of FC layer are denoted by W fc = (Z T Z) 1 Z T Y; (5.3) and where Z = [Z l ;Z ul ]; andY = [Y p ;P(TjZ ul )]: (5.4) The final output of one-stage rectified LSR is z l out =f(W fc z l ) andz ul out =f(W fc z ul ); (5.5) wheref(:) is a non-linear activation function (e.g. ReLU in our experiments), andz l out andz ul out denote outputs of labeled and unlabeled data, respectively. The output vectors lie in a lower dimensional space. They are used as features to the next stage rectified LSR. Unlabeled Sample Selection. 73 Not every unlabeled sample is suitable for constructing FC layers. We define a quality score for each unlabeled samplez ul as S i (z ul ) = P k2C i p(t k jz ul ) P j p(t j jz ul ) ; (5.6) whereC i indicates a set of pseudo-categories that belong to the originalith class. A low qualify score indicates that the sample is far away from the representative set of exam- ples of a single original class. We exclude those samples in solving the LSR problems. Multi-stage LSRs. We repeat several LSRs and finally provide the predicted class labels. In the last stage, we cluster input data based on their original class labels and the LSR is solved using labeled samples only. The multi-stage setting is needed to remove feature redun- dancy in the spectral dimension and resolve intra-class variability gradually. 5.3.2 Proposed Ensemble System Ensembles are often used to combine results from multiple weak classifiers to yield a stronger one [83]. We use the ensemble idea to achieve better performance in semi- supervised classification. Although both BP-CNNs and FF-CNNs can be improved by ensemble methods, FF-CNNs have much lower training complexity to justify an ensemble solution. The proposed ensemble system is illustrated in Fig. 5.2. Multiple semi-supervised FF-CNNs are adopted as the first-stage base classifiers in an ensemble system. Their output decision vectors are concatenated as new features. Afterwards, we apply the principal component analysis (PCA) technique to reduce feature dimen- sion and then feed the dimension-reduced feature vector to the second-stage ensemble classifier. 74 Figure 5.2: An ensemble of multiple SSL decision systems. High input diversity is essential to an effective ensemble system that can reach a higher performance gain [9, 42]. In the proposed ensemble system, we adopt three strategies to increase input diversity. First, we consider different filter sizes in the conv layers as illustrated in Table 5.1 since different filter sizes lead to different features at the output of the conv layer with different receptive fields. Second, we represent color images in different color spaces [34]. In the experiments, we choose the RGB, YCbCr and Lab color spaces and process three channels separately for the latter two color spaces. Third, we decompose images into a set of feature maps using the 3x3 Laws filters [47], where each feature map focuses on different characteristics of the input image (e.g., brightness, edginess, etc.) 75 Table 5.1: Network architectures with respect to different input types and different conv layer parameter settings. The second to fourth columns indicate inputs from MNIST, RGB inputs and the single channel inputs from SVHN and CIFAR-10, accordingly. Greyscale RGB Single Channel FF-1 Conv1 551, 6 553, 32 551, 16 Conv2 556, 16 5532, 64 5516, 32 FF-2 Conv1 331, 6 333, 24 331, 8 Conv2 556, 16 5524, 64 558, 32 FF-3 Conv1 551, 6 553, 32 551, 16 Conv2 336, 16 3332, 64 3316, 32 FF-4 Conv1 331, 6 333, 24 331, 8 Conv2 336, 16 3324, 48 338, 24 5.4 Experimental Results We conduct experiments on three popular datasets: MNIST [49], SVHN [60] and CIFAR-10 [40]. We randomly select a subset of labeled training data from them, and test the classification performance on the entire testing set. Each object class has the same number of labeled data to ensure balanced training. We adopt the LeNet-5 architecture [49] for the MNIST dataset. Since CIFAR-10 is a color image dataset, we increase the filter numbers of the first and the second conv layers and the first and the second FC layers to 32, 64, 200 and 100, respectively, by following [46]. The C-PCA is applied to the output of the second conv layer and the feature dimension per channel is reduced from 25 to 20 (for MNIST) or 15 (for SVHN) or 12 (for CIFAR-10). The probability vectors are computed using Eq. (5.1), where is set to 50 for all three datasets. The Radial Basis Function (RBF) SVM classifier is used as the second-stage classi- fier in the ensemble systems in all experiments. Before training the SVM classifier, PCA is applied to the cascaded decision vectors of first-stage classifiers. The reduced feature dimension is determined based on the correlation of decision vectors of base classifiers in an ensemble. 76 5.4.1 Individual Semi-Supervised FF-CNN We compare the performance of the BP-CNN and the proposed semi-supervised FF- CNN on three benchmark datasets in Fig. 5.3. For MNIST and SVHN, 1/2, 1/4, 1/8, 1/16,1/32, 1/64, 1/128, 1/256, 1/512 of the entire labeled training set are randomly selected to train the networks. As to CIFAR-10, 1/2, 1/4, 1/8, 1/16,1/32, 1/64, 1/128, 1/256 of the whole labeled dataset are used to learn network parameters. We use the labeled data to train the BP-CNNs. We select unlabeled training data with quality scores of top 70%, 70% and 80% in MNIST, SVHN and CIFAR-10, respectively, in the training of the corresponding semi-supervised FF-CNN. Table 5.2: Testing accuracy (%) comparison under three settings: 1) without using unla- beled data; 2) using the entire set of unlabeled data; and 3) using a subset of unlabeled data based on quality scores defined by Eq. (5.1). 1/256 of the labeled data is used on MNIST and SVHN datasets, and 1/128 of the labeled data is used on CIFAR-10 dataset. MNIST SVHN CIFAR-10 Setting 1 57.19 ( 3.4) 25.17 ( 1.10) 24.64 ( 0.25) Setting 2 92.26 ( 0.33) 53.76 ( 0.97) 41.95 ( 0.56) Setting 3 92.65 ( 0.14) 58.58 ( 0.78) 42.53 ( 0.57) 77 Figure 5.3: The comparisons of testing accuracy (%) using BP-CNNs, and semi- supervised FF-CNNs on MNIST, SVHN and CIFAR-10 datasets, respectively. 78 Figure 5.4: The comparisons of testing accuracy (%) using BP-CNNs, semi-supervised FF-CNNs, and ensembles of semi-supervised FF-CNNs on the small labeled portion against MNIST, SVHN and CIFAR-10 datasets, respectively. 79 When using the entire labeled training set, the semi-supervised FF-CNN is exactly the same as the FF-CNN. There is a performance gap between FF-CNN and BP-CNN at the beginning of the plots. However, when the number of labeled training data is reduced, the performance degradation of the semi-supervised FF-CNN is not as severe as that of the BP-CNN and we see cross-over points between these two networks in all three datasets. For the extreme cases, we see that semi-supervised FF-CNNs outperform BP-CNNs by 17.1%, 8.8%, and 6.9% in testing accuracy with 110, 120, and 190 labeled data for MNIST, SVHN and CIFAR-10, respectively. The results show that the proposed semi-supervised FF-CNNs can learn from the unlabeled data more effectively than the corresponding BP-CNNs. Figure 5.5: The relation between test accuracy (%) and the number of semi-supervised FF-CNNs in the ensemble, where three diversity types are indicated as T1, T2 and T3, and T0 indicates the individual semi-supervised FF-CNN. The experiments are con- ducted on CIFAR-10 with 1/16 of the whole labels. To evaluate the effectiveness of several unlabeled data usage ideas, we compare three different settings in Table 5.2. The best results come from using selected unlabeled 80 training data. This is particularly obvious for the SVHN dataset. There is around 5% performance gain by eliminating low quality unlabeled samples. 5.4.2 Ensembles of Multiple Semi-Supervised FF-CNNs The performance of the proposed ensemble system by fusing all diversity types is shown in Fig. 5.4. We see that ensembles can boost classification accuracy by a large gain. There are about 2%, 5% and 8% performance improvements against individual semi- supervised FF-CNNs, and the ensemble results with the smallest labeled portion achieve test accuracy of 89.6%, 49.5%, and 41.4% for MNIST, SVHN and CIFAR-10, respec- tively. We further examine the performance of different diversity types, and show the rela- tion between test accuracy and ensemble complexity in Fig. 5.5, where we test ensemble systems using different diversity types: 1) an ensemble of four semi-supervised FF- CNNs with varied filter sizes and filter numbers in two conv layers as listed in Table 5.1 (T1); 2) an ensemble of color input images in different color spaces (i.e. RGB, YCbCr, and Lab), where three channels separately for the latter two color spaces are treated sep- arately (T2); and 3) an ensemble of nine semi-supervised FF-CNNs obtained by taking different input images computed from filtered greyscale images with 3x3 Laws filters [63] (T3). As shown in the figure, the most efficient ensemble system among all designs is to fuse four different semi-supervised FF-CNNs with T1 diversity which yields a per- formance gain of 6.4%. In general, an ensemble of more semi-supervised FF-CNNs provides higher testing accuracy. The best performance achieved is 63.1% by combin- ing all three diversity types. As compared with that of the single FF-CNN trained with all labeled data, the performance of the ensemble of semi-supervised FF-CNNs trained with 1/16 labeled data is only slightly lower by 0.6%. 81 5.5 Conclusion A semi-supervised learning framework using FF-CNNs for image classification was proposed. It was demonstrated by experimental results on three benchmark datasets (MNIST, SVHN, and CIFAR-10) that the semi-supervised FF-CNNs offer an effective solution. The ensembles of multiple semi-supervised FF-CNNs can boost the perfor- mance furthermore. Two extensions of this work are under current investigation. One is incremental learning. The other is decision fusion in the spatial domain. FF-CNNs provide a more convenient tool than the BP-CNN in both contexts. We will explore them furthermore in the near future. 82 Chapter 6 PixelHop: A Successive Subspace Learning (SSL) Method for Object Recognition 6.1 Introduction Being inspired by interpretable deep learning research, we introduce a new machine learning paradigm, called successive subspace learning (SSL) in this chapter. The sub- space technique has been widely used in signal/image processing, pattern recognition, computer vision, etc. [72], [5], [39], [53], [29], [76]. It may have different names and slightly different definitions in different contexts such as manifold learning [75], [51]. A subspace may denote a dominant feature space where less relevant features are dropped. One example is the principal component analysis (PCA). A subspace may also refer to a certain object class such as the subspace of digit ”0” in the MNIST dataset. The former can be achieved by exploiting signal statistics (i.e., unsupervised dimension reduction) while the latter can be accomplished by using signal labels (i.e., supervised dimension reduction). Generally speaking, the subspace representation offers a powerful and pop- ular tool for signal analysis, modeling and processing. They exploit statistical properties of underlying signals and/or labeled signals to determine a smaller yet significant sub- space for further processing. 83 Existing subspace methods are conducted in a single stage. It is natural to ask whether there is any advantage to perform subspace learning in multiple stages. Research on generalizing from one-stage subspace methods to multi-stage subspace methods is rare. Two PCA stages are cascaded in the PCAnet [11]. Being motivated by multiple convolutional layers in convolutional neural networks (CNNs), Kuo et al. proposed the Saak (subspace approximation via augmented kernels) transform [45] and the Saab (subspace approximation via adjusted bias) transform [46]. The Saak and Saab transforms are variants of PCA. They are carefully designed to handle the sign confusion problem [43], which occurs when multi-stage linear/affine systems are in cascade. It was argued in [43] that the nonlinear activation unit in CNNs can eliminate the sign confu- sion problem. Furthermore, Kuo et al. [46] interpreted a fully-connected (FC) layer as a linear least regression (LLS) regressor. It can be viewed as another dimension-reduction subspace method that exploits labeled training data. This research is the sequel of previous work in [43], [44], [45], [46]. They are inte- grated into one unified framework. With this new viewpoint, the whole CNN pipeline contains multiple subspace processing modules in cascade and the model parameters are determined layer-by-layer in a feedforward manner. This new learning methodol- ogy is called successive subspace learning (SSL). Although there is a strong similarity between the testing procedures in the feedforward CNN path and the SSL, they are fundamentally different in model formulation, training principle and complexity, and trained model properties. We will discuss the similarities and differences between deep learning and SSL furthermore for the rest of the chapter. Instead of using the SSL principle to interpret CNNs, we propose a new SSL-based object classification method and call it the PixelHop method. It is worthwhile to point out that a 3D point cloud classification scheme, called the PointHop method [84], was 84 designed using the SSL principle. To illustrate the principle in the context of image- based object recognition, we propose a PixelHop method. The term “hop” is borrowed from graph theory. For a target node in a graph, its immediate neighboring nodes con- nected by an edge are called its one-hop neighbors. Its neighboring nodes connected to itself through n consecutive edges via the shortest path are n-hop neighbors. The PixelHop method begins with a very localized subspace, where the attributes of each pixel define a pixel-subspace. Next, we concatenate the attributes of a pixel, denoted byp and attributes of its 1-hop neighbors to form a larger subspace formed by one-hop neighborhood denoted byN 1 (p). We can continue to enlarge the subspace by includ- ing larger and larger neighborhoods. If we implement it in a straightforward manner, the dimension of subspaceN l (p) will grow very fast as l becomes larger. To control the rapid dimension growth of subspaceN l (p), we can use the Saab/Saak transform to reduce the dimension ofN l (p) at each stage withl = 1; 2; . This offers an efficient representation of an enlarged receptive field with respect to pixelp. The PixelHop method provides a rich set of representations,N l (p), wherep can be any point in the image and across multiple stages,l = 0; 1;;L, and whereL is the maxi- mum hop number. We need to develop effective aggregation methods to extract features obtained from near- to far-neighborhoods, which is controlled by stage parameterl, of different pixelsp. Furthermore, we propose a new supervised dimension reduction tech- nique to reduce the feature number. After that, we concatenate local-to-global features and then train a classifier, such as the support vector machine (SVM) [16] and the ran- dom forest (RF) [8], to provide the final classification result. Extensive experiments are conducted on three datasets (namely, MNIST, Fashion MNIST and CIFAR-10 datasets) to evaluate the performance of the PixelHop method. It is shown by experimental results that the PixelHop method can outperform classical CNNs of similar model complexity while demanding much lower training complexity. 85 The rest of this chapter is organized as follows. The SSL framework is first intro- duced in Sec. 6.2. Then, the PixelHop method is presented in Sec. 6.3. Experimental results and comparison between the deep learning and the SSL are given in Sec. 6.4. Finally, concluding remarks are drawn in Sec. 6.6. 6.2 Successive Subspace Learing (SSL) Deep learning (DL), including convolutional neural networks (CNNs) and recurrent neu- ral networks (RNNs), is a parametric learning method. One selects a network of a fixed architecture to begin with. The network often has a large number of parameters in order to achieve good performance. The DL determines the model parameters using an end-to-end optimization approach. It is implemented by backpropagation (BP). The DL solution has several limitations. First, it demands a lot of computing resources in model training. As the number of layers goes extremely deep, say 100 and 150 layers, inference can be very expensive. Second, it is a heavily supervised learning methodol- ogy. A large number of labeled training data are needed to train DL models. Third, it is challenging to adapt a trained DL model to new data classes/samples. Fourth, the DL model is a black-box tool. Many of its properties are not well understood. Being motivated by the above observations, it is desired to explore a new machine learning paradigm to achieve the following objectives. The learning system is mathematically transparent. The learning system can handle both small and large datasets using an expandable model. 86 The learning system is scalable in the sense that its model complexity can be adjusted flexibly based on hardware constraints with graceful performance trade- off. The superior performance of DL is attributed to an extremely large model size. Tradi- tional parametric learning methods do not have enough model parameters to deal with datasets of a larger number of samples with rich diversity. To design a scalable learning system, we look for a nonparametric approach. Along this direction, we develop an SSL methodology in this work. The SSL contains the following four high-level ideas: 1. successive local-to-global subspace expansion, 2. subspace approximation, 3. label-assisted dimension reduction, 4. aggregation and classification. For idea #1, we intend to compute local-to-global attributes of a set of selected pixels by aggregating the information of their near- and far-neighboring pixels in successive stages. This design is to ensure that only local communication is required at each stage. The information of near and far neighboring pixels can be propagated to the target pixel through local information exchange, where the information can be the gray-scale pixel value of the RGB pixel value for gray-scale and color images, respectively. The infor- mation of the nearest neighbors can be exchanged in one hop while that of far neighbors can be exchanged through multiple hops. For this reason, the image-based SSL method is called the PixelHop method. As the hop number becomes larger, we build a subspace consisting of more pixels (i.e. a larger receptive field). 87 For idea #2, if we do not perform dimension reduction, the attribute dimension of an exact subspace will grow rapidly. To control the dimension growing speed without hurt- ing the representation power of the subspace, we need to find an approximate subspace. This can be achieved by an unsupervised dimension reduction technique. Specifically, we can exploit correlations between an image pixel and its neighboring pixels to reduce the dimension of the exact subspace formed by them via subspace approximation. The PCA is a one-stage subspace approximation technique. When we consider successive subspace approximation in the context of SSL, we use the Saab and/or the Saak trans- form to eliminate the sign confusion problem. This topic is detailed in [43], [45], [46]. For idea #3, the attribute dimension of a small-hop subspace is lower while that for a large-hop subspace is higher. Yet, more pixels should be selected in an earlier stage since each pixel has a small-hop subspace. As a result, by concatenating attributes of all selected points, the total attribute dimension of each stage will be high. Furthermore, we need to integrate attributes across all stages. The ultimate dimension of aggregated attributes is prohibitively high. It demands further dimension reduction. This can be achieved by exploiting data labels. For an object class, the attributes of its samples follow a certain distribution in a high-dimensional space. Our goal is to find a lower- dimensional manifold to represent samples of each object class. Along this line, we develop a supervised dimension reduction technique in Sec. 6.3.3. For idea #4, we aggregate dimension-reduced attributes from all stages to form the feature vector and train a multi-class classifier for the classification task. The SSL methodology has been applied to various data types, including images, 3D point clouds [84] and graphs. Our current work is concerned with image classification. We call it the PixelHop method and will present it in Sec. 6.3. 88 6.3 PixelHop Method We present a new method to illustrate the SSL methodology for the image classifica- tion application in this section. It is called the PixelHop method. We will first give an overview of the PixelHop system in Sec. 6.3.1. One of the main organization units of the PixelHop system is the PixelHop unit. We will consider several design issues asso- ciated with the PixelHop unit in Sec. 6.3.2. Another important organization unit of the PixelHop system is the Label-Assisted Regressor (LAG) unit, which will be examined in detail in Sec. 6.3.3. Figure 6.1: An overview of the PixelHop method. 6.3.1 System Overview An overview of the PixelHop system is given in Fig. 6.1. The input can be graylevel or color images. They are fed into a sequence ofL PixelHop units in cascade to obtain the l-hop attributes, wherel = 1; 2;L. The attribute vectors froml-hops,l = 1; 2;L, are fed into the Label-Assisted reGressor (LAG) unit for further dimension reduction to generateM attributes per unit. Finally, they are concatenated to form the final feature 89 vector of dimensionML for image classification. The PixelHop system consists of three main modules. Their detailed functions are elaborated below. Module #1: A Sequence of PixelHop Units in Cascade. The main purpose of this module is to derive attributes from near-to-far neighborhoods of a set of selected pixels. Each PixelHop unit can be viewed as an SSL unit which takes a sample point at a spatial position as the input and learns its neighboring subspace. To control the dimension of new subspace, we adopt the Saab transform and a new sampling strategy. A whole image representation can be created by applying PixelHop unit on all positions. Finally, We use spatial pooling to further reduce the dimensions of the new image representation, and this pooling process can also help to enlarge the neighboring size in the following PixelHop unit. The lth PointHop unit, where l = 1; ;L, concatenate the (l 1)th attributes of a target pixel and itsN l neighboring pixels to form a larger subspace of dimension (l 1) (N l + 1). Figure 6.2: (a) Illustration of selected positions of 5 5 neighborhood. (b) Illustration of the PixelHop Unit and subspace-based dimensionality reduction. 90 A sample point integrates its 5 5 neighbors to learn higher dimensional attributes. To manage the dimension increasing along subspace growing and avoid attribute redun- dancy, we adopt two strategies: 1) we select neighboring points along with eight direc- tions by skipping one pixel as shown in Fig. 6.2 (a) for all PixelHop units expect the first one. 2) The Saab transform is used to reduce the dimension of the expanded attributes. To classify images of size 3232, we cascade four PixelHop units to generate local- to-global image representations. For each PixelHop, we conduct the subspace learning on 5 5 neighborhoods. The first PixelHop unit takes raw pixel value as inputs. Its color information is called 0-hop attributes, one dimension for gray-level images and three dimensions for color images. We take all 24 neighbors into account to construct 1-hop attributes. The Saab transform is learned based on 5 5 neighborhoods to build the 1-hop subspace with a dimension of K 1 , and the Saab coefficients are the 1-hop attributes. The Saab transform consists of three modules: 1) DC/AC separation, 2) PCA and 3) bias addition. The number of AC Saab filters is determined by the total energy preservation of PCA coefficients. After applying PixelHop on the whole image, we will generate new image representations with a size of 32 32K 1 . Here, we adopt the zero-padding to the boundaries to retain the input spatial size. At each position, the sample point will have 2 pixels overlapping with its adjacent points. So we can apply a spatial pooling to reduce this redundancy and capture more discriminate attributes. We use max-pooling in all our experiments. Finally, the output of the first PixelHop unit has a dimension of 16 16K 1 . For the other PixelHop units, we can repeat the above process to build N i+1 -Hop attributes based onN i -Hop attributes. The only difference is when counting the neigh- boring positions, we only select 8 positions instead of 24 as illustrated in Fig.6.2 (a). The selected 8 points cover the different area compared with the center point and in this way, we can further reduce the dimensions of new attributes. The generated image 91 representations after each PixelHop unit processing have a dimension of 8 8K 2 , 4 4K 3 ,and 2 2K 4 , respectively. We can determine the proper values ofK 1 , K 2 andK 3 flexibly depending on the statistics of input images. Module #2: Spatial Pooling and Supervised Dimension Reduction. After a PixelHop unit, it will yield a new image representation with the same spatial size of the input. we use max-pooling to decrease the spatial size and eliminate the high correlated attributes among adjacent points. It also helps to cover larger neighboring space in the next Pixel- Hop unit. Conduct supervised dimension reduction on attributes from multiple PixelHop units to obtain the desired feature set. Module #3: Feature Aggregation across Cascaded Units and Classification. We train an SVM classifier with the radial basis function (RBF) kernel based on extracted feature vectors obtained from multiple PointHop units. 6.3.2 Design of PixelHop Units Figure 6.3: Illustration of non-overlapping and overlapping multi-stage Saab transforms. Our idea of unsupervised feature learning is centered around “subspace learning” or “subspace approximation”. Consider a set of image cuboids of size (nn)k, where 92 nn andk i are spatial and spectral dimensions of the input cuboid, respectively. By treating each variable in image cuboids as one dimension, we get a vector spaceV = R n 2 k of dimensionn 2 k. Our objective is to find a subspace,W , that preserves the most significant information of V for further processing or decision. The dimension of W can be chosen flexibly depending on resource availability and performance requirement. Many dimension reduction techniques can be used to achieve this goal. A variant of the principal component analysis (PCA) called the Saab transform is chosen here for its simplicity. When n is larger, it is not a good idea to perform subspace learning in one step. Instead, one can divide images into smaller patches and conduct subspace approxima- tions in multiple steps successively. Special attention should be paid to the cascade of two consecutive subspace learning steps, which is thoroughly discussed in [43], [45], [46]. It is unsupervised since the subspace learning process does not need image labels. Also, subspace learning (especially those in early stages) is relatively stable with respect to the sample size. This will be demonstrated in this section. The feature learning module in the SSL approach is based on Saab transforms. The Saab transform is a variant of the principal component analysis (PCA). It selects a con- stant bias value for all Saab filters to handle the sign confusion problem identified in [44]. All Saab filters are derived from the statistics of the input data in a supervised fashion without BP. To transform images of larger sizes, multi-stage Saab transforms can be utilized. which successively provide a joint spatial-spectral representation. The discriminant power of Saab coefficients from each stage is increased gradually due to a larger receptive field. The multi-stage Saab transforms can be conducted in the non-overlapping and over- lapping manners to transform images of a larger size as demonstrated in Fig. 6.3. In this section, we will use the non-overlapping one to demonstrate this unsupervised feature 93 learning process from four aspects: 1) analyze the spatial size of the anchor vectors; 2) study the depth of a multi-stage Saab transform; 3) examine the convergence behavior of covariance matrices and anchor vectors as a function of the number of training samples; 4) investigate the inverse process of Saab transforms. To begin with, we decompose an input into a set of non-overlapping patches or cuboids, denoted asP i . Then, we conduct the Saab transform on each patch or cuboid to become a spectral vector with a dimension ofK i . The whole process terminates when we reach the last stage that the output has a spatial dimension of 1 1. All intermediate transform coefficients, called Saab coefficients, can serve as discriminant features of the input image. Multi-stage Saab transforms can provide a set of spatial-spectral repre- sentations. They are powerful representations to be used in many applications such as object classification, image synthesis and image processing. Spatial size of the anchor vectors. Fig. 6.4 visualizes the AC anchor vectors of the first stage Saab transform with different spatial size. We can observe that first several anchor vectors are learned the low-frequency patterns, such as the edges and corners, and those patterns are stable concerning different spatial sizes. The high-frequency patterns can also be learned by anchor vectors of small energy portions. 94 Figure 6.4: Visualization of Saab kernels of spatial sizes 5 5, 6 6 and 8 8. To understand the proper spatial size of anchor vectors in each stage, we calcu- lated the Pearson correlation coefficient along with three directions, horizontal, vertical and 45-degree directions. The results are reported in 6.5. We compute the correlation between two pixel locations with different spatial distances on MNIST, Fashion MNIST and Cifar-10 datasets. We can observe that the trend of the correlation curves are dif- ferent among these three datasets. The correlation converges to zero at about 7 pixels distance on MNIST dataset, but there is a long-range correlation on Fashion MNIST and Cifar-10 datasets. It indicates that Fashion MNIST and Cifar-10 are more compli- cated datasets and require larger receptive and deeper architectures when constructing the multi-stage Saab transform. Depth of a multi-stage Saab transform. We study the depth of a multi-stage Saab transform from the view of energy preservation. We plot the cumulative energy dis- tributions from the last stage Saab coefficients using different depths as shown in Fig. 6.8. The spatial size of anchor vectors in each stage are: 1) one stage: 32x32, 2) two 95 stages: 8x8,4x4, 3) three stages: 4x4, 4x4, 2x2, 4) four stages: 4x4, 2x2, 2x2, 2x2, 5) five stages: 2x2, 2x2, 2x2, 2x2, 2x2. We adopt the total energy threshold to 90% in all five cases, and the energy is evenly distributed on each stage in the multi-stage cases. We see that fist several anchor vectors can cover large portion of the total energy. And more transform stages can preserve energy using less Saab coefficients, especially for more complicated datasets such as CIFAR-10. Figure 6.5: The plot of Pearson correlation coefficients with respect to different spatial distances along 0-, 45-, and 90-degree lines for the MNIST, the Fashion MNIST, and the Cifar-10 datasets. Stability of learned Saab filters. The feature learning process using the Saab transform is an unsupervised one, it tends to be more stable than supervised approaches. We study the stability of the Saab transform from two perspectives, 1) we investigate the convergence of the covariance matrix of input data matrix, and 2) we compare the cosine similarity of the Saab trans- form kernels using different training subset in each stage. In the first row of Fig. 6.6, we compute the Frobenius norm of the difference between two covariance matrix computing using N i number of patch vectors and N i + 1000 number of patch vectors. All patch vectors are randomly selected from the whole train- ing patch set where patches are overlapping sampled from each image. The first stage covariance matrix has a faster convergence than the second stage in terms of the image 96 numbers. Overall, we see that the covariance matrix convergence when using about 2k training images. Figure 6.6: Convergence of covariance matrix and Saab kernels for the MNIST dataset: the first row indicates the convergence of covariance matrix, second to fourth rows indi- cate the convergence curves of the top 18 kernels from the first stage, and fifth to seventh rows indicate the convergence curves of the top 18 kernels from the second stage. 97 As for the learned anchor vectors, we calculate the cosine similarity between each anchor vector against the training data size. We first examined the overlapping learning process, which uses the LetNet-5 like architecture, and the results are shown in the second to seventh rows of Fig. 6.6. It shows that first ten anchor vectors have fast convergence in both stages and the anchor vectors become stable when using about 10k images. For the non-overlapping learning process, we tested the image representations after five stage Saab transforms, which set filter sizes to 2 2, and the total energy threshold to 95% for each stage. More results are shown in Fig. 6.7, and we can observe that the first several AC kernels are quite stable, especially on Fashion MNIST and Cifar- 10 datasets. We can conclude that the trained anchor vectors and covariance matrix are very stable against the training data size, and it converges with a small subset of data. Figure 6.7: The cosine similarity of the first nine transform kernels obtained with subsets and the whole set of MNIST training data on MNIST, Fashion MNIST, and Cifar-10 dataset, respectively. 98 Figure 6.8: The cumulative energy distribution of last stage Saab coefficients on the MNIST, Fashion MNIST, and Cifar-10 datasets, where all images are converted to Y channel firstly. The second plot focus on the first ten AC coefficients. Reconstruction using Inverse Saab transforms. Inverse Saab transform is easily com- puted because of the orthonormal transform kernels. The non-overlapping multi-stage Saab transforms also facilitate the reconstruction process which doesn’t have redun- dancy Saab coefficients. The number of anchor vectors directly related to the approx- imation errors, so it provides us a way to investigate the relation between information preserving and the number of anchor vectors. As illustrated in Fig. 6.9, we can recon- struct an input image with inverse Saab transforms using Saab coefficients. Ten example 99 input images are reconstructed from the first stage and second stage Saab coefficients. In the first four row, the reconstructed images are from first-stage Saab coefficients using 8, 16, 24, 32 anchor vectors, respectively. The last four rows give reconstructed images from second-stage Saab coefficients using 32 anchor vectors in the first stage and 50, 150, 250, 350 anchor vectors in the second stage respectively. The quality of recon- structed images increases when keeping more anchor vectors, and people can recognize the digits easily even when keeping small number of anchor vectors. Thus, we can conduct effectively dimensionality reduction based on this observation. Furthermore, the reconstructed results indicate the information loss during the transforms which can guide the choice of the number of anchor vectors. 100 Figure 6.9: Illustration of image reconstruction with non-overlapping inverse Saab trans- forms. The first four rows indicate the reconstruction results using one stage Saab trans- form with kernel number 8, 16, 24, 32. The last four rows indicate the reconstruction results using two stages Saab transform with kernel number 50,150,250,350. 6.3.3 Label-Assisted Regression (LAG) Our design of the Label-Assisted reGression (LAG) unit is motivated by two obser- vations. First, each PixelHop unit offers a representation of a neighborhood of a cer- tain size centered at a target pixel. The size varies from a small one to a large one through successive subspace expansion and approximation. The representations are called attributes. We aggregate the local-to-global attributes across multiple PixelHop 101 units at multiple selected pixels to solve the image classification problem. One straight- forward aggregation method is to concatenate all attributes to form a long feature vector. Yet, the dimension of concatenated attributes is too high to be effective. We need another way to lower the dimension of the concatenated attributes. Second, CNNs use data labels effectively through BP. It is desired to find a way to use data labels in the SSL framework. The attributes of the same object class should reside in a smaller subspace in a high-dimensional attribute space. We attempt to search for the subspace formed by samples of the same class for dimension reduction. This pro- cedure defines a supervised dimension reduction technique. Although being presented in different form, a similar idea was investigated in [46]. To give an interpretation to the first FC layer of the LeNet-5, Kuo et al. [46] proposed to partition samples of each digit class into 12 clusters to generate 12 pseudo classes to account for intra-class vari- abilities. By mimicking the dimension of the first FC layer of the LeNet-5, we have 120 clusters (i.e., 12 clusters per digit for 10 digits) in total. Since each training sample belongs to one of 120 clusters, we assign it to a one-hot vector in a space of dimension 120. Then, we can set up a least-squared regression (LSR) system containing 120 affine equations that map samples in the input feature space to the output space that is formed by 120 one-hot vectors. The one-hot vector is used to indicate a cluster with a hard label. In this work, we adopt soft-labeled output vectors in setting up the LSR problem. The learning task is to use the training samples to determine the elements in the regression matrix. Then, we apply the learned regression matrix to testing samples for dimension reduction. By following the notation in Fig. 6.1, the concatenation of attributes in theith Pixel- Hop unit yields a vector of dimensionS i S i K i , whereS i S i denotes the number of selected pixels andK i denotes the attribute number. We study the distribution of these concatenated attribute vectors based on their labels through the following steps: 102 1. We cluster samples of the same class to create object-oriented subspaces and find the centroid of each subspace. 2. Instead of adopting a hard association between samples and centroids, we adopt a soft association. As a result, the target output vector is changed from the one-hot vector to a probability vector. 3. We set up and solve a linear LSR problem using the probability vectors. The regression matrix obtained in the last step is the label-assisted regressor. For Step #1, we adopt the k-means clustering algorithm. It applies to samples of the same class only. We partition samples of each class intoL clusters Suppose that there areJ object classes, denoted byO j ,j = 1; ;J and the dimension of the concatenated attribute vectors is n. For Step #2, we denote the vector of concatenated attributes of classO j byx j = (x j;1 ;x j;2 ; ;x j;n ) T 2 R n . Also, we denote centroids ofL clusters byc j;1 ,c j;2 , ,c j;L . Then, we define the probability vector of samplex j belonging to centroidc j 0 ;l as Prob(x j ;c j 0 ;l ) = 0 ifj6=j 0 ; (6.1) and Prob(x j ;c j;l ) = e d(x j ;c j;l ) P L l=1 e d(x j ;c j;l) ; (6.2) whered(x;y) is the Euclidean distrance between vectorsx andy and is a parameter to determine the relationship between the Euclidean distance and the likelihood for a sample belonging to a cluster. The larger, the probability decays faster with the dis- tance. The short the Euclidean distance, the larger the likelihood. Finally, we can define the probability of samplex belonging to a subspace spanned by centroids of classj as p j (x j 0) =0; ifj6=j 0 ; (6.3) 103 where0 is the zero vector of dimensionL, and p j (x j ) = (Prob(x j ;c j;1 ); Prob(x j ;c j;l ); Prob(x j ;c j;L )) T : (6.4) Finally, we can set up a set of linear LSR equations to relate the input attribute vector and the output probability vector as 2 6 6 6 6 6 6 6 4 a 11 a 12 a 1n w 1 a 21 a 22 a 2n w 2 . . . . . . . . . . . . . . . a M1 a M2 a Mn w M 3 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 4 x 1 x 2 . . . x n 1 3 7 7 7 7 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 6 6 6 6 4 p 1 (x) . . . p j (x) . . . p J (x) 3 7 7 7 7 7 7 7 7 7 7 5 ; (6.5) whereM =JL is the total number of centroids, parametersw 1 ,w 2 , ,w M are the bias terms andp j (x) is defined in Eq. (6.4). It is the probability vector of dimension L, which indicates the likelihood for inputx to belong to the subspace spanned by the centroids of class j. Since x can belong to one class only, we have zero probability vectors with respect toJ 1 classes. 6.4 Experimental Results We organize experimental results in this section as follows. First, we discuss the exper- imental setup in Sec. 6.4.1. Second, we conduct the ablation study and study the effects of different parameters on the Fashion MNIST dataset in Sec. 6.4.2. Third, we perform error analysis, compare the performance of color image classification using different color spaces and show the scalability of the PixelHop method in Sec. 6.4.3. Finally, 104 we conduct performance benchmarking between the PixelHop method and the LeNet- 5 network [49], which is a classical CNN architecture of model complexity similar to the PixelHop method in terms of classification accuracy and training complexity in Sec. 6.4.4. 6.4.1 Experiment Setup We test the classification performance of the PixelHop method on three popular datasets: the MNIST dataset [49], the Fashion-MNIST dataset [81] and the CIFAR-10 dataset [40]. The MNIST dataset contains gray-scale images of handwritten digits (from 0 to 9). It has 60,000 training images and 10,000 testing images. The original image size is 28 28 and zero-padding is used to enlarge the image size to 32 32. The Fashion-MNIST dataset contains gray-scale fashion images. Its image size and numbers of training and testing images are the same as those of the MNIST dataset. The CIFAR- 10 dataset has 10 object classes of color images and the image size is 32 32. It has 50,000 training images and 10,000 testing images. The following settings are used as the default in our experiments. 1. Four PixelHop units are cascaded in the PixelHop system. 2. To decide the number of Saab AC filters in the unsupervised dimension reduction procedure, we set the total energy ratio preserved by AC filters to 98% for the MNIST and the Fashion MNIST datasets and 95% for the CIFAR-10 dataset. 3. To aggregate attributes spatially, we average responses of nonoverlapping patches of sizes 44, 22, and 22 in the first, the second and the third PixelHop units, respectively, to reduce the spatial dimension of attribute vectors. As a result, the first, the second, the third and the fourth PixelHop units have outputs of dimension 105 2 2K 1 , 2 2K 2 , 4 4K 3 , and 4 4K 4 , respectively. Then, we feed all of them to the supervised dimension reduction unit. 4. We set = 10 and the number of clusters for each object class toL = 5 in the supervised dimension reduction unit. Since there areJ = 10 object classes in all three datasets of concern, the dimension is reduced toJL = 50. 5. We use the multi-class SVM classifier with the Radial Basis Function (RBF) as the kernel. Before training the SVM classifier, we normalize each feature dimension to be a zero mean random variable with the unit variance. 6.4.2 Ablation Study Table 6.1: Ablation study for the Fashion MNIST dataset. Feature Used SRD Spatial Pooling Classifier Test ACC (%) ALL Last Unit Yes No Mean Min Max Skip SVM RF X X X X 89.49 X X X X 89.47 X X X X 89.36 X X X X 91.18 X X X X 91.16 X X X X 91.09 X X X X 91.10 We use the Fashion-MNIST dataset as an example for the ablation study. We show the test averaged classification accuracy (ACC) results for the Fashion-MNIST dataset under different settings in Table 6.1. We can reach a classification accuracy of 91.18% with the default setting as shown in the sixth row of the table. It is obtained by aggregat- ing image representations from all four PixelHop units, using mean-pooling to reduce the spatial dimension of attribute vectors, applying supervised dimension reduction (SDR) and the SVM classifier. Effects of several hyper parameters are discussed below. 106 Number of Saab AC filters. We study the relationship between the classification per- formance and the energy preservation ratio of the Saab AC filters in Fig. 6.10, where the x-axis indicates the cumulative energy ratio preserved by AC filters. Although preserv- ing more AC energy can improve the classification performance, the rate of improve- ment is slow. Furthermore, we show the relationship between the energy preservation ratio and the number of Saab AC filters in Fig. 6.11. We see that leading AC filters can capture a large amount of energy while the capability drops rapidly as the index becomes larger. We plot four energy thresholds: 95% (yellow), 96% (purple), 97% (green), and 98% (red) in Fig. 6.11. This suggests that we may use a higher energy ratio in the begin- ning PixelHop units and a lower energy ratio in the latter PixelHop units if we would like to balance the classification performance and the complexity. Figure 6.10: The classification accuracy as a function of the total energy preserved by AC filters tested on the Fashion-MNIST dataset. 107 Figure 6.11: The log energy plot as a function of the number of AC filters tested on the Fashion MNIST dataset, wherer the yellow, purple, green and red dots indicate the cumulative energy ratio of 95%, 96%, 97%, and 98%, respectively. Aggregation. The PixelHop system provides a rich and diverse set of local-to-global features. We compare the classification performance using the output of an individual PixelHop unit and their aggregations in Table 6.2. We see from the table clear advan- tages of aggregating features across all units. The PixelHop method under the default setting can achieve a classification accuracy of 91.18%. We can boost the classification accuracy furthermore using feature ensembles. For example, we can adopt two mean- pooling sizes to generate two sets of features in each PixelHop unit. That is, we set the pooling sizes to two and four in the first and the second PixelHop units and to one and two in the third and the fourth PixelHop units, where one is the same as no pooling. The testing accuracy increases to 91.44% when we aggregate eight features as shown in the “advanced” column in Table 6.2. 108 Table 6.2: Comparison of the classification accuracy (%) using features from an indi- vidual PixelHop unit, the PixelHop system with the default setting and the PixelHop system with the advanced aggregation setting for the Fashion MNIST dataset. Dataset HOP-1 HOP-2 HOP-3 HOP-4 default advanced Fashion MINST 89.62 90.01 89.35 89.51 91.18 91.44 Table 6.3: The confusion matrix for the MNIST dataset, where the first row shows the predicted object labels and the first column shows the true object labels. 0 1 2 3 4 5 6 7 8 9 0 0.997 0 0 0 0.001 0 0 0.001 0.001 0 1 0 0.996 0.002 0.001 0 0 0.001 0 0 0 2 0.001 0 0.990 0 0.001 0 0 0.006 0.002 0 3 0 0 0.001 0.994 0 0.002 0 0.002 0.001 0 4 0 0 0.001 0 0.991 0 0.002 0.001 0 0.005 5 0.001 0 0 0.002 0 0.994 0.001 0.001 0 0 6 0.007 0.002 0 0 0.002 0.002 0.985 0 0.001 0 7 0 0.002 0.007 0 0 0 0 0.989 0.001 0.001 8 0.002 0 0.001 0.003 0 0.001 0 0.002 0.989 0.002 9 0.003 0.001 0.001 0.001 0.005 0.001 0 0.006 0.003 0.979 Table 6.4: The confusion matrix for the Fashion MNIST dataset, where the first row shows the predicted object labels and the first column shows the true object labels. T-shirt/top Trouser Pullover Dress Coat Sandal Shirt Sneaker Bag Ankle boot T-shirt/top 0.872 0.001 0.014 0.017 0.006 0 0.082 0 0.008 0 Trouser 0.001 0.98 0 0.013 0.002 0 0.002 0 0.002 0 Pullover 0.022 0.001 0.859 0.006 0.061 0 0.048 0 0.003 0 Dress 0.013 0.003 0.009 0.919 0.026 0 0.027 0 0.003 0 Coat 0 0 0.063 0.023 0.862 0 0.052 0 0 0 Sandal 0 0 0 0 0 0.983 0 0.015 0 0.002 Shirt 0.106 0 0.055 0.023 0.063 0 0.746 0 0.007 0 Sneaker 0 0 0 0 0 0.006 0 0.979 0 0.015 Bag 0.002 0.001 0.001 0.004 0.001 0.002 0.002 0.003 0.983 0.001 Ankle boot 0 0 0 0 0 0.006 0.001 0.032 0 0.961 6.4.3 Error Analysis, Color Spaces and Scalability Error Analysis. We provide the confusion matrices for the MNIST dataset, the fashion MNIST dataset and the CIFAR-10 dataset in Table 6.3, Table 6.4 and Table 6.5, respec- tively. Furthermore, we show some error cases in Fig. 6.12 and have the following observations. 109 Table 6.5: The confusion matrix for the CIFAR-10 dataset, where the first row shows the predicted object labels and the first column shows the true object labels. airplane automobile bird cat deer dog frog horse ship truck airplane 0.774 0.016 0.037 0.02 0.014 0.004 0.012 0.012 0.072 0.039 automobile 0.024 0.816 0.005 0.011 0.004 0.007 0.006 0.003 0.03 0.094 bird 0.074 0.005 0.601 0.065 0.082 0.071 0.058 0.024 0.011 0.009 cat 0.026 0.018 0.069 0.545 0.056 0.164 0.062 0.034 0.011 0.015 deer 0.031 0.004 0.065 0.057 0.682 0.039 0.045 0.056 0.018 0.003 dog 0.011 0.007 0.072 0.19 0.05 0.588 0.029 0.04 0.01 0.003 frog 0.009 0.008 0.05 0.052 0.035 0.029 0.807 0.003 0.005 0.002 horse 0.022 0.005 0.037 0.051 0.054 0.066 0.008 0.749 0.001 0.007 ship 0.061 0.039 0.008 0.018 0.002 0.004 0.002 0.007 0.834 0.025 truck 0.038 0.095 0.012 0.015 0.004 0.005 0.005 0.015 0.031 0.78 For the MNIST dataset, the misclassified samples are truly challenging. To handle these hard sample, we may need to turn to a rule-based method. For example, humans often write “4” in two strokes and “9” in one stroke. If we can identify the troke number from a static image, we can use the information to make better prediction. For the Fashion-MNIST dataset, we see that the “shirt” class is the most challeng- ing one. As shown in Fig. 6.12, the shirt class is a general class that overlaps with the “top”, the “pullover” and the “coat” classes. This is the main source of erroneous classifications. For the CIFAR-10 dataset, the “dog” class can be confused with the “cat” class. As compared with other object classes, the “dog” and the “cat” classes share more visual similarities. On one hand, it may demand more distinctive features to dif- ferentiate them. On the other hand, the error is caused by the poor image reso- lution. The “Ship” and the “airplane” classes form another confusing pair. The background is quite similar in these two object classes, i.e. containing the blue sky and the blue ocean. It is a challenging task to recognize small objects in poor resolution images. 110 Figure 6.12: Representative misclassified images in three benchmark datasets, where the first and the second rows show erroneous predictions in the MNIST dataset, the third and the fourth rows erroneous predictions of the “shirt” class in the Fashion-MNIST dataset, and the fifth and the sixth rows show two confusing pairs, “dog vs. cat” and “ship vs. airplane”, in the CIFAR-10 dataset. Performance of Different Color Spaces. We report experimental results on the CIFAR- 10 dataset with different color representations in Table 6.6. We consider three color spaces - RGB, YCbCr, and Lab. The three color channels are combined with different strategies: 1) three channels are taken as the input jointly; 2) each channel is taken as the input individually and all three channels are concatenated at the classification stage; 3) luminance and chrominance components are processed individually and concatenated at the classification stage. We see a clear advantage of processing one luminance channel (L or Y) and two chrominance channels (CbCr or ab) separately and then concatenate 111 the extracted features at the classification stage. This observation is consistent with our prior experience in color image processing. Table 6.6: Comparison of classification accuracy (%) using different color representa- tions on the CIFAR-10 dataset. RGB R,G,B YCbCr Y ,CbCr Lab L,ab Test 68.59 69.96 67.7 71.15 67.21 71.76 Train 83.11 86.18 84.4 86.28 90.88 89.61 Scalability. Since the PixelHop method is a nonparametric learning approach, it is scalable to the number of training samples. In contrast, the LeNet-5 is a parametric learning approach, and its model complexity is fixed regardless of the training data num- ber. We compare the classification accuracies of the LeNet-5 and the PixelHop method in Fig. 6.13, where only 1/4, 1/8, 1/16,1/32, 1/64, 1/128, 1/256 of the original train- ing dataset are randomly selected as the training data for the MNIST and the Fashion MNIST datasets. After training, we apply the learned systems to 10,000 testing data as usual. As shown in Fig. 6.13, when the number of labeled training data is reduced, the classification performance of the LeNet-5 drops faster than the PixelHop method. For the extreme case with 1/256 of the original training data size (i.e., 234 training samples), the PixelHop method outperforms the LeNet-5 by 2.3% and 15.8% in testing accuracy for MNIST and Fashion-MNIST, respectively. Clearly, the PixelHop method is more scalable than the LeNet-5 method against the training data size. 6.4.4 Benchmarking between PixelHop and LeNet-5 We compare the training complexity and the classification performance of the LeNet-5 and the PixelHop method in this subsection. It is Note that these two machine learning models share similar model complexity. As mentioned in Sec. 6.4.2, we use different 112 Figure 6.13: Comparison of testing accuracy (%) of the LeNet-5 and the PixelHop method with different training sample numbers for (a) the MNIST and (b) the Fashion- MNIST datasets. mean-pooling sizes to generate two representations in each PixelHop unit for the Pix- elHop method. That is, the pooling sizes are two and four for the first and the second PixelHop units and one and two for the third and the fourth PixelHop units. 113 We use the Lab color space to represent images in the CIFAR-10 dataset and build the PixelHop units in the luminance and the chrominance spaces separately. For the LeNet-5, we train all networks using TensorFlow [1] with 50 epochs and a batch size of 32. The classic LeNet-5 architecture [49] was designed for the MNIST dataset. We use it for the Fashion MNIST dataset as well. We modify the network architecture slightly to handle color images in the CIFAR-10 dataset. The parameters of the original and the modified LeNet-5 are compared in Table 6.7. The modified LeNet-5 was originally proposed in [46]. Table 6.7: Comparison of the original and the modified LeNet-5 architectures. Architecture Original LeNet-5 Modified LeNet-5 1st Conv Layer Kernel Size 5 5 1 5 5 3 1st Conv Layer Kernel No. 6 32 2nd Conv Layer Kernel Size 5 5 6 5 5 32 2nd Conv Layer Kernel No. 16 64 1st FC Layer Filter No. 120 200 2nd FC Layer Filter No. 84 100 Output Node No. 10 10 We compare the classification performance of the LeNet-5 and the PixelHop method for all three datasets in Table 6.8. As shown in the table, the PixelHop method outper- forms the LeNet-5 in all three datasets. The performance gains are 0.05%, 0.36% and 3.04% for the MNIST, the Fashion-MNIST and the CIFAR-10 datasets, respectively. Furthermore, we compare the training time of the LeNet-5 and the PixelHop method for all three datasets in Table 6.9. The PixelHop method takes less training time than the LeNet-5 for all three datasets in a CPU, where the CPU is Intel(R) Xeon(R) CPU E5- 2620 v3 at 2.40GHz. Although these comparisons are still preliminary, we do see that the PixelHop method can be competitive in terms of classification accuracy and training complexity. 114 Table 6.8: Comparison of testing accuracy (%) of the LeNet-5 and the PixelHop method on the MNIST, the Fashion MNIST and the CIFAR-10 datasets. Method MNIST Fashion MNIST CIFAR-10 LeNet-5 99.04 91.08 68.72 PixelHop 99.09 91.44 71.76 Table 6.9: Comparison of training time of the LeNet-5 and the PixelHop method on the MNIST, the Fashion MNIST and the CIFAR-10 datasets. Method MNIST Fashion MNIST CIFAR-10 LeNet-5 25 min 25 min 45 min PixelHop 15 min 15 min 30 min 6.5 Discussion We compare DL and SSL at a high level below. DL is a parametric learning methodology while SSL is a nonparametric one. DL is a black box while SSL is a white box whose behavior is mathematically explainable. DL formulates a learning problem as an end-to-end optimization problem and solves it using BP. SSL does not rely on end-to-end optimization at all. It designs a learning system in a one-pass feedforward manner. No BP is needed. The training complexity of DL is significantly higher than that of SSL. One can add or remove parameters to an SSL model flexibly. Our interest in the SSL technique for pattern recognition and computer vision arises from research on explainable DL . This work is a sequel of our cumulative research efforts over the last five years [43]-[44]. The sign confusion problem was pointed out in [43]. It was also argued in [43] that nonlinear activation is used to avoid this problem. The intra-class variability problem was discussed in [44]. Two new signal transforms 115 called the Saak (Subspace approximation with augmented kernels) transform and the Saab (Subspace approximation with adjusted bias) transform were proposed in [45] and [46], respectively. They paved a solid foundation of SSL. As compared with previous work, there are several new contributions in this work. 1. Further insights into the unsupervised dimension reduction technique, such as the Saab transform, were given in Sec. 6.3.2. We studied the convergence behavior of covariance matrices as a function of the number of training samples. It indicates that the Saab filters are robust against a smaller number training samples. 2. We examined the choice of several hyper parameters in the SSL. 3. To the best of our knowledge, the proposed supervised dimension reduction tech- nique, called the LAG, as presented in Sec. 6.3.3 by exploiting labeled data dis- tributions is novel. 4. To illustrate the SSL methodology on image-based object recognition, we pro- posed the PixelHop method and conducted extensive experiments to demonstrate the power of the PixelHop system in Sec. 6.3. 6.6 Conclusion A successive subspace Learning (SSL) methodology was introduced and the PixelHop method was proposed in this work. In contrast with traditional subspace methods, the SSL framework builds a sequence of subspaces successively using the near- and far- neighborhood of a set of selected pixels. The PixelHop method has several unique ingredients. First, each basic PixelHop unit contains a subspace concatenation step and a subspace approximation step. The former is used to merge several smaller subspaces 116 into a larger one while the latter reduces the dimension of the exact subspace into that of an approximate one. Second, it has multiple PixelHop units in cascade so as to capture the characteristics of neighborhoods of varying sizes. Third, it contains an efficient way to aggregate attributes from all PixelHop units and reduce the dimension furthermore using a novel supervised dimension reduction technique called the label-guided regres- sor (LGR). Extensive experiments are conducted on the MNIST, the Fashion MNIST and the CIFAR-10 datasets to demonstrate the superior performance of the PixelHop method in terms of classification accuracy and training complexity. 117 Chapter 7 Conclusion and Future Work 7.1 Summary of the Research In this thesis, we first developed an efficient and robust image classification methods based on two neural-network-inspired image transforms, Saak transform and Saab trans- form, respectively. Then, we extended the Saab-transform-based decision system to semi-supervised learning. Finally, a unified successive subspace learning framework is proposed and used to solve image classifications. Data-driven forward and inverse Saak transforms offer a new angle to look at the image representation problem and provide powerful spatial-spectral Saak features for pattern recognition and image synthesis. The MNIST dataset was used as an example to demonstrate the applications of Saak transform. Particularly, a lossy Saak transform- based approach was proposed to solve the handwritten digits recognition problem. This new approach has several advantages such as higher efficiency than the lossless Saak transform, scalability against the variation of training data size and object class numbers and robustness against noisy images. To solve the image classification problems, the ensemble approach built on the Saab- transform-based decision system (FF-designed CNN) was proposed and outperformed the single BP- and FF-designed CNN on two benchmark datasets. FF-designed CNNs offer a complementary approach in CNN filter weights selection. To enhance the per- formance of the ensemble system, it is critical to increasing the diversity of FF-CNN models. To achieve this objective, we introduce diversities by adopting three strategies: 118 1)different parameter settings in convolutional layers, 2) flexible feature subsets fed into the Fully-connected (FC) layers, and 3) multiple-image embeddings of the same input source. Furthermore, we partition input samples into easy and hard ones based on their decision confidence scores. As a result, we can develop a new ensemble system tailored to hard samples to further boost classification accuracy. A semi-supervised learning framework using the FF-designed CNN is proposed for image classification. This work has several novel contributions. First, we apply FF- CNNs to the SSL context and show that FF-CNNs outperforms BP-CNNs when the size of the labeled data set becomes smaller. Second, we propose an ensemble system that fuses the output decision vectors of multiple FF-CNNs so as to achieve even better per- formance. Third, we conduct experiments in three benchmarking datasets (i.e., MNIST, SVHN, and CIFAR-10) to demonstrate the effectiveness of the proposed solutions as described above. We introduced a unified framework, called successive subspace learning (SSL) based on the Saab transform and FF-designed CNN. To illustrate the SSL principle in the con- text of image-based object recognition, we propose a novel PixelHop method. The Pix- elHop method has several unique ingredients. First, each basic PixelHop unit contains a subspace concatenation step and a subspace approximation step. The former is used to merge several smaller subspaces into a larger one while the latter reduces the dimension of the exact subspace into that of an approximate one. Second, it has multiple Pixel- Hop units in cascade so as to capture the characteristics of neighborhoods of varying sizes. Third, it contains an efficient way to aggregate attributes from all PixelHop units and reduces the dimension furthermore using a novel supervised dimension reduction technique called the label-guided regressor (LGR). 119 7.2 Future Research Directions As the future extensions, we would like to apply the SSL-based approach to more challenging datasets in terms of more object classes or/and larger image size, such as CIFAR-100[40] and ImageNet [21]. CIFAR-100 The CIFAR-100 dataset is composed of 100 classes of tiny images, con- taining 50,000 training images and 10,000 testing images of size 32x32. The number of images in each class is only one-tenth of the CIFAR-10 dataset. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image annotates with a ”fine” label (the class to which it belongs) and a ”coarse” label (the superclass to which it belongs). To handle 100 object classes, We need to exploit the extracted attributes more effec- tively. Instead of applying the mean-pooling operation in the spatial domain, we may apply the LGR to the attributes output at each PixelHop unit for supervised dimension reduction. After that, we can apply another LGR to the outputs from all four PixelHop units. The multi-class SVM with 100 object classes can be too slow. We may adopt the random forest (RF) classifier instead for speed-up. ImageNet ImageNet is a dataset of over 15 million labeled high-resolution images. Starting in 2010, there is an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), which uses a subset of ImageNet with 1000 object categories. Overall, there are about 1.2 million training images and 150,000 validation and testing images. Similar to CIFAR-100, ImageNet also has hierarchical labeling as illustrated in Figure 7.3. Also images in ImageNet have variable resolutions which are quite higher to fit in the current framework as demonstrated in Figure 7.2. So it is a crucial step to reduce the image size before adopting our proposed system. There are several ways to realize spatial dimension reduction. 120 Figure 7.1: Sample images from CIFAR-100 [40]. We can downsample the high-resolution images, e.g. average/max pooling, image scaling. We can divide the original images to multiple sub-images. To give an example, we can pick the pixel in the fixed location of every 2x2 windows, in this way, we can acquire 4 sub-images with half spatial size along with both height and width. From Fig 7.3, we can observe that objects don’t occupy the entire images. It is necessary to zoom into a distinctive spatial region based on some pre-processing techniques. We believe that image segmentation and visual saliency detection techniques play an important role in identifying the salient spatial subspace for further processing. We plan to use the SSL for contour extraction and visual saliency detection. Then, we can crop out a proper region for the object classifi- cation task. 121 Figure 7.2: High-resolution sample images from ImageNet. [21]. Figure 7.3: Examples of hierarchical labeling of ImageNet [21]. Another interesting topic for future research is to develop an even more effective SSL system that has a performance competitive with the advanced neural network sys- tems such as the ResNet and the DenseNet. The LeNet-5 is a classical CNN whose architecture is simple. Yet, its performance is inferior to those of advanced CNNs. It is important to understand the ResNet and the DenseNet better and demystify their supe- rior performance so that we may be inspired to develop a more powerful SSL solution. 122 References List [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y . Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´ e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V . Vanhoucke, V . Vasudevan, F. Vi´ egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y . Yu, and X. Zheng. Tensor- Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [2] I. Biederman. Recognition-by-components: a theory of human image understand- ing. Psychological review, 94(2):115, 1987. [3] A. Blum, J. Lafferty, M. R. Rwebangira, and R. Reddy. Semi-supervised learn- ing using randomized mincuts. In Proceedings of the twenty-first international conference on Machine learning, page 13. ACM, 2004. [4] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning the- ory, pages 92–100. ACM, 1998. [5] T. Bouwmans. Subspace learning for background modeling: A survey. Recent Patents on Computer Science, 2(3):223–234, 2009. [6] G. E. Box. Non-normality and tests on variances. Biometrika, 40(3/4):318–335, 1953. [7] L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. [8] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001. [9] G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6(1):5–20, 2005. 123 [10] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE transac- tions on pattern analysis and machine intelligence, 35(8):1872–1886, 2013. [11] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y . Ma. Pcanet: A simple deep learning baseline for image classification? IEEE Transactions on Image Process- ing, 24(12):5017–5032, 2015. [12] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intel- ligence, 40(4):834–848, 2018. [13] Y . Chen, Z. Xu, S. Cai, Y . Lang, and C.-C. J. Kuo. A saak transform approach to efficient, scalable and robust handwritten digits recognition. arXiv preprint arXiv:1710.10714, 2017. [14] Y . Chen, Y . Yang, W. Wang, and C.-C. J. Kuo. Ensembles of feedforward-designed convolutional neural networks. arXiv preprint arXiv:1901.02154, 2019. [15] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: a tensor analysis. arXiv preprint arXiv:1509.05009, 556, 2015. [16] C. Cortes and V . Vapnik. Support-vector networks. Machine learning, 20(3):273– 297, 1995. [17] P. Cunningham and J. Carney. Diversity versus quality in classification ensembles based on feature selection. In European Conference on Machine Learning, pages 109–116. Springer, 2000. [18] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathe- matics of Control, Signals and Systems, 2(4):303–314, 1989. [19] J. Dai, Y . Lu, and Y .-N. Wu. Generative modeling of convolutional neural net- works. arXiv preprint arXiv:1412.6296, 2014. [20] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov. Good semi- supervised learning that requires a bad gan. In Advances in Neural Information Processing Systems, pages 6510–6520, 2017. [21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large- scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. [22] T. G. Dietterich. Ensemble methods in machine learning. In International work- shop on multiple classifier systems, pages 1–15. Springer, 2000. 124 [23] A. Fawzi, S. M. Moosavi Dezfooli, and P. Frossard. A geometric perspective on the robustness of deep networks. Technical report, Institute of Electrical and Electronics Engineers, 2017. [24] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, et al. Query by image and video content: The qbic system. computer, 28(9):23–32, 1995. [25] Y . Freund and R. E. Schapire. A decision-theoretic generalization of on-line learn- ing and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997. [26] G. Giacinto and F. Roli. An approach to the automatic design of multiple classifier systems. Pattern recognition letters, 22(1):25–33, 2001. [27] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y . Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013. [28] F. E. Grubbs. Sample criteria for testing outlying observations. The Annals of Mathematical Statistics, pages 27–58, 1950. [29] Q. Gu, Z. Li, and J. Han. Joint feature selection and subspace learning. In Twenty- Second International Joint Conference on Artificial Intelligence, 2011. [30] K. He and J. Sun. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5353–5360, 2015. [31] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [32] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989. [33] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. [34] N. A. Ibraheem, M. M. Hasan, R. Z. Khan, and P. K. Mishra. Understanding color models: a review. ARPN Journal of science and technology, 2(3):265–275, 2012. [35] A. K. Jain and A. Vailaya. Image retrieval using color and shape. Pattern recogni- tion, 29(8):1233–1244, 1996. 125 [36] C. M. Jarque and A. K. Bera. A test for normality of observations and regres- sion residuals. International Statistical Review/Revue Internationale de Statis- tique, pages 163–172, 1987. [37] T. Joachims. Transductive inference for text classification using support vector machines. In ICML, volume 99, pages 200–209, 1999. [38] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in neural information process- ing systems, pages 3581–3589, 2014. [39] H.-P. Kriegel, P. Kr¨ oger, and A. Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1):1, 2009. [40] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. [41] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. [42] L. I. Kuncheva and C. J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 51(2):181– 207, 2003. [43] C.-C. J. Kuo. Understanding convolutional neural networks with a mathematical model. Journal of Visual Communication and Image Representation, 41:406–413, 2016. [44] C.-C. J. Kuo. The CNN as a guided multilayer RECOS transform [lecture notes]. IEEE Signal Processing Magazine, 34(3):81–89, 2017. [45] C.-C. J. Kuo and Y . Chen. On data-driven Saak transform. Journal of Visual Communication and Image Representation, 50:237–246, 2018. [46] C.-C. J. Kuo, M. Zhang, S. Li, J. Duan, and Y . Chen. Interpretable convolutional neural networks via feedforward design. Journal of Visual Communication and Image Representation, 60:346–359, 2019. [47] K. I. Laws. Rapid texture identification. In Image processing for missile guidance, volume 238, pages 376–382. International Society for Optics and Photonics, 1980. [48] Y . LeCun, Y . Bengio, and G. E. Hinton. Deep learning. Nature, 521:436–444, 2015. 126 [49] Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998. [50] Y . LeCun et al. Lenet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet, 2015. [51] T. Lin and H. Zha. Riemannian manifold learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5):796–809, 2008. [52] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015. [53] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos. A survey of multilinear subspace learning for tensor data. Pattern Recognition, 44(7):1540–1551, 2011. [54] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008. [55] S. Mallat. Group invariant scattering. Communications on Pure and Applied Math- ematics, 65(10):1331–1398, 2012. [56] B. S. Manjunath and W.-Y . Ma. Texture features for browsing and retrieval of image data. IEEE Transactions on pattern analysis and machine intelligence, 18(8):837–842, 1996. [57] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist net- works: The sequential learning problem. Psychology of learning and motivation, 24:109–165, 1989. [58] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accu- rate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016. [59] H. Murase and S. K. Nayar. Visual learning and recognition of 3-d objects from appearance. International journal of computer vision, 14(1):5–24, 1995. [60] Y . Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y . Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011. [61] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 427–436, 2015. 127 [62] G. Pass and R. Zabih. Histogram refinement for content-based image retrieval. In Applications of Computer Vision, 1996. WACV’96., Proceedings 3rd IEEE Work- shop on, pages 96–102. IEEE, 1996. [63] W. K. Pratt. Digital image processing: PIKS Scientific inside, volume 4. Wiley- interscience Hoboken, New Jersey, 2007. [64] C. Rosenberg, M. Hebert, and H. Schneiderman. Semi-supervised self-training of object detection models. WACV/MOTION, 2, 2005. [65] T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Pro- cessing Systems, pages 2234–2242, 2016. [66] R. E. Schapire. The strength of weak learnability. Machine learning, 5(2):197– 227, 1990. [67] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional net- works: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013. [68] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [69] M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the opti- mization landscape of over-parameterized shallow neural networks. IEEE Trans- actions on Information Theory, 2018. [70] H. Stark and J. W. Woods. Probability, random processes, and estimation theory for engineers. Englewood Cliffs: Prentice Hall, 1986, 1986. [71] C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Van- houcke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015. [72] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of cognitive neuro- science, 3(1):71–86, 1991. [73] T. Tuytelaars, K. Mikolajczyk, et al. Local invariant feature detectors: a survey. Foundations and trends R in computer graphics and vision, 3(3):177–280, 2008. [74] L. Wan, M. Zeiler, S. Zhang, Y . L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th international conference on machine learning (ICML-13), pages 1058–1066, 2013. 128 [75] J. Wang, Z. Zhang, and H. Zha. Adaptive manifold learning. In Advances in neural information processing systems, pages 1473–1480, 2005. [76] K. Wang, R. He, L. Wang, W. Wang, and T. Tan. Joint feature selection and subspace learning for cross-modal retrieval. IEEE transactions on pattern analysis and machine intelligence, 38(10):2010–2023, 2015. [77] Y . Wang, H. Su, B. Zhang, and X. Hu. Interpret neural networks by identifying critical data routing paths. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8906–8914, 2018. [78] T. Wiatowski and H. B¨ olcskei. A mathematical theory of deep convolutional neural networks for feature extraction. arXiv preprint arXiv:1512.06293, 2015. [79] D. H. Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992. [80] K. Woods, W. P. Kegelmeyer, and K. Bowyer. Combination of multiple classifiers using local accuracy estimates. IEEE transactions on pattern analysis and machine intelligence, 19(4):405–410, 1997. [81] H. Xiao, K. Rasul, and R. V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017. [82] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833. Springer, 2014. [83] C. Zhang and Y . Ma. Ensemble machine learning: methods and applications. Springer, 2012. [84] M. Zhang, H. You, P. Kadam, S. Liu, and C.-C. J. Kuo. Pointhop: An explain- able machine learning method for point cloud classification. arXiv preprint arXiv:1907.12766, 2019. [85] Q. Zhang, Y . Nian Wu, and S.-C. Zhu. Interpretable convolutional neural networks. arXiv preprint arXiv:1710.00935, 2017. [86] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional net- works for classification and detection. IEEE transactions on pattern analysis and machine intelligence, 38(10):1943–1955, 2016. [87] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene CNNs. arXiv preprint arXiv:1412.6856, 2014. [88] X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2(3):4, 2006. 129 [89] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002. 130
Abstract (if available)
Abstract
Convolutional neural networks (CNNs) have recently demonstrated impressive performance in image classification and change the way building feature extractors from carefully handcrafted design to automatically deep learned from a large labeled dataset. However, a great majority of current CNN literature are application-oriented, and there is no clear understanding and theoretical foundation to explain the outstanding performance and indicate the way to improve. In this thesis, we focus on solving the image classification problem based on the neural-network-inspired transforms. ❧ Being motivated by the multilayer RECOS (REctified-COrrelations on a Sphere) transform, two data-driven signal transforms are proposed, called the “Subspace approximation with augmented kernels” (Saak) transform and “Subspace approximation with adjusted bias” (Saab) transform corresponding to each Convolutional layers in CNNs. Based on the Saak transform, We firstly proposed an efficient, scalable and robust approach to the handwritten digits recognition problem. Next, we also develop an ensemble method using Saab transform to solve the image classification problem. The ensemble method fuses the output decision vectors of Saab-transform-based decision system. To enhance the performance of the ensemble system, it is critical to increasing the diversity of FF-CNN models. To achieve this objective, we introduce diversities by adopting three strategies: 1) different parameter settings in convolutional layers, 2) flexible feature subsets fed into the Fully-connected (FC) layers, and 3) multiple image embeddings of the same input source. We also extend our ensemble method to semi-supervised learning. Since unlabeled data may not always enhance semi-supervised learning, we define an effective quality score and use it to select a subset of unlabeled data in the training process. In the last, we proposed a unified framework, called successive subspace learning (SSL). With this new viewpoint, the whole CNN pipeline contains multiple subspace processing modules in cascade. To illustrate the SSL principle in the context of image-based object recognition, we introduce a novel PixelHop method. The PixelHop method provides a rich set of representations for image classification. To further decrease the complexity of the PixelHop system, we develop a new label-assisted dimension reduction method. Extensive experiments are conducted to demonstrate the superior performance of the PixelHop method in terms of classification accuracy and training complexity.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Experimental analysis and feedforward design of neural networks
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Efficient graph learning: theory and performance evaluation
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Object localization with deep learning techniques
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
A data-driven approach to image splicing localization
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
PDF
Human motion data analysis and compression using graph based techniques
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Exploring complexity reduction in deep learning
PDF
A green learning approach to image forensics: methodology, applications, and performance evaluation
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Human behavior understanding from language through unsupervised modeling
Asset Metadata
Creator
Chen, Yueru
(author)
Core Title
Object classification based on neural-network-inspired image transforms
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/07/2019
Defense Date
09/12/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
convolutional neural networks,image classification,interpretable machine learning,OAI-PMH Harvest,subspace learning
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, Jay C.-C. (
committee chair
), Neumann, Ulrich (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
yueruche@outlook.com,yueruche@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-231396
Unique identifier
UC11674118
Identifier
etd-ChenYueru-7902.pdf (filename),usctheses-c89-231396 (legacy record id)
Legacy Identifier
etd-ChenYueru-7902.pdf
Dmrecord
231396
Document Type
Dissertation
Rights
Chen, Yueru
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
convolutional neural networks
image classification
interpretable machine learning
subspace learning